Tag Archives: bioinformatics

Rcount: dealing with multi-mapping reads in RNAseq data

Rcount: simple and flexible RNA-Seq read counting

Marc W. Schmid* and Ueli Grossniklaus

Institute of Plant Biology and Zu€rich-Basel Plant Science Center, University of Zurich, 8008 Zu€rich, Switzerland

Bioinformatics. doi:10.1093/bioinformatics/btu680, PMID: 25322836

Nate showed me this paper today which is of some interest to us given my obsession with finding a better way to deal with the issue of multi-mapping reads in small RNA-seq data (e.g., with the butter program). This paper describes a tool called Rcount, which is a counter for ‘normal’ mRNA-seq data. As described in the paper, Rcount takes in a BAM file, and deals with multireads. According to figure 1 (copied below), the way they do this is to use the density of local uniquely mapped reads and make a probability assessment… the more uniquely mapped reads in an area, the more likely it is that the multi-read also came from that location. They then place it, noting their calculated probability in the SAM line with a custom tag. Rcount then performs another task (dealing with counting reads that overlap more than one gene annotation) and counts up reads in annotated genes for the user.

Rcount is clearly geared toward counting reads in annotated genes with reference to mRNA-seq data. For that reason, I doubt the program itself will be that useful for small RNA-seq data, where we are not generally interested in counting reads in pre-defined intervals (like gene annotations). But it is striking that Rcount is using pretty much exactly the method that my butter program uses for assigning reads … using the density of the unique mappers to create a probability set used to guide decisions on multi-mappers. I think Nate is going to try and use Rcount for small RNA-seq data.

I don’t think this precludes continued development of butter or it’s successor, because Rcount is pretty clearly geared toward mRNA-seq data. But it is worth testing, if possible, against butter and other methods for small RNA-seq to try and determine for our own lab purposes an optimal method for aligning multi-mapped small RNA-seq reads that is both precise and reproducible.

– Mike Axtell

plantDARIO – a web-based tool for small RNA-seq analysis in select plant genomes

Patra et al. (2014). plantDARIO: web based quantitative and qualitative analysis of small RNA-seq data in plants. Frontiers in Plant science.

doi: 10.3389/fpls.2014.00708
PMID: 25566282

This manuscript describes a web-based service for the annotation of small RNA-producing genes in Arabidopsis thaliana, Beta vulgaris, and Solanum lycopersicum (the authors also state that they plan to extend the number of plant species to “…include most of the available plant genomes.”. Users provide aligned small RNA data in BAM or bed format, and the authors provide a script for condensing reads aligned to the same position. Thus the authors reduce the burden of large data transfers. The web server parses the aligned small RNA data with respect to several pre-loaded annotation tracks, including known miRNAs (from miRBase), known tasi-RNAs, tRNAs, and other ncRNAs from Rfam. Global stats are spit out for the library. Clusters of reads that don’t overlap any annotated regions are flagged, and some miRNA finding and snoRNA finding programs are run. Results can be integrated onto other publically available genome browsers for the species of interest, located on other servers.

I found this manuscript interesting for a couple of reasons. First, I had often wondered about how to make my own small RNA-seq program, ShortStack, available as a web-service. I have not done this, primarily because the input for ShortStack is raw small RNA-seq data, or BAM files of aligned small RNA-seq data, along with the reference genome. This would be tedious to upload for users because of the file sizes. The large file sizes could also place a big demand of the server, as could the intense number of CPU cycles that might be run. It looks like the authors of plantDARIO have gone around this issue by outsourcing the alignments to the user, and enforcing a read-condensation scheme.

The second thing I found interesting about this work was a brief mention of the alignment methods. In particular, the authors state “Unlike many other mapping tools, segemehl has full support for multiple-mapping reads which is very important for small RNA-seq”. I am quite interested in improving the treatment of how multi-mapped small RNA-seq reads are placed and used (see butter). I have not heard of the program “segemehl” before. The relevant paper is Otto et al., 2014, which I will need to put on my reading list.

The third thing I was interested in was the method for annotating small RNA clusters that didn’t overlap a known gene. The authors are using a tool called “blockbuster”, which was described in another earlier paper from this group, Langenberger et al. 2009. Will have to check this out too.

My final thoughts on this paper have to do with comparing a web-based service like plantDARIO to a stand-alone program like ShortStack. The authors of this paper make a plug for a web-based service and ding stand-alone programs by stating “The other sncRNA prediction tools need to be downloaded, installed and run locally, requiring more than basic computer skills.” Well yes, this is true. But there are significant advantages of a stand-alone vs. their approach to web-based analysis. With a standalone, you can use any genome assembly or assembly version you want. But with their approach, you are limited to whatever they have pre-configured. Moving to new species, or even updating with a newer genome assembly version, is not possible except by requesting the authors to update their site. There is a lot more flexibility to be gained with a standalone.

In any event, an interesting read. I’m looking forward to trying out the tool, and to reading some more of the background methods, especially alignments and de-novo cluster finding.

PS. One error: My ShortStack paper is erroneously cited as “Allen et al. (2013)” instead of “Axtell (2013)”. The author lists of my paper and a 2004 paper from the Carrington Lab, with Ed Allen as lead author, appear to have been swapped in the ref. cited section.

–Mike Axtell