Author Archives: mja18

Improving ShortStack performance on highly fragmented genomes

Highly fragmented reference genomes (with thousands or more short contigs or scaffolds) have been a persistent challenge for our small RNA-seq analysis program ShortStack. During a run, ShortStack needs to retrieve genomic sub-sequences for analysis of predicted RNA secondary structure. This is required to identify MIRNA hairpins. Early on I made the decision to use the samtools faidx function as the engine to retrieve genome sub-sequences. This was just pragmatic and lazy .. samtools was already required for other portions of ShortStack’s analysis, and there wasn’t a need to reinvent the wheel. However, when we started to do runs against highly fragmented genome assemblies, we found analysis was very slow. The slowness was traced to the samtools faidx function, which is very sensitive to the number of contigs/references.

The first attempt to fix this issue was in version 3.0, when I introduced the use ‘genome-stitching’.  When the reference genome had more than 50 sequences, and some were < 1Mb in size, the references were ‘stitched’ together into larger assemblies of at least 5Mb each. This worked very well to speed up the process of MIRNA searching, but had some disadvantages:

  1. The end results (bam file, results, etc), were in the coordinate system of the ‘stitched’ genome, not the original. This was quite inconvenient for users.
  2. The reads had to be aligned to the ‘stitched’ genome, which necessitated building a new bowtie index — this added total time to the run.
  3. I recently noticed a bug as well .. despite my efforts to prevent it, sometimes de novo small RNA clusters were found that spanned the junctions between ‘stitched’ genome segments.

In release 3.8, I’ve removed the drawbacks of the old stitching approach, while keeping the speed-up. The solution was simple and I’m kicking myself for not thinking about it earlier. Fragmented genomes are still stitched during the run (now into larger chunks, > 50Mb), but the stitched version is used only for sequence lookups. To get a sequence, the ‘real’ coordinates are converted to the ‘stitched’ coordinates, and the stitched genome used for the samtools faidx query. This keeps the speed up in retrieving sequences for folding, but the results are still reported in the original genome’s coordinate system. This approach eliminates all 3 of the problems listed above.

To demonstrate the speed-up, I compared the performance of 3.8 to 3.7.2 run with nostitch with the following conditions, using a highly fragmented reference genome I’m working on with a collaborator at the moment:

  • Reference genome: 382,375 scaffolds, totaling 3.75Gb of sequence. bowtie indices pre-built.
  • Alignments: 202,147,307 reads (pre aligned in single bam file).
  • mincov = 0.5rpm

Performance:

  • 3.7.2 (under nostitch) : 3 hrs, 14 minutes
  • 3.8 (improved stitch, still reporting in original coordinates): 1 hr. 13 minutes

So, the performance gain is clear, nearly three times as fast. Because so many new genome assemblies are highly fragmented drafts, this performance enhancement may matter to many users.

 

NO BAR CHARTS! (Weissgerber et al. 2015)

I came across this article in PLoS Biology (Weissgerber et al. 2015 doi: 10.1371/journal.pbio.1002128) and it very clearly laid out the issues of using barcharts to display continuous data, especially with low sample sizes. Bottom line: barcharts are misleading summaries of the data, and it’s better to show the data themselves with univariate scatter plots or boxplots. Figure 1 from their paper dramatically shows how barcharts can mask very different distributions of the data:

Figure 1:  Weissgerber et al. (2015).

Figure 1: Weissgerber et al. (2015).

Figure 3 from this paper is also very striking. Here we can easily see the fallacy of using standard error of the mean (SEM), or standard deviation (SD) for non-normally distributed, low sample size data:

Figure 3: Weissgerber et al. (2015)

Figure 3: Weissgerber et al. (2015)

Some people in my lab probably remember me complaining about barcharts for qRT-PCR experiments where n=3 or n=6 on one of our recent papers. I prevailed, and we had scatter plots instead of barcharts. Maybe now my motivations are clearer? 🙂

Figure 7D from our recent paper (Coruh et al. 2015 Plant Cell doi: 10.1105/tpc.15.00228) .. note avoidance of barchart!

Figure 7D from our recent paper (Coruh et al. 2015 Plant Cell doi: 10.1105/tpc.15.00228) .. note avoidance of barchart!

ShortStack now on iPlant

A big thanks to Nate Johnson for coordinating the loading of ShortStack 3.3 onto iPlant’s Discovery Environment. There is now a ‘public app’ for ShortStack on the discovery environment. Just search ‘ShortStack’ in the apps.

This will allow users to leverage iPlant’s compute resources instead of their local machines to run ShortStack. It also uses a gui interface instead of the command line to run the software.

Anyone who uses this, please let me know! (mja18@psu.edu) .. would love to track how much use the community is getting out of the iPlant app.

CRISPR/Cas9 ‘toolbox’ of vectors for plants

Since today’s seminar speaker here is Dr. Daniel Voytas from Minnesota, I went looking for some of his recent papers. I came across this recent one, from him and his collaborators, reporting a series of vectors optimized for plant transgenesis with various types of CRISPR activities, including multi-plexed targeting and transcriptional activation. At a glance the experiments look convincing, and the vector series is available at Addgene. Folks in my lab who are interested in multi-plexed CRISPR/Cas9 targeting might want to look at this study.

A CRISPR/Cas9 Toolbox for Multiplexed Plant Genome Editing and Transcriptional Regulation. (2015; Plant Physiology. doi: 10.1104/pp.15.00636)

Untitled

Plant miRNA evolution update

So here’s another figure I’ve prepared for the Plant Cell review article I am writing. This is an update on the patterns of miRNA conservation across land plants according solely to the information present in miRBase release 21.

For this analysis, I first placed each land plant miRNA family in miRBase 21 into one of eight broad plant groups (Eudicots-Rosids, Eudicots-Asterids, Basal Eudicots, Monocots, Basal Angiosperms, Gymnosperms, Lycophytes, and Bryophytes). I then defined a conserved miRNA family as one which had at least one high-confidence annotation, and which was annotated in two or more of these broad groups. By this definition, there are just 36 conserved families out of the more than 2,000 that are currently annotated.

grid_fig

This figure also highlights a major issue in the large-scale analysis of land plant miRNA conservation .. highly unequal sampling density within the different groups. Rosids and monocots have received the most attention, with large numbers of species represented, and high numbers of overall annotations (barcharts at the top). In contrast, basal eudicots, basal angiosperms, lycophytes, and bryophytes are much less well-sampled, with each group represented at present by just one species in miRBase 21. Inferring secondary losses in some of these lineages (basal eudicots, basal angiosperms, lycophytes) is NOT believable.

This chart also illustrates how important the high-quality miRNA annotation set for bryophytes is (the sole species represented is Physcomitrella patens). Based on the presence of high-confidence annotations in both Physcomitrella and one or more angiosperm group, we can confidently say that nine families (miR156, miR160, miR166, miR171, miR319, miR390, miR477, miR529, and miR535) were most likely present in the last common ancestor of all land plants. Another three families (miR167, miR395, and miR408) might also belong in the ultra-conserved set, but they are less certain because their Physcomitrella annotations are not high-confidence. Another two families (miR396 and miR482) clearly predate divergence of all seed plants. Other patterns are less certain because of the frequent presence of annotations that are not (yet) known to be high confidence. I think this again highlights the need for a systematic review of miRNA annotations, based on re-analysis of all available small RNA-seq data with a single, high-confidence MIRNA identification methodology. We are working toward this goal in my lab.

Plant microRNAs in miRBase 21

So I’ve been asked to contribute a review article to The Plant Cell on the general subject of microRNA / small RNA evolution in plants.  I set out to make an up to date figure on microRNA annotations across plants based solely on the annotations in miRBase 21. The results were surprising to me.

72 different land plant species are represented in miRBase 21, with a total of 2,247 different miRNA FAMILIES annotated. Note, that is families, not loci. Seemingly a huge diversity of different miRNA sequences.

But, the issue is that many of these annotations are likely to be false positives .. either other types of regulatory small RNAs, or even worse, just degraded garbage. The curators of miRBase, since release 20, have set out to try and make a ‘high-confidence’ list of microRNA loci, based on their internal parsing of public small RNA-seq data. (See their paper in NAR). In miRBase 21, from plants, there are just 176 high-confidence miRNA families (a high-confidence family is a family for which at least one locus has a high-confidence designation).  Furthermore, only 17 of the 72 plant species have ANY high-confidence annotations at all! The scatterplot below illustrates this, as we can see that most species have few annotated families and no high-confidence ones.

miRBase_scatterplot

Scatterplot of numbers of high-confidence miRNA families from 72 plant species as a function of the total number of annotated families. Data are coded according to broad taxonomic groups. A few species of interest are labeled. Data were processed from miRBase 21.

Of course, there are some caveats here. The biggest is that the ‘high-confidence’ designation is based on whether or not the miRBase folks have analyzed the available high-throughput small RNA-seq data for a given species. In some cases, they may not have (though I haven’t dug into that specifically). The second caveat is that all species are NOT equally treated here. Some species (for instance, Oryza sativa, Arabidopsis thaliana) have high-quality reference genomes and have had lots of experimental attention over the years. Many others have neither of these traits, and so their annotations are necessarily more piecemeal. Overall I think this points out the need clearly for a more uniform approach to retrospective analysis of miRBase annotations.

I was also pleased to see that Physcomitrella patens has a very high percentage of high-confidence miRNA families .. since my lab annotated the vast majority of those families!

— Mike Axtell

High temps trigger 24mer siRNAs in Arabidopsis?

I came across this article, published about a year ago, in Plant Physiology and Biochemistry and was intrigued:

Insight into small RNA abundance and expression in high- and low-temperature stress response using deep sequencing in Arabidopsis

doi: 10.1016/j.plaphy.2014.09.007

The experiment was small RNA-seq on heat-stressed (HT .. 36C for one day), normal temperature (NT) and low-temperature (LT .. 4C for one day ) Arabidopsis plants. The basic observation that I found intriguing was that the high-temp stress library had quite a few 24mers, while the cold and normal temp. libraries had a lot fewer. (See Figure 1 from the paper below):

Baev_Fig1

Does this imply that heat stress activates the 24nt siRNA pathway?

The major issue with the experiment however is a lack of replication and independent verification: Just one library per treatment so there are no biological replicates to assess the reproducibility, so the result I think is provisional at best.

However, there are some earlier reports that also show effects of heat stress on bulk small RNA functions. Ito et al. (2011) showed that the Arabidopsis ONSEN retrotransposon was transcribed and reverse-transcribed in response to heat stress, but that actual transpositions were prevented by the het-siRNA pathway. Zhong et al. (2013) showed that heat stress de-activated the trans-acting siRNA pathway (which mostly makes 21 and 22mer siRNAs that can also behave like heterochromatic siRNAs). I think that understanding the role of heat-induced small RNA profile changes will be quite interesting.

Rcount: dealing with multi-mapping reads in RNAseq data

Rcount: simple and flexible RNA-Seq read counting

Marc W. Schmid* and Ueli Grossniklaus

Institute of Plant Biology and Zu€rich-Basel Plant Science Center, University of Zurich, 8008 Zu€rich, Switzerland

Bioinformatics. doi:10.1093/bioinformatics/btu680, PMID: 25322836

Nate showed me this paper today which is of some interest to us given my obsession with finding a better way to deal with the issue of multi-mapping reads in small RNA-seq data (e.g., with the butter program). This paper describes a tool called Rcount, which is a counter for ‘normal’ mRNA-seq data. As described in the paper, Rcount takes in a BAM file, and deals with multireads. According to figure 1 (copied below), the way they do this is to use the density of local uniquely mapped reads and make a probability assessment… the more uniquely mapped reads in an area, the more likely it is that the multi-read also came from that location. They then place it, noting their calculated probability in the SAM line with a custom tag. Rcount then performs another task (dealing with counting reads that overlap more than one gene annotation) and counts up reads in annotated genes for the user.

Rcount is clearly geared toward counting reads in annotated genes with reference to mRNA-seq data. For that reason, I doubt the program itself will be that useful for small RNA-seq data, where we are not generally interested in counting reads in pre-defined intervals (like gene annotations). But it is striking that Rcount is using pretty much exactly the method that my butter program uses for assigning reads … using the density of the unique mappers to create a probability set used to guide decisions on multi-mappers. I think Nate is going to try and use Rcount for small RNA-seq data.

I don’t think this precludes continued development of butter or it’s successor, because Rcount is pretty clearly geared toward mRNA-seq data. But it is worth testing, if possible, against butter and other methods for small RNA-seq to try and determine for our own lab purposes an optimal method for aligning multi-mapped small RNA-seq reads that is both precise and reproducible.

– Mike Axtell

plantDARIO – a web-based tool for small RNA-seq analysis in select plant genomes

Patra et al. (2014). plantDARIO: web based quantitative and qualitative analysis of small RNA-seq data in plants. Frontiers in Plant science.

doi: 10.3389/fpls.2014.00708
PMID: 25566282

This manuscript describes a web-based service for the annotation of small RNA-producing genes in Arabidopsis thaliana, Beta vulgaris, and Solanum lycopersicum (the authors also state that they plan to extend the number of plant species to “…include most of the available plant genomes.”. Users provide aligned small RNA data in BAM or bed format, and the authors provide a script for condensing reads aligned to the same position. Thus the authors reduce the burden of large data transfers. The web server parses the aligned small RNA data with respect to several pre-loaded annotation tracks, including known miRNAs (from miRBase), known tasi-RNAs, tRNAs, and other ncRNAs from Rfam. Global stats are spit out for the library. Clusters of reads that don’t overlap any annotated regions are flagged, and some miRNA finding and snoRNA finding programs are run. Results can be integrated onto other publically available genome browsers for the species of interest, located on other servers.

I found this manuscript interesting for a couple of reasons. First, I had often wondered about how to make my own small RNA-seq program, ShortStack, available as a web-service. I have not done this, primarily because the input for ShortStack is raw small RNA-seq data, or BAM files of aligned small RNA-seq data, along with the reference genome. This would be tedious to upload for users because of the file sizes. The large file sizes could also place a big demand of the server, as could the intense number of CPU cycles that might be run. It looks like the authors of plantDARIO have gone around this issue by outsourcing the alignments to the user, and enforcing a read-condensation scheme.

The second thing I found interesting about this work was a brief mention of the alignment methods. In particular, the authors state “Unlike many other mapping tools, segemehl has full support for multiple-mapping reads which is very important for small RNA-seq”. I am quite interested in improving the treatment of how multi-mapped small RNA-seq reads are placed and used (see butter). I have not heard of the program “segemehl” before. The relevant paper is Otto et al., 2014, which I will need to put on my reading list.

The third thing I was interested in was the method for annotating small RNA clusters that didn’t overlap a known gene. The authors are using a tool called “blockbuster”, which was described in another earlier paper from this group, Langenberger et al. 2009. Will have to check this out too.

My final thoughts on this paper have to do with comparing a web-based service like plantDARIO to a stand-alone program like ShortStack. The authors of this paper make a plug for a web-based service and ding stand-alone programs by stating “The other sncRNA prediction tools need to be downloaded, installed and run locally, requiring more than basic computer skills.” Well yes, this is true. But there are significant advantages of a stand-alone vs. their approach to web-based analysis. With a standalone, you can use any genome assembly or assembly version you want. But with their approach, you are limited to whatever they have pre-configured. Moving to new species, or even updating with a newer genome assembly version, is not possible except by requesting the authors to update their site. There is a lot more flexibility to be gained with a standalone.

In any event, an interesting read. I’m looking forward to trying out the tool, and to reading some more of the background methods, especially alignments and de-novo cluster finding.

PS. One error: My ShortStack paper is erroneously cited as “Allen et al. (2013)” instead of “Axtell (2013)”. The author lists of my paper and a 2004 paper from the Carrington Lab, with Ed Allen as lead author, appear to have been swapped in the ref. cited section.

–Mike Axtell

“Clean” gene replacement in Physcomitrella is hard.

Paper: Recombination products suggest the frequent occurrence of aberrant gene replacement in the moss Physcomitrella patens by Wendeler et al.

The Plant Journal .. doi: 10.1111/tpj.12749 .. PMID: 25557140

This paper examines in detail what happen around the PpCOL2 locus of Physcomitrella patens during gene replacement experiments. They find that complex re-arrangements were very frequent. In particular, a number of transformed lines where PCR analysis across the predicted genome / replacement construct junctions was positive had other copies of the target locus in the genome. They find that this is RAD51-dependent (there are two non-redundant RAD51 genes in Physco, -a and -b).

The main take-away for me in this paper was that PCR analysis of junctions is insufficient to screen for gene replacement lines in Physco. For instance, in their southern blotting experiments, the authors find that “gene replacement with correct recombination junction fragments and deletion of the original sequence was obtained in only two [sic] out of 9 targeted lines” .. where these “9 targeted lines” were all positive for both junction PCRs. Yikes. The southern blot is in figure 3B.

This matches with our own past experiences of gene replacement in Physocmitrella. It is not trivial to actually get rid of targeted sequence. Perhaps with a-miRNAs and also perhaps CRISPR-Cas9, gene replacement in this organism might be deprecated.

– Mike Axtell