Improving ShortStack performance on highly fragmented genomes

Highly fragmented reference genomes (with thousands or more short contigs or scaffolds) have been a persistent challenge for our small RNA-seq analysis program ShortStack. During a run, ShortStack needs to retrieve genomic sub-sequences for analysis of predicted RNA secondary structure. This is required to identify MIRNA hairpins. Early on I made the decision to use the samtools faidx function as the engine to retrieve genome sub-sequences. This was just pragmatic and lazy .. samtools was already required for other portions of ShortStack’s analysis, and there wasn’t a need to reinvent the wheel. However, when we started to do runs against highly fragmented genome assemblies, we found analysis was very slow. The slowness was traced to the samtools faidx function, which is very sensitive to the number of contigs/references.

The first attempt to fix this issue was in version 3.0, when I introduced the use ‘genome-stitching’.  When the reference genome had more than 50 sequences, and some were < 1Mb in size, the references were ‘stitched’ together into larger assemblies of at least 5Mb each. This worked very well to speed up the process of MIRNA searching, but had some disadvantages:

  1. The end results (bam file, results, etc), were in the coordinate system of the ‘stitched’ genome, not the original. This was quite inconvenient for users.
  2. The reads had to be aligned to the ‘stitched’ genome, which necessitated building a new bowtie index — this added total time to the run.
  3. I recently noticed a bug as well .. despite my efforts to prevent it, sometimes de novo small RNA clusters were found that spanned the junctions between ‘stitched’ genome segments.

In release 3.8, I’ve removed the drawbacks of the old stitching approach, while keeping the speed-up. The solution was simple and I’m kicking myself for not thinking about it earlier. Fragmented genomes are still stitched during the run (now into larger chunks, > 50Mb), but the stitched version is used only for sequence lookups. To get a sequence, the ‘real’ coordinates are converted to the ‘stitched’ coordinates, and the stitched genome used for the samtools faidx query. This keeps the speed up in retrieving sequences for folding, but the results are still reported in the original genome’s coordinate system. This approach eliminates all 3 of the problems listed above.

To demonstrate the speed-up, I compared the performance of 3.8 to 3.7.2 run with nostitch with the following conditions, using a highly fragmented reference genome I’m working on with a collaborator at the moment:

  • Reference genome: 382,375 scaffolds, totaling 3.75Gb of sequence. bowtie indices pre-built.
  • Alignments: 202,147,307 reads (pre aligned in single bam file).
  • mincov = 0.5rpm


  • 3.7.2 (under nostitch) : 3 hrs, 14 minutes
  • 3.8 (improved stitch, still reporting in original coordinates): 1 hr. 13 minutes

So, the performance gain is clear, nearly three times as fast. Because so many new genome assemblies are highly fragmented drafts, this performance enhancement may matter to many users.


Mobile sRNAs can induce methylation on a genome-wide scale

The paper I chose to discuss was “Mobile small RNAs regulate genome-wide DNA methylation”, from the Ecker and Baulcombe groups. (PMID: 26787884, doi: 10.1073/pnas.1515072113). This goal of this study was to identify mobile sRNA loci and methylation loci, based on their interaction. To do this, the authors used shoot/root grafts of wild types: Col-0 and C24 and a mutant lacking siRNA formation: dcl234, as to elucidate the requirement of a mobile signal. Mobile sRNAs were identified where WT shoot could produce transcripts that were sequenced in dcl234 roots.

They identified 3 relevant classes of mobile-sRNAs and targets: direct interaction, indirect interaction and de novo methylation.  Direct interaction was shown where mobile-sRNAs were were enriched in a methylation site, whereas indirect had methylation due to mobile-sRNAs, but have no clear sRNA culprit. De novo loci are shown where methylation is induced by mobile-sRNAs in Col-0 root by a C24 shoot, but is not present with a Col-0 shoot. This study found widespread accounts of direct and indirect loci, as well as significant de novo methylated sited.

The huge quantity of loci identified as indirectly methylated was an interesting point of this paper. The authors have several thoughts for why this might occur, focusing on a possible secondary signal bridging the mobile signal and methylation or the possibility of aggressive threshold for mobile-sRNA significance inducing false negatives. Another point brought up is the possibility of requiring perfect matching with sRNA alignment resulting in missing valid secondary alignment of transcripts. This certainly seems possible to me, as allowing for a single mis-match opens a much more inclusive set of sRNA targets. These could be biologically relevant despite the mis-match.

Genomically, these direct and indirectly targeted loci are localized in transposable elements, while depleted in coding regions. This is true with both CHH and CHG methylation. Using mutant libraries from another study (Stroud et al. 2013 – Cell), the authors made a clear connection that mobile-sRNA methylation targets are dependent on DRM1/2. This makes a strong case for methylation through an AGO-dependent pathway.

This was an interesting paper, which gave strong evidence to support several of the claims. As an observational study, it gave support to previous work which indicated methylation but on a loci-specific scale. It is clear that this is taking place on a much broader level.

Control meiotic crossover by DNA methylation in Arabidopsis

We discussed a paper by Yelina et al. (2015) titled “DNA methylation epigenetically silences crossover hot spots and controls chromosomal domains of meiotic recombination in Arabidopsis” (PMID: 26494791; doi: 10.1101/gad.270876.115) in our journal club today. It’s a pretty interesting paper with intriguing topic and smart experimental designs.

The authors previously identified two meiotic crossover hot spots, 3a and 3b on subtelomeric regions of Chromasome 3 in Arabidopsis (Yelina et al., 2012). These crossover hot spots have low CG methylation compared to average genome methylation level or regions in between 3a and 3b. They then tested if increasing DNA methylation in 3a and 3b could suppress meiotic crossover rate by expressing inverted-repeat (IR) transgene, which would trigger RdDM in targeted regions. Interestingly, meiotic crossover rate significantly decreased in several IR expressed lines (Figure 1 and 2, Table 1). Other RdDM markers, such as increased H3K9me2 and denser nucleosome occupancy were also detected in IR targeted regions (Figure 3). There results indicate that crossover rate is negatively correlated with DNA methylation in euchromatic regions.

It is thus obvious to ask that if crossover rate would significantly elevate if genome-wide demethylation occurred. The authors used met1/+ plants to test this hypothesis. However, overall crossover rates were similar in Col/Ler vs met1 Col/Ler. Regional remodeling of crossover around subtelomeric and pericentromeric regions was observed (Figure 4C). They showed that the remodeling of crossover in met1 was dependent on crossover interference pathway (Figure 4G). The analysis of crossover in met1 mutant suggests that genome-wide demethylation has different effect on crossover in euchromatic and centromeric regions. Very interestingly, the met1 mutation causes increased crossover in euchromatic regions, but vise versa in pericentromeric regions (Figure 5D). They further showed that double strand DNA breakage (DSB) was similar in met1 and WT in Arabidopsis (Figure 6), which ruled out the possibility that crossover remodeling in met1 was due to altered DSB.

A few brilliant technique/experiments were used in this research. I think it’s very smart to study meiotic crossover by studying pollen DNA. More information about pollen typing could be found in Drouaud and Mézard (2011). The crossover detecting system by using GFP/RFP that inserts into different positions on same chromosome is also very neat. More information about the GFP/RFP lines can be found in Yelina et al. (2012) paper.


Drouaud, J., & Mézard, C. (2011). Characterization of meiotic crossovers in pollen from Arabidopsis thaliana. DNA Recombination: Methods and Protocols, 223-249.

Yelina, N. E., Choi, K., Chelysheva, L., Macaulay, M., De Snoo, B., Wijnker, E., … & Mezard, C. (2012). Epigenetic remodeling of meiotic crossover frequency in Arabidopsis thaliana DNA methyltransferase mutants. PLoS Genet, 8(8), e1002844.

Yelina, N. E., Lambing, C., Hardcastle, T. J., Zhao, X., Santos, B., & Henderson, I. R. (2015). DNA methylation epigenetically silences crossover hot spots and controls chromosomal domains of meiotic recombination in Arabidopsis. Genes Dev, 29, 2183-2202.


Arabidopsis RNASE THREE LIKE2 modulates the expression of protein-coding genes via 24-nucleotide small interfering RNA-directed DNA methylation

Elvira-Matelot E, Hachet M, Shamandi N, Comella P, Saez-Vasquez J, Zytnicki M, Vaucheret H

The Arabidopsis RTL2 is an RNASE THREE LIKE protein with one RNAseIII domain and two dsRNA-binding domains. Its transient over-expression in plants is known to enhance the production of exogenous siRNAs. Here the authors investigate its role in the production of endogenous siRNAs.

The ectopic expression of RTL2 stimulates the production of siRNAs from artificial and natural Inverted Repeats constructs. Both its domains are necessary for the RTL2-dependent production of siRNAs from dsRNAs. Contrarily, in other cases the over-expression of RTL2 reduces the production of siRNAs from dsRNAs. The opposite effect of RTL2 on dsRNA substrates likely depends on the structure and/or sequence of the dsRNAs.

Interestingly, the over-expression of RTL2 also stimulates the production of RNA molecules larger than 24 nts from artificial and natural Inverted Repeats constructs. The authors also demonstrate that RTL2 cannot substitute for the function of DCL2, DCL3 and DCL4. The suggested hypothesis is that RTL2 could process dsRNAs into RNA molecules longer than 24 nts, which could subsequently be processed by DCL proteins into siRNAs. In some cases the cleavage of RTL2 results in a better processing by DCLs, in some other cases it results in a worse processing by DCLs.

By a sliding window approach combined with the analysis of reads at each position of the genome, a total of 481 sRNA loci are found to be differentially expressed in the rtl2 mutant compared to wt: 183 are RTL2-dependent loci, 298 are RTL2-sensitive loci. These RTL2-targeted sRNA loci produce siRNAs that are mostly dependent on DCL2DCL3DCL4, Pol IV, Pol V and DRM2, indicating the involvement of these sRNA loci in the RdDM process.

Recent works reported that siRNA precursors, named P4R2 RNAs (Pol IV- and RDR2-dependent), preferentially start with an A or a G, and preferentially end with a U (Blevins et al. 2015 and Zhai et al., 2015). The authors found that the RTL2 loci show different 5’ and 3’ nucleobase preferences compared to the total Pol IV loci previously identified. This tendency is reverted to the same behavior observed for the total Pol IV loci in the rtl2 mutant, suggesting that RLT2 modifies the 5’ and 3’ end compositions of the target loci, making them similar to those of the total P4R2 RNAs.

Screen Shot 2016-02-09 at 11.58.55 PMScreen Shot 2016-02-09 at 11.59.00 PM

RTL2 targets mainly TEs and intergenic regions but also protein-coding genes, influencing the DNA methylation and mRNA expression level of the target loci.

The biogenesis of siRNAs is not yet completely understood and these findings suggest that the siRNA precursors might undergo multiple successive processings operated by different proteins. The one precursor one siRNA model might not be valid for all sRNA genes, leaving the question: what are the genetic/epigenetic features of the sRNA genes that differentiate their sRNA precursors production and processing?


NO BAR CHARTS! (Weissgerber et al. 2015)

I came across this article in PLoS Biology (Weissgerber et al. 2015 doi: 10.1371/journal.pbio.1002128) and it very clearly laid out the issues of using barcharts to display continuous data, especially with low sample sizes. Bottom line: barcharts are misleading summaries of the data, and it’s better to show the data themselves with univariate scatter plots or boxplots. Figure 1 from their paper dramatically shows how barcharts can mask very different distributions of the data:

Figure 1:  Weissgerber et al. (2015).

Figure 1: Weissgerber et al. (2015).

Figure 3 from this paper is also very striking. Here we can easily see the fallacy of using standard error of the mean (SEM), or standard deviation (SD) for non-normally distributed, low sample size data:

Figure 3: Weissgerber et al. (2015)

Figure 3: Weissgerber et al. (2015)

Some people in my lab probably remember me complaining about barcharts for qRT-PCR experiments where n=3 or n=6 on one of our recent papers. I prevailed, and we had scatter plots instead of barcharts. Maybe now my motivations are clearer? 🙂

Figure 7D from our recent paper (Coruh et al. 2015 Plant Cell doi: 10.1105/tpc.15.00228) .. note avoidance of barchart!

Figure 7D from our recent paper (Coruh et al. 2015 Plant Cell doi: 10.1105/tpc.15.00228) .. note avoidance of barchart!

A One Precursor One siRNA Model for Pol IV-Dependent siRNA Biogenesis

A One Precursor One siRNA Model for Pol IV-Dependent siRNA Biogenesis (Zhai J, Bischof S, Wang H, Feng S, Lee TF, Teng C, Chen X, Park SY, Liu L, Gallego-Bartolome J, Liu W, Henderson IR, Meyers BC, Ausin I, Jacobsen SE, PMID: 26451488)

In this work the authors demonstrate that the Arabidopsis Pol IV-dependent siRNA precursors, named P4RNAs, are not as long as it was previously assumed: P4RNAs are indeed 30÷40-nt. The characterization of the P4RNAs length and sequence composition give insights to the mechanisms of Pol IV transcription initiation and termination and of DCL processing of the P4RNAs into siRNAs.

P4RNAs are the precursors of Pol IV siRNAs

P4RNAs are 30÷40-nt, as shown by the size distribution of the PATH libraries, and are dependent on both Pol IV and RDR2, suggesting that in vivo the two enzymes work in tight association.

Multiple experiments confirm that these long RNAs are the precursors of siRNAs and not misprocessed siRNAs, for example in the dcl2/3/4 mutant, siRNAs are mainly lost while P4RNAs are increased in abundance but AGO4 still selectively binds to the remaining 22-24-nt siRNAs and not to the longer RNAs. At Pol IV siRNA loci, siRNAs and P4RNAs show positively correlated abundances and interestingly, restricting the analysis on the Pol IV siRNA loci with a strand bias of siRNA accumulation and DNA methylation, the P4RNAs accumulation shows the same strand bias. This result suggests that Pol IV-derived strands, rather than the RDR2-derived strands, are strongly favored to become the final 24-nt siRNAs.

Because of the small length of P4RNAs on average only one 24-nt siRNA is processed by each P4RNA precursor.

P4RNA 5’ end

Pol IV is demonstrated to have retained the same TSS preference from its evolutionary ancestor Pol II (Y/R rule) but the two polymerases are here shown to occupy different genomic territories.

At 5’ end, P4RNAs are enriched in A, as it is known for the siRNAs, and the majority appear to have a 5’ monophosphate: I think this last result was in some way expected because of the cloning technique used to construct the PATH libraries.

P4RNA 3’end

P4RNAs that perfectly match to the genome are shown to have an enrichment of ACU in their three last positions, but more than 50% of the total P4RNAs present mismatches at their 3’ ends and these non-templated P4RNAs have a different nucleotide pattern in their 3’ end. 3’ end mismatches are enriched in CG dinucleotides, being C the last matched base and G the first mismatched base, so where a C is found on the template DNA. The level of nucleotide mismatches at 3’ end is strongly decreased in ddm1/dcl3 compared to dcl3, proving the DNA methylation is influencing the misincorporation of nucleotides by Pol IV. In this model, the DNA cytosine methylation causes the termination of Pol IV transcription to give rise to the short siRNA precursors. What still remains unclear to me is: why exactly after 30÷40-nt? It would be interesting to know what is the frequency of finding a methylated C after a Pol II-like TSS in the genome.

By contrast to the P4RNAs, only 1% of the total siRNAs have mismatches at their 3’ end. This result, together with the shared 5’ A enrichment and strand bias between siRNAs and P4RNAs, suggest that siRNAs are preferentially cleaved from the 5’ portion of their P4RNA precursors.


Another recent work “Identification of Pol IV and RDR2-dependent precursors of 24 nt siRNAs guiding de novo DNA methylation in Arabidopsis” (Blevins T, Podicheti R, Mishra V, Marasco M, Wang J, Rusch D, Tang H, Pikaard CS, PMID: 26430765) confirms the short nature of the siRNA precursors but with a main difference: here, the precursors of siRNAs are found to have a strong preference for a 5’ purine but with similar frequencies for A and G. Compared to precursors with 5’ A, those with 5’ G have 3’ end pattern more similar to that of siRNAs, suggesting that these 5’ G precursors might be processed from their 3’ portion to give rise to siRNAs. It would be interesting to understand why these 5’ G siRNA precursors were not observed in the previous described work.


ShortStack now on iPlant

A big thanks to Nate Johnson for coordinating the loading of ShortStack 3.3 onto iPlant’s Discovery Environment. There is now a ‘public app’ for ShortStack on the discovery environment. Just search ‘ShortStack’ in the apps.

This will allow users to leverage iPlant’s compute resources instead of their local machines to run ShortStack. It also uses a gui interface instead of the command line to run the software.

Anyone who uses this, please let me know! ( .. would love to track how much use the community is getting out of the iPlant app.

CRISPR/Cas9 ‘toolbox’ of vectors for plants

Since today’s seminar speaker here is Dr. Daniel Voytas from Minnesota, I went looking for some of his recent papers. I came across this recent one, from him and his collaborators, reporting a series of vectors optimized for plant transgenesis with various types of CRISPR activities, including multi-plexed targeting and transcriptional activation. At a glance the experiments look convincing, and the vector series is available at Addgene. Folks in my lab who are interested in multi-plexed CRISPR/Cas9 targeting might want to look at this study.

A CRISPR/Cas9 Toolbox for Multiplexed Plant Genome Editing and Transcriptional Regulation. (2015; Plant Physiology. doi: 10.1104/pp.15.00636)


Annotation of miRNAs and phasiRNAs from Norway Spruce

This week I selected a recent article by Xia et al. (2015) titled Extensive Families of miRNAs and PHAS Loci in Norway Spruce Demonstrate the Origins of Complex phasiRNA Networks in Seed Plants” (doi: 10.1093/molbev/msv164, PMID: 26318183) for our journal club discussion. This was a pretty neat paper, with some interesting insights about the phasiRNA and miR482/2118 superfamily network in spruce. I also looked up the Norway spruce genome paper (Nystedt et al. 2013) to check some details about the small RNA seq data, and it seemed like the genome also has some distinct characteristics. Firstly, the coniferous Norway spruce (Picea abies) has a considerably large genome -20 gigabase pairs (the biggest plant genome assayed up to now is a monocot Paris japonica with 150 Gbp genome). Despite its huge size, the number of coding genes in spruce is comparable to that of arabidopsis, which has a much smaller genome ~135 Mbp (Nystedt et al. 2013). The bulk of the spruce genome is contributed by transposons, but the overall frequency of repeat-associated 24 mer small RNAs is also much lower in spruce compared to angiosperms. There were previously some conflicting studies about the presence of 24 mers in spruce, but the small RNA seq data from as many as 22 samples in the Nystedt et al. (2013) study showed that 24 mer small RNAs are indeed expressed in spruce but mainly in reproductive tissues. Based on these small RNA seq datasets, Nystedt et al. (2013) also reported 2,719 de novo miRNA gene annotations in spruce using UEA sRNA workbench tools, which is quite big compared to the typical range (~100 – 750) of miRNA annotations generated in diverse plant species listed in miRBase 21.

Xia et al (2015) reanalyzed the small RNA sequencing data (~352 million reads) published in spruce genome paper for de novo discovery of miRNA and phased siRNA loci. Their relatively more stringent pipeline for de novo annotation generated a much smaller set of 585 miRNA loci than previously reported (Nystedt et al. 2013). Nearly half of these miRNA loci were new without any known homologs in miRBase. Subsequent comparison of the mature sequences with ginkgo small RNAs indicated one-third of the new loci may be conifer-specific. The miRNA discovery pipeline included PatMaN for mapping small RNAs to the spruce genome, which were then filtered based on a set of conditions and then passed on to mireap for de novo annotation (Xia et al. 2015). One of these conditions was selecting only 20-22 mer fraction of the mapped small RNAs for miRNA gene discovery, but it was not clearly explained why the 23-24 mer fraction was discarded. Our analysis of Amborella small RNAs indicated that 24 mer miRNAs are expressed in the basal lineage of angiosperm (Amborella genome project, 2013) so I was interested to see if any such long miRNAs are also present in gymnosperms.

Among the Xia et al. (2015)- reported unique miRNA sequences, 21 mers were most predominant as expected (~46%), followed by 22 mers (20%, 119 miRNAs) and the rest were 20 mers. The most striking result was that spruce showed an expanded miR482/2118 superfamily (26 members) with relatively higher mature miRNA sequence diversity (only half of the positions in the consensus sequence was deeply conserved). These members not only triggered phasiRNAs from hundreds of NB-LRR genes (similar to miR482/2118 function in dicots), but also targeted noncoding transcripts in reproductive tissues (similar to grasses). This led to the authors propose a dual function of miR482/2118 family in the gymnosperms. Some of the spruce miR482 precursors also showed strong evidence of evolutionary emergence from NB-LRR genes. The miR390-TAS3 phasiRNA network in spruce also showed high diversity in miR390 target sites (1 -3 sites per transcript) as well as tasiARF regions. The spruce TAS3 gene family was also the largest one (16 genes) identified to-date. These results indicate an extensive network of miRNA and phasiRNAs in spruce, and it will be really interesting to figure out their actual function. The spruce 24 mer siRNAs were not analyzed in detail by Xia et al (2015), and I am curious about the role of these in heretrochromatin silencing in spruce. The authors suggested that the phased siRNAs may have a role in regulating the transposons in spruce genome, but there is no clear evidence of these targeting the transposons. Another interesting point is how the handling of multi-mappers by different tools affects the number of miRNA discovery. It will be great to compare these annotations with de novo annotations from ShortStack, which has a relatively better approach for assigning multi-mappers compared to PatMan, which I believe keeps all possible locations for multi-mapped reads.


Xia, Rui, et al. “Extensive Families of miRNAs and PHAS Loci in Norway Spruce Demonstrate the Origins of Complex phasiRNA Networks in Seed Plants.” Molecular biology and evolution (2015): msv164.

Nystedt, Björn, et al. “The Norway spruce genome sequence and conifer genome evolution.” Nature 497.7451 (2013): 579-584.

Plant miRNA evolution update

So here’s another figure I’ve prepared for the Plant Cell review article I am writing. This is an update on the patterns of miRNA conservation across land plants according solely to the information present in miRBase release 21.

For this analysis, I first placed each land plant miRNA family in miRBase 21 into one of eight broad plant groups (Eudicots-Rosids, Eudicots-Asterids, Basal Eudicots, Monocots, Basal Angiosperms, Gymnosperms, Lycophytes, and Bryophytes). I then defined a conserved miRNA family as one which had at least one high-confidence annotation, and which was annotated in two or more of these broad groups. By this definition, there are just 36 conserved families out of the more than 2,000 that are currently annotated.


This figure also highlights a major issue in the large-scale analysis of land plant miRNA conservation .. highly unequal sampling density within the different groups. Rosids and monocots have received the most attention, with large numbers of species represented, and high numbers of overall annotations (barcharts at the top). In contrast, basal eudicots, basal angiosperms, lycophytes, and bryophytes are much less well-sampled, with each group represented at present by just one species in miRBase 21. Inferring secondary losses in some of these lineages (basal eudicots, basal angiosperms, lycophytes) is NOT believable.

This chart also illustrates how important the high-quality miRNA annotation set for bryophytes is (the sole species represented is Physcomitrella patens). Based on the presence of high-confidence annotations in both Physcomitrella and one or more angiosperm group, we can confidently say that nine families (miR156, miR160, miR166, miR171, miR319, miR390, miR477, miR529, and miR535) were most likely present in the last common ancestor of all land plants. Another three families (miR167, miR395, and miR408) might also belong in the ultra-conserved set, but they are less certain because their Physcomitrella annotations are not high-confidence. Another two families (miR396 and miR482) clearly predate divergence of all seed plants. Other patterns are less certain because of the frequent presence of annotations that are not (yet) known to be high confidence. I think this again highlights the need for a systematic review of miRNA annotations, based on re-analysis of all available small RNA-seq data with a single, high-confidence MIRNA identification methodology. We are working toward this goal in my lab.