Highly fragmented reference genomes (with thousands or more short contigs or scaffolds) have been a persistent challenge for our small RNA-seq analysis program ShortStack. During a run, ShortStack needs to retrieve genomic sub-sequences for analysis of predicted RNA secondary structure. This is required to identify MIRNA hairpins. Early on I made the decision to use the samtools faidx function as the engine to retrieve genome sub-sequences. This was just pragmatic and lazy .. samtools was already required for other portions of ShortStack’s analysis, and there wasn’t a need to reinvent the wheel. However, when we started to do runs against highly fragmented genome assemblies, we found analysis was very slow. The slowness was traced to the samtools faidx function, which is very sensitive to the number of contigs/references.
The first attempt to fix this issue was in version 3.0, when I introduced the use ‘genome-stitching’. When the reference genome had more than 50 sequences, and some were < 1Mb in size, the references were ‘stitched’ together into larger assemblies of at least 5Mb each. This worked very well to speed up the process of MIRNA searching, but had some disadvantages:
- The end results (bam file, results, etc), were in the coordinate system of the ‘stitched’ genome, not the original. This was quite inconvenient for users.
- The reads had to be aligned to the ‘stitched’ genome, which necessitated building a new bowtie index — this added total time to the run.
- I recently noticed a bug as well .. despite my efforts to prevent it, sometimes de novo small RNA clusters were found that spanned the junctions between ‘stitched’ genome segments.
In release 3.8, I’ve removed the drawbacks of the old stitching approach, while keeping the speed-up. The solution was simple and I’m kicking myself for not thinking about it earlier. Fragmented genomes are still stitched during the run (now into larger chunks, > 50Mb), but the stitched version is used only for sequence lookups. To get a sequence, the ‘real’ coordinates are converted to the ‘stitched’ coordinates, and the stitched genome used for the samtools faidx query. This keeps the speed up in retrieving sequences for folding, but the results are still reported in the original genome’s coordinate system. This approach eliminates all 3 of the problems listed above.
To demonstrate the speed-up, I compared the performance of 3.8 to 3.7.2 run with nostitch with the following conditions, using a highly fragmented reference genome I’m working on with a collaborator at the moment:
- Reference genome: 382,375 scaffolds, totaling 3.75Gb of sequence. bowtie indices pre-built.
- Alignments: 202,147,307 reads (pre aligned in single bam file).
- mincov = 0.5rpm
- 3.7.2 (under nostitch) : 3 hrs, 14 minutes
- 3.8 (improved stitch, still reporting in original coordinates): 1 hr. 13 minutes
So, the performance gain is clear, nearly three times as fast. Because so many new genome assemblies are highly fragmented drafts, this performance enhancement may matter to many users.