Don’t trust read counts (..and identify species based on the mitochondrial sequence)

Recently, we wanted to verify that the sample we have comes from the species we believe it comes from. Since mitochondrial sequences seem like a good marker for such analysis, I’ve written following script for PacBio data. Key parts of the script include:

blasr -nproc 63 -bestn 1 -sam -minPctIdentity 85 $reads_1 $reference_mitochondrion_genome -sa ${reference_mitochondrion_genome}.sa > $SAM    
I use blasr to report best hit for reads with at least 85 % identity.    
samtools view -bhS -F 4 -q 0 $SAM > $BAM    
I remove unmapped reads and reads with lower MAPQ quality than 30. Later I also remove all supplementary alignments (-F 2048) and then I simply output species with the highest number of the reads.  
In the complementary analysis, I retrieve all reads mapping to my mitochondrial reference from the previous step and use blastn to identify species with the best hits. I only keep alignments longer than 1,000 bp and percentage identity of at least 85 %.

Does this seem like too much work for too little gain? Not really. For our sample, if I wouldn’t do all the quality filtering, I would end up with the wrong answer. Why?

I have a list of thousands of species. What if I told you that over 5,000 reads mapped to species B, whereas all other species had much, much lower read counts (more than an order of the magnitude lower). If species A has less than 200 reads, is there any doubt that we are indeed looking at species B? There shouldn’t be, right?

Let’s look at the plots:

species A

Screen Shot 2015-10-23 at 2.12.41 PM

All parts of the mitochondrial genome of species A are covered, although the reads do look noisy, which is completely fine since we are looking at PacBio long reads. Now let’s look at species B:

Screen Shot 2015-10-23 at 2.30.47 PM

 

Interesting, huh? All these abundant reads map only to one single region of the mitochondrial genome (circled in the red). Here is the closer view of that region:

Screen Shot 2015-10-23 at 2.18.20 PM

The take home message from this is that it’s simply not enough to look only at read counts.

Let me show you another plot with mapping qualities:

Screen Shot 2015-10-23 at 2.20.31 PM

And now another with template length TLEN, which is 9th column in SAM format.

Screen Shot 2015-10-23 at 2.22.16 PM

Please note that the Y axis is cut because otherwise it would go to the roof and we wouldn’t see blue histogram which is what we are interested in.

Don’t trust the read counts. Always examine the data visually and check for the mapping quality, alignment lengths, etc. As I showed at the beginning of this post, even adding very simple quality filters already helps us get it right and identify the species correctly:)

So let us finish with the introductory sentence:

Don’t trust read counts – always look at the data!