I’ve got an interesting question regarding how to process way long paired end RNA-seq reads before Tophat alignment:
“We have a set of paired-end RNA-Seq raw data with Adaptors and Barcode in both Read1 and Read2 sequences. Also, because each read sequence is 150bp long and the library insert size is only 180bp, it is possible that Read1 and Read2 have an overlapped region. How can I deal with this kind of datasets to remove the Adaptors, Barcodes and the overlapped sequences?”
The similar questions have been asked before, so I would post my answer in the blog to share with everyone!
Below is my answer:
“I hope below will help you
If your paired reads have overlap, tophat won’t map the pairs in a straightforward manner.
There are a few options here:
(1) Stitch R1 and R2 to make the read longer and treat them as if they were a single molecule. This is what we do for making Illumina reads as long as 454 reads in silico (2X300 will turn into as 600bp reads).
… However, in your case, even though you stitch the R1 and R2, it becomes only 30bp longer than its original length, so I would not take this option.
(2) You can trim both R1 and R2 to 75bp and run tophat using -r 0 option. Or you may want to check the insert size computationally (http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics) as there might be certain deviation in the insert length even the core said the insert is 180bp. Then you could do serial different length of trimming down to 50bp each for R1 and R2 and then tophat using -r 0 to -r 50 options and see which one maps best.
(3) You can align R1 and R2 separately. If above picard run tells you there is certain deviation in the insert length and potential adapter read through, you will need to trim adapters (https://www.biostars.org/p/63044/). If the libraries were made strand specific, you will have to align them separately (http://seqanswers.com/forums/showthread.php?t=64806).
Hope these will help!
Thank you so much!