Bloom-filter inspired testing of pooled samples (and splitting of swabs!)

In the current situation (COVID-19 pandemic), a lot has been debated about pooling samples in order to save time, cost, or reagents. It has been suggested that one could pool samples of 32 or 64 samples and test them all at once. If the pooled sample is negative, great — all individual samples are negative. If it’s positive, further re-testing is warranted. Of course, different more or less creative versions of pooling are possible, such as pooling rows and diagonals in order to reduce the total number of tests.

The disadvantage of these approaches is that at least some re-testing is pretty much unavoidable. Thus, ideally, one would pool in a manner that would require no re-testing at all. And here’s where the idea of hashing or chemical bloom filter comes in.

Imagine you split your sample and pool it with other samples. But, you will be smart about how you do this. You will distribute each part of each of your sample on your 96-well plate in a pre-defined manner. That way, based on the combination of positive wells, you could tell exactly which of the original samples were positive. How, you ask? The explanation, code and the whole original idea by Tomáš Janoušek (@Liskni_si) are described here: https://github.com/liskin/covid19-bloom and illustrated below:

schematics of the pooling and detection process

When the number of positive samples per 1,000 is low, such as below 1% (less than 10 positive samples per 1,000 tested), the number of false positives is remarkably low. For example, for 8/1000 positive samples, splitting each sample into 5, only single PCR reaction is needed on a 96-well plate to tell you exactly which those 8 positive samples are, and, in this specific scenario, also 1 extra sample that is false positive (although validating all 8+1 samples is not that labor-intensive and could be done).

the simulation of the bloom-filter inspired sample pooling

There won’t be any positive samples you would algorithmically miss (bloom filter doesn’t have false negatives) and also few (if any) false positives.

The caveat is the act of pooling itself; while adding 40 or 50 samples to a single tube is definitely easy for a robot, it could be error-prone when done manually. While a human could pool rows or columns relatively easily, asking someone to pipette 1,000 samples, and moreover each of those into 3 wells, more so in a pre-defined manner, might simply be too much of an ask. Reducing the number of tested samples to 500 and capping the number of samples in a tube will generate the following plot:

sample pooling, with max of 10-16 samples per tube

Here, all 500 samples are separated into three pieces, and the number of real positives versus false positives is plotted, while keeping the total number of pooled samples in a tube low.

This all brings us to the discussion of the next bottleneck: RNA extraction. In all this pooling, what do we actually pool? Extracted RNA? If that’s the case, we still need to do 1,000 extractions for 1,000 samples. And then split those 1,000 extractions into 2, 3, 4 or 5 parts (a parameter that can be chosen). And we know that with all this testing, the extraction kits are running low.

We also know that with the progress of the disease a viral load in a patient changes — someone who might have tested negative yesterday could be positive today. So when a swab is taken, a viral load might be low to begin with, without any further amplification step. However, we do amplify, over many cycles. So perhaps it doesn’t matter if we start with 1/3rd or 1/4th of the original load. This means that if a swab could be mechanically split, with the sufficient viral load still there, one could pool these parts of the original swab.

Thus, for 1,000 samples, only 96 RNA extractions are needed. And we can still confidently say which individual samples are positive. 

In conclusion, for a population of samples with a small percentage of positives, the chemical bloom filters could represent a fast and efficient way to reduce the number of performed tests and save reagents (1,000 samples can be tested using 96 pooled tubes).


The data for the two plots above can be found below:

nRealPositive  nAll  kHashes  samplesInTube  avgFalsePositives
1              1000  4        41.67          0.00
2              1000  4        41.67          0.01
3              1000  5        52.08          0.03
4              1000  5        52.08          0.07
5              1000  5        52.08          0.19
6              1000  5        52.08          0.40
7              1000  5        52.08          0.78
8              1000  5        52.08          1.29
9              1000  5        52.08          2.12
10             1000  5        52.08          3.23
11             1000  5        52.08          4.97
12             1000  5        52.08          7.25
13             1000  5        52.08          10.00
14             1000  5        52.08          13.27
15             1000  5        52.08          17.13
16             1000  4        41.67          21.30
17             1000  4        41.67          25.41
18             1000  4        41.67          29.96
19             1000  4        41.67          35.00
20             1000  4        41.67          40.67
25             1000  3        31.25          75.94
30             1000  3        31.25          114.93
35             1000  3        31.25          160.06
40             1000  2        20.83          208.18
45             1000  2        20.83          244.67
50             1000  2        20.83          279.05
nRealPositive  nAll  kHashes  samplesInTube  avgFalsePositives
1              500   3        15.62          0.00
2              500   3        15.62          0.04
3              500   3        15.62          0.17
4              500   3        15.62          0.36
5              500   3        15.62          0.56
6              500   3        15.62          0.93
7              500   3        15.62          1.44
8              500   3        15.62          2.14
9              500   3        15.62          2.96
10             500   3        15.62          3.95
11             500   3        15.62          5.07
12             500   3        15.62          6.32
13             500   3        15.62          7.64
14             500   3        15.62          9.18
15             500   3        15.62          10.99
16             500   3        15.62          12.96
17             500   3        15.62          15.08
18             500   3        15.62          17.34
19             500   3        15.62          19.80
20             500   3        15.62          22.37
25             500   3        15.62          37.95
30             500   3        15.62          57.12
35             500   3        15.62          79.27
40             500   2        10.42          98.58
45             500   2        10.42          115.78
50             500   2        10.42          131.52

 

Disclaimer: We are not epidemiologists and we are publishing this with an intent to induce further discussions and improvements. Our conclusions are not intended as public-health advice and shouldn’t be regarded as such.

Where have my nanopore reads dissapeared?

You might have been so excited to see your fastq reads coming immediately out of the MinION that you might not (initially) realized that the data you started to analyze are NOT all the data there is! In case your computer wasn’t able to do all the base-calling on the fly, there will be plenty of the data in your skip folder — in our recent run, this was as much as 90% of it!

One way to fix this is to catch up with the base-calling using albacore on Linux machine. In my case, I achieved it using following commands:

PYTHON_VERSION=3.5
conda create --prefix /biomonika/conda/nanopore_basecalling python=${PYTHON_VERSION} anaconda
source activate nanopore_basecalling
conda update --all
wget https://mirror.oxfordnanoportal.com/software/analysis/ont_albacore-2.2.7-cp35-cp35m-manylinux1_x86_64.whl
pip install ont_albacore-2.2.*.whl

So how does our loop submitting jobs look like?

for i in `seq 0 48`; do
  echo $i;
  sbatch basecalling_job.sh /biomonika/nanopore_run/data/reads/20180426_2154_/fast5/skip/${i} ${i};
done;

And this is the insides of basecalling_job.sh

#!/bin/bash
#SBATCH --job-name=biomonika
#SBATCH --output=biomonika-%j.out
#SBATCH --error=biomonika-%j.err
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=63
#SBATCH --cpus-per-task=1

set -x
set -e

input_dir=$1
output_dir=$2

echo "input dir: " ${input_dir}
echo "output dir: " ${output_dir}

which conda
conda info

source activate /galaxy/home/biomonika/conda/nanopore_basecalling
read_fast5_basecaller.py --flowcell FLO-MIN106 --kit SQK-LSK108 --barcoding --output_format fastq --input ${input_dir} --save_path ${output_dir} --worker_threads 63
echo "Done."

As usual, please comment if you have suggestions or advice. Happy base-calling!

How to work with CRAM files (by converting them to bam files:)

I recently downloaded trio from 1000 human genomes that happened to be in a CRAM format. My first impulse was – of course – to convert it into the format I am comfortable with, such as bam.

I quickly found following command online:

java -jar cramtools-<version>.jar bam –input-cram-file <cram file> –reference-fasta-file <reference fasta file> –output-bam-file <output bam file>

thinking to myself that it must be rather frustrating to develop a tool where the frustrated users desperately push all the buttons to get to their chubby, but lovely bam files as quickly as possible (this is exaggeration in case you wondered;-).

OK, let’s try the conversion:

input=NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

But I got error message saying something like “cramtools ReferenceSource Downloaded sequence is corrupt” and complaining about mismatch in md5sums.

Is my reference incorrect? Should I search the internet for the magic file with md5sum of 6aef897c3d6ff0c78aff06ac189178dd that cramtools seems to want?

(@md5sam pointed to me you indeed can search by md5sum in case you wondered: find . -type f -exec md5sum {} + | grep ‘^6aef897c3d6ff0c78aff06ac189178dd’)

OK, so I went back to reading about CRAM files and if you are curious too, you can take a look at this publication: Efficient storage of high throughput DNA sequencing data using reference-based compression.

Turns out, storing pretty much everything in our BAM files is pretty wasteful, even if they are compressed (SAM files are not even compressed but they are human readable which is kind of nice at times). We are all humans and our DNA is extremely similar to each other. You and me might differ just by as few as 3,000,000 SNPs (and then insertions, deletions, SVs and sometimes more complex rearrangments).

So the idea is this:

Instead of storing human reference, let’s just store where our reads differ from it.

As simple as that.

OK, so what do I do now? The reference must be of course always present, otherwise the data is useless (knowing something differs, but not knowing what it differs isn’t particularly helpful. The reference could be reachable online, but you will likely want it to be on your local system instead. The easiest way to achieve this is probably through these commands:

/path_to_samtools/samtools-1.3.1/misc/seq_cache_populate.pl -root /home/biomonika/reference /home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa

export REF_PATH=/home/biomonika/reference/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s

export REF_CACHE=/home/biomonika/reference/%2s/%2s/%s

Note that you need to have version newer version of samtools installed, such as 1.3.1 (older versions could precede CRAM development).

Now, let’s try again:

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

Yeaay – we just got the bam file. Although, the difference in sizes is quite dramatic: 4.6 GB versus 10.4 GB. (PS: If you don’t like cramtools, samtools can do the conversion too)

Big thanks to researchers from 1000genome who put together this very helpful README:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md

Don’t trust read counts (..and identify species based on the mitochondrial sequence)

Recently, we wanted to verify that the sample we have comes from the species we believe it comes from. Since mitochondrial sequences seem like a good marker for such analysis, I’ve written following script for PacBio data. Key parts of the script include:

blasr -nproc 63 -bestn 1 -sam -minPctIdentity 85 $reads_1 $reference_mitochondrion_genome -sa ${reference_mitochondrion_genome}.sa > $SAM    
I use blasr to report best hit for reads with at least 85 % identity.    
samtools view -bhS -F 4 -q 0 $SAM > $BAM    
I remove unmapped reads and reads with lower MAPQ quality than 30. Later I also remove all supplementary alignments (-F 2048) and then I simply output species with the highest number of the reads.  
In the complementary analysis, I retrieve all reads mapping to my mitochondrial reference from the previous step and use blastn to identify species with the best hits. I only keep alignments longer than 1,000 bp and percentage identity of at least 85 %.

Does this seem like too much work for too little gain? Not really. For our sample, if I wouldn’t do all the quality filtering, I would end up with the wrong answer. Why?

I have a list of thousands of species. What if I told you that over 5,000 reads mapped to species B, whereas all other species had much, much lower read counts (more than an order of the magnitude lower). If species A has less than 200 reads, is there any doubt that we are indeed looking at species B? There shouldn’t be, right?

Let’s look at the plots:

species A

Screen Shot 2015-10-23 at 2.12.41 PM

All parts of the mitochondrial genome of species A are covered, although the reads do look noisy, which is completely fine since we are looking at PacBio long reads. Now let’s look at species B:

Screen Shot 2015-10-23 at 2.30.47 PM

 

Interesting, huh? All these abundant reads map only to one single region of the mitochondrial genome (circled in the red). Here is the closer view of that region:

Screen Shot 2015-10-23 at 2.18.20 PM

The take home message from this is that it’s simply not enough to look only at read counts.

Let me show you another plot with mapping qualities:

Screen Shot 2015-10-23 at 2.20.31 PM

And now another with template length TLEN, which is 9th column in SAM format.

Screen Shot 2015-10-23 at 2.22.16 PM

Please note that the Y axis is cut because otherwise it would go to the roof and we wouldn’t see blue histogram which is what we are interested in.

Don’t trust the read counts. Always examine the data visually and check for the mapping quality, alignment lengths, etc. As I showed at the beginning of this post, even adding very simple quality filters already helps us get it right and identify the species correctly:)

So let us finish with the introductory sentence:

Don’t trust read counts – always look at the data!

Trim now or never – PE and MP adaptor removal

I might have slight OCD when it comes to dealing with adapters. I like to remove them all 🙂 I will now provide detailed post about how to remove adapters in PE and MP reads based on my own experience. (Note that default set of adaptors provided with Trimmomatic might work just as fine for you.)

Summary: based on success of sequencing run, removing adaptors and trimming of PE reads might or might not be important. However, proper dealing with MP will have major impact on your downstream analysis and you just SHOULD pay proper attention to it.

Sanity check

Usually, forward and reverse reads come in two fastq files named as R1 and R2 (sometimes they are in one file, but I just deinterleave them since I prefer separate files). First I make sure that both of these files have same number of sequences (and hence were copied correctly). This can be done with FastQC during the same time you will check for quality of sequences and duplication levels. Next, I check that sequence length equals quality score length for each entry and validate files with fastQValidator. I also note md5sum – unique hashcode of the file (even deletion of one character will alter md5sum). This might be a bit overkill, but better find out sooner than later. I use script that looks something like this:

sanity_check.sh


#!/bin/bash

function check_data {
R1=$1;
R2=$(echo $R1 | sed ‘s/R1.fastq/R2.fastq/g’)

echo “Check if length of sequence equals quality score length”
bioawk -c fastx ‘{if (length($seq)!=length($qual)) print “Offending: ” NR}’ $R1
bioawk -c fastx ‘{if (length($seq)!=length($qual)) print “Offending: ” NR}’ $R2

echo “run FastQ Validator”
fastQValidator –file $R1
fastQValidator –file $R2
echo “Finished validation.”
echo “Check md5 sums:”;
echo `md5sum $R1`
echo `md5sum $R2`
echo “——–”
}

#validate Fastq file
echo “File: ” $1
check_data $1


Adapter detection

Now you want to know which adaptors are in your data. The easiest is probably to ask lab person that generated the data. What you will usually observe with Illumina is indexed adaptor at the beginning of your F reads (plus sometimes some fake polyA stretches that drive me a bit crazy) and universal adaptor at the end of your R reads. I usually take a look with egrep how those sequences look like and how abundant they are:

>PrefixAdapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>PrefixUniversal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

So if you used indexed adaptor 7 from Illumina you would type:

egrep –color “GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG” R1.fasta

egrep –color “AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT” R2.fastq

Some adaptors might be present just partially, so this will be rather a lower boundary.

Screen Shot 2015-07-03 at 9.47.18 AM

Trimming PAIRED-END reads – THEORY

Now you have an idea how many adaptors you have. My software of choice is trimmomatic, because it

  • trims adaptors
  • removes low quality sequences if you wish to do so
  • keeps order of the reads – I usually choose to work only with cases where both F and R reads survived trimming

I know that a lot of people run Trimmomatic without checking which types of adaptors they actually have in the data and it often works. I would recommend reading pages 5 and 6 in Trimmomatic manual – turns out you can run your trimming in both simple or palindrome mode.

Here is the summary of that:

“With ‘simple’ trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately.

‘Palindrome’ trimming is specifically designed for the case of ‘reading through’ a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are ‘in silico ligated’ onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates ‘read-through’, the forward read is clipped and the reverse read dropped (since it contains no new data).

Naming of the sequences indicates how they should be used. For ‘Palindrome’ clipping, the sequence names should both start with ‘Prefix’, and end in ‘/1’ for the forward adapter and ‘/2’ for the reverse adapter. All other sequences are checked using ‘simple’ mode. Sequences with names ending in ‘/1’ or ‘/2’ will be checked only against the forward or reverse read. Sequences not ending in ‘/1’ or ‘/2’ will be checked against both the forward and reverse read. If you want to check for the reverse-complement of a specific sequence, you need to specifically include the reverse-complemented form of the sequence as well, with another name.” http://www.usadellab.org/cms/?page=trimmomatic

For paired end reads I use palindrome mode:

>PrefixAdapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>PrefixUniversal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

For simple mode I would choose following adaptor sequences:

>Adapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>Universal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

Sometimes you don’t know which adaptor sequences you have, then I would check if any of these are present: MEDIUM_contaminant_list (please note that this list is very incomplete and might not cover new adaptor sequences from Illumina) and check them with following script:

find_cMEDIUM.sh

#!/bin/bash –
file=$1;
echo “FILE PROCESSING: ” $file;
rm -f ${file}_MEDIUMadapt_contamination_log.txt; #remove previous logs

while read line; do name=$(echo $line | cut -f1 -d’ ‘); seq=$(echo $line | cut -f2 -d’ ‘); echo -ne $(date) $name “looking for $seq\t” >>${file}_MEDIUMadapt_contamination_log.txt; fgrep -c $seq ${file} >>${file}_MEDIUMadapt_contamination_log.txt; done <MEDIUM_contaminant_list.txt

There are for sure better ways to do this, but since I am running this on every dataset just once, I never bothered to optimize 🙂

Trimming PAIRED-END reads – HANDS ON

I use script that looks like this:

#!/bin/bash

function trim_data {
R1=$1;
R2=$(echo $R1 | sed ‘s/R1/R2/g’)
adapter_file=$2
java -jar /usr/bin/biomonika/Trimmomatic-0.32/trimmomatic-0.32.jar PE -threads 60 -phred33 -trimlog ${R1}_trimlog $R1 $R2 ${R1}_trimmed.fastq ${R1}_Fsingle.fastq ${R2}_trimmed.fastq ${R1}_Rsingle.fastq ILLUMINACLIP:adaptors_sequences/${adapter_file}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30;
rm ${R1}_Fsingle.fastq ${R1}_Rsingle.fastq;
}

#TRIM PAIRED END READS

trim_data PE_R1.fastq customPalindrome_7_UNI.fa

Finally, I perform similar sanity check as at the beginning to make sure that these files are ready for downstream analysis. In general, I would recommend to trim rather less than more  – I have seen people trimming 15 bp from the beginning and 5 bp from the end of 100 bp reads because of small nucleotide bias in FastQC report and I don’t think it’s worth it – you will get less uniquely mapped reads since extremely short reads are prone to mapping to multiple positions. However, really bad reads should be discarded, as well as low quality bases at the read ends, but that’s different from trimming everything without looking at the actual quality scores.

Trimming MATE-PAIR reads

Here things get tricky since mate-pair reads will not only contain Illumina adaptors we are all familiar with but also biotinylated adapters. Here is the figure that explains how this works:

 

MP_illustrationMP_illustration

 

Data Processing of Nextera® Mate Pair Reads on Illumina Sequencing Platforms

So, DNA gets circularized thanks to which fragments originally distant few kb come close together (biotinylated adapters are in green, whereas traditional Illumina adaptors are pink and gray). It all might sound too good to be true and indeed things can get a bit messy – what if your enrichment wasn’t 100 % and for some reason you ended up sequencing fragments without biotinylated adapters? Then those fragments would be only 250 bp or so apart and not few kb, right? Poor scaffolders dealing with these “fake MP” then 🙂

Here is the figure summarizing the cases that can happen:

Screen Shot 2015-06-26 at 4.04.13 PM

Data Processing of Nextera® Mate Pair Reads on Illumina Sequencing Platforms

Please take careful look at the resulting orientation of these reads – PE reads are natively in FR orientation and MP reads in RF orientation (case a). This means that we are going to end up with some “paired-end contamination” in our MP libraries. In our lab, we usually experience less than ~1/3 of such contamination, which we consider OK.

MP_PE_contamination

 

In unknown reads, orientation couldn’t be determined by the position of biotinylated adaptors, but in our experience these are all MP reads (evaluated by mapping to reference and plotting insert size distribution).

I recommend using software called NxTrim to split your reads into these categories – I then recommend you continue with MP and UKNOWN subsets for scaffolding/assembly.

Biotinylated adaptors are now gone so you can go ahead and remove Illumina adaptors. Please notice that NxTrim will reverse complements reads for you so they are now going to be in FR orientation just as paired-end reads usually are. This means that you will find indexed and universal adaptors at rather unusual places.

forward reads, RC means reverse complement, numbers represent how many counts each adaptor appeared:

PrefixAdapter7/1 PrefixUniversal/2 RC PrefixAdapter7/1 RC PrefixUniversal/2
R1_MP.fastq 0 0 328669 0
x x end x
R1_PE.fastq 1734 0 1 0
beginning x x x
R1_UNKNOWN.fastq 0 0 2697712 0
x x end x

reverse reads, RC means reverse complement, numbers represent how many counts each adaptor appeared:

PrefixAdapter7/1 PrefixUniversal/2 RC PrefixAdapter7/1 RC PrefixUniversal/2
R2_MP.fastq 0 0 0 135757
x x x end
R2_PE.fastq 0 798 10 0
x beginning x x
R2_UNKNOWN.fastq 1 0 0 0
x x x x

It makes sense – if adaptor was at the beginning of reads before reverse complementing is now going to end up at the end of the reads.

Final code for removing adaptors for reads after NxTrim will then look like this:

#!/bin/bash

function trim_data {
R1=$1;
R2=$(echo $R1 | sed ‘s/R1.fastq/R2.fastq/g’)
adapter_file=$2
java -jar /usr/bin/biomonika/Trimmomatic-0.32/trimmomatic-0.32.jar PE -threads 60 -phred33 -trimlog ${R1}_trimlog $R1 $R2 ${R1}_trimmed.fastq ${R1}_Fsingle.fastq ${R2}_trimmed.fastq ${R1}_Rsingle.fastq ILLUMINACLIP:adaptors_sequences/${adapter_file}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30;
rm ${R1}_Fsingle.fastq ${R1}_Rsingle.fastq;
}

#trim data

#TRIM MATE PAIR READS
trim_data MP_R1_mp.fastq customSimple_RC-7_RC-UNI.fa
trim_data MP_R1_pe.fastq customSimple_7_UNI.fa
trim_data MP_R1_unknown.fastq customSimple_RC-7_RC-UNI.fa

with following adaptor sequences:

customSimple_RC-7_RC-UNI.fa

>Adapter7RC/1
CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
>UniversalRC/2
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

customSimple_7_UNI.fa

>Adapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>Universal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

Again, if your adaptor counts are really low, you might decide to just trim for quality since removing 1,000 reads with forgotten adaptors really won’t make a difference 🙂

Additional notes

Sometimes it is believed that adapters just magically disappear because mappers and assemblers are smart (well, some of them are, e.g. some mappers will soft clip these adapter sequences automatically for you, but how can you then tell if soft-clipped sequence is adaptor or something biologically meaningful?).

Extract sequence from fasta file at particular positions only

They are thousands of ways how to extract sequence from fasta file and this is my most favorite:

  1. install samtools (http://samtools.sourceforge.net/) (I am sure you have done this already.)
  2. index reference
  3. samtools faidx reference.fasta
  4. cut out read/contig/scaffold you are interested in and display only portion of it, e.g. here basepairs 1-10
samtools faidx reference.fasta contig:1-10