using slurm to parallelize per chromosome

I recently needed to parallelize variant calling and decided to do this by chromosome (one might argue that chromosomes have unequal sizes and that equal chunks might be better). Therefore, I needed my script to use 22 processors, one per each chromosome. Hence, the header of my sbatch script looked like this:

#!/bin/bash
#SBATCH -C new
#SBATCH --nodes=1
#SBATCH --ntasks=22
#SBATCH -t 0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=biomonika@psu.edu

Now, I need to run 22 instances of my variant caller. I can do it like this:

array=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22) #chromosomes to use
for chromosome in "${array[@]}"; do 
    echo $chromosome;  
    time srun -C new --nodes=1 --ntasks=1 --time=INFINITE python naive_variant_caller_for_region_file.py --bam=${var_folder}/${chromosome}_${motif}.bam --index=${var_folder}/${chromosome}_${motif}.bam.bai --reference_genome_filename=${reference} --regions_filename=${var_folder}/${chromosome}_${motif}.bed --output_vcf_filename=${var_folder}/${chromosome}_${out}.vcf &
done; 
wait #all chromosomes call variants/errors in parallel

This way, all my jobs get submitted and then wait command makes sure I won’t continue until all chromosomes finished.

split large fasta file into multiple fasta files on any computer

You just downloaded your favorite genome. It’s huge fasta file and now you want to do something with it, but running it on this single file would take forever. You need to split it by chromosomes or contigs, but you are not on your favorite computer where your magic toolbox is installed. Worry no more! As always, UNIX has many built-in tools, that can do anything, but people rarely talk about them.

csplit panTro5.fa /\>chr.*/ {*} 
#every sequence is now in one file named xxx something

for a in x*; do echo $a; mv $a $(head -1 $a).txt; done; 
#we just renamed each file by using first line of each file

Of course, if you fasta header contained anything other than >chr in the header, you would modify you csplit command and replace chr with whatever characters your headers start with. Luckily, you can use csplit to split any files, making it robust and useful tool for bioinformatics that is being flooded by new file formats as we speak..

ParalleLAZYing in R

I found myself loading and processing huge files in R. For each of those files, I needed to analyze many DNA motifs, so I naturally started with a for loop that, once my gargantuan file was comfortably sitting in a memory, went motif by motif and created an output for each motif separately.

This is a wonderful candidate situation for parallelization since:

I can process each motif independently
I can assign each motif to a single processor
Workers/processes don’t have to communicate
They write their output to a separate files (not a shared one so they don’t have to fight over access to it)

I quickly googled following resource by Max Gordon which demonstrates that basic implementation of parallelization (in simple cases like this one) can be very straightforward in R:

library(parallel)

list<-c("CGG","CGGT","CG","CGT") #my loooong list of motifs, shortened for illustration purposes

# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores,type="FORK")

print("PARALLEL")
ptm <- proc.time() #start timer
parLapply(cl, list,
 function(motif)
 processMotif(motif) #PROCESS EACH MOTIF INDIVIDUALLY
 )
stopCluster(cl)
proc.time() - ptm

print("NON-PARALLEL")
ptm <- proc.time()
for (motif in list) { 
 processMotif(motif)
}
proc.time() - ptm #stop timer

Let’s check how much time this chunk takes to run:

   user  system elapsed

908.340 300.906 432.906

And let’s compare that with the for-loop solution:

   user  system elapsed

8544.826 3079.385 6089.453

Happy paralleLAZYing!

(The code was run on GNU/Linux with 64 processors and 500GB RAM.)

On intriguing exit codes of your scripts

I run a lot of scripts (these days using slurm) and, as is often the case in bioinformatics, they fail more often than I would wish for. In order to be confident that the rest of the pipeline won’t continue anytime any of my commands returns non-zero exit code that suggests a problem, I use following two lines at the beginning of my scripts:

set -e (causes the shell to exit if any subcommand or pipeline returns a non-zero status, read more here about when it is a good idea to use it and when not)
set -x (each command gets printed right before it is executed so that the logfiles become very verbose and arguably easier to debug:)

However this time interesting thing happened. I was running Tandem Repeats Finder and my script kept crashing although the output looked completely normal and did process all sequences in my input fasta file. I subsampled my input file into very tiny file and it happened again, with return code 63 that doesn’t have any traditional meaning. I cleaned my fasta header, removed all whitespaces and prepared very neat fasta file and the script crashed again, leaving hundreds of dot characters in my logfile. Interestingly, when I ran the whole thing from command line, Tandem Repeats Finder finished by saying “Done” and no nasty dots in the output.

This time I submitted a script with 5 sequences only and returned error code was 5.

I did the same thing with 2 sequences and returned error code was 2.

I ran the command again from command line (that finished by graceful “Done.” and no dots) and typed echo $?

$? stores value of the last executed command and it was surprisingly non-zero. So although it “seemed” that everything was fine from the command line, it actually wasn’t.

Or was it?

It turns out everything was fine, just the software decided – for some reason – to return number of sequences as an error code instead of 0 as one would expect.

Going back to the manual, I found out that with -ngs parameter, Tandem Repeats Finder finally returns 0 when everything is fine. So this was one of those rare moments when I was being unjustly mistrustful:-)

The standard consensus is to use exit codes to report important errors, not runtime features. There is a set of reserved exit codes with special meanings as described here. If you decide to use exit codes differently, good practice is to mention how you use them in a manual.

Don’t trust read counts (..and identify species based on the mitochondrial sequence)

Recently, we wanted to verify that the sample we have comes from the species we believe it comes from. Since mitochondrial sequences seem like a good marker for such analysis, I’ve written following script for PacBio data. Key parts of the script include:

Does this seem like too much work for too little gain? Not really. For our sample, if I wouldn’t do all the quality filtering, I would end up with the wrong answer. Why?

I have a list of thousands of species. What if I told you that over 5,000 reads mapped to species B, whereas all other species had much, much lower read counts (more than an order of the magnitude lower). If species A has less than 200 reads, is there any doubt that we are indeed looking at species B? There shouldn’t be, right?

Let’s look at the plots:

species A

All parts of the mitochondrial genome of species A are covered, although the reads do look noisy, which is completely fine since we are looking at PacBio long reads. Now let’s look at species B:

Interesting, huh? All these abundant reads map only to one single region of the mitochondrial genome (circled in the red). Here is the closer view of that region:

The take home message from this is that it’s simply not enough to look only at read counts.

Let me show you another plot with mapping qualities:

And now another with template length TLEN, which is 9th column in SAM format.

Please note that the Y axis is cut because otherwise it would go to the roof and we wouldn’t see blue histogram which is what we are interested in.

Don’t trust the read counts. Always examine the data visually and check for the mapping quality, alignment lengths, etc. As I showed at the beginning of this post, even adding very simple quality filters already helps us get it right and identify the species correctly:)

So let us finish with the introductory sentence:

Don’t trust read counts – always look at the data!

Trim now or never – PE and MP adaptor removal

I might have slight OCD when it comes to dealing with adapters. I like to remove them all 🙂 I will now provide detailed post about how to remove adapters in PE and MP reads based on my own experience. (Note that default set of adaptors provided with Trimmomatic might work just as fine for you.)

Summary: based on success of sequencing run, removing adaptors and trimming of PE reads might or might not be important. However, proper dealing with MP will have major impact on your downstream analysis and you just SHOULD pay proper attention to it.

Sanity check

Usually, forward and reverse reads come in two fastq files named as R1 and R2 (sometimes they are in one file, but I just deinterleave them since I prefer separate files). First I make sure that both of these files have same number of sequences (and hence were copied correctly). This can be done with FastQC during the same time you will check for quality of sequences and duplication levels. Next, I check that sequence length equals quality score length for each entry and validate files with fastQValidator. I also note md5sum – unique hashcode of the file (even deletion of one character will alter md5sum). This might be a bit overkill, but better find out sooner than later. I use script that looks something like this:

sanity_check.sh

#!/bin/bash

function check_data {
R1=$1;
R2=$(echo $R1 | sed ‘s/R1.fastq/R2.fastq/g’)

echo “Check if length of sequence equals quality score length”
bioawk -c fastx ‘{if (length($seq)!=length($qual)) print “Offending: ” NR}’ $R1
bioawk -c fastx ‘{if (length($seq)!=length($qual)) print “Offending: ” NR}’ $R2

echo “run FastQ Validator”
fastQValidator –file $R1
fastQValidator –file $R2
echo “Finished validation.”
echo “Check md5 sums:”;
echo `md5sum $R1`
echo `md5sum $R2`
echo “——–”
}

#validate Fastq file
echo “File: ” $1
check_data $1

Adapter detection

Now you want to know which adaptors are in your data. The easiest is probably to ask lab person that generated the data. What you will usually observe with Illumina is indexed adaptor at the beginning of your F reads (plus sometimes some fake polyA stretches that drive me a bit crazy) and universal adaptor at the end of your R reads. I usually take a look with egrep how those sequences look like and how abundant they are:

>PrefixAdapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>PrefixUniversal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

So if you used indexed adaptor 7 from Illumina you would type:

egrep –color “GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG” R1.fasta

egrep –color “AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT” R2.fastq

Some adaptors might be present just partially, so this will be rather a lower boundary.

Trimming PAIRED-END reads – THEORY

Now you have an idea how many adaptors you have. My software of choice is trimmomatic, because it

trims adaptors
removes low quality sequences if you wish to do so
keeps order of the reads – I usually choose to work only with cases where both F and R reads survived trimming

I know that a lot of people run Trimmomatic without checking which types of adaptors they actually have in the data and it often works. I would recommend reading pages 5 and 6 in Trimmomatic manual – turns out you can run your trimming in both simple or palindrome mode.

Here is the summary of that:

“With ‘simple’ trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately.

‘Palindrome’ trimming is specifically designed for the case of ‘reading through’ a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are ‘in silico ligated’ onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates ‘read-through’, the forward read is clipped and the reverse read dropped (since it contains no new data).

Naming of the sequences indicates how they should be used. For ‘Palindrome’ clipping, the sequence names should both start with ‘Prefix’, and end in ‘/1’ for the forward adapter and ‘/2’ for the reverse adapter. All other sequences are checked using ‘simple’ mode. Sequences with names ending in ‘/1’ or ‘/2’ will be checked only against the forward or reverse read. Sequences not ending in ‘/1’ or ‘/2’ will be checked against both the forward and reverse read. If you want to check for the reverse-complement of a specific sequence, you need to specifically include the reverse-complemented form of the sequence as well, with another name.” http://www.usadellab.org/cms/?page=trimmomatic

For paired end reads I use palindrome mode:

>PrefixAdapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>PrefixUniversal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

For simple mode I would choose following adaptor sequences:

>Adapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>Universal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

Sometimes you don’t know which adaptor sequences you have, then I would check if any of these are present: MEDIUM_contaminant_list (please note that this list is very incomplete and might not cover new adaptor sequences from Illumina) and check them with following script:

find_cMEDIUM.sh

#!/bin/bash –
file=$1;
echo “FILE PROCESSING: ” $file;
rm -f ${file}_MEDIUMadapt_contamination_log.txt; #remove previous logs

while read line; do name=$(echo $line | cut -f1 -d’ ‘); seq=$(echo $line | cut -f2 -d’ ‘); echo -ne $(date) $name “looking for $seq\t” >>${file}_MEDIUMadapt_contamination_log.txt; fgrep -c $seq ${file} >>${file}_MEDIUMadapt_contamination_log.txt; done <MEDIUM_contaminant_list.txt

There are for sure better ways to do this, but since I am running this on every dataset just once, I never bothered to optimize 🙂

Trimming PAIRED-END reads – HANDS ON

I use script that looks like this:

#!/bin/bash

function trim_data {
R1=$1;
R2=$(echo $R1 | sed ‘s/R1/R2/g’)
adapter_file=$2
java -jar /usr/bin/biomonika/Trimmomatic-0.32/trimmomatic-0.32.jar PE -threads 60 -phred33 -trimlog ${R1}_trimlog $R1 $R2 ${R1}_trimmed.fastq ${R1}_Fsingle.fastq ${R2}_trimmed.fastq ${R1}_Rsingle.fastq ILLUMINACLIP:adaptors_sequences/${adapter_file}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30;
rm ${R1}_Fsingle.fastq ${R1}_Rsingle.fastq;
}

#TRIM PAIRED END READS

trim_data PE_R1.fastq customPalindrome_7_UNI.fa

Finally, I perform similar sanity check as at the beginning to make sure that these files are ready for downstream analysis. In general, I would recommend to trim rather less than more – I have seen people trimming 15 bp from the beginning and 5 bp from the end of 100 bp reads because of small nucleotide bias in FastQC report and I don’t think it’s worth it – you will get less uniquely mapped reads since extremely short reads are prone to mapping to multiple positions. However, really bad reads should be discarded, as well as low quality bases at the read ends, but that’s different from trimming everything without looking at the actual quality scores.

Trimming MATE-PAIR reads

Here things get tricky since mate-pair reads will not only contain Illumina adaptors we are all familiar with but also biotinylated adapters. Here is the figure that explains how this works:

Data Processing of Nextera® Mate Pair Reads on Illumina Sequencing Platforms

So, DNA gets circularized thanks to which fragments originally distant few kb come close together (biotinylated adapters are in green, whereas traditional Illumina adaptors are pink and gray). It all might sound too good to be true and indeed things can get a bit messy – what if your enrichment wasn’t 100 % and for some reason you ended up sequencing fragments without biotinylated adapters? Then those fragments would be only 250 bp or so apart and not few kb, right? Poor scaffolders dealing with these “fake MP” then 🙂

Here is the figure summarizing the cases that can happen:

Data Processing of Nextera® Mate Pair Reads on Illumina Sequencing Platforms

Please take careful look at the resulting orientation of these reads – PE reads are natively in FR orientation and MP reads in RF orientation (case a). This means that we are going to end up with some “paired-end contamination” in our MP libraries. In our lab, we usually experience less than ~1/3 of such contamination, which we consider OK.

In unknown reads, orientation couldn’t be determined by the position of biotinylated adaptors, but in our experience these are all MP reads (evaluated by mapping to reference and plotting insert size distribution).

I recommend using software called NxTrim to split your reads into these categories – I then recommend you continue with MP and UKNOWN subsets for scaffolding/assembly.

Biotinylated adaptors are now gone so you can go ahead and remove Illumina adaptors. Please notice that NxTrim will reverse complements reads for you so they are now going to be in FR orientation just as paired-end reads usually are. This means that you will find indexed and universal adaptors at rather unusual places.

forward reads, RC means reverse complement, numbers represent how many counts each adaptor appeared:

	PrefixAdapter7/1	PrefixUniversal/2	RC PrefixAdapter7/1	RC PrefixUniversal/2
R1_MP.fastq	0	0	328669	0
	x	x	end	x
R1_PE.fastq	1734	0	1	0
	beginning	x	x	x
R1_UNKNOWN.fastq	0	0	2697712	0
	x	x	end	x

reverse reads, RC means reverse complement, numbers represent how many counts each adaptor appeared:

	PrefixAdapter7/1	PrefixUniversal/2	RC PrefixAdapter7/1	RC PrefixUniversal/2
R2_MP.fastq	0	0	0	135757
	x	x	x	end
R2_PE.fastq	0	798	10	0
	x	beginning	x	x
R2_UNKNOWN.fastq	1	0	0	0
	x	x	x	x

It makes sense – if adaptor was at the beginning of reads before reverse complementing is now going to end up at the end of the reads.

Final code for removing adaptors for reads after NxTrim will then look like this:

#!/bin/bash

function trim_data {
R1=$1;
R2=$(echo $R1 | sed ‘s/R1.fastq/R2.fastq/g’)
adapter_file=$2
java -jar /usr/bin/biomonika/Trimmomatic-0.32/trimmomatic-0.32.jar PE -threads 60 -phred33 -trimlog ${R1}_trimlog $R1 $R2 ${R1}_trimmed.fastq ${R1}_Fsingle.fastq ${R2}_trimmed.fastq ${R1}_Rsingle.fastq ILLUMINACLIP:adaptors_sequences/${adapter_file}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30;
rm ${R1}_Fsingle.fastq ${R1}_Rsingle.fastq;
}

#trim data

#TRIM MATE PAIR READS
trim_data MP_R1_mp.fastq customSimple_RC-7_RC-UNI.fa
trim_data MP_R1_pe.fastq customSimple_7_UNI.fa
trim_data MP_R1_unknown.fastq customSimple_RC-7_RC-UNI.fa

with following adaptor sequences:

customSimple_RC-7_RC-UNI.fa

>Adapter7RC/1
CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC
>UniversalRC/2
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

customSimple_7_UNI.fa

>Adapter7/1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>Universal/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

Again, if your adaptor counts are really low, you might decide to just trim for quality since removing 1,000 reads with forgotten adaptors really won’t make a difference 🙂

Additional notes

Sometimes it is believed that adapters just magically disappear because mappers and assemblers are smart (well, some of them are, e.g. some mappers will soft clip these adapter sequences automatically for you, but how can you then tell if soft-clipped sequence is adaptor or something biologically meaningful?).

speed up alignments with blasr

Blasr is my favorite aligner for multitude of reason – m option is very powerful in getting the exact formatting of output I need (finally someone agrees that length of the both query and reference should be included in the output:), SAM is also an option and reporting only best n hits is very convenient.

-bestn n (10)

Report the top ‘n’ alignments.

When you use blasr often, it’s good idea to index the reference. However, one needs to (redundantly) provide link to both reference and index.

Use a precomputed suffix array for faster startup
- > sawriter hg19.fasta.sa hg19.fasta #First precompute the suffix array
- > blasr reads.bas.h5 hg19.fasta -sa hg19.fasta.sa

https://github.com/PacificBiosciences/blasr

Extract sequence from fasta file at particular positions only

They are thousands of ways how to extract sequence from fasta file and this is my most favorite:

install samtools (http://samtools.sourceforge.net/) (I am sure you have done this already.)
index reference
```
samtools faidx reference.fasta
```
cut out read/contig/scaffold you are interested in and display only portion of it, e.g. here basepairs 1-10

samtools faidx reference.fasta contig:1-10

join two files on common column, report NAN whenever value in second file is missing

Recently I wanted to present a table with information from various sources – some blast results, some repeat masker results, some custom-made scripts’ outputs. Cool. Let’s just put it all in Excel, do some conditional formatting and make it pretty. All I need to do is to combine multiple tables into one big nice table, so CTRL+C and CTRL+V will just do the job. An hour later, I felt like this paperclip:

from @tim_yates https://mobile.twitter.com/tim_yates/status/367297797709504513?p=v

Issue is, some tables miss some of the values. Some tables are sorted in one fashion and others in the other, based on how naturally or numerically whoever sorted them felt. So, let’s forget Excel and just join them with proper tools. For the beginning, let’s join two tables, each with 2 columns separated by tab.

join -t $'\t' -a 1 table1 table2

This simple command joins two sorted (e.g. with sort -b) tables based on common column (usually first column which is default behavior). It uses tabulator as separator. The command above is INNER JOIN, only those rows will be reported for which there is entry in both files.

from: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

In my case, table2 has less rows than table1, but I want to produce all rows from table1. What I need is LEFT OUTER JOIN. Whenever there is no corresponding value available in table2, empty string will be produced.

from: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

This can be achieved by parameter -a, in our case -a 1 (first file).

     -a file_number In addition to the default output, produce a line for each 
unpairable line in file file_number [man join]

Empty strings are those of less favorite strings for most people, therefore let’s now use NAN instead of empty strings for all missing values.

join -t $'\t' -a 1 -e "NAN" table1 table2

Nothing happened. Still, only empty strings are reported instead of desired “NAN”. It turns out you need to be specific about which output you want:

join -t $'\t' -a 1 -e "NAN" -o 0,1.2,2.2 table1 table2

In this simple scenario, 0 will output common column (the one we are using for joining) and 1.2 and 2.2 will output second column of table 1 and table 2, respectively.

vigilance needed: 1/10

Happy joining.

Resources:

A Visual Explanation of SQL Joins; http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

join – relational database operator; http://pubs.opengroup.org/onlinepubs/007904975/utilities/join.html

-blasr -nproc 63 -bestn 1 -sam -minPctIdentity 85 $reads_1 $reference_mitochondrion_genome -sa ${reference_mitochondrion_genome}.sa > $SAM
-I use blasr to report best hit for reads with at least 85 % identity.
-samtools view -bhS -F 4 -q 0 $SAM > $BAM
-I remove unmapped reads and reads with lower MAPQ quality than 30. Later I also remove all supplementary alignments (-F 2048) and then I simply output species with the highest number of the reads.
-In the complementary analysis, I retrieve all reads mapping to my mitochondrial reference from the previous step and use blastn to identify species with the best hits. I only keep alignments longer than 1,000 bp and percentage identity of at least 85 %.