using slurm to parallelize per chromosome

I recently needed to parallelize variant calling and decided to do this by chromosome (one might argue that chromosomes have unequal sizes and that equal chunks might be better). Therefore, I needed my script to use 22 processors, one per each chromosome. Hence, the header of my sbatch script looked like this:

 

#!/bin/bash
#SBATCH -C new
#SBATCH --nodes=1
#SBATCH --ntasks=22
#SBATCH -t 0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=biomonika@psu.edu

Now, I need to run 22 instances of my variant caller. I can do it like this:

array=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22) #chromosomes to use
for chromosome in "${array[@]}"; do 
    echo $chromosome;  
    time srun -C new --nodes=1 --ntasks=1 --time=INFINITE python naive_variant_caller_for_region_file.py --bam=${var_folder}/${chromosome}_${motif}.bam --index=${var_folder}/${chromosome}_${motif}.bam.bai --reference_genome_filename=${reference} --regions_filename=${var_folder}/${chromosome}_${motif}.bed --output_vcf_filename=${var_folder}/${chromosome}_${out}.vcf &
done; 
wait #all chromosomes call variants/errors in parallel

This way, all my jobs get submitted and then wait command makes sure I won’t continue until all chromosomes finished.

split large fasta file into multiple fasta files on any computer

You just downloaded your favorite genome. It’s huge fasta file and now you want to do something with it, but running it on this single file would take forever. You need to split it by chromosomes or contigs, but you are not on your favorite computer where your magic toolbox is installed. Worry no more! As always, UNIX has many built-in tools, that can do anything, but people rarely talk about them.

csplit panTro5.fa /\>chr.*/ {*} 
#every sequence is now in one file named xxx something
for a in x*; do echo $a; mv $a $(head -1 $a).txt; done; 
#we just renamed each file by using first line of each file

Of course, if you fasta header contained anything other than >chr in the header, you would modify you csplit command and replace chr with whatever characters your headers start with. Luckily, you can use csplit to split any files, making it robust and useful tool for bioinformatics that is being flooded by new file formats as we speak..

On intriguing exit codes of your scripts

I run a lot of scripts (these days using slurm) and, as is often the case in bioinformatics, they fail more often than I would wish for. In order to be confident that the rest of the pipeline won’t continue anytime any of my commands returns non-zero exit code that suggests a problem, I use following two lines at the beginning of my scripts:

set -e (causes the shell to exit if any subcommand or pipeline returns a non-zero status, read more here about when it is a good idea to use it and when not)
set -x (each command gets printed right before it is executed so that the logfiles become very verbose and arguably easier to debug:)

However this time interesting thing happened. I was running Tandem Repeats Finder and my script kept crashing although the output looked completely normal and did process all sequences in my input fasta file. I subsampled my input file into very tiny file and it happened again, with return code 63 that doesn’t have any traditional meaning. I cleaned my fasta header, removed all whitespaces and prepared very neat fasta file and the script crashed again, leaving hundreds of dot characters in my logfile. Interestingly, when I ran the whole thing from command line, Tandem Repeats Finder finished by saying “Done” and no nasty dots in the output.

This time I submitted a script with 5 sequences only and returned error code was 5.

I did the same thing with 2 sequences and returned error code was 2.

I ran the command again from command line (that finished by graceful “Done.” and no dots) and typed echo $?

$? stores value of the last executed command and it was surprisingly non-zero. So although it “seemed” that everything was fine from the command line, it actually wasn’t.

Or was it?

It turns out everything was fine, just the software decided – for some reason – to return number of sequences as an error code instead of 0 as one would expect.

Going back to the manual, I found out that with -ngs parameter, Tandem Repeats Finder finally returns 0 when everything is fine. So this was one of those rare moments when I was being unjustly mistrustful:-)


The standard consensus is to use exit codes to report important errors, not runtime features. There is a set of reserved exit codes with special meanings as described here. If you decide to use exit codes differently, good practice is to mention how you use them in a manual.

speed up alignments with blasr

Blasr is my favorite aligner for multitude of reason – m option is very powerful in getting the exact formatting of output I need (finally someone agrees that length of the both query and reference should be included in the output:), SAM is also an option and reporting only best n hits is very convenient.

-bestn n (10)

               Report the top ‘n’ alignments.

When you use blasr often, it’s good idea to index the reference. However, one needs to (redundantly) provide link to both reference and index.

  • Use a precomputed suffix array for faster startup
    • > sawriter hg19.fasta.sa hg19.fasta #First precompute the suffix array
    • > blasr reads.bas.h5 hg19.fasta -sa hg19.fasta.sa

https://github.com/PacificBiosciences/blasr

join two files on common column, report NAN whenever value in second file is missing

Recently I wanted to present a table with information from various sources – some blast results, some repeat masker results, some custom-made scripts’ outputs. Cool. Let’s just put it all in Excel, do some conditional formatting and make it pretty. All I need to do is to combine multiple tables into one big nice table, so CTRL+C and CTRL+V will just do the job. An hour later, I felt like this paperclip:

spinka

from @tim_yates https://mobile.twitter.com/tim_yates/status/367297797709504513?p=v

Issue is, some tables miss some of the values. Some tables are sorted in one fashion and others in the other, based on how naturally or numerically whoever sorted them felt. So, let’s forget Excel and just join them with proper tools. For the beginning, let’s join two tables, each with 2 columns separated by tab.

join -t $'\t' -a 1 table1 table2

This simple command joins two sorted (e.g. with sort -b) tables based on common column (usually first column which is default behavior). It uses tabulator as separator. The command above is INNER JOIN, only those rows will be reported for which there is entry in both files.

from: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

In my case, table2 has less rows than table1, but I want to produce all rows from table1. What I need is LEFT OUTER JOIN. Whenever there is no corresponding value available in table2, empty string will be produced.

from: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

This can be achieved by parameter -a, in our case -a 1 (first file).

     -a file_number In addition to the default output, produce a line for each 
unpairable line in file file_number [man join]

Empty strings are those of less favorite strings for most people, therefore let’s now use NAN instead of empty strings for all missing values.

join -t $'\t' -a 1 -e "NAN" table1 table2

Nothing happened.  Still, only empty strings are reported instead of desired “NAN”. It turns out you need to be specific about which output you want:

join -t $'\t' -a 1 -e "NAN" -o 0,1.2,2.2 table1 table2

In this simple scenario, 0 will output common column (the one we are using for joining) and 1.2 and 2.2 will output second column of table 1 and table 2, respectively.

vigilance needed: 1/10

Happy joining.

Resources:

A Visual Explanation of SQL Joins; http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

join – relational database operator; http://pubs.opengroup.org/onlinepubs/007904975/utilities/join.html