using slurm to parallelize per chromosome

I recently needed to parallelize variant calling and decided to do this by chromosome (one might argue that chromosomes have unequal sizes and that equal chunks might be better). Therefore, I needed my script to use 22 processors, one per each chromosome. Hence, the header of my sbatch script looked like this:

 

#!/bin/bash
#SBATCH -C new
#SBATCH --nodes=1
#SBATCH --ntasks=22
#SBATCH -t 0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=biomonika@psu.edu

Now, I need to run 22 instances of my variant caller. I can do it like this:

array=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22) #chromosomes to use
for chromosome in "${array[@]}"; do 
    echo $chromosome;  
    time srun -C new --nodes=1 --ntasks=1 --time=INFINITE python naive_variant_caller_for_region_file.py --bam=${var_folder}/${chromosome}_${motif}.bam --index=${var_folder}/${chromosome}_${motif}.bam.bai --reference_genome_filename=${reference} --regions_filename=${var_folder}/${chromosome}_${motif}.bed --output_vcf_filename=${var_folder}/${chromosome}_${out}.vcf &
done; 
wait #all chromosomes call variants/errors in parallel

This way, all my jobs get submitted and then wait command makes sure I won’t continue until all chromosomes finished.

About mqm5775https://www.biostars.org/u/2884/Bioinformatics of sequences. Sex chromosomes. Enjoying DNA in my computer. Great ape Y chromosome evolution, specifically heterochromatin variability and analysis of male fertility genes. Creative
 use of visualization, gene expression of multi-copy gene families, genome assembly. Enthusiastic about learning and applying new technologies: Pacific Biosciencies (expert experience), Oxford Nanopore, BioNano Genomics.

Leave a Reply

Your email address will not be published. Required fields are marked *