Reading text files with variable number of columns in R.

Do you ever need to read a text file while not knowing what’s the maximum number of columns it can have? Well, I do 🙂 Turns out read.table function only looks at first five columns to determine the number of columns as ?read.table help page says:

 

The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill orblank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

So what’s the workaround? Find out the maximum in advance!

  no_col <- max(count.fields(file, sep = "\t"))

  dataI <- read.table(file,sep="\t",fill=TRUE,header = F,col.names=c("chr", "start", "end", "length",1:no_col))

Since we want to fill the blanks, we will use fill=TRUE function and we may also decide to name our columns numerically as col.names does here. Thus, the columns will be named chr, start, end, length and 1, 2, 3 … maxColumn.

 

using slurm to parallelize per chromosome

I recently needed to parallelize variant calling and decided to do this by chromosome (one might argue that chromosomes have unequal sizes and that equal chunks might be better). Therefore, I needed my script to use 22 processors, one per each chromosome. Hence, the header of my sbatch script looked like this:

 

#!/bin/bash
#SBATCH -C new
#SBATCH --nodes=1
#SBATCH --ntasks=22
#SBATCH -t 0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=biomonika@psu.edu

Now, I need to run 22 instances of my variant caller. I can do it like this:

array=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22) #chromosomes to use
for chromosome in "${array[@]}"; do 
    echo $chromosome;  
    time srun -C new --nodes=1 --ntasks=1 --time=INFINITE python naive_variant_caller_for_region_file.py --bam=${var_folder}/${chromosome}_${motif}.bam --index=${var_folder}/${chromosome}_${motif}.bam.bai --reference_genome_filename=${reference} --regions_filename=${var_folder}/${chromosome}_${motif}.bed --output_vcf_filename=${var_folder}/${chromosome}_${out}.vcf &
done; 
wait #all chromosomes call variants/errors in parallel

This way, all my jobs get submitted and then wait command makes sure I won’t continue until all chromosomes finished.