September | 2017 | Bioinformatics Bits and Pieces

Do you ever need to read a text file while not knowing what’s the maximum number of columns it can have? Well, I do 🙂 Turns out read.table function only looks at first five columns to determine the number of columns as ?read.table help page says:

The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill orblank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

So what’s the workaround? Find out the maximum in advance!

  no_col <- max(count.fields(file, sep = "\t"))

  dataI <- read.table(file,sep="\t",fill=TRUE,header = F,col.names=c("chr", "start", "end", "length",1:no_col))

Since we want to fill the blanks, we will use fill=TRUE function and we may also decide to name our columns numerically as col.names does here. Thus, the columns will be named chr, start, end, length and 1, 2, 3 … maxColumn.

array=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22) #chromosomes to use for chromosome in "${array[@]}"; do echo $chromosome; time srun -C new --nodes=1 --ntasks=1 --time=INFINITE python naive_variant_caller_for_region_file.py --bam=${var_folder}/${chromosome}_${motif}.bam --index=${var_folder}/${chromosome}_${motif}.bam.bai --reference_genome_filename=${reference} --regions_filename=${var_folder}/${chromosome}_${motif}.bed --output_vcf_filename=${var_folder}/${chromosome}_${out}.vcf & done; wait #all chromosomes call variants/errors in parallel

Bioinformatics Bits and Pieces

Monthly Archives: September 2017

Reading text files with variable number of columns in R.

using slurm to parallelize per chromosome