Why you should always specify colClasses when reading tables in R

Today, I was importing table that looked like this into R:

with a command:

 r<-read.table("passes_stat",colClasses = c("integer","character"))

Why did I put an extra effort in specifying colClasses? That’s because if I wouldn’t, read.table function would have to make an educated guess about what type of the data my columns contain.

This would not only take longer (see the post here), but in this particular case it would also ruin my leading zeroes – see for example line number 5. Additionally, I might have not even noticed that it happened.

 r<-read.table("passes_stat")
 View(r)

How to work with CRAM files (by converting them to bam files:)

I recently downloaded trio from 1000 human genomes that happened to be in a CRAM format. My first impulse was – of course – to convert it into the format I am comfortable with, such as bam.

I quickly found following command online:

java -jar cramtools-<version>.jar bam –input-cram-file <cram file> –reference-fasta-file <reference fasta file> –output-bam-file <output bam file>

thinking to myself that it must be rather frustrating to develop a tool where the frustrated users desperately push all the buttons to get to their chubby, but lovely bam files as quickly as possible (this is exaggeration in case you wondered;-).

OK, let’s try the conversion:

input=NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

But I got error message saying something like “cramtools ReferenceSource Downloaded sequence is corrupt” and complaining about mismatch in md5sums.

Is my reference incorrect? Should I search the internet for the magic file with md5sum of 6aef897c3d6ff0c78aff06ac189178dd that cramtools seems to want?

(@md5sam pointed to me you indeed can search by md5sum in case you wondered: find . -type f -exec md5sum {} + | grep ‘^6aef897c3d6ff0c78aff06ac189178dd’)

OK, so I went back to reading about CRAM files and if you are curious too, you can take a look at this publication: Efficient storage of high throughput DNA sequencing data using reference-based compression.

Turns out, storing pretty much everything in our BAM files is pretty wasteful, even if they are compressed (SAM files are not even compressed but they are human readable which is kind of nice at times). We are all humans and our DNA is extremely similar to each other. You and me might differ just by as few as 3,000,000 SNPs (and then insertions, deletions, SVs and sometimes more complex rearrangments).

So the idea is this:

Instead of storing human reference, let’s just store where our reads differ from it.

As simple as that.

OK, so what do I do now? The reference must be of course always present, otherwise the data is useless (knowing something differs, but not knowing what it differs isn’t particularly helpful. The reference could be reachable online, but you will likely want it to be on your local system instead. The easiest way to achieve this is probably through these commands:

/path_to_samtools/samtools-1.3.1/misc/seq_cache_populate.pl -root /home/biomonika/reference /home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa

export REF_PATH=/home/biomonika/reference/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s

export REF_CACHE=/home/biomonika/reference/%2s/%2s/%s

Note that you need to have version newer version of samtools installed, such as 1.3.1 (older versions could precede CRAM development).

Now, let’s try again:

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

Yeaay – we just got the bam file. Although, the difference in sizes is quite dramatic: 4.6 GB versus 10.4 GB. (PS: If you don’t like cramtools, samtools can do the conversion too)

Big thanks to researchers from 1000genome who put together this very helpful README:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md