I recently downloaded trio from 1000 human genomes that happened to be in a CRAM format. My first impulse was – of course – to convert it into the format I am comfortable with, such as bam.
I quickly found following command online:
java -jar cramtools-<version>.jar bam –input-cram-file <cram file> –reference-fasta-file <reference fasta file> –output-bam-file <output bam file>
thinking to myself that it must be rather frustrating to develop a tool where the frustrated users desperately push all the buttons to get to their chubby, but lovely bam files as quickly as possible (this is exaggeration in case you wondered;-).
OK, let’s try the conversion:
input=NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram
java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;
But I got error message saying something like “cramtools ReferenceSource Downloaded sequence is corrupt” and complaining about mismatch in md5sums.
Is my reference incorrect? Should I search the internet for the magic file with md5sum of 6aef897c3d6ff0c78aff06ac189178dd that cramtools seems to want?
(@md5sam pointed to me you indeed can search by md5sum in case you wondered: find . -type f -exec md5sum {} + | grep ‘^6aef897c3d6ff0c78aff06ac189178dd’)
OK, so I went back to reading about CRAM files and if you are curious too, you can take a look at this publication: Efficient storage of high throughput DNA sequencing data using reference-based compression.
Turns out, storing pretty much everything in our BAM files is pretty wasteful, even if they are compressed (SAM files are not even compressed but they are human readable which is kind of nice at times). We are all humans and our DNA is extremely similar to each other. You and me might differ just by as few as 3,000,000 SNPs (and then insertions, deletions, SVs and sometimes more complex rearrangments).
So the idea is this:
Instead of storing human reference, let’s just store where our reads differ from it.
As simple as that.
OK, so what do I do now? The reference must be of course always present, otherwise the data is useless (knowing something differs, but not knowing what it differs isn’t particularly helpful. The reference could be reachable online, but you will likely want it to be on your local system instead. The easiest way to achieve this is probably through these commands:
/path_to_samtools/samtools-1.3.1/misc/seq_cache_populate.pl -root /home/biomonika/reference /home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
export REF_PATH=/home/biomonika/reference/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s
export REF_CACHE=/home/biomonika/reference/%2s/%2s/%s
Note that you need to have version newer version of samtools installed, such as 1.3.1 (older versions could precede CRAM development).
Now, let’s try again:
java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;
Yeaay – we just got the bam file. Although, the difference in sizes is quite dramatic: 4.6 GB versus 10.4 GB. (PS: If you don’t like cramtools, samtools can do the conversion too)
Big thanks to researchers from 1000genome who put together this very helpful README:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md