Do not fold, find out what non-ACTGN characters do you have in your .fasta file?

While some software packages blissfully ignore unusual characters or whitespaces, others will complain. Today I ran two pipelines over the same file and both reported different sequence length. So, I ran this handy command:

cat sequence.fa | fold -w1 | sort | uniq -c

The beauty of fold command is that it will wrap any long continuous sequence at a fixed number of characters. Even after every single one. And voila, here are the results:

And yeay, indeed two pipelines didn’t agree by 375 characters. I must have somehow introduced spaces during the concatenation step when I was building a single sequence out of many smaller.

 

 

Why you shouldn’t postpone learning how to use Conda

I was recently a teaching assistant for 2018 PSU Bootcamp on Reproducible Research where I gave a workshop on how to use Conda. Here are my slides. While preparing for this workshop, I came across this wonderful blog post that represents a wonderful introduction to Conda. Before I read it, all was very confusing: conda, bioconda, miniconda, anaconda.. how many condas are even out there? But this post not only helped, but motivated me to really start using conda in everyday life. Bioinformatics can be a dependency nightmare, but all these condas definitely make one’s life much easier!

Where have my nanopore reads dissapeared?

You might have been so excited to see your fastq reads coming immediately out of the MinION that you might not (initially) realized that the data you started to analyze are NOT all the data there is! In case your computer wasn’t able to do all the base-calling on the fly, there will be plenty of the data in your skip folder — in our recent run, this was as much as 90% of it!

One way to fix this is to catch up with the base-calling using albacore on Linux machine. In my case, I achieved it using following commands:

PYTHON_VERSION=3.5
conda create --prefix /biomonika/conda/nanopore_basecalling python=${PYTHON_VERSION} anaconda
source activate nanopore_basecalling
conda update --all
wget https://mirror.oxfordnanoportal.com/software/analysis/ont_albacore-2.2.7-cp35-cp35m-manylinux1_x86_64.whl
pip install ont_albacore-2.2.*.whl

So how does our loop submitting jobs look like?

for i in `seq 0 48`; do
  echo $i;
  sbatch basecalling_job.sh /biomonika/nanopore_run/data/reads/20180426_2154_/fast5/skip/${i} ${i};
done;

And this is the insides of basecalling_job.sh

#!/bin/bash
#SBATCH --job-name=biomonika
#SBATCH --output=biomonika-%j.out
#SBATCH --error=biomonika-%j.err
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=63
#SBATCH --cpus-per-task=1

set -x
set -e

input_dir=$1
output_dir=$2

echo "input dir: " ${input_dir}
echo "output dir: " ${output_dir}

which conda
conda info

source activate /galaxy/home/biomonika/conda/nanopore_basecalling
read_fast5_basecaller.py --flowcell FLO-MIN106 --kit SQK-LSK108 --barcoding --output_format fastq --input ${input_dir} --save_path ${output_dir} --worker_threads 63
echo "Done."

As usual, please comment if you have suggestions or advice. Happy base-calling!

Few thoughts on unionization at Penn State

Monday, April 10th is the first election date and while the campaigning is at full speed, I feel like there is still much confusion about the whole process. I decided to write this post as a student who has been interested in the subject of unionization since the beginning. Perhaps this will be of interest to those who have not been receiving any emails about the topic, or at least not as much as we (who are eligible to vote) have.

Who is this “third-party” that seeks to represent graduate students in a union?

If you are confused about who this third-party is, that’s because the emails from Penn State administration never mention this group. The organization is called The Coalition of Graduate Employees (CGE) and their website is http://cge-psu.com/

 

As they state on their website, they are “group of graduate assistants (GAs, TAs, and RAs) who are working towards the goal of a graduate employee union at Penn State”. Since they are in the group they seek to represent, I do not think it’s accurate to refer to them as third-party, as happened for example here.

Why would we need a union?

If you’re a graduate student, every semester you get a contract, sign it and that’s it. But what if you could negotiate what’s in the contract? Perhaps add something, perhaps remove something. Or, perhaps you think that your contract and the package of benefits that come with it (such as health insurance) was already designed with your best interests in mind. And what if you have serious issues during your graduate experience? Who will advocate for you?

Don’t we already have student organizations that advocate for us?

We do! For example, Graduate and Professional Student Association (GPSA; http://sites.psu.edu/gpsapsu/) advocates for students. However, legally speaking, their voice is advisory and not binding. Shall union be formed, union representatives would have much stronger negotiating power than other student organizations ever could.

How common is this union thing?

Quite common! Out of the 14 Universities within the Big Ten, 6 of them are unionized (Michigan State; Rutgers; University of Illinois at Urbana-Champaign; University of Iowa; University of Michigan; and University of Wisconsin at Madison). Source: http://cge-psu.com/what-is-a-graduate-union/other-graduate-unions/

For example, you can visit Michigan State union website here: http://geuatmsu.org/

What are the reasons NOT to want a union?

Graduate school has mentioned few in their emails, although to me they feel more like a campaign and less like reasons I am personally concerned about.

I do not see how a union could affect the relationship with my advisor. I talked to few students who were represented by unions before at different unis and none of them seemed to think it hurt their relationship, more often they thought they didn’t notice any effect.

I do not see how a union could limit what research outputs can or cannot be put into a thesis. I am not aware of such case and I don’t know which modification of a contract could possibly make this a real concern.

Yes, I see this as partially a goal. I do agree that teaching experience is important for students who seek to pursue an academic career. However, there is a difference between teaching experience and between being overworked and yet financially struggling. There are majors where students are expected to teach every single semester of their Ph.D. experience. There are labs where students are expected to work every weekend. There are courses where graduate students spend >30 hours each week preparing lectures, teaching, and grading, leaving little time for their research.

Are we students or are we workers?

If you keep hearing about PLRB, then note that it stands for the Pennsylvania Labor Relations Board. Penn State has challenged that graduate students perform, in fact, work for the university and argued that they shouldn’t be allowed to form a union. This has led to an official hearing in which PLRB listened to both sides: The Coalition of Graduate Employees (CGE) and the group of lawyers hired by Penn State who took the position that graduate students do not perform work, but are in fact just and only students. 

I highlighted the last sentence since that’s what the witnesses talked about during the hearing. You can read the full transcripts here. If you’re interested, one of more “spicy” discussion is on page 631 of this document. I cherry-picked a teaser for you:

In March, PLRB issued the order to receive an information about eligible voters, including a list of the names, home addresses, and university email addresses.

Then we all received this news/email.

Graduate students can expect to be visited or contacted by representatives from CGE/PSEA. These representatives may make visits to graduate students’ homes, both announced and unannounced, but students have the right to decline any request to meet with them at home or on campus.

This looks like nonsense and CGE has distanced themselves from this message. To be honest, I do not understand why anyone would issue such a statement. Why would any organization of this type want to visit anyone in their home? More so unannounced? This is a mystery to me.

Anyway, the final decision of PLRB is that the graduate students are workers and allowed to form a union, which is why we have an election. That brings us to a beginning of my post.

To conclude, both parties: CGE, as well as Penn State administration, agree that you should get informed and vote. (I know I will.)

NOTE THAT THESE OPINIONS ARE MINE AND MINE ONLY. Please let me know if you find any factual inaccuracies and I’ll be happy to correct them.

 

 

 

seamless filtering of columns with dplyr without loosing rownames

If you use R and haven’t used dplyr yet, go and give it a try, it will be afternoon well spent. As much as I love this package, I probably haven’t used it yet as much as I should have – that’s because it requires you to shift your way of thinking about the problems a bit. To me, that’s similar as transitioning from basic R plotting functions to ggplot.

Today I worked with the dataframe where I needed to only keep rows where at least one of the columns has value greater than let’s say 15. So I did this:

filter_all(myDF, any_vars(. > 15))

However what happened was that I lost all my rownames where I keep important information. This is something that author of dplyr Hadley Wickham would refer to as “feature, not a bug”. This is because in complicated queries, rownames are hard to get right and it’s better to not to return anything than to get it wrong.

Therefore,  I will now need to keep my rownames as an additional column. And just today I came across very nice library that allows you do do that easily: tibble (note that it can do much more than this:). I like the readability of tibble commands a lot.

So here is my code:

myDF<-myDF %>% rownames_to_column('new_column') 
myDF<-filter_at(myDF, vars(-new_column), any_vars(. > 15))
myDF<-myDF %>% column_to_rownames('new_column')

This can be turn into one-liner, if that’s what you’re into.

Why you should always specify colClasses when reading tables in R

Today, I was importing table that looked like this into R:

with a command:

 r<-read.table("passes_stat",colClasses = c("integer","character"))

Why did I put an extra effort in specifying colClasses? That’s because if I wouldn’t, read.table function would have to make an educated guess about what type of the data my columns contain.

This would not only take longer (see the post here), but in this particular case it would also ruin my leading zeroes – see for example line number 5. Additionally, I might have not even noticed that it happened.

 r<-read.table("passes_stat")
 View(r)

How to work with CRAM files (by converting them to bam files:)

I recently downloaded trio from 1000 human genomes that happened to be in a CRAM format. My first impulse was – of course – to convert it into the format I am comfortable with, such as bam.

I quickly found following command online:

java -jar cramtools-<version>.jar bam –input-cram-file <cram file> –reference-fasta-file <reference fasta file> –output-bam-file <output bam file>

thinking to myself that it must be rather frustrating to develop a tool where the frustrated users desperately push all the buttons to get to their chubby, but lovely bam files as quickly as possible (this is exaggeration in case you wondered;-).

OK, let’s try the conversion:

input=NA12878.alt_bwamem_GRCh38DH.20150718.CEU.low_coverage.cram

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

But I got error message saying something like “cramtools ReferenceSource Downloaded sequence is corrupt” and complaining about mismatch in md5sums.

Is my reference incorrect? Should I search the internet for the magic file with md5sum of 6aef897c3d6ff0c78aff06ac189178dd that cramtools seems to want?

(@md5sam pointed to me you indeed can search by md5sum in case you wondered: find . -type f -exec md5sum {} + | grep ‘^6aef897c3d6ff0c78aff06ac189178dd’)

OK, so I went back to reading about CRAM files and if you are curious too, you can take a look at this publication: Efficient storage of high throughput DNA sequencing data using reference-based compression.

Turns out, storing pretty much everything in our BAM files is pretty wasteful, even if they are compressed (SAM files are not even compressed but they are human readable which is kind of nice at times). We are all humans and our DNA is extremely similar to each other. You and me might differ just by as few as 3,000,000 SNPs (and then insertions, deletions, SVs and sometimes more complex rearrangments).

So the idea is this:

Instead of storing human reference, let’s just store where our reads differ from it.

As simple as that.

OK, so what do I do now? The reference must be of course always present, otherwise the data is useless (knowing something differs, but not knowing what it differs isn’t particularly helpful. The reference could be reachable online, but you will likely want it to be on your local system instead. The easiest way to achieve this is probably through these commands:

/path_to_samtools/samtools-1.3.1/misc/seq_cache_populate.pl -root /home/biomonika/reference /home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa

export REF_PATH=/home/biomonika/reference/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s

export REF_CACHE=/home/biomonika/reference/%2s/%2s/%s

Note that you need to have version newer version of samtools installed, such as 1.3.1 (older versions could precede CRAM development).

Now, let’s try again:

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

Yeaay – we just got the bam file. Although, the difference in sizes is quite dramatic: 4.6 GB versus 10.4 GB. (PS: If you don’t like cramtools, samtools can do the conversion too)

Big thanks to researchers from 1000genome who put together this very helpful README:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md

Reading text files with variable number of columns in R.

Do you ever need to read a text file while not knowing what’s the maximum number of columns it can have? Well, I do 🙂 Turns out read.table function only looks at first five columns to determine the number of columns as ?read.table help page says:

 

The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill orblank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

So what’s the workaround? Find out the maximum in advance!

  no_col <- max(count.fields(file, sep = "\t"))

  dataI <- read.table(file,sep="\t",fill=TRUE,header = F,col.names=c("chr", "start", "end", "length",1:no_col))

Since we want to fill the blanks, we will use fill=TRUE function and we may also decide to name our columns numerically as col.names does here. Thus, the columns will be named chr, start, end, length and 1, 2, 3 … maxColumn.

 

using slurm to parallelize per chromosome

I recently needed to parallelize variant calling and decided to do this by chromosome (one might argue that chromosomes have unequal sizes and that equal chunks might be better). Therefore, I needed my script to use 22 processors, one per each chromosome. Hence, the header of my sbatch script looked like this:

 

#!/bin/bash
#SBATCH -C new
#SBATCH --nodes=1
#SBATCH --ntasks=22
#SBATCH -t 0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=biomonika@psu.edu

Now, I need to run 22 instances of my variant caller. I can do it like this:

array=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22) #chromosomes to use
for chromosome in "${array[@]}"; do 
    echo $chromosome;  
    time srun -C new --nodes=1 --ntasks=1 --time=INFINITE python naive_variant_caller_for_region_file.py --bam=${var_folder}/${chromosome}_${motif}.bam --index=${var_folder}/${chromosome}_${motif}.bam.bai --reference_genome_filename=${reference} --regions_filename=${var_folder}/${chromosome}_${motif}.bed --output_vcf_filename=${var_folder}/${chromosome}_${out}.vcf &
done; 
wait #all chromosomes call variants/errors in parallel

This way, all my jobs get submitted and then wait command makes sure I won’t continue until all chromosomes finished.