Do not fold, find out what non-ACTGN characters do you have in your .fasta file?

While some software packages blissfully ignore unusual characters or whitespaces, others will complain. Today I ran two pipelines over the same file and both reported different sequence length. So, I ran this handy command:

cat sequence.fa | fold -w1 | sort | uniq -c

The beauty of fold command is that it will wrap any long continuous sequence at a fixed number of characters. Even after every single one. And voila, here are the results:

And yeay, indeed two pipelines didn’t agree by 375 characters. I must have somehow introduced spaces during the concatenation step when I was building a single sequence out of many smaller.

 

 

Why you shouldn’t postpone learning how to use Conda

I was recently a teaching assistant for 2018 PSU Bootcamp on Reproducible Research where I gave a workshop on how to use Conda. Here are my slides. While preparing for this workshop, I came across this wonderful blog post that represents a wonderful introduction to Conda. Before I read it, all was very confusing: conda, bioconda, miniconda, anaconda.. how many condas are even out there? But this post not only helped, but motivated me to really start using conda in everyday life. Bioinformatics can be a dependency nightmare, but all these condas definitely make one’s life much easier!

seamless filtering of columns with dplyr without loosing rownames

If you use R and haven’t used dplyr yet, go and give it a try, it will be afternoon well spent. As much as I love this package, I probably haven’t used it yet as much as I should have – that’s because it requires you to shift your way of thinking about the problems a bit. To me, that’s similar as transitioning from basic R plotting functions to ggplot.

Today I worked with the dataframe where I needed to only keep rows where at least one of the columns has value greater than let’s say 15. So I did this:

filter_all(myDF, any_vars(. > 15))

However what happened was that I lost all my rownames where I keep important information. This is something that author of dplyr Hadley Wickham would refer to as “feature, not a bug”. This is because in complicated queries, rownames are hard to get right and it’s better to not to return anything than to get it wrong.

Therefore,  I will now need to keep my rownames as an additional column. And just today I came across very nice library that allows you do do that easily: tibble (note that it can do much more than this:). I like the readability of tibble commands a lot.

So here is my code:

myDF<-myDF %>% rownames_to_column('new_column') 
myDF<-filter_at(myDF, vars(-new_column), any_vars(. > 15))
myDF<-myDF %>% column_to_rownames('new_column')

This can be turn into one-liner, if that’s what you’re into.