Bloom-filter inspired testing of pooled samples (and splitting of swabs!)

In the current situation (COVID-19 pandemic), a lot has been debated about pooling samples in order to save time, cost, or reagents. It has been suggested that one could pool samples of 32 or 64 samples and test them all at once. If the pooled sample is negative, great — all individual samples are negative. If it’s positive, further re-testing is warranted. Of course, different more or less creative versions of pooling are possible, such as pooling rows and diagonals in order to reduce the total number of tests.

The disadvantage of these approaches is that at least some re-testing is pretty much unavoidable. Thus, ideally, one would pool in a manner that would require no re-testing at all. And here’s where the idea of hashing or chemical bloom filter comes in.

Imagine you split your sample and pool it with other samples. But, you will be smart about how you do this. You will distribute each part of each of your sample on your 96-well plate in a pre-defined manner. That way, based on the combination of positive wells, you could tell exactly which of the original samples were positive. How, you ask? The explanation, code and the whole original idea by Tomáš Janoušek (@Liskni_si) are described here: and illustrated below:

schematics of the pooling and detection process

When the number of positive samples per 1,000 is low, such as below 1% (less than 10 positive samples per 1,000 tested), the number of false positives is remarkably low. For example, for 8/1000 positive samples, splitting each sample into 5, only single PCR reaction is needed on a 96-well plate to tell you exactly which those 8 positive samples are, and, in this specific scenario, also 1 extra sample that is false positive (although validating all 8+1 samples is not that labor-intensive and could be done).

the simulation of the bloom-filter inspired sample pooling

There won’t be any positive samples you would algorithmically miss (bloom filter doesn’t have false negatives) and also few (if any) false positives.

The caveat is the act of pooling itself; while adding 40 or 50 samples to a single tube is definitely easy for a robot, it could be error-prone when done manually. While a human could pool rows or columns relatively easily, asking someone to pipette 1,000 samples, and moreover each of those into 3 wells, more so in a pre-defined manner, might simply be too much of an ask. Reducing the number of tested samples to 500 and capping the number of samples in a tube will generate the following plot:

sample pooling, with max of 10-16 samples per tube

Here, all 500 samples are separated into three pieces, and the number of real positives versus false positives is plotted, while keeping the total number of pooled samples in a tube low.

This all brings us to the discussion of the next bottleneck: RNA extraction. In all this pooling, what do we actually pool? Extracted RNA? If that’s the case, we still need to do 1,000 extractions for 1,000 samples. And then split those 1,000 extractions into 2, 3, 4 or 5 parts (a parameter that can be chosen). And we know that with all this testing, the extraction kits are running low.

We also know that with the progress of the disease a viral load in a patient changes — someone who might have tested negative yesterday could be positive today. So when a swab is taken, a viral load might be low to begin with, without any further amplification step. However, we do amplify, over many cycles. So perhaps it doesn’t matter if we start with 1/3rd or 1/4th of the original load. This means that if a swab could be mechanically split, with the sufficient viral load still there, one could pool these parts of the original swab.

Thus, for 1,000 samples, only 96 RNA extractions are needed. And we can still confidently say which individual samples are positive. 

In conclusion, for a population of samples with a small percentage of positives, the chemical bloom filters could represent a fast and efficient way to reduce the number of performed tests and save reagents (1,000 samples can be tested using 96 pooled tubes).

The data for the two plots above can be found below:

nRealPositive  nAll  kHashes  samplesInTube  avgFalsePositives
1              1000  4        41.67          0.00
2              1000  4        41.67          0.01
3              1000  5        52.08          0.03
4              1000  5        52.08          0.07
5              1000  5        52.08          0.19
6              1000  5        52.08          0.40
7              1000  5        52.08          0.78
8              1000  5        52.08          1.29
9              1000  5        52.08          2.12
10             1000  5        52.08          3.23
11             1000  5        52.08          4.97
12             1000  5        52.08          7.25
13             1000  5        52.08          10.00
14             1000  5        52.08          13.27
15             1000  5        52.08          17.13
16             1000  4        41.67          21.30
17             1000  4        41.67          25.41
18             1000  4        41.67          29.96
19             1000  4        41.67          35.00
20             1000  4        41.67          40.67
25             1000  3        31.25          75.94
30             1000  3        31.25          114.93
35             1000  3        31.25          160.06
40             1000  2        20.83          208.18
45             1000  2        20.83          244.67
50             1000  2        20.83          279.05
nRealPositive  nAll  kHashes  samplesInTube  avgFalsePositives
1              500   3        15.62          0.00
2              500   3        15.62          0.04
3              500   3        15.62          0.17
4              500   3        15.62          0.36
5              500   3        15.62          0.56
6              500   3        15.62          0.93
7              500   3        15.62          1.44
8              500   3        15.62          2.14
9              500   3        15.62          2.96
10             500   3        15.62          3.95
11             500   3        15.62          5.07
12             500   3        15.62          6.32
13             500   3        15.62          7.64
14             500   3        15.62          9.18
15             500   3        15.62          10.99
16             500   3        15.62          12.96
17             500   3        15.62          15.08
18             500   3        15.62          17.34
19             500   3        15.62          19.80
20             500   3        15.62          22.37
25             500   3        15.62          37.95
30             500   3        15.62          57.12
35             500   3        15.62          79.27
40             500   2        10.42          98.58
45             500   2        10.42          115.78
50             500   2        10.42          131.52


Disclaimer: We are not epidemiologists and we are publishing this with an intent to induce further discussions and improvements. Our conclusions are not intended as public-health advice and shouldn’t be regarded as such.

Do not fold, find out what non-ACTGN characters do you have in your .fasta file?

While some software packages blissfully ignore unusual characters or whitespaces, others will complain. Today I ran two pipelines over the same file and both reported different sequence length. So, I ran this handy command:

cat sequence.fa | fold -w1 | sort | uniq -c

The beauty of fold command is that it will wrap any long continuous sequence at a fixed number of characters. Even after every single one. And voila, here are the results:

And yeay, indeed two pipelines didn’t agree by 375 characters. I must have somehow introduced spaces during the concatenation step when I was building a single sequence out of many smaller.



Why you shouldn’t postpone learning how to use Conda

I was recently a teaching assistant for 2018 PSU Bootcamp on Reproducible Research where I gave a workshop on how to use Conda. Here are my slides. While preparing for this workshop, I came across this wonderful blog post that represents a wonderful introduction to Conda. Before I read it, all was very confusing: conda, bioconda, miniconda, anaconda.. how many condas are even out there? But this post not only helped, but motivated me to really start using conda in everyday life. Bioinformatics can be a dependency nightmare, but all these condas definitely make one’s life much easier!

Where have my nanopore reads dissapeared?

You might have been so excited to see your fastq reads coming immediately out of the MinION that you might not (initially) realized that the data you started to analyze are NOT all the data there is! In case your computer wasn’t able to do all the base-calling on the fly, there will be plenty of the data in your skip folder — in our recent run, this was as much as 90% of it!

One way to fix this is to catch up with the base-calling using albacore on Linux machine. In my case, I achieved it using following commands:

conda create --prefix /biomonika/conda/nanopore_basecalling python=${PYTHON_VERSION} anaconda
source activate nanopore_basecalling
conda update --all
pip install ont_albacore-2.2.*.whl

So how does our loop submitting jobs look like?

for i in `seq 0 48`; do
  echo $i;
  sbatch /biomonika/nanopore_run/data/reads/20180426_2154_/fast5/skip/${i} ${i};

And this is the insides of

#SBATCH --job-name=biomonika
#SBATCH --output=biomonika-%j.out
#SBATCH --error=biomonika-%j.err
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=63
#SBATCH --cpus-per-task=1

set -x
set -e


echo "input dir: " ${input_dir}
echo "output dir: " ${output_dir}

which conda
conda info

source activate /galaxy/home/biomonika/conda/nanopore_basecalling --flowcell FLO-MIN106 --kit SQK-LSK108 --barcoding --output_format fastq --input ${input_dir} --save_path ${output_dir} --worker_threads 63
echo "Done."

As usual, please comment if you have suggestions or advice. Happy base-calling!

Few thoughts on unionization at Penn State

Monday, April 10th is the first election date and while the campaigning is at full speed, I feel like there is still much confusion about the whole process. I decided to write this post as a student who has been interested in the subject of unionization since the beginning. Perhaps this will be of interest to those who have not been receiving any emails about the topic, or at least not as much as we (who are eligible to vote) have.

Who is this “third-party” that seeks to represent graduate students in a union?

If you are confused about who this third-party is, that’s because the emails from Penn State administration never mention this group. The organization is called The Coalition of Graduate Employees (CGE) and their website is


As they state on their website, they are “group of graduate assistants (GAs, TAs, and RAs) who are working towards the goal of a graduate employee union at Penn State”. Since they are in the group they seek to represent, I do not think it’s accurate to refer to them as third-party, as happened for example here.

Why would we need a union?

If you’re a graduate student, every semester you get a contract, sign it and that’s it. But what if you could negotiate what’s in the contract? Perhaps add something, perhaps remove something. Or, perhaps you think that your contract and the package of benefits that come with it (such as health insurance) was already designed with your best interests in mind. And what if you have serious issues during your graduate experience? Who will advocate for you?

Don’t we already have student organizations that advocate for us?

We do! For example, Graduate and Professional Student Association (GPSA; advocates for students. However, legally speaking, their voice is advisory and not binding. Shall union be formed, union representatives would have much stronger negotiating power than other student organizations ever could.

How common is this union thing?

Quite common! Out of the 14 Universities within the Big Ten, 6 of them are unionized (Michigan State; Rutgers; University of Illinois at Urbana-Champaign; University of Iowa; University of Michigan; and University of Wisconsin at Madison). Source:

For example, you can visit Michigan State union website here:

What are the reasons NOT to want a union?

Graduate school has mentioned few in their emails, although to me they feel more like a campaign and less like reasons I am personally concerned about.

I do not see how a union could affect the relationship with my advisor. I talked to few students who were represented by unions before at different unis and none of them seemed to think it hurt their relationship, more often they thought they didn’t notice any effect.

I do not see how a union could limit what research outputs can or cannot be put into a thesis. I am not aware of such case and I don’t know which modification of a contract could possibly make this a real concern.

Yes, I see this as partially a goal. I do agree that teaching experience is important for students who seek to pursue an academic career. However, there is a difference between teaching experience and between being overworked and yet financially struggling. There are majors where students are expected to teach every single semester of their Ph.D. experience. There are labs where students are expected to work every weekend. There are courses where graduate students spend >30 hours each week preparing lectures, teaching, and grading, leaving little time for their research.

Are we students or are we workers?

If you keep hearing about PLRB, then note that it stands for the Pennsylvania Labor Relations Board. Penn State has challenged that graduate students perform, in fact, work for the university and argued that they shouldn’t be allowed to form a union. This has led to an official hearing in which PLRB listened to both sides: The Coalition of Graduate Employees (CGE) and the group of lawyers hired by Penn State who took the position that graduate students do not perform work, but are in fact just and only students. 

I highlighted the last sentence since that’s what the witnesses talked about during the hearing. You can read the full transcripts here. If you’re interested, one of more “spicy” discussion is on page 631 of this document. I cherry-picked a teaser for you:

In March, PLRB issued the order to receive an information about eligible voters, including a list of the names, home addresses, and university email addresses.

Then we all received this news/email.

Graduate students can expect to be visited or contacted by representatives from CGE/PSEA. These representatives may make visits to graduate students’ homes, both announced and unannounced, but students have the right to decline any request to meet with them at home or on campus.

This looks like nonsense and CGE has distanced themselves from this message. To be honest, I do not understand why anyone would issue such a statement. Why would any organization of this type want to visit anyone in their home? More so unannounced? This is a mystery to me.

Anyway, the final decision of PLRB is that the graduate students are workers and allowed to form a union, which is why we have an election. That brings us to a beginning of my post.

To conclude, both parties: CGE, as well as Penn State administration, agree that you should get informed and vote. (I know I will.)

NOTE THAT THESE OPINIONS ARE MINE AND MINE ONLY. Please let me know if you find any factual inaccuracies and I’ll be happy to correct them.




seamless filtering of columns with dplyr without loosing rownames

If you use R and haven’t used dplyr yet, go and give it a try, it will be afternoon well spent. As much as I love this package, I probably haven’t used it yet as much as I should have – that’s because it requires you to shift your way of thinking about the problems a bit. To me, that’s similar as transitioning from basic R plotting functions to ggplot.

Today I worked with the dataframe where I needed to only keep rows where at least one of the columns has value greater than let’s say 15. So I did this:

filter_all(myDF, any_vars(. > 15))

However what happened was that I lost all my rownames where I keep important information. This is something that author of dplyr Hadley Wickham would refer to as “feature, not a bug”. This is because in complicated queries, rownames are hard to get right and it’s better to not to return anything than to get it wrong.

Therefore,  I will now need to keep my rownames as an additional column. And just today I came across very nice library that allows you do do that easily: tibble (note that it can do much more than this:). I like the readability of tibble commands a lot.

So here is my code:

myDF<-myDF %>% rownames_to_column('new_column') 
myDF<-filter_at(myDF, vars(-new_column), any_vars(. > 15))
myDF<-myDF %>% column_to_rownames('new_column')

This can be turn into one-liner, if that’s what you’re into.

Why you should always specify colClasses when reading tables in R

Today, I was importing table that looked like this into R:

with a command:

 r<-read.table("passes_stat",colClasses = c("integer","character"))

Why did I put an extra effort in specifying colClasses? That’s because if I wouldn’t, read.table function would have to make an educated guess about what type of the data my columns contain.

This would not only take longer (see the post here), but in this particular case it would also ruin my leading zeroes – see for example line number 5. Additionally, I might have not even noticed that it happened.


How to work with CRAM files (by converting them to bam files:)

I recently downloaded trio from 1000 human genomes that happened to be in a CRAM format. My first impulse was – of course – to convert it into the format I am comfortable with, such as bam.

I quickly found following command online:

java -jar cramtools-<version>.jar bam –input-cram-file <cram file> –reference-fasta-file <reference fasta file> –output-bam-file <output bam file>

thinking to myself that it must be rather frustrating to develop a tool where the frustrated users desperately push all the buttons to get to their chubby, but lovely bam files as quickly as possible (this is exaggeration in case you wondered;-).

OK, let’s try the conversion:


java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

But I got error message saying something like “cramtools ReferenceSource Downloaded sequence is corrupt” and complaining about mismatch in md5sums.

Is my reference incorrect? Should I search the internet for the magic file with md5sum of 6aef897c3d6ff0c78aff06ac189178dd that cramtools seems to want?

(@md5sam pointed to me you indeed can search by md5sum in case you wondered: find . -type f -exec md5sum {} + | grep ‘^6aef897c3d6ff0c78aff06ac189178dd’)

OK, so I went back to reading about CRAM files and if you are curious too, you can take a look at this publication: Efficient storage of high throughput DNA sequencing data using reference-based compression.

Turns out, storing pretty much everything in our BAM files is pretty wasteful, even if they are compressed (SAM files are not even compressed but they are human readable which is kind of nice at times). We are all humans and our DNA is extremely similar to each other. You and me might differ just by as few as 3,000,000 SNPs (and then insertions, deletions, SVs and sometimes more complex rearrangments).

So the idea is this:

Instead of storing human reference, let’s just store where our reads differ from it.

As simple as that.

OK, so what do I do now? The reference must be of course always present, otherwise the data is useless (knowing something differs, but not knowing what it differs isn’t particularly helpful. The reference could be reachable online, but you will likely want it to be on your local system instead. The easiest way to achieve this is probably through these commands:

/path_to_samtools/samtools-1.3.1/misc/ -root /home/biomonika/reference /home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa

export REF_PATH=/home/biomonika/reference/%2s/%2s/%s:

export REF_CACHE=/home/biomonika/reference/%2s/%2s/%s

Note that you need to have version newer version of samtools installed, such as 1.3.1 (older versions could precede CRAM development).

Now, let’s try again:

java -jar /home/biomonika/cramtools/cramtools-3.0.jar bam --input-cram-file ${input} --reference-fasta-file /galaxy/home/biomonika/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-bam-file ${input}.bam;

Yeaay – we just got the bam file. Although, the difference in sizes is quite dramatic: 4.6 GB versus 10.4 GB. (PS: If you don’t like cramtools, samtools can do the conversion too)

Big thanks to researchers from 1000genome who put together this very helpful README:

Reading text files with variable number of columns in R.

Do you ever need to read a text file while not knowing what’s the maximum number of columns it can have? Well, I do 🙂 Turns out read.table function only looks at first five columns to determine the number of columns as ?read.table help page says:


The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill orblank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

So what’s the workaround? Find out the maximum in advance!

  no_col <- max(count.fields(file, sep = "\t"))

  dataI <- read.table(file,sep="\t",fill=TRUE,header = F,col.names=c("chr", "start", "end", "length",1:no_col))

Since we want to fill the blanks, we will use fill=TRUE function and we may also decide to name our columns numerically as col.names does here. Thus, the columns will be named chr, start, end, length and 1, 2, 3 … maxColumn.