ParalleLAZYing in R

I found myself loading and processing huge files in R. For each  of those files, I needed to analyze many DNA motifs, so I naturally started with a for loop that, once my gargantuan file was comfortably sitting in a memory, went motif by motif and created an output for each motif separately.

This is a wonderful candidate situation for parallelization since:

  • I can process each motif independently
  • I can assign each motif to a single processor
  • Workers/processes don’t have to communicate
  • They write their output to a separate files (not a shared one so they don’t have to fight over access to it)

I quickly googled following resource by Max Gordon which demonstrates that basic implementation of parallelization (in simple cases like this one) can be very straightforward in R:

library(parallel)

list<-c("CGG","CGGT","CG","CGT") #my loooong list of motifs, shortened for illustration purposes

# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores,type="FORK")

print("PARALLEL")
ptm <- proc.time() #start timer
parLapply(cl, list,
 function(motif)
 processMotif(motif) #PROCESS EACH MOTIF INDIVIDUALLY
 )
stopCluster(cl)
proc.time() - ptm

print("NON-PARALLEL")
ptm <- proc.time()
for (motif in list) { 
 processMotif(motif)
}
proc.time() - ptm #stop timer

Let’s check how much time this chunk takes to run:

   user  system elapsed 
908.340 300.906 432.906

And let’s compare that with the for-loop solution:

   user  system elapsed
8544.826 3079.385 6089.453

Happy paralleLAZYing!

(The code was run on GNU/Linux with 64 processors and 500GB RAM.)

About mqm5775https://www.biostars.org/u/2884/Bioinformatics of sequences. Sex chromosomes. Enjoying DNA in my computer. Great ape Y chromosome evolution, specifically heterochromatin variability and analysis of male fertility genes. Creative
 use of visualization, gene expression of multi-copy gene families, genome assembly. Enthusiastic about learning and applying new technologies: Pacific Biosciencies (expert experience), Oxford Nanopore, BioNano Genomics.

One thought on “ParalleLAZYing in R

  1. Nice tutorial on implementing parallel processing in R.

    Modern laptops and PCs today have multi-core processors with sufficient amount of memory available and one can use it to generate outputs quickly.

    Once you learn, how to parallelize your code, you will only regret that why didn’t you learn it sooner.

    https://stantyan.com

Leave a Reply

Your email address will not be published. Required fields are marked *