ParalleLAZYing in R

I found myself loading and processing huge files in R. For each  of those files, I needed to analyze many DNA motifs, so I naturally started with a for loop that, once my gargantuan file was comfortably sitting in a memory, went motif by motif and created an output for each motif separately.

This is a wonderful candidate situation for parallelization since:

  • I can process each motif independently
  • I can assign each motif to a single processor
  • Workers/processes don’t have to communicate
  • They write their output to a separate files (not a shared one so they don’t have to fight over access to it)

I quickly googled following resource by Max Gordon which demonstrates that basic implementation of parallelization (in simple cases like this one) can be very straightforward in R:

library(parallel)

list<-c("CGG","CGGT","CG","CGT") #my loooong list of motifs, shortened for illustration purposes

# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores,type="FORK")

print("PARALLEL")
ptm <- proc.time() #start timer
parLapply(cl, list,
 function(motif)
 processMotif(motif) #PROCESS EACH MOTIF INDIVIDUALLY
 )
stopCluster(cl)
proc.time() - ptm

print("NON-PARALLEL")
ptm <- proc.time()
for (motif in list) { 
 processMotif(motif)
}
proc.time() - ptm #stop timer

Let’s check how much time this chunk takes to run:

   user  system elapsed 
908.340 300.906 432.906

And let’s compare that with the for-loop solution:

   user  system elapsed
8544.826 3079.385 6089.453

Happy paralleLAZYing!

(The code was run on GNU/Linux with 64 processors and 500GB RAM.)