I found myself loading and processing huge files in R. For each of those files, I needed to analyze many DNA motifs, so I naturally started with a for loop that, once my gargantuan file was comfortably sitting in a memory, went motif by motif and created an output for each motif separately.
This is a wonderful candidate situation for parallelization since:
- I can process each motif independently
- I can assign each motif to a single processor
- Workers/processes don’t have to communicate
- They write their output to a separate files (not a shared one so they don’t have to fight over access to it)
I quickly googled following resource by Max Gordon which demonstrates that basic implementation of parallelization (in simple cases like this one) can be very straightforward in R:
library(parallel)
list<-c("CGG","CGGT","CG","CGT") #my loooong list of motifs, shortened for illustration purposes
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores,type="FORK")
print("PARALLEL")
ptm <- proc.time() #start timer
parLapply(cl, list,
function(motif)
processMotif(motif) #PROCESS EACH MOTIF INDIVIDUALLY
)
stopCluster(cl)
proc.time() - ptm
print("NON-PARALLEL")
ptm <- proc.time()
for (motif in list) {
processMotif(motif)
}
proc.time() - ptm #stop timer
Let’s check how much time this chunk takes to run:
user system elapsed
908.340 300.906 432.906
And let’s compare that with the for-loop solution:
user system elapsed
8544.826 3079.385 6089.453
Happy paralleLAZYing!
(The code was run on GNU/Linux with 64 processors and 500GB RAM.)