I found myself loading and processing huge files in R. For each of those files, I needed to analyze many DNA motifs, so I naturally started with a for loop that, once my gargantuan file was comfortably sitting in a memory, went motif by motif and created an output for each motif separately.
This is a wonderful candidate situation for parallelization since:
- I can process each motif independently
- I can assign each motif to a single processor
- Workers/processes don’t have to communicate
- They write their output to a separate files (not a shared one so they don’t have to fight over access to it)
I quickly googled following resource by Max Gordon which demonstrates that basic implementation of parallelization (in simple cases like this one) can be very straightforward in R:
library(parallel) list<-c("CGG","CGGT","CG","CGT") #my loooong list of motifs, shortened for illustration purposes # Calculate the number of cores no_cores <- detectCores() - 1 # Initiate cluster cl <- makeCluster(no_cores,type="FORK") print("PARALLEL") ptm <- proc.time() #start timer parLapply(cl, list, function(motif) processMotif(motif) #PROCESS EACH MOTIF INDIVIDUALLY ) stopCluster(cl) proc.time() - ptm print("NON-PARALLEL") ptm <- proc.time() for (motif in list) { processMotif(motif) } proc.time() - ptm #stop timer
Let’s check how much time this chunk takes to run:
user system elapsed
908.340 300.906 432.906
And let’s compare that with the for-loop solution:
user system elapsed
8544.826 3079.385 6089.453
Happy paralleLAZYing!
(The code was run on GNU/Linux with 64 processors and 500GB RAM.)
Nice tutorial on implementing parallel processing in R.
Modern laptops and PCs today have multi-core processors with sufficient amount of memory available and one can use it to generate outputs quickly.
Once you learn, how to parallelize your code, you will only regret that why didn’t you learn it sooner.