I found myself loading and processing huge files in R. For each of those files, I needed to analyze many DNA motifs, so I naturally started with a for loop that, once my gargantuan file was comfortably sitting in a memory, went motif by motif and created an output for each motif separately.
This is a wonderful candidate situation for parallelization since:
- I can process each motif independently
- I can assign each motif to a single processor
- Workers/processes don’t have to communicate
- They write their output to a separate files (not a shared one so they don’t have to fight over access to it)
I quickly googled following resource by Max Gordon which demonstrates that basic implementation of parallelization (in simple cases like this one) can be very straightforward in R:
library(parallel) list<-c("CGG","CGGT","CG","CGT") #my loooong list of motifs, shortened for illustration purposes # Calculate the number of cores no_cores <- detectCores() - 1 # Initiate cluster cl <- makeCluster(no_cores,type="FORK") print("PARALLEL") ptm <- proc.time() #start timer parLapply(cl, list, function(motif) processMotif(motif) #PROCESS EACH MOTIF INDIVIDUALLY ) stopCluster(cl) proc.time() - ptm print("NON-PARALLEL") ptm <- proc.time() for (motif in list) { processMotif(motif) } proc.time() - ptm #stop timer
Let’s check how much time this chunk takes to run:
user system elapsed
908.340 300.906 432.906
And let’s compare that with the for-loop solution:
user system elapsed
8544.826 3079.385 6089.453
Happy paralleLAZYing!
(The code was run on GNU/Linux with 64 processors and 500GB RAM.)