A No BS Guide to the Basics of Parallelization in R

wrathematics

What is parallelization?
Parallelization is using multiple processing cores to, hopefully, make your programs run faster than serial code, which is the use of just one processing core.

Parallel code is not always faster than its serial counterpart (but if you're doing it right and you're careful about what you parallelize, it will be --- remember, that's your goal here).  Don't rush to parallelize something just because you can.  Look for the serious bottlenecks in your code.

Think of your code like the checkout line at a supermarket.  The shoppers are the "component jobs" of the code and the cashier is the processor.  Usually (universally?), each customer gets one cashier.  But imagine suddenly you can chop up your shopping cart into 4 pieces and pass them along to 4 cashiers.  If you're only buying 7 things and they're all easy to ring up, this is probably dumb, because in the end the cashiers are going to have to coordinate and tally for final billing (an inherent sunken time cost to this approach).  On the other hand, if you're buying half of the entire store and the items are slow to ring up (like clothing), this kind of cashier scheme might seriously speed things up for you.  My point is that for some people it makes sense to send to multiple cashiers, and for some it really doesn't make sense to do this.  The same is true of your code.

If you are a computer science nerd who wants to write a 100,000 word dissertation about why this is not an appropriate metaphor, please send your submissions to nobodycares@shutupihateyou.org.

Ok, so that's (kind of) how it works.  So why should you even bother to learn how to parallelize your code?  Because:

  1. Computers aren't really getting faster, just stuffed full of more cores
  2. Other people care about parallel code (i.e., looks good on a resume)
  3. R is slow
  4. Easy way to speed up your code (in my opinion, MUCH easier than Rcpp and, usually, vectorization)

Just remember to try basic optimization before you jump to parallelization.  Slow serial code produces slow parallel code.

 

How does parallelization work in R?

There are some things in R which usualy parallelize very naturally.  These are:

  1. lapply() calls
  2. probably most other "ply" functions
  3. for loops

As hinted at above, don't necessarily just go hog wild and replace every for loop with a parallel implementation (especially if you're on Windows, which has issues).  Good candidates for easy parallelization usually involve many independent calculations (e.g., checking many models at once and ranking by AIC/BIC/etc, bootstrapping, cross-validation, monte carlo simulation, ...).  Bad candidates for parallelization include things that couldn't possibly be bottlenecking anything, so just leave them alone.  Additionally, some things are, as far as I'm concerned, immune to parallelization; but I'm dumb, so that's probably wrong.

As of R version 2.14, your parallel tools are in the parallel package.  This comes with R, so you don't even have to install it, assuming R is installed (if you don't have R installed and you are still reading this, you have some odd priorities in your life).  Just load it up by calling 'library(parallel)' in R.  This package basically merges the old multicore and snow packages, so that you get mclapply() from multicore and the clusterApply family from snow.  You can also use the foreach package for parallelization, but I personally don't.  Experiment for yourself and come to your own conclusions, but for the remainder, I'm not going to talk about it.

Thankfully, parallelization in R is fairly straight forward.  There are just a few functions you have to become familiar with, and then you let your operating system do most of the magic behind the scenes.  The main workhorses for parallelization in R via library(parallel) are:

  1. mclapply() --- relies on system forking.  Works on any POSIX-like operating system (Linux, Mac OS X, etc---basically all but Windows).  Usually reasonably fast.
  2. clusterApply(), clusterApplyLB(), etc --- relies on voodoo.  Necessary if you're working on Windows.  Will work on POSIX OS's too if you're lazy, but mclapply() is pretty much guaranteed to be faster.  Often slower than serial implementations (which is rare for mclapply()).
  3. The detectCores() function determines the number of cores available to R, which can be very handy, if only for setting defaults.

These functions that facilitate parallelization, mclapply() and the clusterApply functions, all behave kind of like lapply().  In fact, mclapply() is kind of giving the game away by its name, as it is the multiple core lapply.  If you're familiar with lapply(), then you're already halfway there.

 

Stop typing things no one wants to read and give me an example

We're going to simulate 1,000,000 runs of the Monty Hall game (switching every time) to see if we've been lied to all these years about the probability of winning with this switching strategy really being 2/3.  The serial (one core) implementation might look like this:

system.time({
	doors <- 1:3
	runs <- 1e6
	game.outputs <- numeric(runs)
	for (run in 1:runs){
		prize.door <- sample(doors, size=1)
		choice <- sample(doors, size=1)

		if (choice!=prize.door) game.outputs[run] <- 1 # Always switch
	}
	avg <- mean(game.outputs)
})[3]

When we do this, after 7.354 seconds we get a winning average of 0.666297, or approximately 0.666.  Clearly this result shows that Bayes' Theorem is the work of Satan.  From Satan to voodoo, let's run the simulation on multiple cores. Using the code below (Edit:  Correction thanks to Owe Jessen)

#########################################
# ---------------------------------------
# Functions
# ---------------------------------------
#########################################

# One simulation of the Monty Hall game
onerun <- function(.){ # Function of no arguments
	doors <- 1:3
	prize.door <- sample(doors, size=1)
	choice <- sample(doors, size=1)

	if (choice==prize.door) return(0) else return(1) # Always switch
}

# Many simulations of Monty Hall games
MontyHall <- function(runs, cores=detectCores()){
	require(parallel)
	# clusterApply() for Windows
	if (Sys.info()[1] == "Windows"){
		cl <- makeCluster(cores)
		runtime <- system.time({
			avg <- mean(unlist(clusterApply(cl=cl, x=1:runs, fun=onerun)))
		})[3]
		stopCluster(cl) # Don't forget to do this--I frequently do

	# mclapply() for everybody else
	} else {
		runtime <- system.time({
			avg <- mean(unlist(mclapply(X=1:runs, FUN=onerun, mc.cores=cores)))
		})[3]
	}
	return(list(avg=avg, runtime=runtime))
}

#########################################
# ---------------------------------------
# Outputs
# ---------------------------------------
#########################################

run1 <- rbind(c(MontyHall(1e6, cores=1), "cores"=1))
run2 <- rbind(c(MontyHall(1e6, cores=4), "cores"=4))
run3 <- rbind(c(MontyHall(1e6, cores=8), "cores"=8))
rbind(run1, run2, run3)

we get

avg      runtime cores
[1,] 0.666452 8.395   1
[2,] 0.665846 2.876   4
[3,] 0.666336 2.553   8

So for us, there was some pretty good speedup in going from 1 to 4 cores (but a little slow down in going from an ordinary serial implementation to mclapply() with mc.cores=1).  Notice that the one core time here is roughly 3 times the 4 core time.  In an ideal world of perfect magical rainbows, the 4 core time would be 4 times as fast as the 1 core time.  The closer you can get to this kind of optimal scaling, the better, but getting anything close to this ideal is rare.  The kind of scaling we are seeing here is fairly typical and, frankly, not too bad.

Ok, so why the small jump from 4 to "8" cores?  Well, I don't have 8 cores; this computer only has 4.  So what's the deal?  You can set the "number of cores" argument in these functions to a higher value than what you actually have, and sometimes you can get some interesting speedup.  Slightly more on this is discussed below in point 5.

Now, if you're on Windows, I'm not sure that this will be faster in parallel than in serial.  The clusterApply family of functions have, in my experience, very odd behavior on Linux, and I can only assume that this odd behavior also exists on Windows.  What I'm trying to say is that your mileage/kilometerage(?) may vary.

UPDATE:  It strikes me that because of the fat startup costs of the clusterApply family, that this kind of scheme, while great for mclapply() (my primary tool), is inappropriate.  So what you might do instead is something like


runs <- 1e6
manyruns <- function(n) mean(unlist(lapply(X=1:(runs/4), FUN=onerun)))

library(parallel)
cores <- 4
cl <- makeCluster(cores)

# Send function to workers
tobeignored <- clusterEvalQ(cl, {
	onerun <- function(.){ # Function of no arguments
		doors <- 1:3
		prize.door <- sample(doors, size=1)
		choice <- sample(doors, size=1)
		if (choice==prize.door) return(0) else return(1) # Always switch
	}
; NULL
})

# Send runs to the workers
tobeignored <- clusterEvalQ(cl, {runs <- 1e6; NULL})
runtime <- system.time({
	avg <- mean(unlist(clusterApply(cl=cl, x=rep(runs, 4), fun=manyruns)))
})[3]
stopCluster(cl)

cbind(avg, runtime)

with output

> cbind(avg, runtime)
             avg runtime
elapsed 0.666466   3.644

So what we are doing here is chopping up the work so that, with 4 cores, each one of the 4 cores is initialized to do 1/4 of the work all at once.  It seems the title of this entry should be changed...whoopsies.

 

Other thoughts

  1. RTFM.  See 'library(help="parallel")' in R
  2. Check out the book Parallel R by McCallum and Weston, O'Reilly publishing.  I've only thumbed it, but I liked what I saw, and it's cheap.
  3. Practice and experiment!
  4. The "parallel functions" all assume you have just one input.  If you have multiple inputs you want to feed in parallel (i.e., multiple things you want to vary), this problem can easily be remedied by dumping everything into strings with separater characters, then inside the function that gets fed to mclapply/clusterApply, unpack the single input into its multiple component inputs.
  5. When using mclapply(), mc.cores is the number of forks you wish to spawn, not necessarily the exact number of physical cores you want to use (or even have).  The way cores get utilized in dealing with the forks is (thankfully) handled by the operating system.  However, as a result of this, if you experiment with different values of mc.cores, you may sometimes find non-trivial speedup by setting it to a value higher than detectCores().  Speaking from practice (not theory), the returns quickly diminish when increasing the cores beyond what your computer physically has.  This may also work for the clusterApply family, but it might not; don't know as I tend not to use those functions.

To close, even though I (jokingly) discouraged it at the top of the article, all comments are certainly welcome.


11 comments on “A No BS Guide to the Basics of Parallelization in R

  1. Corey Chivers on said:

    Great post. I really like your writing style. I\'ve been using snowfall with sfApply() and I found this quite helpful. Thanks!

  2. Owe Jessen on said:

    Thanks for the short intro. Just some nitpicks: I got an error with your example: \"Fehler in checkForRemoteErrors(val) :
    1000 nodes produced errors; first error: Objekt \'doors\' nicht gefunden\" which i was able to heal by putting doors into function onerun. Here are the results on my windows with 2 clusters:> run1 run2 run3 rbind(run1, run2, run3)
    avg runtime cores
    [1,] 0.66887 36.79 1
    [2,] 0.66498 25.68 2
    [3,] 0.66656 33.32 4

    which is considerably worse than the unthreaded result
    elapsed
    1.33

    • wrathematics on said:

      You are most correct. That\'s where doors originally was, and it seems I changed it at the last minute, and I\'m not entirely sure why, as I obviously didn\'t test it. Thanks.

  3. Erin Hodgess on said:

    Great article! Your style is excellent.

  4. Alexis on said:

    Thank you so much! Can you say a little more about multiple inputs? (The parallel and lapply documentation are rather scant) If I want to pass a function which takes arguments, for example:

    MyFunction <- function(a,b,c) {
    stuff happens here
    return(result)
    }

    mclapply(X=1:1000, FUN=MyFunction(a,b,c), mc.cores=detectCores())

    I *think* my trying to pass arguments like this is what is giving: Error in get(as.character(FUN), mode = "function", envir = envir) : n object 'FUN' of mode 'function' was not foundn"

    Thanks!
    Alexis

    • wrathematics on said:

      Yes, this is not allowed. You would need to do something like collapse all of your inputs into strings with some kind of separator character. So if you want to evaluate MyFunction(a,b,c), you first need to do something like

      x <- paste(a, b, c, sep="@")

      Then make a \"new\" function that knows what to do with these string inputs, say

      mcMyFunction <- function(x){
      x <- unlist(strsplit(x, split="@"))
      MyFunction(x[1], x[2], x[3])
      }

      So in the context of mclapply(), you collapse all of your n-tuple inputs into single strings, and call the collection of them X. Then you just do

      mclapply(X=X, FUN=mcMyFunction,
      mc.cores=cores)

      • Alexis on said:

        Gosh thanks! Ok, so what if one of my arguments is some complicated data structure, like a list of lists of vectors of stuff (and things). Is there a good way to convert such things back and forth from paste-able representations?

        • wrathematics on said:

          Interesting; I\'ve never needed anything like this before, but the same principle applies. You would just want to use the indices for your lists to be the combined strings. But you\'d have to be a little careful about how you do it.

          Without knowing more of what you have in mind, it\'s a little difficult to come up with an example. The way you would want to chop things up with something that complicated would depend heavily on what you\'re actually trying to do, because certain ways of distributing the computations probably wouldn\'t make any sense. Feel free to use the contact form on my About page if you want to discuss this in more detail.

          • Alexis on said:

            From your response, I\'m not sure if I was clear or not: I want to pass the complex data structures *themselves*, not simply their indicies (the complex structures both guide and also facilitate the efficient completion of MyFunction).

          • wrathematics on said:

            I don\'t understand the dilemma. If you have everything stored in a global list (which itself may be a list of lists, whose objects themselves are lists of lists, ...), then having the indices is equivalent to having the objects. You just pass the indices to a function which knows \"what to do\" with them and then calls on the real function, much in the same way that mcMyFunction does above.

            I may not fully understand or appreciate the difficulty of your problem; an example might help, but we are quickly reaching the limit of comment depth that WordPress will allow. If you want to discuss this further, it might be best to do so in a different venue, such as email.

            Best of luck.

  5. Brandon Weinberg on said:

    \"If you are a computer science nerd who wants to write a 100,000 word dissertation about why this is not an appropriate metaphor, please send your submissions to nobodycares@shutupihateyou.org.\"

    - Not only is the cashier metaphor effective, the pre-emptive FU is perfect there. Real educators hate those losers, thank you for adding that :)

Leave a Reply

Your email address will not be published. Required fields are marked *

*


six × = forty eight

48,091 Spam Comments Blocked so far by Spam Free Wordpress

HTML tags are not allowed.