• Great post. I really like your writing style. I've been using snowfall with sfApply() and I found this quite helpful. Thanks!

  • Owe Jessen

    Thanks for the short intro. Just some nitpicks: I got an error with your example: "Fehler in checkForRemoteErrors(val) :
    1000 nodes produced errors; first error: Objekt 'doors' nicht gefunden" which i was able to heal by putting doors into function onerun. Here are the results on my windows with 2 clusters:> run1 run2 run3 rbind(run1, run2, run3)
    avg runtime cores
    [1,] 0.66887 36.79 1
    [2,] 0.66498 25.68 2
    [3,] 0.66656 33.32 4

    which is considerably worse than the unthreaded result

    • wrathematics

      You are most correct. That's where doors originally was, and it seems I changed it at the last minute, and I'm not entirely sure why, as I obviously didn't test it. Thanks.

  • Erin Hodgess

    Great article! Your style is excellent.

  • Alexis

    Thank you so much! Can you say a little more about multiple inputs? (The parallel and lapply documentation are rather scant) If I want to pass a function which takes arguments, for example:

    MyFunction <- function(a,b,c) {
    stuff happens here

    mclapply(X=1:1000, FUN=MyFunction(a,b,c), mc.cores=detectCores())

    I *think* my trying to pass arguments like this is what is giving: Error in get(as.character(FUN), mode = \"function\", envir = envir) : \n object 'FUN' of mode 'function' was not found\n"


    • wrathematics

      Yes, this is not allowed. You would need to do something like collapse all of your inputs into strings with some kind of separator character. So if you want to evaluate MyFunction(a,b,c), you first need to do something like

      x <- paste(a, b, c, sep="@")

      Then make a "new" function that knows what to do with these string inputs, say

      mcMyFunction <- function(x){
      	x <- unlist(strsplit(x, split="@"))
      	MyFunction(x[1], x[2], x[3])

      So in the context of mclapply(), you collapse all of your n-tuple inputs into single strings, and call the collection of them X. Then you just do

      mclapply(X=X, FUN=mcMyFunction, 
      • Alexis

        Gosh thanks! Ok, so what if one of my arguments is some complicated data structure, like a list of lists of vectors of stuff (and things). Is there a good way to convert such things back and forth from paste-able representations?

        • wrathematics

          Interesting; I've never needed anything like this before, but the same principle applies. You would just want to use the indices for your lists to be the combined strings. But you'd have to be a little careful about how you do it.

          Without knowing more of what you have in mind, it's a little difficult to come up with an example. The way you would want to chop things up with something that complicated would depend heavily on what you're actually trying to do, because certain ways of distributing the computations probably wouldn't make any sense. Feel free to use the contact form on my About page if you want to discuss this in more detail.

          • Alexis

            From your response, I'm not sure if I was clear or not: I want to pass the complex data structures *themselves*, not simply their indicies (the complex structures both guide and also facilitate the efficient completion of MyFunction).

          • wrathematics

            I don't understand the dilemma. If you have everything stored in a global list (which itself may be a list of lists, whose objects themselves are lists of lists, ...), then having the indices is equivalent to having the objects. You just pass the indices to a function which knows "what to do" with them and then calls on the real function, much in the same way that mcMyFunction does above.

            I may not fully understand or appreciate the difficulty of your problem; an example might help, but we are quickly reaching the limit of comment depth that WordPress will allow. If you want to discuss this further, it might be best to do so in a different venue, such as email.

            Best of luck.

    • Dante

      I know this is an old article but the original response to this comment is wrong (at least it is as of now) and may undermine the effectiveness of the internet.

      For multiple inputs:
      mclapply(X=1:1000, FUN=MyFunction(a), b=your_b, c=your_c, mc.cores=detectCores())

  • Brandon Weinberg

    "If you are a computer science nerd who wants to write a 100,000 word dissertation about why this is not an appropriate metaphor, please send your submissions to nobodycares@shutupihateyou.org."

    - Not only is the cashier metaphor effective, the pre-emptive FU is perfect there. Real educators hate those losers, thank you for adding that :)

  • Bill Engels

    Thanks for the very helpful tutorial!
    As for Monty Hall, it looks to me like your program does not include the switch. So I think what you're simulating is the probability of losing if she NEVER switches, which is 2/3 or course. Programming in the switch would add a few lines to the code.


    Great article.
    I run the code successfully with three different workloads. Here are my benchmark results:

    # workload: 1e4
    #avg runtime cores
    #[1,] 0.6606 1.5 1
    #[2,] 0.6688 0.85 4
    #[3,] 0.6588 0.81 8

    # workload: 1e5
    # avg runtime cores
    #[1,] 0.6657 15.19 1
    #[2,] 0.66728 8.45 4
    #[3,] 0.66618 8.11 8

    # workload: 1e6
    # avg runtime cores
    #[1,] 0.666071 163.29 1
    #[2,] 0.666403 86.9 4
    #[3,] 0.666151 84.56 8

  • annoporci

    your code for the Monty Hall problem already assumes a thorough understanding of the problem in that it is based on the idea that you win iff your initial guess was wrong: it's a little too smart already.

  • Guillermo Ponce

    Hi, really informative post! Thanks.

    I have been dealing with a huge list of time series elements. Using mclapply with that huge list of elements makes a copy for each one of the cores you are using(that's part of forking). So, if your list takes 20% of your RAM and you have 120 cores you better use only 5 or 6 cores, right? Well, what if I need to improve the processing time over large data sets?

    At this point I had to go through the maybe less efficient performance approach, gatter/scatter, using:

    vcluster <- makePSOCKcluter(120)

    list.result <- parLapply(vcluster, vcluster, myFunc)

    Using this approach I don't get R-crashes due to memory issues while with mclapply() I reach memory limits pretty quick and eventually a crash if I try to use more than 5 cores, I was able to see/follow these behaviors by using just htop on linux.

    What are your suggestions to work with large data sets and trying to parallelize a function that deals with each one of the elements on my big list (67 million elements)?