R at 12,000 Cores
I am very happy to introduce a new set of packages that has just hit the CRAN. We are calling it the Programming with Big Data in R Project, or pbdR for short (or as I like to jokingly refer to it, 'pretty bad for dyslexics'). You can find out more about the pbdR project at http://r-pbd.org/
The packages are a natural programming framework that are, from the user's point of view, a very simple extension of R's natural syntax, but running in parallel over MPI and handling big data sets with ease. Much of the parallelism we offer is implicit, meaning that you can use code you are already using while achieving massive performance gains.
The packages are free as in beer, and free as in speech. You could call them "free and open source", or libre software. The source code is free for everyone to look at, extend, re-use, whatever, forever.
At present, the project consists of 4 packages: pbdMPI, pbdSLAP, pbdBASE, and pbdDMAT. The pbdMPI package offers simplified hooks into MPI, making explicit parallel programming over much simpler, and sometimes much faster than with Rmpi. Next up the chain is pbdSLAP, which is a set of libraries pre-bundled for the R user, to greatly simplify complicated installations. The last two packages, pbdBASE and pbdDMAT, offer high-level R syntax for computing with distributed matrix objects at low-level programming speed. The only system requirements are that you have R and an MPI installation.
We have attempted to extensively document the project in a collection of package vignettes; but really, if you are already using R, then much of the work is already familiar to you. Want to take the svd of a matrix? Just use svd(x) or La.svd(x), only "x" is now a distributed matrix object.
Not only can we perform computations faster in this way, but because our objects are distributed, then assuming you have the ram, the sky is the limit for the amount of data you can analyze in R 2.15.1 and earlier. These versions of core R have an indexing limitation, where, for example, the largest square matrix one can analyze is 46340 by 46340. However, we have routinely been working with much larger matrices using our system.
If I sound excited about this work, it's because I am. However, I would caution that the work is still in its infancy. The team consists of 4 people (with 2 main developers), none of whom are developing the project full time, and the project began in earnest in July. We have accomplished some important work, but there is more yet to be done. We hope that others will become as excited by this project as we are, though I would caution that this is the initial release with version number 0.1.
We have made every effort (and then some) to test this in every way imaginable and make our first release as stable as possible. But the fact of the matter is that all software contains bugs. Our documentation probably contains typos or outright errors. If you have questions/comments/bugs/suggestions, feel free to drop by our google group.
Simply put, we want this project to kick ass. Every one of us on the project is an R lover. We love the language, we love the community, and we're tired of people talking about how R isn't relevant because it doesn't scale. And frankly, we very much want critics of the language to have something we can put into their pipes to smoke.
Current Work and Future Plans
So far, we have a distributed dense matrix class and many methods for this class which act as copycat functions that look and behave very much like core R's functions. These include "[", apply(), "+", "%*%", prcomp(), ... .
And here's the 'but'. We do not yet have lm(). Though lm.fit() should be coming "soon". It is a surprisingly difficult problem to resolve, even if you know a thing or two about computational linear algebra. On this front, we have learned much from core R's trail blazing, especially when dealing with a rank deficient model matrix.
Under the Hood
This work very much stands on the shoulders of giants. We use proven high performance libraries to power much of the analytics for the user. At the heart of everything is MPI, which is no big surprise to anyone who read the article title. However, much of the computations are handled by the well-known linear algebra library ScaLAPACK. This is a proven, massively high performance library for dense linear algebra, which is a scalable extension of the libraries that power serial R's own linear algebra operations. The version of ScaLAPACK that we ship is the free version from netlib, although it is possible to use commercial distributions of ScaLAPACK, such as MKL, with our system (instructions are in the pbdSLAP vignette).
Scaling R to New Heights
We have already used these packages to evaluate some big matrices with some big machines. But we aren't planning to stop here.
One interesting note is that we can now officially run R at a scale where Rcpp is actually slowing things down in an interesting way. When you start up 10,000 R processes, you are basically overwhelming even a parallel file system as it tries to load the executables and libraries. Our initial attempts to start up all the R interpreters at that scale took over an hour. Add to that the fairly hefty 20mb(WRONG, SEE EDIT BELOW) Rcpp package, and this increases the time it takes just to load everything up. So while I love Rcpp, and strongly, emphatically encourage it to everyone wanting to speed up R at a sub-500 processor scale, looking forward, it's just too hefty a cost for us and this project. So while several of our packages currently depend on Rcpp, that will probably change with our next release (sorry Dirk :[ ).
Early benchmarks are promising. We have done, for example, principal components analyses via singular value decomposition with matrices as large as 100,000x100,000 in under an hour. We have solved systems of 70,000 equations in 70,000 unknowns. We have scaled computations from 2 to --- true to the title --- 12,000 cores.
In short, R now scales.
Edit: Whoopsies, as Dirk correctly points out, this is not the correct figure. We're finding that in use, Rcpp is taking about 3.5MB of ram, which is, believe it or not, still somewhat hefty for a 10,000+ core scale, but nowhere near the insanity of 20MB.