I am very happy to introduce a new set of packages that has just hit the CRAN. We are calling it the *Programming with Big Data in R Project*, or **pbdR** for short (or as I like to jokingly refer to it, '**p**retty **b**ad for **d**yslexics'). You can find out more about the **pbdR** project at http://r-pbd.org/

The packages are a natural programming framework that are, from the user's point of view, a very simple extension of R's natural syntax, but running in parallel over MPI and handling big data sets with ease. Much of the parallelism we offer is implicit, meaning that you can use code you are already using while achieving massive performance gains.

The packages are free as in beer, and free as in speech. You could call them "free and open source", or libre software. The source code is free for everyone to look at, extend, re-use, whatever, forever.

At present, the project consists of 4 packages: **pbdMPI**, **pbdSLAP**, **pbdBASE**, and **pbdDMAT**. The **pbdMPI** package offers simplified hooks into MPI, making explicit parallel programming over much simpler, and sometimes much faster than with **Rmpi**. Next up the chain is **pbdSLAP**, which is a set of libraries pre-bundled for the R user, to greatly simplify complicated installations. The last two packages, **pbdBASE** and **pbdDMAT**, offer high-level R syntax for computing with distributed matrix objects at low-level programming speed. The only system requirements are that you have R and an MPI installation.

We have attempted to extensively document the project in a collection of package vignettes; but really, if you are already using R, then much of the work is already familiar to you. Want to take the svd of a matrix? Just use svd(x) or La.svd(x), only "x" is now a distributed matrix object.

Not only can we perform computations faster in this way, but because our objects are distributed, then assuming you have the ram, the sky is the limit for the amount of data you can analyze in R 2.15.1 and earlier. These versions of core R have an indexing limitation, where, for example, the largest square matrix one can analyze is 46340 by 46340. However, we have routinely been working with much larger matrices using our system.

If I sound excited about this work, it's because I am. However, I would caution that the work is still in its infancy. The team consists of 4 people (with 2 main developers), none of whom are developing the project full time, and the project began in earnest in July. We have accomplished some important work, but there is more yet to be done. We hope that others will become as excited by this project as we are, though I would caution that this is the initial release with version number 0.1.

We have made every effort (and then some) to test this in every way imaginable and make our first release as stable as possible. But the fact of the matter is that all software contains bugs. Our documentation probably contains typos or outright errors. If you have questions/comments/bugs/suggestions, feel free to drop by our google group.

Simply put, we want this project to kick ass. Every one of us on the project is an R lover. We love the language, we love the community, and we're *tired* of people talking about how R isn't relevant because it doesn't scale. And frankly, we very much want critics of the language to have something we can put into their pipes to smoke.

**Current Work and Future Plans**

So far, we have a distributed dense matrix class and many methods for this class which act as copycat functions that look and behave very much like core R's functions. These include "[", apply(), "+", "%*%", prcomp(), ... .

And here's the 'but'. We do not yet have lm(). Though lm.fit() should be coming "soon". It is a surprisingly difficult problem to resolve, even if you know a thing or two about computational linear algebra. On this front, we have learned much from core R's trail blazing, especially when dealing with a rank deficient model matrix.

**Under the Hood**

This work very much stands on the shoulders of giants. We use proven high performance libraries to power much of the analytics for the user. At the heart of everything is MPI, which is no big surprise to anyone who read the article title. However, much of the computations are handled by the well-known linear algebra library ScaLAPACK. This is a proven, massively high performance library for dense linear algebra, which is a scalable extension of the libraries that power serial R's own linear algebra operations. The version of ScaLAPACK that we ship is the free version from netlib, although it is possible to use commercial distributions of ScaLAPACK, such as MKL, with our system (instructions are in the **pbdSLAP** vignette).

**Scaling R to New Heights**

We have already used these packages to evaluate some big matrices with some big machines. But we aren't planning to stop here.

One interesting note is that we can now officially run R at a scale where **Rcpp** is actually slowing things down in an interesting way. When you start up 10,000 R processes, you are basically overwhelming even a parallel file system as it tries to load the executables and libraries. Our initial attempts to start up all the R interpreters at that scale took over an hour. Add to that the fairly hefty 20mb(WRONG, SEE EDIT BELOW) **Rcpp** package, and this increases the time it takes just to load everything up. So while I *love* **Rcpp**, and strongly, emphatically encourage it to everyone wanting to speed up R at a sub-500 processor scale, looking forward, it's just too hefty a cost for us and this project. So while several of our packages currently depend on **Rcpp**, that will probably change with our next release (sorry Dirk :[ ).

Early benchmarks are promising. We have done, for example, principal components analyses via singular value decomposition with matrices as large as 100,000x100,000 in under an hour. We have solved systems of 70,000 equations in 70,000 unknowns. We have scaled computations from 2 to --- true to the title --- 12,000 cores.

In short, R now scales.

Edit: Whoopsies, as Dirk correctly points out, this is not the correct figure. We're finding that in use, Rcpp is taking about 3.5MB of ram, which is, believe it or not, still somewhat hefty for a 10,000+ core scale, but nowhere near the insanity of 20MB.

Awesome

Thanks!

This looks like a very interesting and exciting new project. I just pushed this up on r-bloggers.com and I hope you will find more contributors to this project.

Good luck!

Tal

Awesome, thank you!

your link for “put that in your pipe and smoke it” is wrong. you meant to link to the Dowager Countess of Grantham saying it. http://www.youtube.com/watch?v=DfNg47FsOU4 You’re welcome.

I'm suddenly craving tea and boiled food.

Hi,

Nice post, and extremely promising packages--as I already said in private mail. Now, could you explain to me how you measure the cost of Rcpp to b e 20mb? On my 64bit Linux system, libRcpp.so is 344kb, or roughly 1/60 of the size you claim. Did you forget to turn off the `-g` switch for debugging symbols?

Cheers, Dirk

Regarding your spooling up problems. I saw someone have similar issues when running parallel Python stuff. If you skip to about 19:00, you can hear him talk about his issue. I think I heard that they solved their problem, though I don't know how. You might want to get in touch to see if their problem is similar to yours and if their solution can be adapted.

http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-python-for-high-performance-computing-4899211

Very interesting; I'll check this out later. Thanks.

A big part of the problem is that the R interpreter is pretty bloated, and so dragging around a bunch of interpreted stuff that you aren't using while running jobs in batch doesn't make a lot of sense. So eventually, we're probably going to have to build a version of R that throws away a bunch of things that are unneeded for us.

It's kind of funny though to be working at a scale where you have access to terabytes of memory, but 50mb is considered monstrously large.

Sounds marvelous! Glad to see its under GPL. Thanks for doing this.

We're all big GPL fans, so you can reasonably expect our work to always be GPL >= 2.

That's an impressive start. Now get back to work and finish lm ASAP!

Mean Old (former) Boss

You're not the boss of me!!!

Thats fantastic start!!!

This is pretty impressive. I'm looking forward to see the modelling functions being broaded up.

What seems to be supprisingly difficult about a distributed lm, if I may ask?

By the way, do you also have a distributed data.frame?

I don't understand why rcpp is a bottleneck. Are you using a shared filesystem for executables? Should rcpp not be loaded independently at each MPI node? At most the contention should be equal to P at each node, where P is the number of MPI processes at each node in the cluster.