How Much of R is Written in R?
My boss sent me an email (on my day off!) asking me just how much of R is written in the R language. This is very simple if you use R and a Unix-like system. It also gives me a good excuse to defend the title of this blog. It's librestats, not projecteulerstats, afterall.
So I grabbed the R-2.13.1 source package from the cran and wrote up a little script that would look at all .R, .c, and .f files in the archive, record the language (R, C, or Fortran), number of lines of code, and the file the code came from; then it's just a matter of dumping all that to a csv (converted to .xls (in LibreOffice) because WordPress hates freedom).
We'll talk in a minute about just how you would generate that csv--but first let's address the original question.
By a respectable majority, most of the source code files of core R are written in R:
At first glance, it seems like Fortran doesn't give much of a contribution. However, when we look at the proportion of lines of code, we see something more reasonable:
So there you have it. Roughly 22% of R is written in R. I know some people want R to be written in R for some crazy reason; but really, if anything, that 22% is too high. Trust me, you really want C and Fortran to be doing all the heavy lifting so that things stay nice and peppy.
Besides, this is a fairly irrelevant issue, in my opinion. What matters is that people outside of Core R are writing in R. Look at the extra packages repo and you'll see a very different story from the above graphic. That's something SAS certainly can't say, since people who want to do anything other than call some cookie-cutter SAS proc have to use IML or that ridiculous SAS macro language--each of which is somehow even more of a hilarious mess than base SAS.
Ok, so how do we get that data? I actually have a much better script than the one I'm about to describe. The new one automatically grabs every source package from the cran that you don't already have and starts digging in on them, dumping everything out into one big csv so you can watch trending. It's interesting to see the transition from R being almost entirely (92%) in C to seeing it slowly drop down to ~52%. But that's a different post for a different day because I have a few kinks to work out with that script before I would feel comfortable releasing it.
So here's how this system works. It's basically the dumbest possible solution; I'm pretty good at those, if I may say so myself. Basically the shell script hops into across the R-version/src/ folder and gets a line count of each .R, .c, and .f file. That's it; here it is:
#!/bin/sh outdir="/path/to/where/you/want/the/csv/dumped" rdir="/path/to/R/source/root/directory/to/be/examined" #eg, ~/R-2.13.1/ cd $rdir/src for rfile in `find -name *.R` do loc=`wc -l $rfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'` echo "R,$loc" >> $outdir/r_source_loc.csv done for cfile in `find -name *.c` do loc=`wc -l $cfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'` echo "C,$loc" >> $outdir/r_source_loc.csv done for ffile in `find -name *.f` do loc=`wc -l $ffile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'` echo "Fortran,$loc" >> $outdir/r_source_loc.csv done
Then the R script just does exactly what you'd think, given the data (take a look at the "csv" for examples).
r.loc <- read.csv("r_source_loc.csv",header=FALSE)
a <-r.loc[which(r.loc[1] == "R"),][2]
b <-r.loc[which(r.loc[1] == "C"),][2]
c <-r.loc[which(r.loc[1] == "Fortran"),][2]
files.total <- length(a[,1])+length(b[,1])+length(c[,1])
loc.total <- sum(a)+sum(b)+sum(c)
cat(sprintf("\nNumber .R source files:\t\t %d\nNumber .c source files:\t\t %d\nNumber .f source files:\t\t %d\n",length(a[,1]),length(b[,1]),length(c[,1])))
cat(sprintf("-------------------------------------"))
cat(sprintf("\nTotal source files examined:\t %d\n\n",length(a[,1])+length(b[,1])+length(c[,1])))
cat(sprintf("\nLines of R code:\t %d\nLines of C code:\t %d\nLines of Fortran code:\t %d\n",sum(a),sum(b),sum(c)))
cat(sprintf("-------------------------------"))
cat(sprintf("\nTotal lines of code:\t %d\n\n",loc.total))
cat(sprintf("\nAmong all lines of code being either R, C, or Fortran:\n"))
cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in Fortran:\t %f\n",100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total))
png("pct_r_source_files.png")
barplot(c(length(a[,1])/files.total,length(b[,1])/files.total,length(c[,1])/files.total),main="Percent of Core R Sourcecode Files",names.arg=c("R","C","Fortran"))
dev.off()
png("pct_r_code.png")
barplot(c(100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total),main="Percent of Core R Lines of Code",names.arg=c("R","C","Fortran"))
dev.off()
From the R script, we can get precise figures, which I prefer to pictures any day. But I seem to be an outlier in this regard...
Number .R source files: 729 Number .c source files: 586 Number .f source files: 45 ------------------------------------- Total source files examined: 1360 Lines of R code: 149520 Lines of C code: 346778 Lines of Fortran code: 175409 ------------------------------- Total lines of code: 671707 Among all lines of code being either R, C, or Fortran: % code in R: 22.259705 % code in C: 51.626379 % code in Fortran: 26.113916

Well done!
I have once thought about this question, but did not analyze the source code.
Thanks!
Scale is screwed up on your first graph
Hah, good catch. The second one is a little wonky too, now that you mention it.
The counts for R 2.13.0 are surprisingly larger:
R 778
c 599
f 45
h 240
S 4
These were obtained using the numbers program (http://shape-of-code.coding-guidelines.com/2011/05/30/searching-for-inaccurate-literals-in-r/) which operates directly on the uncompressed tar file.
This is great. That 22% should be rewritten in a high-level language with 0-indexed arrays, sensible data structures, and normal scope rules. This wouldn\'t look much like R (it\'s just be a bunch of libraries in a high-level language), but it would drastically increase the uptake of the R libraries by users (the lousiness and novelty of the S language is the biggest barrier). Anyone interested in joining me?
I\'d definitely be inclined to use that, but that kind of development is way above me.
Sounds like a great idea, though!
nice post. you may be interested in the cloc tool
Thanks, I\'d never heard of cloc before. It\'s quite fancy.
Why not use R to do this?
i.e. use R to tabulate the files types in R\'s source code.
e.g.
rUrl <- "http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz"
(temp <- tempfile(fileext = ".tar.gz"))
# may take a little while to grab a 20mb file.
system.time(download.file(rUrl, temp))
str(file.info(temp)) # 21mb file
# extract only filenames from a compressed tarred file (on windows)
system.time(filePaths <- untar(tarfile = temp,
files = NULL, list = T, compressed = NA,
verbose = FALSE, tar = Sys.getenv("TAR")))
str(filePaths)
head(filePaths, 2e1L)
# focus on file extensions
fileNames <- basename(filePaths)
ext <- sub(".+(.[A-Za-z]+$)", "1", fileNames)
# top 50 file types
head(numTypes <- sort(table(ext[grep(".", ext)]), de = T), 50)
# .R, .c, .f only
rcf <- unlist(strsplit("rcf", ""))
(types <- paste(".", c(rcf, toupper(rcf)), sep = ""))
(ans rUrl (temp # may take a little while to grab a 20mb file.
> system.time(download.file(rUrl, temp))
trying URL \'http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz\'
Content type \'application/x-gzip\' length 22063747 bytes (21.0 Mb)
opened URL
downloaded 21.0 Mb
user system elapsed
0.19 1.21 31.24
> str(file.info(temp)) # 21mb file
\'data.frame\': 1 obs. of 7 variables:
$ size : num 22063747
....
> (ans round(ans / sum(ans) * 1e2L, 0)
.R .c .f
55 42 3
> # remove 20mb source code.
> file.remove(temp)
[1] TRUE
Cool approach, but unless I misread, this doesn\'t get the lines of code, which I think is also an important measure. I\'m sure it\'s possible to get that in R (without using system() to call wc), but I don\'t know how.
Generally, my feeling is that R is very good at what R does, but that it\'s not really all that well suited for shell tasks. R can be used as a general scripting language, but it\'s nowhere near my top choice for that.
Plus, I just love sed\'s regex syntax. I honestly think it\'s adorable. Thanks for the cool idea, though!
Nice job. Now get back to work! Heh heh!
Bob Muenchen
(the mean boss who asked for this info on a weekend)
Best boss anybody could hope for, no joke.
Similar stuff can be found on ohloh, also with changes in time: http://www.ohloh.net/p/rproject
Wow! This is really cool. Your graphs are especially beautiful.
I was actually working on something that would grab all the historical data, and was basically done with it except for fixing some weird problems that occur when you run it more than once (to update it without having to re-do everything that\'s already done). I\'ll probably still finish it up and post it since I\'m nearly done, but I have to say that you guys put me to shame!
Pingback: How Much of R is Written in R Part 2: Contributed Packages « librestats
Pingback: Ma quanto R รจ scritto in R? « Chemiomet[R]ia
Pingback: Introduction to programming in C/C++ | NerdaHolyC