How Much of R is Written in R?

wrathematics

My boss sent me an email (on my day off!) asking me just how much of R is written in the R language. This is very simple if you use R and a Unix-like system. It also gives me a good excuse to defend the title of this blog. It's librestats, not projecteulerstats, afterall.

So I grabbed the R-2.13.1 source package from the cran and wrote up a little script that would look at all .R, .c, and .f files in the archive, record the language (R, C, or Fortran), number of lines of code, and the file the code came from; then it's just a matter of dumping all that to a csv (converted to .xls (in LibreOffice) because WordPress hates freedom).

We'll talk in a minute about just how you would generate that csv--but first let's address the original question.

By a respectable majority, most of the source code files of core R are written in R:

At first glance, it seems like Fortran doesn't give much of a contribution. However, when we look at the proportion of lines of code, we see something more reasonable:
So there you have it. Roughly 22% of R is written in R. I know some people want R to be written in R for some crazy reason; but really, if anything, that 22% is too high. Trust me, you really want C and Fortran to be doing all the heavy lifting so that things stay nice and peppy.

Besides, this is a fairly irrelevant issue, in my opinion. What matters is that people outside of Core R are writing in R. Look at the extra packages repo and you'll see a very different story from the above graphic. That's something SAS certainly can't say, since people who want to do anything other than call some cookie-cutter SAS proc have to use IML or that ridiculous SAS macro language--each of which is somehow even more of a hilarious mess than base SAS.

Ok, so how do we get that data? I actually have a much better script than the one I'm about to describe. The new one automatically grabs every source package from the cran that you don't already have and starts digging in on them, dumping everything out into one big csv so you can watch trending. It's interesting to see the transition from R being almost entirely (92%) in C to seeing it slowly drop down to ~52%. But that's a different post for a different day because I have a few kinks to work out with that script before I would feel comfortable releasing it.

So here's how this system works. It's basically the dumbest possible solution; I'm pretty good at those, if I may say so myself. Basically the shell script hops into across the R-version/src/ folder and gets a line count of each .R, .c, and .f file. That's it; here it is:

#!/bin/sh

outdir="/path/to/where/you/want/the/csv/dumped"

rdir="/path/to/R/source/root/directory/to/be/examined" #eg, ~/R-2.13.1/
cd $rdir/src

for rfile in `find -name *.R`
do
	loc=`wc -l $rfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "R,$loc"  >> $outdir/r_source_loc.csv
done

for cfile in `find -name *.c`
do
	loc=`wc -l $cfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "C,$loc"  >> $outdir/r_source_loc.csv
done

for ffile in `find -name *.f`
do
	loc=`wc -l $ffile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "Fortran,$loc"  >> $outdir/r_source_loc.csv
done

Then the R script just does exactly what you'd think, given the data (take a look at the "csv" for examples).

r.loc <- read.csv("r_source_loc.csv",header=FALSE)

a <-r.loc[which(r.loc[1] == "R"),][2]
b <-r.loc[which(r.loc[1] == "C"),][2]
c <-r.loc[which(r.loc[1] == "Fortran"),][2]

files.total <- length(a[,1])+length(b[,1])+length(c[,1])
loc.total <- sum(a)+sum(b)+sum(c)

cat(sprintf("\nNumber .R source files:\t\t %d\nNumber .c source files:\t\t %d\nNumber .f source files:\t\t %d\n",length(a[,1]),length(b[,1]),length(c[,1])))
cat(sprintf("-------------------------------------"))
cat(sprintf("\nTotal source files examined:\t %d\n\n",length(a[,1])+length(b[,1])+length(c[,1])))

cat(sprintf("\nLines of R code:\t %d\nLines of C code:\t %d\nLines of Fortran code:\t %d\n",sum(a),sum(b),sum(c)))
cat(sprintf("-------------------------------"))
cat(sprintf("\nTotal lines of code:\t %d\n\n",loc.total))

cat(sprintf("\nAmong all lines of code being either R, C, or Fortran:\n"))
cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in Fortran:\t %f\n",100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total))

png("pct_r_source_files.png")
barplot(c(length(a[,1])/files.total,length(b[,1])/files.total,length(c[,1])/files.total),main="Percent of Core R Sourcecode Files",names.arg=c("R","C","Fortran"))
dev.off()

png("pct_r_code.png")
barplot(c(100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total),main="Percent of Core R Lines of Code",names.arg=c("R","C","Fortran"))
dev.off()

From the R script, we can get precise figures, which I prefer to pictures any day. But I seem to be an outlier in this regard...

Number .R source files:		 729
Number .c source files:		 586
Number .f source files:		 45
-------------------------------------
Total source files examined:	 1360

Lines of R code:	 149520
Lines of C code:	 346778
Lines of Fortran code:	 175409
-------------------------------
Total lines of code:	 671707

Among all lines of code being either R, C, or Fortran:
% code in R:		 22.259705
% code in C:		 51.626379
% code in Fortran:	 26.113916


18 comments on “How Much of R is Written in R?

  1. Well done!
    I have once thought about this question, but did not analyze the source code.

  2. Scale is screwed up on your first graph

  3. Derek Jones on said:

    The counts for R 2.13.0 are surprisingly larger:

    R 778
    c 599
    f 45
    h 240
    S 4

    These were obtained using the numbers program (http://shape-of-code.coding-guidelines.com/2011/05/30/searching-for-inaccurate-literals-in-r/) which operates directly on the uncompressed tar file.

  4. Kyle Gorman (@killa__bee) on said:

    This is great. That 22% should be rewritten in a high-level language with 0-indexed arrays, sensible data structures, and normal scope rules. This wouldn\'t look much like R (it\'s just be a bunch of libraries in a high-level language), but it would drastically increase the uptake of the R libraries by users (the lousiness and novelty of the S language is the biggest barrier). Anyone interested in joining me?

    • wrathematics on said:

      I\'d definitely be inclined to use that, but that kind of development is way above me.

      Sounds like a great idea, though!

  5. ian fellows on said:

    nice post. you may be interested in the cloc tool

  6. johngavinblog on said:

    Why not use R to do this?
    i.e. use R to tabulate the files types in R\'s source code.

    e.g.
    rUrl <- "http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz"
    (temp <- tempfile(fileext = ".tar.gz"))
    # may take a little while to grab a 20mb file.
    system.time(download.file(rUrl, temp))
    str(file.info(temp)) # 21mb file
    # extract only filenames from a compressed tarred file (on windows)
    system.time(filePaths <- untar(tarfile = temp,
    files = NULL, list = T, compressed = NA,
    verbose = FALSE, tar = Sys.getenv("TAR")))
    str(filePaths)
    head(filePaths, 2e1L)
    # focus on file extensions
    fileNames <- basename(filePaths)
    ext <- sub(".+(.[A-Za-z]+$)", "1", fileNames)
    # top 50 file types
    head(numTypes <- sort(table(ext[grep(".", ext)]), de = T), 50)
    # .R, .c, .f only
    rcf <- unlist(strsplit("rcf", ""))
    (types <- paste(".", c(rcf, toupper(rcf)), sep = ""))
    (ans rUrl (temp # may take a little while to grab a 20mb file.
    > system.time(download.file(rUrl, temp))
    trying URL \'http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz\'
    Content type \'application/x-gzip\' length 22063747 bytes (21.0 Mb)
    opened URL
    downloaded 21.0 Mb

    user system elapsed
    0.19 1.21 31.24
    > str(file.info(temp)) # 21mb file
    \'data.frame\': 1 obs. of 7 variables:
    $ size : num 22063747

    ....

    > (ans round(ans / sum(ans) * 1e2L, 0)

    .R .c .f
    55 42 3
    > # remove 20mb source code.
    > file.remove(temp)
    [1] TRUE

    • wrathematics on said:

      Cool approach, but unless I misread, this doesn\'t get the lines of code, which I think is also an important measure. I\'m sure it\'s possible to get that in R (without using system() to call wc), but I don\'t know how.

      Generally, my feeling is that R is very good at what R does, but that it\'s not really all that well suited for shell tasks. R can be used as a general scripting language, but it\'s nowhere near my top choice for that.

      Plus, I just love sed\'s regex syntax. I honestly think it\'s adorable. Thanks for the cool idea, though!

  7. Robert A. Muenchen on said:

    Nice job. Now get back to work! Heh heh!

    Bob Muenchen
    (the mean boss who asked for this info on a weekend)

  8. Similar stuff can be found on ohloh, also with changes in time: http://www.ohloh.net/p/rproject

    • wrathematics on said:

      Wow! This is really cool. Your graphs are especially beautiful.

      I was actually working on something that would grab all the historical data, and was basically done with it except for fixing some weird problems that occur when you run it more than once (to update it without having to re-do everything that\'s already done). I\'ll probably still finish it up and post it since I\'m nearly done, but I have to say that you guys put me to shame!

  9. Pingback: How Much of R is Written in R Part 2: Contributed Packages « librestats

  10. Pingback: Ma quanto R รจ scritto in R? « Chemiomet[R]ia

  11. Pingback: Introduction to programming in C/C++ | NerdaHolyC

Leave a Reply

Your email address will not be published. Required fields are marked *

*


three − one =

47,474 Spam Comments Blocked so far by Spam Free Wordpress

HTML tags are not allowed.