Statistical Software Popularity on Google Scholar
Background (probably boring)
Several months ago, my boss and I were discussing how he got the data for his software popularity article; the rest of the background discussion pertains to those plots, so I would recommend going over to take a look before continuing on (or just skip to the next section if you're impatient). Specifically, we were talking about his figures 7 and 11. Basically he was manually performing searches and recording whatever value he was interested in---by hand---like some kind of barbarian. This is a bad idea for at least 3 reasons:
- It's a waste of time to scrape the web manually. Writing a web scraper, especially if what you want to scrape is very uncomplicated (as is the case here), is easy in pretty much every scripting language. Spending a little time up front in developing a scraper will quickly pay dividends.
- If you ever want to change your operational definition (in this case, these are effectively the search queries), then you have to go grab all the old data. This is especially awful if you are trying to get a sense for how things perform over time (as we are here). With a scraper, this is no problem. By hand, this could be hours of laborious, menial work.
- This is the reason computers were invented! Put down that mouse and open up vim! (I have heard that emacs has similar capabilities to vim, although unfortunately I was never able to test this myself since I could not find the text editor in the emacs operating system).
There's another good reason to do this with a scraper instead of by hand, but it is more specific to our particular task. Search engine queries are not going to produce the same results across different days. It just doesn't work that way for a whole host of reasons. So if, as here, you decide to look at the number of Google Scholar hits for various searches across years, you have to grab all years at once for a sort of "snapshot" in time of how google is deciding to index things on that day. If you grab historical data and slowly build on it (as had been done), then any inference is dubious.
For that matter, it is probably a good idea to point out that this data is a description of exactly what we are saying; it is an examination of Google Scholar hits for various queries. I would hesitate to say that this describes the world of publishing at large, although given Google's monstrous scope in all things they do, I would not be surprised if this were such a reflection (I'm just not claiming that it is!).
Additionally, again given that search engines are, as far as the end user is concerned, mystical voodoo, I don't necessarily trust the numbers given here in absolute terms. I might be inclined, however, to put more faith in the relative growth/decay. Although even that is odd. If you look at the timeplot in the software popularity article, you will see a massive spike and then equally massive cratering of SPSS. Why might this be? I suppose the economy could be partly to blame (research funds are drying up everywhere and have been for some time), but certainly not entirely. I have no idea what would cause such a sharp rise and then fall except that maybe searching for "SPSS" is hitting a lot of false positives, in addition to some other spectrum of explanations.
SAS has a somewhat similar behavior, although certainly not as pronounced. The SAS one is a little more believable. It is reasonable to think that the crash of the global economy hurt SAS, and that, for SAS Institute, this couldn't have come at a worse time, since competitors such as R are steadily eating away at the SAS userbase. But I'm still not sure that's a complete explanation for the behavior seen here either.
What I'm saying is that you should probably take these numbers with a grain of salt.
Finally, there are a few changes here over his previous versions of that graph. If you had seen it before, you would have seen stata in a much stronger position. This is because we had been getting a lot of false positives because the word "stata" is the conditional perfect form of the verb "to be" in Italian. No, seriously. Some other less noticeable changes were made, but all of them are completely transparent in the scraper code posted at the end of this blog. You can see exactly what we did and how we did it, and then angrily post to your newsgroup and/or bbs of choice about it, you giant nerd.
The Data
Ok, so now that boring crap you skipped is out of the way, let's talk about the data. The most recent data was collected on Monday, April 9th at around 4:00pm EDT. You can grab your very own copy of the data here. I'm not sure why you would want it, but why does anyone want anything. A timeplot of this data is over at the software popularity article. But who cares; timeplots are for boring nerds. The real sweetness is in this sexy thing:
Isn't that beautiful? Made, of course, with the amazing ggplot2 package for R. I'm not sure what this type of plot is called. It probably has a name, but I've always called it a "market share plot", since that's basically how they're used. Anyone who's ever played Civilization III will be very familiar with these kinds of plots (they also show up a lot in political horse races).
Basically, as the name sort of suggests, the horizontal slices are capturing a sense for how much market share (proportional use) each software has in that time frame. To make your very own sexy market share plot using the example data set linked above (and again here since I know you're lazy), you could do something like this:
library(ggplot2)
library(reshape2)
library(scales)
Scholar <- read.csv("scholarly_impact_2012.4.9.csv")
Little6 <- c("JMP","Minitab","Stata","Statistica","Systat","R")
Subset <- Scholar[ , Little6]
Year <- rep(Scholar$Year, length(Subset))
ScholarLong <- melt(Subset)
names(ScholarLong) <- c("Software", "Hits")
ScholarLong <- data.frame(Year, ScholarLong)
#png("marketshare.png")
ggplot(ScholarLong, aes(Year, Hits, group=Software)) +
geom_smooth(aes(fill=Software), position="fill") +
coord_flip()+
scale_x_continuous("Year", trans="reverse") +
scale_y_continuous("Proportion of Google Scholar Hits For Each Software", labels = NULL)+
opts(title = expression("Market Share"), axis.ticks = theme_blank())
#dev.off()
Now maybe this type of data (for the reasons outlined above in the boring part) isn't really appropriate for this kind of plot, but at the very least we can appreciate it as a gorgeous plot, even if we don't entirely trust it.
The Scraper
Ok, so finally, here's the code for the scraper. Most of you can safely ignore this because, no, I didn't write it in R---I wrote it in shell. Some people seem to get very angsty when anyone does anything outside of R. R is amazing at what it does. I'm one of R's biggest fans, but there are some tasks that, in my opinion, are just better suited for the shell. Yes I know I can just use system(), but I don't want to. Stop complaining so much, voice in my head.
The shebang here is for bash, but it probably will work in less robust shells. If you're the kind of weirdo who insists on avoiding bash, then this should satisfy you.
#!/bin/bash
# For each year starting from the given first year up to the given last year, this script
# scrapes google scholar for chosen search strings.
# ----------------------------------------
# Changeable options
# ----------------------------------------
# SEARCH STRINGS: Separate queries by a space; enclose in quotes with %22; use + for space;
queries="
BDMP
JMP+AND+%22SAS+Institute%22
Minitab
SPSS
%22SAS+Institute%22+-JMP
Statacorp
%22Statsoft+Statistica%22
Systat
%22the+R+software%22+OR+%22the+R+project%22+OR+%22r-project.org%22+OR+hmisc+OR+ggplot2+OR+RTextTools
%22s-plus%22%2Btibco+OR+%22s-plus%22%2B%22insightful%22
"
# QUERY NAMES: What the column titles in the first row of the output .csv should be.
# Order should match order of queries above. This only affects the way the output file looks.
qnames="
BDMP
JMP
Minitab
SPSS
SAS
Stata
Statistica
Systat
R
SPlus
"
# First and last years to consider
firstyear="1995"
#lastyear=`date | awk -F " " '{print $6}'` # Current year
lastyear=`date | awk -F " " '{print $6}'`; lastyear=$(( $lastyear - 1 )) # Previous year
workdir=/tmp
outdir=~/scraper/google
# ----------------------------------------
# Don't touch
# ----------------------------------------
firstrun=TRUE
cd $workdir
mon=`date | awk -F " " '{print $2}' | sed -e 's/Jan/1/' -e 's/Feb/2/' -e 's/Mar/3/' -e 's/Apr/4/' -e 's/May/5/' -e 's/Jun/6/' -e 's/Jul/7/' -e 's/Aug/8/' -e 's/Sep/9/' -e 's/Oct/10/' -e 's/Nov/11/' -e 's/Dec/12/'`
day=`date | awk -F " " '{print $3}'`
yr=`date | awk -F " " '{print $6}'`
# Output file: current month.day.year.csv
#outfil="scholarly_impact_${mon}.${day}.${yr}.csv" #original outfile name
outfil="scholarly_impact_${yr}.${mon}.${day}.csv"
if [ ! -e ${outdir}/${outfil} ]; then
touch ${outdir}/${outfil}
echo -n "Year," >> $outdir/$outfil
for q in $qnames; do
echo -n "$q," >> $outdir/$outfil
done
sed -i 's/.$//' $outdir/$outfil
echo "" >> $outdir/$outfil
else
echo -e "\n\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
echo -e "!\t\tWARNING: OUTPUT FILE EXISTS\t\t !\n! New output will be appended--this shouldn't be happening !"
echo -e "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
echo -e "Giving you 10 seconds to change your mind...\n"
sleep 10
fi
years=""
for (( year=${firstyear}; year<=${lastyear}; year++ )); do
years="$years $year"
done
for year in $years; do
numbers=""
for query in $queries; do
# Waits between 10 and 20 seconds before continuing
if [ $firstrun = FALSE ]; then
sleepfullsecs=$[ $RANDOM % 11 + 10 ]
sleepfracsecs=`echo "scale=4; $[ $RANDOM % 10000 ] /10000" | bc -l`
sleepsecs=`echo "${sleepfullsecs}${sleepfracsecs}"`
echo -e "\n\nSleeping for $sleepsecs seconds so we don't get flagged as a bot.\n"
for (( sleepy=${sleepfullsecs}; sleepy>=1; sleepy-- )); do
echo -n $sleepy
sleep .25; echo -n "."
sleep .25; echo -n "."
sleep .25; echo -n "."
sleep .25
done
echo -n "0 and 0$sleepfracsecs"; sleep $sleepfracsecs; echo -e "\n\n"
else if [ $firstrun = TRUE ]; then
firstrun=FALSE
fi;fi
url="http://scholar.google.com/scholar?hl=en?&num=1&q=${query}&btnG=Search&as_sdt=1%2C43&as_ylo=${year}&as_yhi=${year}&as_vis=1"
#wget --user-agent=${useragent} --referer=$referer --output-document=goog --tries=20 $url
wget --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --referer="http://scholar.google.com/" --output-document=goog --tries=20 $url
# Query may find nothing
teststring=`awk -F "No pages were found containing" '{print $2}' goog`
if [ "$teststring" = "" ]; then
sed -n -i 's/.*Results <b>1<\/b> - <b>1<\/b> of //p' goog
sed -i -e 's/about //' -e 's/<b>//' -e 's/,//g' goog
number=`awk -F "</b>" '{print $1}' goog`
else
number=0
fi
numbers="${numbers},$number"
echo -e "\n\n Number found: $number \n\n"
rm goog
done
echo "${year}${numbers}" >> ${outdir}/${outfil}
done
SPSS changed its name to PASW in 2009 when it was acquired by IBM (http://en.wikipedia.org/wiki/SPSS). You may want to re-run your analysis using that term and see what you get.
I remember when it was named PASW, still everyone, in psychological research at least, called it SPSS. I would image that any researcher using PASW would\'ve wrote it like PASW SPSS in their articles.
Agreed. I don\'t know anyone that stopped calling it SPSS.
I\'m a researcher and have been trained to be \"very thorough\" or else your work will be questioned.
As your Wikipedia article points out, the PASW name existed for only one year. When IBM bought them, they wisely reverted back to the name everyone knew. When citing the use of PASW, people should have said that it was from SPSS, Inc. which should have us covered. It\'s certainly worth trying though. We\'ll add it to see if it adds anything. The new count-based graph should be up by 4/13/2012 on http://r4stats.com/popularity.
I should have replied to Laura O\'Grady but started a separate thread by accident, so here I am replying to my own post. We looked at PASW excluding SPSS and got a small number of messy hits. They included Plant Available Soil Water and Pluent Abdominal Segment Width. So we left both graphs (here and in the popularity article) unchanged. It was definitely worth a try though.
I\'m not a programmer but I wonder if there is a what to write in something that can exclude, \"Plant availabe soil water\" etc. like:
If PASW near \'Plant availabe soil water\' then skip.
Probably not worth the effort other than as an intellectual exercise as it has already been pointed out the likely use of PASW is limited.
I\'ve had research articles sent back from an editor because I didn\'t state the version number of SPSS I was using (in case a bug is revealed later, which could call my analysis into question).
I don\'t think Google Scholar has a \"near\" function as some software does, but it uses the minus sign to exclude things. That would work fine in this case, but we learned that you have to be careful not to make Google Scholar queries too complex. The logic seems to fall apart eventually. You can test this by adding a very large number of \"or\" conditions. The values should always increase but they can eventually decrease. That may have only happened when there were also some \"and\" conditions, so the test may not be as easy as that. We were totally surprised though that the logic failed ever given the popularity of Google.
In this case the number of additional hits was tiny though.
I love your \"market share\" chart - I prefer that as a way of representing the data to the overlaid time series chart. I wondered why SAS wasn\'t represented though, until I realised the \"JMP\" segment represents both SAS and JMP. I\'m surprised at the early dominance of Systat though -- and do you know why it doesn\'t appear in the chart at r4stats.com/popularity? Anyway, thanks for providing this background information on how the data were collected.
That market share plot mirrors Fig. 7b in r4stats.com/popularity in which SPSS and SAS are left out.
amazing post. love the ggplot!
I think there is an erratum in your bash script: it is BMDP instead of BDMP. BDMP stands for 2,6-dimethylphenyl or another chemicals.
Nice catch Julio! Since we only plotted the top 6 packages, then dropped SAS and SPSS to plot the next 6, BMDP didn\'t show up. In 2011 it\'s last place with only 554 articles.