Tuesday, April 22, 2014

Hacking Google Scholar (a little bit)

This post is about 2 things I really like, the R statistical programming language and Google Scholar, and my (tiny) effort to bring them together.  I've written quite a lot about R.  I use it all the time at my job, and have frequently used it in my published work.  I'll admit, I don't usually cite R when I use it. In fact, I don't even acknowledge that I've used it.  Usually I just report some statistical stuff and move on, never clarifying the actual software used to get the result (or whether I did it with a slide rule and pencil, which reminds me of the ad image below that isn't at all related to this post, but is probably the best thing I've seen in weeks.) 

However, you can cite R in your publications (and maybe I should).  In fact R has a handy function called citation that's only purpose is to inform you how to cite R.  Here's what it says. 


To cite R in publications use:

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 

IBM ad from the 1950s
|
there are so many things I love about this ad: 
-white shirt, slick hair, and a tie - no exceptions
-no women (of course, these are engineers!)
-the phrase "routine repetitive figuring"
-the little atomic motif next to IBMs logo
-the earnest look on everyone's face
(or maybe it's concern, they're all
about to get laid off I think)
-the enormity of the computer
-FREAKING SLIDE RULES!!!!!
|
also, wouldn't it be nice if we still 
talked about computer power in 
engineer equivalents. not unlike 
horsepower.  imagine saying a
 2 giga-engineer processor. more 
tangible than a petaflop, whatever 
that is. 
 
That's it.  In older versions of R it had been R Development Core Team, but they dropped the Development part.

What's interesting is that inevitably when people add this to their favorite citation manager like Endnote it gets mangled into the name R.C. Team (or for older citations R.D.C. Team).  

I really love this. I see it all the time in papers.  So for giggles I made a Google Scholar account for R.D.C. Team.  R.D.C. is doing pretty well with almost 14,000 citations as of 4/14.  

Actually there is some interesting stuff that can be gleaned from this novelty Google Scholar account. First, notice the amazing growth of citations for R. 2013 saw almost as many citations as all the past years combined! This fits with my own experience. I learned R in 2004 or 2005, and back then it was unusual to meet R users, especially outside of stats departments or those folks (like me) who worked on microarrays.  Now it is practically a standard. Where I work we commonly ask interviewees if they use R, mainly because the primary alternative, SAS, costs a fortune.  

Also, the bulk of citations are to R.D.C. and not the more recent R.C. Team. I'm not exactly sure when the D was dropped, but I think it was in the last couple of years.  Because so many citations come from 2013, this suggests that people are either using an old version of R and finding the old citation there, or more likely, just copying an out-of-date citation from an existing paper. 

Another interesting thing are the myriad ways in which people screw up the citation. Here's an example of the old standby, RDC Team.  Here it's mangled to R Development CT.  And if you're not into that whole brevity thing, you can spell out RDC's full name; here's one that mangled it to Team R Development Core.  

This leads me to my final thoughts. I don't feel bad for not citing R. I do cite specific R packages, but if I just compute a correlation or regression in R I'm not going to cite it.  In these cases I use R out of convenience, not out of necessity. It feels no different than citing Microsoft Excel for helping me organize my data, which seems ridiculous. That said, if you are going to the trouble of citing R, do it correctly!! Here's a nice page describing how.