Wednesday, September 3, 2014

Unsolicited Email

In any profession one of the downsides of attending conferences or publishing papers is that your email address gets out into the world and you become the target of unsolicited junk email.  Most of what I receive is trying to sell me a scientific product or get me to attend a conference or publish in a journal that no one has ever never heard. I never open the bulk of it.

Recently I've received a ton of crap.  

Today, as a sort of silver-lining, I got my all-time favorite unsolicited email, and one that I actually opened and read completely.  Check it out below. The subject line was "Get all the DNA out of urine!".  YES!!

Best part: "Urine is a veritable gold mine...".  I couldn't agree more!

Anyway, I'm in a select group (I assume) of scientists chosen to be beta-testers of Zymo Research's Extract-ALL(TM) Urine DNA Kit, so I got that going for me, which is nice.



Thursday, July 10, 2014

Duke Divinity School says it answers what science cannot

I stumbled on an article in the Duke Chronicle with the title above.  I assumed it was probably a young journalist getting a little carried away with the headline, but read it anyway.  In fact, the headline was pretty factual.  Here are some interesting nuggets:


Duke Chapel
Students in the Divinity graduate programs come from a wide variety of backgrounds, but all of them come to seek further study in the field of faith. Each come having accepted the fundamentals of their Christian faith—just as a mathematics graduate student accepts the concept of numbers, or a medical student accepts chemistry, Hays said.
[Hays is the Dean of the Div. School.]

“There is no field at Duke that doesn’t take on presuppositions,” Myers said. “I don’t think the argument should be about the crazy claims that the Christian Church makes because we all have crazy presuppositions."

[Myers is as grad student at the Div. School]

“Science seeks to describe empirical phenomena in a material world,” Hays said. “It describes how things work. Science cannot answer questions about why it exists or for what purposes or how it came to be. Those are the questions that theology tries to address.”

[Hays again]


The study of what it means to be human is at the heart of humanities studies, and that is where religion plays a role, said Carnes, who wants to become a theology professor. 
“Humanities in general have something to do with what it means to be a human in a way that math and science can’t fully address,” she said.
[Carnes is another grad student at the Div. School]
I am a practicing scientist.  I do not have a very sophisticated understanding in philosophy, and I'll take it as given that there is some lack of "proof" for the theory of numbers or atoms, as Hays claims. However, I absolutely reject the reasoning that somehow a belief in 2+2=4 is essentially the same as a belief in the Christian god (or any other god). That's nuts! Chemistry and Christianity are just not on equal footing in terms of supporting evidence, no matter how intractable the ultimate proof of atom theory may be. 
These little philosophical slights-of-hand, like the one used by Hays, create a superficial notion of an equivalence of the factuality and veracity of science and religious belief. This kind of thinking just blows my mind. I also find this very disheartening because I have a hunch that these tricks are one of the more pernicious roots of the rejection of science. It's much easier to cast science aside if you pose it as a belief system or as equivalent to a religious belief system.  
Second, what does it mean to study why things exist or for what purpose [Why do carrots exists, and for what purpose are supernovae?], or what it means to be human?  I suppose in some sense I've been studying what it means to be human every day for the last 34 years. I'm sad to report I've had no breakthroughs :). Honestly, I'm not even sure I'd recognize a breakthrough if it occurred.  

Friday, June 13, 2014

Programming Geometric Art

A few weeks ago I was listening the the brilliant podcast 99% Invisible. The particular episode was all about quatrefoils, which are cloverleaf shaped geometric designs. In the episode they point out that quatrefoils feature prominently in high class objects, like gothic cathedrals and in luxury products (e.g. Louis Vuitton bags). The episode really stuck with me. It was so interesting. These simple shapes can convey so much meaning, much of it subconsciously (and now for you more consciously, because you too are going to start seeing quatrefoils everywhere, especially for my friends at Duke!)

I have always been fascinated with kaleidoscopes and geometric designs. I love Arabic art and gothic patterns. There something soothing about geometric designs, they have order and organization but at the same time an organic familiarity because similar processes arise in nature.  



This got me thinking. At a base level geometric art is just an algorithm. Draw a line here, move 10 degrees over, draw another line, repeat. I know a little about algorithms and I started to wonder if I could make anything pretty in R (R being an ironic choice because 99% of what I do in R is decidedly not pretty). At first I set out to make quatrefoils, but I never did figure it out (and if you do, paste your code in the comments section below).  Then I reset my sights on just making something interesting. 

Below are the first 5 pleasing things I've managed to produce along with the R code that draws them.  The first 4 were designs I made intentionally. It's a fun exercise. Find
or imagine something you want to draw, then think about the algorithm that would make it, and finally translate that into code. It's one of those tasks that uses a lot of your brain.

Of the images below, I'm most proud of the one called "angles", because I got the process from idea to algorithm to code on the first try! All the others took some trial and error, with interesting squiggles and blobs emerging along the way.       

These pictures raise a lot of questions which are beyond my grasp of geometry and trig.  For example, in the mystic rose below why do all those internal circles arise? What is their radius? Why are there voids? There must be some way to derive answers to these things mathematically, and I'm sure if I meditated on it long enough I could probably figure it out. Unfortunately, for now I'm a little too busy to remedy my own ignorance.

Also, "mystic rose" isn't my name.  I did some googling and that seems to be the standard name for that object.  I'm a geneticist.  I'd have called it the "full diallel". :) (And a hellacious one at that. 625 crosses. You'd need to be pretty ambitious or foolish.)   

Finally, if you decide to run the R code below, it is best in regular old vanilla R, because it draws one line at a time and you can see the pattern emerge (and if you have an old slow computer like mine it will emerge at a nice human pace). When I plotted these in RStudio the image wouldn't print to the screen until it was done, which kind of kills the effect (I recommend running some of the code, it's mesmerizing).   

##mystic rose
rad = function(x) (x*pi)/180
plot(c(0,0), c(0,0), ty='n', xlim=c(-1,1), ylim=c(-1, 1))

deg.rot = 15
for (i in seq(0,360, deg.rot)) {
         for (j in seq(0, 360, deg.rot)) {
               lines(c(cos(rad(i)), cos(rad(j))), c(sin(rad(i)), sin(rad(j))), lwd=0.5)
         }
}
mystic rose 
##circles
v = seq(0,360,45)
rad = function(x) (x*pi)/180
plot(0,0, ty='n', xlim=c(-20,20), ylim=c(-20,20))
for (i in v) {
    for (j in seq(1, 10, 1)){
        symbols(cos(rad(i))*j,sin(rad(i))*10, circles=1, add=T) 
    }
}

circles

#angles
rad = function(x) (x*pi)/180
v = seq(45,360,45)
steps = seq(0,1,.025)
plot(c(0,0), c(0,0), ty='n', xlim=c(-1,1), ylim=c(-1, 1))
for (i in v) {
         next_angle = i + 45
         for (j in steps) {
         lines(c(cos(rad(i))*j, cos(rad(next_angle))*(1-j)), c(sin(rad(i))*j, sin(rad(next_angle))*(1-j)), lwd=0.5)
        }
}

angles 
##betas
v = seq(2, 1502, 15)**.75
plot(0,0, xlim=c(0,1), ylim=c(-20,20), ty='n')
for (i in v) {
    lines(seq(0,1,0.01), dbeta(seq(0,1,.01), shape1=i, shape2=i), lwd=.1)
    lines(seq(0,1,0.01), -dbeta(seq(0,1,.01), shape1=i, shape2=i), lwd=.1)
}
betas also tells a story. it's drawn using a beta distribution. it shows the uncertainty around a proportion estimated to be at 50% frequency as a function of an ever increasing sample size. it moves really fast at first, but quickly hits diminishing returns.  at the end, adding hundreds of samples does very little to increase statistical power.
betas
##trig functions
v = seq(0,pi*.25, pi/10)
v2 = seq(0,pi*0.5, pi/200)
plot(c(0,0), c(0,0), ty='n', xlim=c(-120,120), ylim=c(-150, 150))
for(i in v) {
            for (cons1 in seq(0,100,10) ){
                        for (cons2 in seq(0,100,5)) {
    xf = function (i) cons1*cos(i)+cos(3*i)+cos(2*i)+cos(4*i)
    yf = function(i) cons2*sin(i)+cons1*sin(3*i)
    lines(xf(v2), yf(v2), lwd=.05)
    lines(-xf(v2), -yf(v2),  lwd=.05)
                }
    }
}
trig functions

Tuesday, April 22, 2014

Hacking Google Scholar (a little bit)

This post is about 2 things I really like, the R statistical programming language and Google Scholar, and my (tiny) effort to bring them together.  I've written quite a lot about R.  I use it all the time at my job, and have frequently used it in my published work.  I'll admit, I don't usually cite R when I use it. In fact, I don't even acknowledge that I've used it.  Usually I just report some statistical stuff and move on, never clarifying the actual software used to get the result (or whether I did it with a slide rule and pencil, which reminds me of the ad image below that isn't at all related to this post, but is probably the best thing I've seen in weeks.) 

However, you can cite R in your publications (and maybe I should).  In fact R has a handy function called citation that's only purpose is to inform you how to cite R.  Here's what it says. 


To cite R in publications use:

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 

IBM ad from the 1950s
|
there are so many things I love about this ad: 
-white shirt, slick hair, and a tie - no exceptions
-no women (of course, these are engineers!)
-the phrase "routine repetitive figuring"
-the little atomic motif next to IBMs logo
-the earnest look on everyone's face
(or maybe it's concern, they're all
about to get laid off I think)
-the enormity of the computer
-FREAKING SLIDE RULES!!!!!
|
also, wouldn't it be nice if we still 
talked about computer power in 
engineer equivalents. not unlike 
horsepower.  imagine saying a
 2 giga-engineer processor. more 
tangible than a petaflop, whatever 
that is. 
 
That's it.  In older versions of R it had been R Development Core Team, but they dropped the Development part.

What's interesting is that inevitably when people add this to their favorite citation manager like Endnote it gets mangled into the name R.C. Team (or for older citations R.D.C. Team).  

I really love this. I see it all the time in papers.  So for giggles I made a Google Scholar account for R.D.C. Team.  R.D.C. is doing pretty well with almost 14,000 citations as of 4/14.  

Actually there is some interesting stuff that can be gleaned from this novelty Google Scholar account. First, notice the amazing growth of citations for R. 2013 saw almost as many citations as all the past years combined! This fits with my own experience. I learned R in 2004 or 2005, and back then it was unusual to meet R users, especially outside of stats departments or those folks (like me) who worked on microarrays.  Now it is practically a standard. Where I work we commonly ask interviewees if they use R, mainly because the primary alternative, SAS, costs a fortune.  

Also, the bulk of citations are to R.D.C. and not the more recent R.C. Team. I'm not exactly sure when the D was dropped, but I think it was in the last couple of years.  Because so many citations come from 2013, this suggests that people are either using an old version of R and finding the old citation there, or more likely, just copying an out-of-date citation from an existing paper. 

Another interesting thing are the myriad ways in which people screw up the citation. Here's an example of the old standby, RDC Team.  Here it's mangled to R Development CT.  And if you're not into that whole brevity thing, you can spell out RDC's full name; here's one that mangled it to Team R Development Core.  

This leads me to my final thoughts. I don't feel bad for not citing R. I do cite specific R packages, but if I just compute a correlation or regression in R I'm not going to cite it.  In these cases I use R out of convenience, not out of necessity. It feels no different than citing Microsoft Excel for helping me organize my data, which seems ridiculous. That said, if you are going to the trouble of citing R, do it correctly!! Here's a nice page describing how.

Saturday, January 18, 2014

My paper on Indel Variation in Mimulus is published: Omissions and an Admission

Mimulus guttatus
Edit: Below I point out two relevant papers I wish I had discovered and cited in my recent publication.  Here is a third paper that is also highly relevant to our manuscript, and also escaped my notice until it was too late!

My latest paper just came out in Genome Biology and Evolution (here).  In it we describe insertion/deletion (indel) variation in a natural population of the wildflower Mimulus guttatus.  We sequenced 10 inbred lines derived from this population and found millions of SNPs and thousands of indels.  One of the more striking findings was the large number of indels associated with a certain class of plant disease resistances genes called NBS-LRRs.  We noted in the paper that this same observation had been made in soybeans and Arabidopsis and cited the appropriate papers.  Once our paper was in press I found 2 additional paper, one in barley and the other in musk melon that had also found lots of indels associated with NBS-LRRs. Unfortunately it was too late to cite the these 2 papers.  My apologies to the authors for the omission.

Next and admission.  In the paper I wanted to compare the spectrum of allele frequencies in the population  (often called the site frequency spectrum or SFS) for indel mutations vs. various forms of SNP mutation (see figure 2 and table 2 in the paper).  I also wanted to compare these to the expected frequency distribution under the standard neutral model, as this serves as a kind of a null expectation in population genetics. Here's my admission, until very recently I did not know how to compute the SFS under the standard neutral model.  So in my paper I used coalescent simulations to estimate it.  From this estimate I found that the expected mean neutral SFS for the rare (minor) allele was about 0.222 (and this means that the mean frequency for the common (major) allele is about 0.778). Recently I found a paper that describes the exact calculation very clearly. The equation is #6.  So let's recompute the expected mean SFS for the minor allele using the exact calculation.  I'll do it in R below.

#first make a vector of possible allele counts

#remember we had 10 inbred lines, so there are 9 possible states for a SNP site
ac = 1:9 # gives 1 2 3 4 5 6 7 8 9

#now implement eqn 6 from the paper noted above
unfolded_SFS = (1/ac) / (sum(1/ac))

#now fold the SFS to get the folded SFS

folded_SFS = c(unfolded_SFS[1]+unfolded_SFS[9], unfolded_SFS[2]+unfolded_SFS[8], unfolded_SFS[3]+unfolded_SFS[7], unfolded_SFS[4]+unfolded_SFS[6], unfolded_SFS[5])

#finally figure out the expected mean

exp_mean = sum((1:5/10)*folded_SFS) # gives 0.2282228

So there you have it, the exact expected mean minor allele frequency is about 0.228.  My coalescent simulations were close at 0.222.  Now I know how to get the exact SFS (and it's pretty easy), but like the omissions noted above, it's a little too late to change my paper!