Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Friday, June 13, 2014

Programming Geometric Art

A few weeks ago I was listening the the brilliant podcast 99% Invisible. The particular episode was all about quatrefoils, which are cloverleaf shaped geometric designs. In the episode they point out that quatrefoils feature prominently in high class objects, like gothic cathedrals and in luxury products (e.g. Louis Vuitton bags). The episode really stuck with me. It was so interesting. These simple shapes can convey so much meaning, much of it subconsciously (and now for you more consciously, because you too are going to start seeing quatrefoils everywhere, especially for my friends at Duke!)

I have always been fascinated with kaleidoscopes and geometric designs. I love Arabic art and gothic patterns. There something soothing about geometric designs, they have order and organization but at the same time an organic familiarity because similar processes arise in nature.  



This got me thinking. At a base level geometric art is just an algorithm. Draw a line here, move 10 degrees over, draw another line, repeat. I know a little about algorithms and I started to wonder if I could make anything pretty in R (R being an ironic choice because 99% of what I do in R is decidedly not pretty). At first I set out to make quatrefoils, but I never did figure it out (and if you do, paste your code in the comments section below).  Then I reset my sights on just making something interesting. 

Below are the first 5 pleasing things I've managed to produce along with the R code that draws them.  The first 4 were designs I made intentionally. It's a fun exercise. Find
or imagine something you want to draw, then think about the algorithm that would make it, and finally translate that into code. It's one of those tasks that uses a lot of your brain.

Of the images below, I'm most proud of the one called "angles", because I got the process from idea to algorithm to code on the first try! All the others took some trial and error, with interesting squiggles and blobs emerging along the way.       

These pictures raise a lot of questions which are beyond my grasp of geometry and trig.  For example, in the mystic rose below why do all those internal circles arise? What is their radius? Why are there voids? There must be some way to derive answers to these things mathematically, and I'm sure if I meditated on it long enough I could probably figure it out. Unfortunately, for now I'm a little too busy to remedy my own ignorance.

Also, "mystic rose" isn't my name.  I did some googling and that seems to be the standard name for that object.  I'm a geneticist.  I'd have called it the "full diallel". :) (And a hellacious one at that. 625 crosses. You'd need to be pretty ambitious or foolish.)   

Finally, if you decide to run the R code below, it is best in regular old vanilla R, because it draws one line at a time and you can see the pattern emerge (and if you have an old slow computer like mine it will emerge at a nice human pace). When I plotted these in RStudio the image wouldn't print to the screen until it was done, which kind of kills the effect (I recommend running some of the code, it's mesmerizing).   

##mystic rose
rad = function(x) (x*pi)/180
plot(c(0,0), c(0,0), ty='n', xlim=c(-1,1), ylim=c(-1, 1))

deg.rot = 15
for (i in seq(0,360, deg.rot)) {
         for (j in seq(0, 360, deg.rot)) {
               lines(c(cos(rad(i)), cos(rad(j))), c(sin(rad(i)), sin(rad(j))), lwd=0.5)
         }
}
mystic rose 
##circles
v = seq(0,360,45)
rad = function(x) (x*pi)/180
plot(0,0, ty='n', xlim=c(-20,20), ylim=c(-20,20))
for (i in v) {
    for (j in seq(1, 10, 1)){
        symbols(cos(rad(i))*j,sin(rad(i))*10, circles=1, add=T) 
    }
}

circles

#angles
rad = function(x) (x*pi)/180
v = seq(45,360,45)
steps = seq(0,1,.025)
plot(c(0,0), c(0,0), ty='n', xlim=c(-1,1), ylim=c(-1, 1))
for (i in v) {
         next_angle = i + 45
         for (j in steps) {
         lines(c(cos(rad(i))*j, cos(rad(next_angle))*(1-j)), c(sin(rad(i))*j, sin(rad(next_angle))*(1-j)), lwd=0.5)
        }
}

angles 
##betas
v = seq(2, 1502, 15)**.75
plot(0,0, xlim=c(0,1), ylim=c(-20,20), ty='n')
for (i in v) {
    lines(seq(0,1,0.01), dbeta(seq(0,1,.01), shape1=i, shape2=i), lwd=.1)
    lines(seq(0,1,0.01), -dbeta(seq(0,1,.01), shape1=i, shape2=i), lwd=.1)
}
betas also tells a story. it's drawn using a beta distribution. it shows the uncertainty around a proportion estimated to be at 50% frequency as a function of an ever increasing sample size. it moves really fast at first, but quickly hits diminishing returns.  at the end, adding hundreds of samples does very little to increase statistical power.
betas
##trig functions
v = seq(0,pi*.25, pi/10)
v2 = seq(0,pi*0.5, pi/200)
plot(c(0,0), c(0,0), ty='n', xlim=c(-120,120), ylim=c(-150, 150))
for(i in v) {
            for (cons1 in seq(0,100,10) ){
                        for (cons2 in seq(0,100,5)) {
    xf = function (i) cons1*cos(i)+cos(3*i)+cos(2*i)+cos(4*i)
    yf = function(i) cons2*sin(i)+cons1*sin(3*i)
    lines(xf(v2), yf(v2), lwd=.05)
    lines(-xf(v2), -yf(v2),  lwd=.05)
                }
    }
}
trig functions

Tuesday, April 22, 2014

Hacking Google Scholar (a little bit)

This post is about 2 things I really like, the R statistical programming language and Google Scholar, and my (tiny) effort to bring them together.  I've written quite a lot about R.  I use it all the time at my job, and have frequently used it in my published work.  I'll admit, I don't usually cite R when I use it. In fact, I don't even acknowledge that I've used it.  Usually I just report some statistical stuff and move on, never clarifying the actual software used to get the result (or whether I did it with a slide rule and pencil, which reminds me of the ad image below that isn't at all related to this post, but is probably the best thing I've seen in weeks.) 

However, you can cite R in your publications (and maybe I should).  In fact R has a handy function called citation that's only purpose is to inform you how to cite R.  Here's what it says. 


To cite R in publications use:

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 

IBM ad from the 1950s
|
there are so many things I love about this ad: 
-white shirt, slick hair, and a tie - no exceptions
-no women (of course, these are engineers!)
-the phrase "routine repetitive figuring"
-the little atomic motif next to IBMs logo
-the earnest look on everyone's face
(or maybe it's concern, they're all
about to get laid off I think)
-the enormity of the computer
-FREAKING SLIDE RULES!!!!!
|
also, wouldn't it be nice if we still 
talked about computer power in 
engineer equivalents. not unlike 
horsepower.  imagine saying a
 2 giga-engineer processor. more 
tangible than a petaflop, whatever 
that is. 
 
That's it.  In older versions of R it had been R Development Core Team, but they dropped the Development part.

What's interesting is that inevitably when people add this to their favorite citation manager like Endnote it gets mangled into the name R.C. Team (or for older citations R.D.C. Team).  

I really love this. I see it all the time in papers.  So for giggles I made a Google Scholar account for R.D.C. Team.  R.D.C. is doing pretty well with almost 14,000 citations as of 4/14.  

Actually there is some interesting stuff that can be gleaned from this novelty Google Scholar account. First, notice the amazing growth of citations for R. 2013 saw almost as many citations as all the past years combined! This fits with my own experience. I learned R in 2004 or 2005, and back then it was unusual to meet R users, especially outside of stats departments or those folks (like me) who worked on microarrays.  Now it is practically a standard. Where I work we commonly ask interviewees if they use R, mainly because the primary alternative, SAS, costs a fortune.  

Also, the bulk of citations are to R.D.C. and not the more recent R.C. Team. I'm not exactly sure when the D was dropped, but I think it was in the last couple of years.  Because so many citations come from 2013, this suggests that people are either using an old version of R and finding the old citation there, or more likely, just copying an out-of-date citation from an existing paper. 

Another interesting thing are the myriad ways in which people screw up the citation. Here's an example of the old standby, RDC Team.  Here it's mangled to R Development CT.  And if you're not into that whole brevity thing, you can spell out RDC's full name; here's one that mangled it to Team R Development Core.  

This leads me to my final thoughts. I don't feel bad for not citing R. I do cite specific R packages, but if I just compute a correlation or regression in R I'm not going to cite it.  In these cases I use R out of convenience, not out of necessity. It feels no different than citing Microsoft Excel for helping me organize my data, which seems ridiculous. That said, if you are going to the trouble of citing R, do it correctly!! Here's a nice page describing how.

Saturday, December 7, 2013

Modeling Genetic Drift Part 3 - Exact Calculation

In two earlier posts I used a Monte Carlo simulation and a beta distribution to model genetic drift in a population. You can read about the background by reading the Monte Carlo post.  The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population.  This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population.  The previous posts noted above are both approximations (pretty darn good approximations).  Here I'll demonstrate how to make the exact calculation using a compound binomial distribution method. 

Approach 3: Exact calculation using a compound binomial distribution

We are assuming that at each generation a binomial sample of 200 gametes is randomly drawn from the parents to make the next generation (often called the Fisher-Wright model). We can model this process by nesting together a bunch of binomial distributions - one for each generation - with each dependent on the one that came before it.  Below I've drawn a small cartoonish view of what this looks like.  In the figure the three levels represent the discrete probability mass distributions for each of the first three generations.  In our previous examples we've assumed the initial allele frequency is 50%. Below you can see that Gen. 1 has just one bar.  This is an allele at 50% and it's probability mass is 1 (meaning it's the only possibility). We use a binomial distribution to determine the distributions of probabilities in Gen. 2.  The allele frequency has started to diffuse, though the highest mass is still on 0.5.  Then in the third generation things get more interesting (and complex).  Each of the 3 allele frequencies represented in Gen. 2 have some probability of contributing to the probability mass of all three allele frequencies in Gen. 3.  I've tried to represent this by color coding the allele frequencies in Gen. 2 and showing how these 3 colors each contribute to the final probability mass for each allele frequency in Gen. 3.  Notice that the left most (orange) allele frequency contributes more to the left most allele frequency in the next generation, and visa-versa the right most (red) allele frequency.  Each of the arrows represents one allele frequency path through the generations.  The trick is to calculate the probability of all these paths and sum those probabilities to give the total probability!  


Most population genetics text books describe this approach. They show the mechanics (typically for only a single generation), and then dismiss it because in many cases it is intractable.  In the toy example above there are only 9 paths, which isn't much to deal with.  In our real example with a population of 100 diploid individuals and 5 generations there are many more paths! If we wanted to do this with a population of 1 million individuals over fifty-thousand generations this would likely be an intractable problem to compute, at least in the way I've coded it below (I'm sure a talented programmer could make this happen given enough compute power).  Below is R code I used to create an exact distribution of probability masses for allele frequencies in the fifth generation. 


k = matrix(0, nc=5, nr=201)##k will store all allele freq prob. for all 5 generations
rownames(k) = 0:200 # all possible allele counts in 100 diploids
colnames(k) = c('gen1', 'gen2', 'gen3', 'gen4', 'gen5')
k[,'gen1'] = c(rep(0,100),1,rep(0,100)) #initialize generation 1 @ 50% allele freq (i.e. p=1 for 100/200)

prev_gen = 'gen1'

for (gen in c('gen2', 'gen3', 'gen4', 'gen5')) {
  for (prev_ct in 0:200){
    prev_prob = k[prev_ct+1,prev_gen] #previous prob.
    new_probs = prev_prob * dbinom(x=0:200, size=200, prob=prev_ct/200) # prob of all paths from prev prob weighted by prob. of going down that path in prev. gen.
    k[,gen] = k[,gen] + new_probs  #summation
  }
  prev_gen = gen #move to next generation
}


That's it.  The code traces all possible paths and sums up their probabilities in each generation. Below is the exact probability distribution we expect in the fifth generation. This is the discrete distribution we've previously approximated with Monte Carlo simulations and with a continuous beta distribution.
In the R code above, coding that bit in inside both for loops was tricky.  I'd like to make sure I haven't messed it up. Fortunately we've been working with this particular setup (5 generations, 100 diploids, 50% initial allele frequency) for a little while now so we have some expectations.  For example we know the mean allele frequency in the fifth generation should be the same as the initial frequency (i.e. 50%). We also know that the variance in allele frequency in the fifth generation should be about 0.00496.  So we can test those expectations to see if the above code checks out. Also, at each generation we've computed the exhaustive probability mass distribution. This distribution should sum to 1 for each generation.  Let's test that first.  Below is an equation simply saying that for each allele frequency (x), if we look up it's probability (p(x)), and sum all these up, they are expected to sum to 1.


We can test that in R:
apply(k, 2, sum) #sum each gen. all should sum to 1
gen1 gen2 gen3 gen4 gen5 
   1    1    1    1    1 
Yep, they all sum to 1.  That's a good start.  Now let's work out the mean and variance for the fifth generation.  Following the definitions for x and p(x) given above we can compute the expected mean allele frequency in the fifth generation as follows.


E[x] is often used to indicate an expected mean value from a distribution.  And once we have E[x] we can compute the expected variance (Var[x]) using the equation below.


So let's test these two expected values now, remembering that we expect E[x] to be 0.5 and Var[x] to be about 0.00496.

final_gen = as.vector(k[,'gen5'])# grab final gen data
exp_x = sum(final_gen * ((0:200)/200))
exp_x #should be 0.5
[1] 0.5

exp_var = sum(final_gen * ((0:200)/200 - exp_x)**2)
exp_var #should be about 0.00496, from previous analysis
[1] 0.004962625

Great!  Everything checks out.  Now one more thing.  In the past examples we've calculated the 95% confidence interval for the allele frequency in the fifth generation.  We know it should be between about 36% and 64%.  Here we have a distribution of probabilities, so how do we figure out the 95% confidence interval?  Here's one approach I adapted from John Kruschke's code on pg. 627 of Doing Bayesian Data Analysis.

final_gen_sort = sort(final_gen, decreasing=T) #reverse sort prob mass distrib
CIheigthIdx = min(which(cumsum(final_gen_sort) >= 0.95))#count values with less than 95% prob. mass
lwr_allele_freq = (100 - CIheigthIdx/2) / 200 #get freqs at lower bound
upr_allele_freq = (100 + CIheigthIdx/2) / 200 #and upper bound
c(lwr_allele_freq, upr_allele_freq) # conf. interval
[1] 0.36 0.64
sum(final_gen_sort[0:CIheigthIdx]) #find true CI width
[1] 0.9536558
sum(final_gen_sort[0:(CIheigthIdx-2)]) #find next smallest CI width
[1] 0.9450768

We got the expected confidence interval, 36%-64%. But it isn't quite the 95% confidence interval, it's the 95.36558% confidence interval!  Because the exact approach we're using here models a discrete process, the cumulative sum of those probabilities moves in discrete chunks.  In this case there is no chunk that ends right as it captures 95% of the total probability.  We can have 95.36558% or 94.50768% confidence intervals; those are the two chunks that flank 95%. Typically you take the first chunk over the nominal value you are looking for, so in our case we'd choose 36%-64% as our 95% confidence interval.

Wednesday, December 4, 2013

Modeling Genetic Drift Part 2 - Beta Distribution

In an earlier post I used a Monte Carlo simulation to model genetic drift in a population. You can read about the background by reading that post.  The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population.  This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population.  Here I'll demonstrate another way to answer this question with an approximation using the beta distribution. 

Approach 2: Estimating Genetic Drift Using the Beta Distribution

The beta distribution is really cool and useful. Here I’ll try to explain how to apply the beta distribution to estimating genetic drift.  However, a quick word of caution, my description of the beta distribution is pretty incomplete.  Read the wikipedia article if you want to know all the things you can do with the beta distribution.

To understand the beta distribution we have to first talk about the binomial distribution. I used a binomial random sampler in my earlier Monte Carlo simulation of genetic drift, but I didn't give much of an explanation.  The binomial distribution is really useful for binary outcomes, like flipping a coin and getting a heads or a tails.  So for example, if I flip a coin we can use the binomial distribution to estimate the probability of getting a certain number of heads after a number of flips.  If we use a fair coin (50% chance of heads and 50% chance of tails on any flip), we can use the binomial distribution to find the probability of getting exactly 3 heads from 10 flips.  The beta distribution is sort of the mirror image of the binomial; it lets you go in the opposite direction.  Imagine we find 10 flipped coins lying on the ground, with 3 on heads and 7 on tails.  We want to know if they came from a fair coin factory. Given the observation of 3 heads and 7 tails, the beta distribution can tell us the probability that the probability of heads is 50% for our coins (whew!).  So the beta distribution lets us model the cloud of uncertainty surrounding a probability given some discrete binary observations.

Random matings produces discrete binary observations (an allele either does or does not get passed on to the next generation), and allele frequencies are a lot like probabilities, so we can use the beta distribution to model allele frequencies given a characterization of the random matings we expect in a population of 100 individuals (fish in this case).  To do this we have to solve for two “shape” parameters that we’ll need to supply to characterize the shape of our beta distribution.  These two parameters are called alpha and beta (I’ll use a and b here).  If you know the expected mean (u) and variance (v) for the final allele frequencies you can use them to figure out what a and b should be.  We discussed the expected mean earlier, it won’t change, and since we are starting with an allele at a frequency of 0.5, we expect the final frequency to also be 0.5, so u = 0.5.  The expected variance is a bit trickier.  There is a handy equation in Falconer and Mackay.  It’s equation 3.2.  It’s written as:

(read Falconer and Mackay Chapter 3 for more detail on this equation).

Here v is the expected variance we are looking for, and t is the number of generations it takes to go from the initial population to the final population while being maintained with N diploid individuals. The starting allele frequencies are p0 and q0 (both 0.5 in our case).  So in our fishy model we can supply all of the missing values and calculate v.

p0 = 0.5
q0 = 0.5
2N = 2 X 100 = 200
t = 4 (remember getting to the 5th generation requires only 4 rounds of mating)

So chugging through this I get v = 0.00496. That's the expected variance for our allele frequency in the final generation.

Great, now we have the expected mean (u) and variance (v) at the final generation.  We use these values to calculate a and b for our beta distribution using the equations below (see: http://en.wikipedia.org/wiki/Beta_distribution#Parameter_estimation). 


Plugging and chugging again I get a = 24.688 and b = 24.688.  That’s it!  We can specify a beta distribution to get our estimate of allele frequency drift!  Below is the R code repeating these steps and estimating the 95% confidence intervals:

n = 100
p = 0.5
q = 1-p

u = p
t = 4
v = p*q*(1-(1-(1/(2*n)))**t) #Falconer and Mackay eqn 3.2
a = u*((u*(1-u))/v-1) #see eqns above
b = (1-u)*((
u*(1-u)
)/v-1) #ditto see above

#estimated lower and upper conf. intervals
qbeta(0.025, shape1=a, shape2=b) #lower

 [1] 0.3625435
qbeta(1-0.025, shape1=a, shape2=b) #upper
 [1] 0.6374565


The R code above tells us that the estimated 95% confidence intervals for our beta distribution estimation of allele frequency drift are about 36.3% and 63.7%.  If you go back to the MonteCarlo approach you see that both estimates are really close!  Hooray!  

Now for fun lets compare this to the Monte Carlo estimate. More R code:
#rerun our monte carlo (from before) and fit our beta est. to it
x = 0.5
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=100000, nc=4) #matrix to store results
for (i in 1:100000) {
  xf = x  #initial freq always 0.5
  for (j in 1:4){ #5 generations require 4 samples
    xf = rbinom(1,size=200, xf) / 200 #random sample, normalized by size
    k[i,j] = xf   #update matrix
  }
}
#plot monte carlo result as histogram

hist(k[,4], br=100, freq=F, main="Monte Carlo (bars) vs. Beta (red line)", xlab='allele freq')
#add beta as a red line

lines(seq(0,1,0.01), dbeta(seq(0,1, 0.01), shape1=a, shape2=b ), ty='l', col=2)


Wow, the red line (beta) is pretty close to the Monte Carlo result!

Now some caveats.  The process we are modeling is discrete, but the beta distribution in continuous.  This means that the beta distribution will be a good estimator of drift if the population size is large (because there are so many discrete possibilities they become virtually continuous), and will get increasingly worse with smaller populations. Another watch out, the beta distribution will start to perform poorly if you want to measure drift for a really rare or really common allele, especially in a small population.  You'll know when this is going to happen if you notice that the a and b parameters are really different from one another (one will be big and one will be small), and the smaller one is a pretty small number. Under these kinds of scenarios you'll see that the beta distribution does a poor job of estimating the loss of the rare allele (i.e. fixation of one allele due to drift). Below is an example of this problem, I tweaked the R code above to model an initial allele starting at a 4% frequency rather than 50%.    

See how the beta estimation misses the Monte Carlo distribution a little on the left side.  In this case a = 1.98 and b = 47.4.  This is the sort of thing to watch out for if you use this approach to estimate genetic drift.