Saturday, December 7, 2013

Modeling Genetic Drift Part 3 - Exact Calculation

In two earlier posts I used a Monte Carlo simulation and a beta distribution to model genetic drift in a population. You can read about the background by reading the Monte Carlo post.  The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population.  This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population.  The previous posts noted above are both approximations (pretty darn good approximations).  Here I'll demonstrate how to make the exact calculation using a compound binomial distribution method. 

Approach 3: Exact calculation using a compound binomial distribution

We are assuming that at each generation a binomial sample of 200 gametes is randomly drawn from the parents to make the next generation (often called the Fisher-Wright model). We can model this process by nesting together a bunch of binomial distributions - one for each generation - with each dependent on the one that came before it.  Below I've drawn a small cartoonish view of what this looks like.  In the figure the three levels represent the discrete probability mass distributions for each of the first three generations.  In our previous examples we've assumed the initial allele frequency is 50%. Below you can see that Gen. 1 has just one bar.  This is an allele at 50% and it's probability mass is 1 (meaning it's the only possibility). We use a binomial distribution to determine the distributions of probabilities in Gen. 2.  The allele frequency has started to diffuse, though the highest mass is still on 0.5.  Then in the third generation things get more interesting (and complex).  Each of the 3 allele frequencies represented in Gen. 2 have some probability of contributing to the probability mass of all three allele frequencies in Gen. 3.  I've tried to represent this by color coding the allele frequencies in Gen. 2 and showing how these 3 colors each contribute to the final probability mass for each allele frequency in Gen. 3.  Notice that the left most (orange) allele frequency contributes more to the left most allele frequency in the next generation, and visa-versa the right most (red) allele frequency.  Each of the arrows represents one allele frequency path through the generations.  The trick is to calculate the probability of all these paths and sum those probabilities to give the total probability!  


Most population genetics text books describe this approach. They show the mechanics (typically for only a single generation), and then dismiss it because in many cases it is intractable.  In the toy example above there are only 9 paths, which isn't much to deal with.  In our real example with a population of 100 diploid individuals and 5 generations there are many more paths! If we wanted to do this with a population of 1 million individuals over fifty-thousand generations this would likely be an intractable problem to compute, at least in the way I've coded it below (I'm sure a talented programmer could make this happen given enough compute power).  Below is R code I used to create an exact distribution of probability masses for allele frequencies in the fifth generation. 


k = matrix(0, nc=5, nr=201)##k will store all allele freq prob. for all 5 generations
rownames(k) = 0:200 # all possible allele counts in 100 diploids
colnames(k) = c('gen1', 'gen2', 'gen3', 'gen4', 'gen5')
k[,'gen1'] = c(rep(0,100),1,rep(0,100)) #initialize generation 1 @ 50% allele freq (i.e. p=1 for 100/200)

prev_gen = 'gen1'

for (gen in c('gen2', 'gen3', 'gen4', 'gen5')) {
  for (prev_ct in 0:200){
    prev_prob = k[prev_ct+1,prev_gen] #previous prob.
    new_probs = prev_prob * dbinom(x=0:200, size=200, prob=prev_ct/200) # prob of all paths from prev prob weighted by prob. of going down that path in prev. gen.
    k[,gen] = k[,gen] + new_probs  #summation
  }
  prev_gen = gen #move to next generation
}


That's it.  The code traces all possible paths and sums up their probabilities in each generation. Below is the exact probability distribution we expect in the fifth generation. This is the discrete distribution we've previously approximated with Monte Carlo simulations and with a continuous beta distribution.
In the R code above, coding that bit in inside both for loops was tricky.  I'd like to make sure I haven't messed it up. Fortunately we've been working with this particular setup (5 generations, 100 diploids, 50% initial allele frequency) for a little while now so we have some expectations.  For example we know the mean allele frequency in the fifth generation should be the same as the initial frequency (i.e. 50%). We also know that the variance in allele frequency in the fifth generation should be about 0.00496.  So we can test those expectations to see if the above code checks out. Also, at each generation we've computed the exhaustive probability mass distribution. This distribution should sum to 1 for each generation.  Let's test that first.  Below is an equation simply saying that for each allele frequency (x), if we look up it's probability (p(x)), and sum all these up, they are expected to sum to 1.


We can test that in R:
apply(k, 2, sum) #sum each gen. all should sum to 1
gen1 gen2 gen3 gen4 gen5 
   1    1    1    1    1 
Yep, they all sum to 1.  That's a good start.  Now let's work out the mean and variance for the fifth generation.  Following the definitions for x and p(x) given above we can compute the expected mean allele frequency in the fifth generation as follows.


E[x] is often used to indicate an expected mean value from a distribution.  And once we have E[x] we can compute the expected variance (Var[x]) using the equation below.


So let's test these two expected values now, remembering that we expect E[x] to be 0.5 and Var[x] to be about 0.00496.

final_gen = as.vector(k[,'gen5'])# grab final gen data
exp_x = sum(final_gen * ((0:200)/200))
exp_x #should be 0.5
[1] 0.5

exp_var = sum(final_gen * ((0:200)/200 - exp_x)**2)
exp_var #should be about 0.00496, from previous analysis
[1] 0.004962625

Great!  Everything checks out.  Now one more thing.  In the past examples we've calculated the 95% confidence interval for the allele frequency in the fifth generation.  We know it should be between about 36% and 64%.  Here we have a distribution of probabilities, so how do we figure out the 95% confidence interval?  Here's one approach I adapted from John Kruschke's code on pg. 627 of Doing Bayesian Data Analysis.

final_gen_sort = sort(final_gen, decreasing=T) #reverse sort prob mass distrib
CIheigthIdx = min(which(cumsum(final_gen_sort) >= 0.95))#count values with less than 95% prob. mass
lwr_allele_freq = (100 - CIheigthIdx/2) / 200 #get freqs at lower bound
upr_allele_freq = (100 + CIheigthIdx/2) / 200 #and upper bound
c(lwr_allele_freq, upr_allele_freq) # conf. interval
[1] 0.36 0.64
sum(final_gen_sort[0:CIheigthIdx]) #find true CI width
[1] 0.9536558
sum(final_gen_sort[0:(CIheigthIdx-2)]) #find next smallest CI width
[1] 0.9450768

We got the expected confidence interval, 36%-64%. But it isn't quite the 95% confidence interval, it's the 95.36558% confidence interval!  Because the exact approach we're using here models a discrete process, the cumulative sum of those probabilities moves in discrete chunks.  In this case there is no chunk that ends right as it captures 95% of the total probability.  We can have 95.36558% or 94.50768% confidence intervals; those are the two chunks that flank 95%. Typically you take the first chunk over the nominal value you are looking for, so in our case we'd choose 36%-64% as our 95% confidence interval.

Wednesday, December 4, 2013

Modeling Genetic Drift Part 2 - Beta Distribution

In an earlier post I used a Monte Carlo simulation to model genetic drift in a population. You can read about the background by reading that post.  The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population.  This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population.  Here I'll demonstrate another way to answer this question with an approximation using the beta distribution. 

Approach 2: Estimating Genetic Drift Using the Beta Distribution

The beta distribution is really cool and useful. Here I’ll try to explain how to apply the beta distribution to estimating genetic drift.  However, a quick word of caution, my description of the beta distribution is pretty incomplete.  Read the wikipedia article if you want to know all the things you can do with the beta distribution.

To understand the beta distribution we have to first talk about the binomial distribution. I used a binomial random sampler in my earlier Monte Carlo simulation of genetic drift, but I didn't give much of an explanation.  The binomial distribution is really useful for binary outcomes, like flipping a coin and getting a heads or a tails.  So for example, if I flip a coin we can use the binomial distribution to estimate the probability of getting a certain number of heads after a number of flips.  If we use a fair coin (50% chance of heads and 50% chance of tails on any flip), we can use the binomial distribution to find the probability of getting exactly 3 heads from 10 flips.  The beta distribution is sort of the mirror image of the binomial; it lets you go in the opposite direction.  Imagine we find 10 flipped coins lying on the ground, with 3 on heads and 7 on tails.  We want to know if they came from a fair coin factory. Given the observation of 3 heads and 7 tails, the beta distribution can tell us the probability that the probability of heads is 50% for our coins (whew!).  So the beta distribution lets us model the cloud of uncertainty surrounding a probability given some discrete binary observations.

Random matings produces discrete binary observations (an allele either does or does not get passed on to the next generation), and allele frequencies are a lot like probabilities, so we can use the beta distribution to model allele frequencies given a characterization of the random matings we expect in a population of 100 individuals (fish in this case).  To do this we have to solve for two “shape” parameters that we’ll need to supply to characterize the shape of our beta distribution.  These two parameters are called alpha and beta (I’ll use a and b here).  If you know the expected mean (u) and variance (v) for the final allele frequencies you can use them to figure out what a and b should be.  We discussed the expected mean earlier, it won’t change, and since we are starting with an allele at a frequency of 0.5, we expect the final frequency to also be 0.5, so u = 0.5.  The expected variance is a bit trickier.  There is a handy equation in Falconer and Mackay.  It’s equation 3.2.  It’s written as:

(read Falconer and Mackay Chapter 3 for more detail on this equation).

Here v is the expected variance we are looking for, and t is the number of generations it takes to go from the initial population to the final population while being maintained with N diploid individuals. The starting allele frequencies are p0 and q0 (both 0.5 in our case).  So in our fishy model we can supply all of the missing values and calculate v.

p0 = 0.5
q0 = 0.5
2N = 2 X 100 = 200
t = 4 (remember getting to the 5th generation requires only 4 rounds of mating)

So chugging through this I get v = 0.00496. That's the expected variance for our allele frequency in the final generation.

Great, now we have the expected mean (u) and variance (v) at the final generation.  We use these values to calculate a and b for our beta distribution using the equations below (see: http://en.wikipedia.org/wiki/Beta_distribution#Parameter_estimation). 


Plugging and chugging again I get a = 24.688 and b = 24.688.  That’s it!  We can specify a beta distribution to get our estimate of allele frequency drift!  Below is the R code repeating these steps and estimating the 95% confidence intervals:

n = 100
p = 0.5
q = 1-p

u = p
t = 4
v = p*q*(1-(1-(1/(2*n)))**t) #Falconer and Mackay eqn 3.2
a = u*((u*(1-u))/v-1) #see eqns above
b = (1-u)*((
u*(1-u)
)/v-1) #ditto see above

#estimated lower and upper conf. intervals
qbeta(0.025, shape1=a, shape2=b) #lower

 [1] 0.3625435
qbeta(1-0.025, shape1=a, shape2=b) #upper
 [1] 0.6374565


The R code above tells us that the estimated 95% confidence intervals for our beta distribution estimation of allele frequency drift are about 36.3% and 63.7%.  If you go back to the MonteCarlo approach you see that both estimates are really close!  Hooray!  

Now for fun lets compare this to the Monte Carlo estimate. More R code:
#rerun our monte carlo (from before) and fit our beta est. to it
x = 0.5
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=100000, nc=4) #matrix to store results
for (i in 1:100000) {
  xf = x  #initial freq always 0.5
  for (j in 1:4){ #5 generations require 4 samples
    xf = rbinom(1,size=200, xf) / 200 #random sample, normalized by size
    k[i,j] = xf   #update matrix
  }
}
#plot monte carlo result as histogram

hist(k[,4], br=100, freq=F, main="Monte Carlo (bars) vs. Beta (red line)", xlab='allele freq')
#add beta as a red line

lines(seq(0,1,0.01), dbeta(seq(0,1, 0.01), shape1=a, shape2=b ), ty='l', col=2)


Wow, the red line (beta) is pretty close to the Monte Carlo result!

Now some caveats.  The process we are modeling is discrete, but the beta distribution in continuous.  This means that the beta distribution will be a good estimator of drift if the population size is large (because there are so many discrete possibilities they become virtually continuous), and will get increasingly worse with smaller populations. Another watch out, the beta distribution will start to perform poorly if you want to measure drift for a really rare or really common allele, especially in a small population.  You'll know when this is going to happen if you notice that the a and b parameters are really different from one another (one will be big and one will be small), and the smaller one is a pretty small number. Under these kinds of scenarios you'll see that the beta distribution does a poor job of estimating the loss of the rare allele (i.e. fixation of one allele due to drift). Below is an example of this problem, I tweaked the R code above to model an initial allele starting at a 4% frequency rather than 50%.    

See how the beta estimation misses the Monte Carlo distribution a little on the left side.  In this case a = 1.98 and b = 47.4.  This is the sort of thing to watch out for if you use this approach to estimate genetic drift.

Modeling Genetic Drift - Monte Carlo Simulation

EDIT: There are two follow up posts to this one here and here.

Here’s an interesting question in population genetics that my wonderful wife brought up recently.  She was working with a fish population.  The population started when 100 wild-caught fish were brought into the lab.  The fish were allowed to randomly mate.  After mating, 100 offspring were randomly sampled to produce the next generation.  The fish are now in their fifth generation of this procedure (i.e. 4 rounds of 100 offspring randomly selected).  
High Plains (Genetic) Drifter

She was thinking about genotyping the fifth generation of this population to estimate allele frequencies in the wild population from which they were sampled.  We talked this over and both started wondering how much change to expect in their allele frequencies when compared to the initial frequencies of the 100 wild-caught fish that founded the population.  Both selection and genetic drift can change allele frequencies, but for the moment we’re going to just focus on genetic drift. 

We’re not ignoring selection because we don’t think it occurred - it probably has - but we don’t know how much, so selection is tough to deal with.  Genetic drift on the other hand can be estimated (as we’ll do below), and moreover we know it will happen in a small population like the one we are dealing with.  So by estimating drift alone we are getting a best-case-scenario estimate of the amount of allele frequency change we can expect between the founders and their fifth generation offspring.  If we aren’t happy with the amount of change under the best-case-scenario we can rule out genotyping the fish as a viable option.

So the question is this, how do we model random genetic drift in a population that has been maintained at 100 diploid individuals through 5 generations. It turns out there are a bunch of approaches. In subsequent posts I’m going to demonstrate several. 

Also, at this point I haven’t specified the initial allele frequency of the allele we are going to model. For our purposes we’ll work with an allele at 50% frequency in the initial founding population, but each approach outlined in this series can be generalized to deal with (almost) any initial allele frequency.

Approach 1: Monte Carlo simulation of genetic drift

This is my favorite approach.  I'll get to some other approaches in future posts, but I believe this is the easiest way to get an estimate of drift in this situation.  Below is the R code to simulate genetic drift in our fish population.  What we are doing here is just creating 1000 random walks through 5 generations. The sample size (200 haploid genomes) influences the amount of allele frequency variation (drift) from one generation to the next.  The random selection is done by taking a random sample from a binomial distribution of size 200. This random selection always starts with a probability of sampling our allele at 50% in the first generation.  In each subsequent generation the probability is updated to reflect the realized frequency in from the random sample drawn from the previous generation.  By the time you get to the fifth generation, the final allele frequency has usually drifted a little bit away from 50% (in some cases it drifts away for a time but by the end drifts its way back!!).

x = 0.5 #intial allele freq
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=1000, nc=4) #matrix to store results

for (i in 1:1000) {
   xf = x  #initial freq always 0.5, going fwd xf will store our updated allele freq
   for (j in 1:4){ #5 generations require 4 samples
     xf = rbinom(1,size=200, xf) / 200 #random sample, pop size 200
     k[i,j] = xf   #update matrix
   }
}

k = cbind(rep(.5, 1000), k) #add first gen. (always 0.5)

##that’s it for sampling, code below is plotting

#Now plot all 1000 random walks
plot(1,1, xlim=c(1,5), ylim=c(0,1), ty='n', xlab='generation', ylab='allele freq')
for (i in 1:1000){
  lines(1:5, k[i,], lwd=.1)  #draw 1000 random walks as lines
  points(5.1, k[i,5], pch='-', col=adjustcolor(2, alpha.f=.1))  #add red rug on right
}

1000 Random Walks

MC_final = sort(k[,5]) #Monte Carlo result in final generation
hist(MC_final) #histogram
Histgram of 1000 Final Alelle Frequencies
  
mean(MC_final) #(should be very close to 0.5)
[1] 0.49883

#lower and upper 95% Confidence intervals
c(MC_final[25], MC_final[975])
[1] 0.355 0.640

In the code I added a few bells and whistles to draw plots and also compute the approximate expected mean and 95% confidence intervals.  The results above tell us that the expected mean allele frequency is almost 0.5 (i.e. the starting frequency).  This makes sense, we didn’t apply selection so we don’t expect any net directional change in allele frequency.  Also 95% of the time we expect an allele starting at 50% frequency to fall between about 36% and 64% in the fifth generation given our population size.


Since we are using a simulation these results are all approximate, but that’s ok.  We could do a larger number of simulations to improve accuracy.  But more importantly, you have to put things in context of the question, and I think these approximations are good enough to have a working sense of how much drift we should expect in our fifth generation fish.

UPDATE: I found another blogger who describes the same Monte Carlo approach but codes it a little differently in R.