Wednesday, December 4, 2013

Modeling Genetic Drift Part 2 - Beta Distribution

In an earlier post I used a Monte Carlo simulation to model genetic drift in a population. You can read about the background by reading that post.  The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population.  This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population.  Here I'll demonstrate another way to answer this question with an approximation using the beta distribution. 

Approach 2: Estimating Genetic Drift Using the Beta Distribution

The beta distribution is really cool and useful. Here I’ll try to explain how to apply the beta distribution to estimating genetic drift.  However, a quick word of caution, my description of the beta distribution is pretty incomplete.  Read the wikipedia article if you want to know all the things you can do with the beta distribution.

To understand the beta distribution we have to first talk about the binomial distribution. I used a binomial random sampler in my earlier Monte Carlo simulation of genetic drift, but I didn't give much of an explanation.  The binomial distribution is really useful for binary outcomes, like flipping a coin and getting a heads or a tails.  So for example, if I flip a coin we can use the binomial distribution to estimate the probability of getting a certain number of heads after a number of flips.  If we use a fair coin (50% chance of heads and 50% chance of tails on any flip), we can use the binomial distribution to find the probability of getting exactly 3 heads from 10 flips.  The beta distribution is sort of the mirror image of the binomial; it lets you go in the opposite direction.  Imagine we find 10 flipped coins lying on the ground, with 3 on heads and 7 on tails.  We want to know if they came from a fair coin factory. Given the observation of 3 heads and 7 tails, the beta distribution can tell us the probability that the probability of heads is 50% for our coins (whew!).  So the beta distribution lets us model the cloud of uncertainty surrounding a probability given some discrete binary observations.

Random matings produces discrete binary observations (an allele either does or does not get passed on to the next generation), and allele frequencies are a lot like probabilities, so we can use the beta distribution to model allele frequencies given a characterization of the random matings we expect in a population of 100 individuals (fish in this case).  To do this we have to solve for two “shape” parameters that we’ll need to supply to characterize the shape of our beta distribution.  These two parameters are called alpha and beta (I’ll use a and b here).  If you know the expected mean (u) and variance (v) for the final allele frequencies you can use them to figure out what a and b should be.  We discussed the expected mean earlier, it won’t change, and since we are starting with an allele at a frequency of 0.5, we expect the final frequency to also be 0.5, so u = 0.5.  The expected variance is a bit trickier.  There is a handy equation in Falconer and Mackay.  It’s equation 3.2.  It’s written as:

(read Falconer and Mackay Chapter 3 for more detail on this equation).

Here v is the expected variance we are looking for, and t is the number of generations it takes to go from the initial population to the final population while being maintained with N diploid individuals. The starting allele frequencies are p0 and q0 (both 0.5 in our case).  So in our fishy model we can supply all of the missing values and calculate v.

p0 = 0.5
q0 = 0.5
2N = 2 X 100 = 200
t = 4 (remember getting to the 5th generation requires only 4 rounds of mating)

So chugging through this I get v = 0.00496. That's the expected variance for our allele frequency in the final generation.

Great, now we have the expected mean (u) and variance (v) at the final generation.  We use these values to calculate a and b for our beta distribution using the equations below (see: http://en.wikipedia.org/wiki/Beta_distribution#Parameter_estimation). 


Plugging and chugging again I get a = 24.688 and b = 24.688.  That’s it!  We can specify a beta distribution to get our estimate of allele frequency drift!  Below is the R code repeating these steps and estimating the 95% confidence intervals:

n = 100
p = 0.5
q = 1-p

u = p
t = 4
v = p*q*(1-(1-(1/(2*n)))**t) #Falconer and Mackay eqn 3.2
a = u*((u*(1-u))/v-1) #see eqns above
b = (1-u)*((
u*(1-u)
)/v-1) #ditto see above

#estimated lower and upper conf. intervals
qbeta(0.025, shape1=a, shape2=b) #lower

 [1] 0.3625435
qbeta(1-0.025, shape1=a, shape2=b) #upper
 [1] 0.6374565


The R code above tells us that the estimated 95% confidence intervals for our beta distribution estimation of allele frequency drift are about 36.3% and 63.7%.  If you go back to the MonteCarlo approach you see that both estimates are really close!  Hooray!  

Now for fun lets compare this to the Monte Carlo estimate. More R code:
#rerun our monte carlo (from before) and fit our beta est. to it
x = 0.5
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=100000, nc=4) #matrix to store results
for (i in 1:100000) {
  xf = x  #initial freq always 0.5
  for (j in 1:4){ #5 generations require 4 samples
    xf = rbinom(1,size=200, xf) / 200 #random sample, normalized by size
    k[i,j] = xf   #update matrix
  }
}
#plot monte carlo result as histogram

hist(k[,4], br=100, freq=F, main="Monte Carlo (bars) vs. Beta (red line)", xlab='allele freq')
#add beta as a red line

lines(seq(0,1,0.01), dbeta(seq(0,1, 0.01), shape1=a, shape2=b ), ty='l', col=2)


Wow, the red line (beta) is pretty close to the Monte Carlo result!

Now some caveats.  The process we are modeling is discrete, but the beta distribution in continuous.  This means that the beta distribution will be a good estimator of drift if the population size is large (because there are so many discrete possibilities they become virtually continuous), and will get increasingly worse with smaller populations. Another watch out, the beta distribution will start to perform poorly if you want to measure drift for a really rare or really common allele, especially in a small population.  You'll know when this is going to happen if you notice that the a and b parameters are really different from one another (one will be big and one will be small), and the smaller one is a pretty small number. Under these kinds of scenarios you'll see that the beta distribution does a poor job of estimating the loss of the rare allele (i.e. fixation of one allele due to drift). Below is an example of this problem, I tweaked the R code above to model an initial allele starting at a 4% frequency rather than 50%.    

See how the beta estimation misses the Monte Carlo distribution a little on the left side.  In this case a = 1.98 and b = 47.4.  This is the sort of thing to watch out for if you use this approach to estimate genetic drift.

No comments:

Post a Comment