In an earlier post I used a Monte Carlo simulation to model genetic drift in a population. You can read about the background by reading that post. The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population. This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population. Here I'll demonstrate another way to answer this question with an approximation using the beta distribution.
Approach 2: Estimating Genetic Drift Using the Beta Distribution
The beta distribution is really cool and useful. Here I’ll try to explain how to
apply the beta distribution to estimating genetic drift.
However, a quick word of caution, my description of the beta distribution is pretty incomplete. Read the wikipedia article if you want to know all the things you can do with the beta distribution.
To understand the beta distribution we have to first talk about
the binomial distribution. I used a binomial random sampler in my earlier Monte Carlo simulation of genetic drift, but I didn't give much of an explanation. The binomial distribution is really useful for binary
outcomes, like flipping a coin and getting a heads or a tails. So for example, if I flip a coin we can use
the binomial distribution to estimate the probability of getting a certain
number of heads after a number of flips.
If we use a fair coin (50% chance of heads and 50% chance of tails on
any flip), we can use the binomial distribution to find the probability of
getting exactly 3 heads from 10 flips.
The beta distribution is sort of the mirror image of the binomial; it
lets you go in the opposite direction.
Imagine we find 10 flipped coins lying on the ground, with 3 on heads
and 7 on tails. We want to know if they
came from a fair coin factory. Given the observation of 3 heads and 7 tails, the beta distribution
can tell us the probability that the probability of heads is 50% for our coins (whew!). So the beta distribution lets us model the
cloud of uncertainty surrounding a probability given some discrete binary
observations.
Random matings produces discrete binary observations (an allele either
does or does not get passed on to the next generation), and allele frequencies
are a lot like probabilities, so we can use the beta distribution to model
allele frequencies given a characterization of the random matings we expect in
a population of 100 individuals (fish in this case). To do this we
have to solve for two “shape” parameters that we’ll need to supply to
characterize the shape of our beta distribution. These two parameters are called alpha and
beta (I’ll use a and b here). If you
know the expected mean (u) and variance (v) for the final allele frequencies
you can use them to figure out what a and b should be. We discussed the expected mean earlier, it
won’t change, and since we are starting with an allele at a frequency of 0.5, we expect the final frequency to also be 0.5, so u = 0.5. The expected variance
is a bit trickier. There is a handy
equation in Falconer and Mackay. It’s equation 3.2.
It’s written as:
Here v is the expected variance we are looking for, and t is the number of generations it takes to go from the initial population to the final population while being maintained with N diploid individuals. The starting allele frequencies are p0 and
q0 (both 0.5 in our case). So in our fishy model we can supply all
of the missing values and calculate v.
p0 = 0.5
q0 = 0.5
2N = 2 X 100 = 200
t = 4 (remember getting to the 5th generation requires only 4
rounds of mating)
So chugging through this I get v = 0.00496. That's the expected variance for our allele frequency in the final generation.
Great, now we have the expected mean (u) and variance (v) at the
final generation. We use these values to
calculate a and b for our beta distribution using the equations below (see: http://en.wikipedia.org/wiki/Beta_distribution#Parameter_estimation).
Plugging and chugging again I get a = 24.688 and b = 24.688. That’s it!
We can specify a beta distribution to get our estimate of allele
frequency drift! Below is the R code
repeating these steps and estimating the 95% confidence intervals:
n = 100
p = 0.5
q = 1-p
u = p
t = 4
p = 0.5
q = 1-p
u = p
t = 4
v = p*q*(1-(1-(1/(2*n)))**t) #Falconer and Mackay eqn
3.2
a = u*((u*(1-u))/v-1) #see eqns above
b = (1-u)*((u*(1-u))/v-1) #ditto see above
b = (1-u)*((u*(1-u))/v-1) #ditto see above
#estimated lower and upper conf. intervals
qbeta(0.025, shape1=a, shape2=b) #lower
[1] 0.3625435
qbeta(1-0.025, shape1=a, shape2=b) #upper
[1] 0.6374565
qbeta(0.025, shape1=a, shape2=b) #lower
[1] 0.3625435
qbeta(1-0.025, shape1=a, shape2=b) #upper
[1] 0.6374565
The R code above tells us that the estimated 95% confidence intervals
for our beta distribution estimation of allele frequency drift are about 36.3%
and 63.7%. If you go back to the MonteCarlo approach you see that both estimates are really close! Hooray!
Now for fun lets
compare this to the Monte Carlo estimate. More R code:
#rerun our monte carlo (from before) and fit our beta est. to it
x = 0.5
sz = 100*2 #100 diploids -> 200 genomes
k = matrix(-999, nr=100000, nc=4) #matrix to store results
for (i in 1:100000) {
xf = x #initial freq always 0.5
for (j in 1:4){ #5 generations require 4 samples
xf = rbinom(1,size=200, xf) / 200 #random sample, normalized by size
k[i,j] = xf #update matrix
}
}
#plot monte carlo result as histogram
hist(k[,4], br=100, freq=F, main="Monte Carlo (bars) vs. Beta (red line)", xlab='allele freq')
#add beta as a red line
lines(seq(0,1,0.01), dbeta(seq(0,1, 0.01), shape1=a, shape2=b ), ty='l', col=2)
#rerun our monte carlo (from before) and fit our beta est. to it
x = 0.5
sz = 100*2 #100 diploids -> 200 genomes
k = matrix(-999, nr=100000, nc=4) #matrix to store results
for (i in 1:100000) {
xf = x #initial freq always 0.5
for (j in 1:4){ #5 generations require 4 samples
xf = rbinom(1,size=200, xf) / 200 #random sample, normalized by size
k[i,j] = xf #update matrix
}
}
#plot monte carlo result as histogram
hist(k[,4], br=100, freq=F, main="Monte Carlo (bars) vs. Beta (red line)", xlab='allele freq')
#add beta as a red line
lines(seq(0,1,0.01), dbeta(seq(0,1, 0.01), shape1=a, shape2=b ), ty='l', col=2)
Wow, the red line (beta) is pretty close to the Monte Carlo result!
Now some caveats. The process we are modeling is discrete, but the beta distribution in continuous. This means that the beta distribution will be a good estimator of drift if the population size is large (because there are so many discrete possibilities they become virtually continuous), and will get increasingly worse with smaller populations. Another watch out, the beta distribution will start to perform poorly if you want to measure drift for a really rare or really common allele, especially in a small population. You'll know when this is going to happen if you notice that the a and b parameters are really different from one another (one will be big and one will be small), and the smaller one is a pretty small number. Under these kinds of scenarios you'll see that the beta distribution does a poor job of estimating the loss of the rare allele (i.e. fixation of one allele due to drift). Below is an example of this problem, I tweaked the R code above to model an initial allele starting at a 4% frequency rather than 50%.
See how the beta estimation misses the Monte Carlo distribution a little on the left side. In this case a = 1.98 and b = 47.4. This is the sort of thing to watch out for if you use this approach to estimate genetic drift.
No comments:
Post a Comment