Approach 3: Exact calculation using a compound binomial distribution
We are assuming that at each generation a binomial sample of 200 gametes is randomly drawn from the parents to make the next generation (often called the Fisher-Wright model). We can model this process by nesting together a bunch of binomial distributions - one for each generation - with each dependent on the one that came before it. Below I've drawn a small cartoonish view of what this looks like. In the figure the three levels represent the discrete probability mass distributions for each of the first three generations. In our previous examples we've assumed the initial allele frequency is 50%. Below you can see that Gen. 1 has just one bar. This is an allele at 50% and it's probability mass is 1 (meaning it's the only possibility). We use a binomial distribution to determine the distributions of probabilities in Gen. 2. The allele frequency has started to diffuse, though the highest mass is still on 0.5. Then in the third generation things get more interesting (and complex). Each of the 3 allele frequencies represented in Gen. 2 have some probability of contributing to the probability mass of all three allele frequencies in Gen. 3. I've tried to represent this by color coding the allele frequencies in Gen. 2 and showing how these 3 colors each contribute to the final probability mass for each allele frequency in Gen. 3. Notice that the left most (orange) allele frequency contributes more to the left most allele frequency in the next generation, and visa-versa the right most (red) allele frequency. Each of the arrows represents one allele frequency path through the generations. The trick is to calculate the probability of all these paths and sum those probabilities to give the total probability!
Most population genetics text books describe this approach. They show the mechanics (typically for only a single generation), and then dismiss it because in many cases it is intractable. In the toy example above there are only 9 paths, which isn't much to deal with. In our real example with a population of 100 diploid individuals and 5 generations there are many more paths! If we wanted to do this with a population of 1 million individuals over fifty-thousand generations this would likely be an intractable problem to compute, at least in the way I've coded it below (I'm sure a talented programmer could make this happen given enough compute power). Below is R code I used to create an exact distribution of probability masses for allele frequencies in the fifth generation.
rownames(k) = 0:200 # all possible allele counts in 100 diploids
colnames(k) = c('gen1', 'gen2', 'gen3', 'gen4', 'gen5')
k[,'gen1'] = c(rep(0,100),1,rep(0,100)) #initialize generation 1 @ 50% allele freq (i.e. p=1 for 100/200)
prev_gen = 'gen1'
for (gen in c('gen2', 'gen3', 'gen4', 'gen5')) {
for (prev_ct in 0:200){
prev_prob = k[prev_ct+1,prev_gen] #previous prob.
new_probs = prev_prob * dbinom(x=0:200, size=200, prob=prev_ct/200) # prob of all paths from prev prob weighted by prob. of going down that path in prev. gen.
k[,gen] = k[,gen] + new_probs #summation
}
prev_gen = gen #move to next generation
}
That's it. The code traces all possible paths and sums up their probabilities in each generation. Below is the exact probability distribution we expect in the fifth generation. This is the discrete distribution we've previously approximated with Monte Carlo simulations and with a continuous beta distribution.
In the R code above, coding that bit in inside both for loops was tricky. I'd like to make sure I haven't messed it up. Fortunately we've been working with this particular setup (5 generations, 100 diploids, 50% initial allele frequency) for a little while now so we have some expectations. For example we know the mean allele frequency in the fifth generation should be the same as the initial frequency (i.e. 50%). We also know that the variance in allele frequency in the fifth generation should be about 0.00496. So we can test those expectations to see if the above code checks out. Also, at each generation we've computed the exhaustive probability mass distribution. This distribution should sum to 1 for each generation. Let's test that first. Below is an equation simply saying that for each allele frequency (x), if we look up it's probability (p(x)), and sum all these up, they are expected to sum to 1.
We can test that in R:
apply(k, 2, sum) #sum each gen. all should sum to 1gen1 gen2 gen3 gen4 gen5
1 1 1 1 1
Yep, they all sum to 1. That's a good start. Now let's work out the mean and variance for the fifth generation. Following the definitions for x and p(x) given above we can compute the expected mean allele frequency in the fifth generation as follows.
E[x] is often used to indicate an expected mean value from a distribution. And once we have E[x] we can compute the expected variance (Var[x]) using the equation below.
So let's test these two expected values now, remembering that we expect E[x] to be 0.5 and Var[x] to be about 0.00496.
exp_x = sum(final_gen * ((0:200)/200))
exp_x #should be 0.5
[1] 0.5
exp_var = sum(final_gen * ((0:200)/200 - exp_x)**2)
exp_var #should be about 0.00496, from previous analysis
[1] 0.004962625
Great! Everything checks out. Now one more thing. In the past examples we've calculated the 95% confidence interval for the allele frequency in the fifth generation. We know it should be between about 36% and 64%. Here we have a distribution of probabilities, so how do we figure out the 95% confidence interval? Here's one approach I adapted from John Kruschke's code on pg. 627 of Doing Bayesian Data Analysis.
final_gen_sort = sort(final_gen, decreasing=T) #reverse sort prob mass distrib
CIheigthIdx = min(which(cumsum(final_gen_sort) >= 0.95))#count values with less than 95% prob. mass
lwr_allele_freq = (100 - CIheigthIdx/2) / 200 #get freqs at lower bound
upr_allele_freq = (100 + CIheigthIdx/2) / 200 #and upper bound
c(lwr_allele_freq, upr_allele_freq) # conf. interval
[1] 0.36 0.64
sum(final_gen_sort[0:CIheigthIdx]) #find true CI width
[1] 0.9536558
sum(final_gen_sort[0:(CIheigthIdx-2)]) #find next smallest CI width
[1] 0.9450768
We got the expected confidence interval, 36%-64%. But it isn't quite the 95% confidence interval, it's the 95.36558% confidence interval! Because the exact approach we're using here models a discrete process, the cumulative sum of those probabilities moves in discrete chunks. In this case there is no chunk that ends right as it captures 95% of the total probability. We can have 95.36558% or 94.50768% confidence intervals; those are the two chunks that flank 95%. Typically you take the first chunk over the nominal value you are looking for, so in our case we'd choose 36%-64% as our 95% confidence interval.