EDIT: There are two follow up posts to this one here and here.
Here’s an interesting question in population genetics that my wonderful wife brought up recently. She was working with a fish population. The population started when 100 wild-caught fish were brought into the lab. The fish were allowed to randomly mate. After mating, 100 offspring were randomly sampled to produce the next generation. The fish are now in their fifth generation of this procedure (i.e. 4 rounds of 100 offspring randomly selected).
Here’s an interesting question in population genetics that my wonderful wife brought up recently. She was working with a fish population. The population started when 100 wild-caught fish were brought into the lab. The fish were allowed to randomly mate. After mating, 100 offspring were randomly sampled to produce the next generation. The fish are now in their fifth generation of this procedure (i.e. 4 rounds of 100 offspring randomly selected).
High Plains (Genetic) Drifter |
She was thinking about genotyping the fifth generation of
this population to estimate allele frequencies in the wild population from
which they were sampled. We talked this
over and both started wondering how much change to expect in their allele
frequencies when compared to the initial frequencies of the 100 wild-caught
fish that founded the population. Both selection
and genetic drift can change allele frequencies, but for the moment we’re going
to just focus on genetic drift.
We’re not ignoring selection because we don’t think it occurred
- it probably has - but we don’t know how much, so selection is tough to deal
with. Genetic drift on the other hand
can be estimated (as we’ll do below), and moreover we know it will happen in a
small population like the one we are dealing with. So by estimating drift alone we are getting a
best-case-scenario estimate of the amount of allele frequency change we can
expect between the founders and their fifth generation offspring. If we aren’t happy with the amount of change
under the best-case-scenario we can rule out genotyping the fish as a viable
option.
So the question is this, how do we model random genetic
drift in a population that has been maintained at 100 diploid individuals
through 5 generations. It turns out there are a bunch of approaches. In subsequent posts I’m
going to demonstrate several.
Also, at this point I haven’t specified the initial allele
frequency of the allele we are going to model. For our purposes we’ll work with
an allele at 50% frequency in the initial founding population, but each
approach outlined in this series can be generalized to deal with (almost) any initial
allele frequency.
Approach 1: Monte
Carlo simulation of genetic drift
This is my favorite
approach. I'll get to some other approaches in future posts, but I believe this is the easiest way to get an estimate of drift in this situation. Below is the R code to
simulate genetic drift in our fish population.
What we are doing here is just creating 1000 random walks through 5
generations.
The sample size (200 haploid genomes) influences the amount of allele frequency
variation (drift) from one generation to the next. The random selection is done by taking a
random sample from a binomial distribution of size 200. This random selection always starts with a
probability of sampling our allele at 50% in the first generation. In each subsequent generation the probability
is updated to reflect the realized frequency in from the random sample drawn
from the previous generation. By the
time you get to the fifth generation, the final allele frequency has
usually drifted a little bit away from 50% (in some cases it drifts away for a
time but by the end drifts its way back!!).
x = 0.5 #intial allele freq
sz = 100*2 #100 diploids
-> 200 genomes
k = matrix(-999, nr=1000, nc=4) #matrix to store results
for (i in 1:1000) {
xf = x #initial freq always 0.5, going fwd xf will
store our updated allele freq
for (j in 1:4){ #5
generations require 4 samples
xf = rbinom(1,size=200, xf) / 200 #random
sample, pop size 200
k[i,j] = xf
#update matrix
}
}
k = cbind(rep(.5, 1000), k) #add first gen. (always 0.5)
##that’s it for sampling, code below is plotting
#Now plot all 1000 random walks
plot(1,1, xlim=c(1,5), ylim=c(0,1), ty='n', xlab='generation',
ylab='allele freq')
for (i in 1:1000){
lines(1:5, k[i,],
lwd=.1) #draw 1000 random walks as lines
points(5.1, k[i,5],
pch='-', col=adjustcolor(2, alpha.f=.1))
#add red rug on right
}
1000 Random Walks |
MC_final = sort(k[,5]) #Monte Carlo result in final generation
hist(MC_final) #histogram
Histgram of 1000 Final Alelle Frequencies |
mean(MC_final) #(should be very close to 0.5)
[1] 0.49883
#lower and upper 95% Confidence intervals
c(MC_final[25], MC_final[975])
[1] 0.355 0.640
In the code I added a few bells and whistles to draw plots and
also compute the approximate expected mean and 95% confidence intervals. The results above tell us that the expected mean
allele frequency is almost 0.5 (i.e. the starting frequency). This makes sense, we didn’t apply selection
so we don’t expect any net directional change in allele frequency. Also 95% of the time we expect an allele
starting at 50% frequency to fall between about 36% and 64% in the fifth generation given our population size.
Since we are using a simulation these results are all approximate,
but that’s ok. We could do a larger
number of simulations to improve accuracy.
But more importantly, you have to put things in context of the question,
and I think these approximations are good enough to have a working sense of how
much drift we should expect in our fifth generation fish.
UPDATE: I found another blogger who describes the same Monte Carlo approach but codes it a little differently in R.
UPDATE: I found another blogger who describes the same Monte Carlo approach but codes it a little differently in R.
No comments:
Post a Comment