Wednesday, December 4, 2013

Modeling Genetic Drift - Monte Carlo Simulation

EDIT: There are two follow up posts to this one here and here.

Here’s an interesting question in population genetics that my wonderful wife brought up recently.  She was working with a fish population.  The population started when 100 wild-caught fish were brought into the lab.  The fish were allowed to randomly mate.  After mating, 100 offspring were randomly sampled to produce the next generation.  The fish are now in their fifth generation of this procedure (i.e. 4 rounds of 100 offspring randomly selected).  
High Plains (Genetic) Drifter

She was thinking about genotyping the fifth generation of this population to estimate allele frequencies in the wild population from which they were sampled.  We talked this over and both started wondering how much change to expect in their allele frequencies when compared to the initial frequencies of the 100 wild-caught fish that founded the population.  Both selection and genetic drift can change allele frequencies, but for the moment we’re going to just focus on genetic drift. 

We’re not ignoring selection because we don’t think it occurred - it probably has - but we don’t know how much, so selection is tough to deal with.  Genetic drift on the other hand can be estimated (as we’ll do below), and moreover we know it will happen in a small population like the one we are dealing with.  So by estimating drift alone we are getting a best-case-scenario estimate of the amount of allele frequency change we can expect between the founders and their fifth generation offspring.  If we aren’t happy with the amount of change under the best-case-scenario we can rule out genotyping the fish as a viable option.

So the question is this, how do we model random genetic drift in a population that has been maintained at 100 diploid individuals through 5 generations. It turns out there are a bunch of approaches. In subsequent posts I’m going to demonstrate several. 

Also, at this point I haven’t specified the initial allele frequency of the allele we are going to model. For our purposes we’ll work with an allele at 50% frequency in the initial founding population, but each approach outlined in this series can be generalized to deal with (almost) any initial allele frequency.

Approach 1: Monte Carlo simulation of genetic drift

This is my favorite approach.  I'll get to some other approaches in future posts, but I believe this is the easiest way to get an estimate of drift in this situation.  Below is the R code to simulate genetic drift in our fish population.  What we are doing here is just creating 1000 random walks through 5 generations. The sample size (200 haploid genomes) influences the amount of allele frequency variation (drift) from one generation to the next.  The random selection is done by taking a random sample from a binomial distribution of size 200. This random selection always starts with a probability of sampling our allele at 50% in the first generation.  In each subsequent generation the probability is updated to reflect the realized frequency in from the random sample drawn from the previous generation.  By the time you get to the fifth generation, the final allele frequency has usually drifted a little bit away from 50% (in some cases it drifts away for a time but by the end drifts its way back!!).

x = 0.5 #intial allele freq
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=1000, nc=4) #matrix to store results

for (i in 1:1000) {
   xf = x  #initial freq always 0.5, going fwd xf will store our updated allele freq
   for (j in 1:4){ #5 generations require 4 samples
     xf = rbinom(1,size=200, xf) / 200 #random sample, pop size 200
     k[i,j] = xf   #update matrix
   }
}

k = cbind(rep(.5, 1000), k) #add first gen. (always 0.5)

##that’s it for sampling, code below is plotting

#Now plot all 1000 random walks
plot(1,1, xlim=c(1,5), ylim=c(0,1), ty='n', xlab='generation', ylab='allele freq')
for (i in 1:1000){
  lines(1:5, k[i,], lwd=.1)  #draw 1000 random walks as lines
  points(5.1, k[i,5], pch='-', col=adjustcolor(2, alpha.f=.1))  #add red rug on right
}

1000 Random Walks

MC_final = sort(k[,5]) #Monte Carlo result in final generation
hist(MC_final) #histogram
Histgram of 1000 Final Alelle Frequencies
  
mean(MC_final) #(should be very close to 0.5)
[1] 0.49883

#lower and upper 95% Confidence intervals
c(MC_final[25], MC_final[975])
[1] 0.355 0.640

In the code I added a few bells and whistles to draw plots and also compute the approximate expected mean and 95% confidence intervals.  The results above tell us that the expected mean allele frequency is almost 0.5 (i.e. the starting frequency).  This makes sense, we didn’t apply selection so we don’t expect any net directional change in allele frequency.  Also 95% of the time we expect an allele starting at 50% frequency to fall between about 36% and 64% in the fifth generation given our population size.


Since we are using a simulation these results are all approximate, but that’s ok.  We could do a larger number of simulations to improve accuracy.  But more importantly, you have to put things in context of the question, and I think these approximations are good enough to have a working sense of how much drift we should expect in our fifth generation fish.

UPDATE: I found another blogger who describes the same Monte Carlo approach but codes it a little differently in R.

No comments:

Post a Comment