Showing posts with label Monte Carlo simulation. Show all posts
Showing posts with label Monte Carlo simulation. Show all posts

Wednesday, December 4, 2013

Modeling Genetic Drift Part 2 - Beta Distribution

In an earlier post I used a Monte Carlo simulation to model genetic drift in a population. You can read about the background by reading that post.  The basic setup is that we are assuming a population of diploids were allowed to randomly mate and following mating 100 offspring are chosen to start the next population.  This process was repeated 4 times, and we want to know how much an allele in the final population could have drifted from its initial allele frequency in the founding population.  Here I'll demonstrate another way to answer this question with an approximation using the beta distribution. 

Approach 2: Estimating Genetic Drift Using the Beta Distribution

The beta distribution is really cool and useful. Here I’ll try to explain how to apply the beta distribution to estimating genetic drift.  However, a quick word of caution, my description of the beta distribution is pretty incomplete.  Read the wikipedia article if you want to know all the things you can do with the beta distribution.

To understand the beta distribution we have to first talk about the binomial distribution. I used a binomial random sampler in my earlier Monte Carlo simulation of genetic drift, but I didn't give much of an explanation.  The binomial distribution is really useful for binary outcomes, like flipping a coin and getting a heads or a tails.  So for example, if I flip a coin we can use the binomial distribution to estimate the probability of getting a certain number of heads after a number of flips.  If we use a fair coin (50% chance of heads and 50% chance of tails on any flip), we can use the binomial distribution to find the probability of getting exactly 3 heads from 10 flips.  The beta distribution is sort of the mirror image of the binomial; it lets you go in the opposite direction.  Imagine we find 10 flipped coins lying on the ground, with 3 on heads and 7 on tails.  We want to know if they came from a fair coin factory. Given the observation of 3 heads and 7 tails, the beta distribution can tell us the probability that the probability of heads is 50% for our coins (whew!).  So the beta distribution lets us model the cloud of uncertainty surrounding a probability given some discrete binary observations.

Random matings produces discrete binary observations (an allele either does or does not get passed on to the next generation), and allele frequencies are a lot like probabilities, so we can use the beta distribution to model allele frequencies given a characterization of the random matings we expect in a population of 100 individuals (fish in this case).  To do this we have to solve for two “shape” parameters that we’ll need to supply to characterize the shape of our beta distribution.  These two parameters are called alpha and beta (I’ll use a and b here).  If you know the expected mean (u) and variance (v) for the final allele frequencies you can use them to figure out what a and b should be.  We discussed the expected mean earlier, it won’t change, and since we are starting with an allele at a frequency of 0.5, we expect the final frequency to also be 0.5, so u = 0.5.  The expected variance is a bit trickier.  There is a handy equation in Falconer and Mackay.  It’s equation 3.2.  It’s written as:

(read Falconer and Mackay Chapter 3 for more detail on this equation).

Here v is the expected variance we are looking for, and t is the number of generations it takes to go from the initial population to the final population while being maintained with N diploid individuals. The starting allele frequencies are p0 and q0 (both 0.5 in our case).  So in our fishy model we can supply all of the missing values and calculate v.

p0 = 0.5
q0 = 0.5
2N = 2 X 100 = 200
t = 4 (remember getting to the 5th generation requires only 4 rounds of mating)

So chugging through this I get v = 0.00496. That's the expected variance for our allele frequency in the final generation.

Great, now we have the expected mean (u) and variance (v) at the final generation.  We use these values to calculate a and b for our beta distribution using the equations below (see: http://en.wikipedia.org/wiki/Beta_distribution#Parameter_estimation). 


Plugging and chugging again I get a = 24.688 and b = 24.688.  That’s it!  We can specify a beta distribution to get our estimate of allele frequency drift!  Below is the R code repeating these steps and estimating the 95% confidence intervals:

n = 100
p = 0.5
q = 1-p

u = p
t = 4
v = p*q*(1-(1-(1/(2*n)))**t) #Falconer and Mackay eqn 3.2
a = u*((u*(1-u))/v-1) #see eqns above
b = (1-u)*((
u*(1-u)
)/v-1) #ditto see above

#estimated lower and upper conf. intervals
qbeta(0.025, shape1=a, shape2=b) #lower

 [1] 0.3625435
qbeta(1-0.025, shape1=a, shape2=b) #upper
 [1] 0.6374565


The R code above tells us that the estimated 95% confidence intervals for our beta distribution estimation of allele frequency drift are about 36.3% and 63.7%.  If you go back to the MonteCarlo approach you see that both estimates are really close!  Hooray!  

Now for fun lets compare this to the Monte Carlo estimate. More R code:
#rerun our monte carlo (from before) and fit our beta est. to it
x = 0.5
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=100000, nc=4) #matrix to store results
for (i in 1:100000) {
  xf = x  #initial freq always 0.5
  for (j in 1:4){ #5 generations require 4 samples
    xf = rbinom(1,size=200, xf) / 200 #random sample, normalized by size
    k[i,j] = xf   #update matrix
  }
}
#plot monte carlo result as histogram

hist(k[,4], br=100, freq=F, main="Monte Carlo (bars) vs. Beta (red line)", xlab='allele freq')
#add beta as a red line

lines(seq(0,1,0.01), dbeta(seq(0,1, 0.01), shape1=a, shape2=b ), ty='l', col=2)


Wow, the red line (beta) is pretty close to the Monte Carlo result!

Now some caveats.  The process we are modeling is discrete, but the beta distribution in continuous.  This means that the beta distribution will be a good estimator of drift if the population size is large (because there are so many discrete possibilities they become virtually continuous), and will get increasingly worse with smaller populations. Another watch out, the beta distribution will start to perform poorly if you want to measure drift for a really rare or really common allele, especially in a small population.  You'll know when this is going to happen if you notice that the a and b parameters are really different from one another (one will be big and one will be small), and the smaller one is a pretty small number. Under these kinds of scenarios you'll see that the beta distribution does a poor job of estimating the loss of the rare allele (i.e. fixation of one allele due to drift). Below is an example of this problem, I tweaked the R code above to model an initial allele starting at a 4% frequency rather than 50%.    

See how the beta estimation misses the Monte Carlo distribution a little on the left side.  In this case a = 1.98 and b = 47.4.  This is the sort of thing to watch out for if you use this approach to estimate genetic drift.

Modeling Genetic Drift - Monte Carlo Simulation

EDIT: There are two follow up posts to this one here and here.

Here’s an interesting question in population genetics that my wonderful wife brought up recently.  She was working with a fish population.  The population started when 100 wild-caught fish were brought into the lab.  The fish were allowed to randomly mate.  After mating, 100 offspring were randomly sampled to produce the next generation.  The fish are now in their fifth generation of this procedure (i.e. 4 rounds of 100 offspring randomly selected).  
High Plains (Genetic) Drifter

She was thinking about genotyping the fifth generation of this population to estimate allele frequencies in the wild population from which they were sampled.  We talked this over and both started wondering how much change to expect in their allele frequencies when compared to the initial frequencies of the 100 wild-caught fish that founded the population.  Both selection and genetic drift can change allele frequencies, but for the moment we’re going to just focus on genetic drift. 

We’re not ignoring selection because we don’t think it occurred - it probably has - but we don’t know how much, so selection is tough to deal with.  Genetic drift on the other hand can be estimated (as we’ll do below), and moreover we know it will happen in a small population like the one we are dealing with.  So by estimating drift alone we are getting a best-case-scenario estimate of the amount of allele frequency change we can expect between the founders and their fifth generation offspring.  If we aren’t happy with the amount of change under the best-case-scenario we can rule out genotyping the fish as a viable option.

So the question is this, how do we model random genetic drift in a population that has been maintained at 100 diploid individuals through 5 generations. It turns out there are a bunch of approaches. In subsequent posts I’m going to demonstrate several. 

Also, at this point I haven’t specified the initial allele frequency of the allele we are going to model. For our purposes we’ll work with an allele at 50% frequency in the initial founding population, but each approach outlined in this series can be generalized to deal with (almost) any initial allele frequency.

Approach 1: Monte Carlo simulation of genetic drift

This is my favorite approach.  I'll get to some other approaches in future posts, but I believe this is the easiest way to get an estimate of drift in this situation.  Below is the R code to simulate genetic drift in our fish population.  What we are doing here is just creating 1000 random walks through 5 generations. The sample size (200 haploid genomes) influences the amount of allele frequency variation (drift) from one generation to the next.  The random selection is done by taking a random sample from a binomial distribution of size 200. This random selection always starts with a probability of sampling our allele at 50% in the first generation.  In each subsequent generation the probability is updated to reflect the realized frequency in from the random sample drawn from the previous generation.  By the time you get to the fifth generation, the final allele frequency has usually drifted a little bit away from 50% (in some cases it drifts away for a time but by the end drifts its way back!!).

x = 0.5 #intial allele freq
sz = 100*2  #100 diploids -> 200 genomes
k = matrix(-999, nr=1000, nc=4) #matrix to store results

for (i in 1:1000) {
   xf = x  #initial freq always 0.5, going fwd xf will store our updated allele freq
   for (j in 1:4){ #5 generations require 4 samples
     xf = rbinom(1,size=200, xf) / 200 #random sample, pop size 200
     k[i,j] = xf   #update matrix
   }
}

k = cbind(rep(.5, 1000), k) #add first gen. (always 0.5)

##that’s it for sampling, code below is plotting

#Now plot all 1000 random walks
plot(1,1, xlim=c(1,5), ylim=c(0,1), ty='n', xlab='generation', ylab='allele freq')
for (i in 1:1000){
  lines(1:5, k[i,], lwd=.1)  #draw 1000 random walks as lines
  points(5.1, k[i,5], pch='-', col=adjustcolor(2, alpha.f=.1))  #add red rug on right
}

1000 Random Walks

MC_final = sort(k[,5]) #Monte Carlo result in final generation
hist(MC_final) #histogram
Histgram of 1000 Final Alelle Frequencies
  
mean(MC_final) #(should be very close to 0.5)
[1] 0.49883

#lower and upper 95% Confidence intervals
c(MC_final[25], MC_final[975])
[1] 0.355 0.640

In the code I added a few bells and whistles to draw plots and also compute the approximate expected mean and 95% confidence intervals.  The results above tell us that the expected mean allele frequency is almost 0.5 (i.e. the starting frequency).  This makes sense, we didn’t apply selection so we don’t expect any net directional change in allele frequency.  Also 95% of the time we expect an allele starting at 50% frequency to fall between about 36% and 64% in the fifth generation given our population size.


Since we are using a simulation these results are all approximate, but that’s ok.  We could do a larger number of simulations to improve accuracy.  But more importantly, you have to put things in context of the question, and I think these approximations are good enough to have a working sense of how much drift we should expect in our fifth generation fish.

UPDATE: I found another blogger who describes the same Monte Carlo approach but codes it a little differently in R.

Sunday, November 24, 2013

The Monty Hall Problem

The Monty Hall Problem is an interesting problem in probability theory.  The problem comes from the old TV show Let's Make a Deal hosted by Monty Hall.  You can read more about the background here on Wikipedia.  The basic premise is that there are 3 doors, behind one is a prize (like a new car), and behind the other two is nothing.  As the contestant you get to pick a door and you win whatever is behind it.  The wrinkle is this, first you pick one of the three doors, then Monty picks and opens one of the 2 remaining doors revealing a losing door.  Now he asks you if you want to stay with your first pick, or switch to the other closed door.  What should you do?
Monty Hall
(I dig the doo!)

My first thought was, well when you picked the door at the beginning, it had a 1 in 3 chance of being right. Now Monty has shown you one door that was a loser, so you have 1 in 2 chance of being right.  From this you can reason that switching to the remaining door also results in a 1 in 2 chance of being right, so it's arbitrary, you have a 50% chance of winning either way.  Boy this logic seems intuitive, but it turns out it is wrong!!


In fact, you should always switch from the door you first chose to the door that remains after Monty shows you a loser door.  Switching doubles your chance of winning! (you have a 1/3 chance of winning if you stay with your original door, and a 2/3 chance if you switch)  The trick is basically doing some careful bookkeeping and tracking conditional probabilities.  When Monty opens a door showing you a loser he is giving you valuable information.  You can read all about conditional probabilities at the Wikipedia link given above.


For me, doing is believing, so I have to play the Monty Hall game for this to really sink in.  One way would be to get a willing friend to help me play many rounds of the game some where I switch, and others where I don't and track how often I win using either strategy.  Sounds tedious.  Another route is to have a computer play the game for me.  Below is how I played 10,000 rounds of the Monty Hall game in Python.


from random import *
from collections import defaultdict

def count_all(xlist, proportions=False):
    '''Count all the items in a list, return a dict
       with the item as key and counts as value'''
    out =  defaultdict(int)
    for i in xlist: out[i]+=1
    if proportions:
        out2 = {}
        tot_sz = float(sum(out.values()))
        for i in out: out2[i] = out[i] / tot_sz
        return out2
    else: return out

switch = []
no_switch = []

m = [0,0,1]  #a representation of Monty's doors, two losers (i.e. zeros) and one winner (i.e. one)

for i in xrange(10000): #ten thousand games
    shuffle(m) #randomly shuffle which door is the winner
    p = choice([0,1,2]) #contestant randomly picks a door
    x = [idx for idx,i in enumerate(m) if not i]  #find the two loser doors
    if p in x: show = set(x) - set([p]) #if our pick (p) is a loser pick the other door
    if p not in x: show = set([choice(x)]) #if our pick is the winner, randomly choose one of the two losers to show
    show = show.pop() #the door we will show
    all_s = set(range(3)) #setting up a set of all doors
    f = set([p, show]) # setting up a set with the orig pick and the loser we will show
    new_p = all_s.difference(f) #new_p is the set diff, meaning it's the door we'd switch to after Monty shows the loser
    new_p = new_p.pop()
    if m[p]: no_switch.append(1) #if the no switch strategy succeeded add 1 to list
    else: no_switch.append(0) # if no switch was a loser add 0 to list
    if m[new_p]: switch.append(1) #ditto above, but for switch strategy
    else: switch.append(0)

print 'switch strategy', count_all(switch).items()  #print the counts of the results
print 'no switch strategy', count_all(no_switch).items()


I run this code with Python and here is what I get:
switch strategy [(0, 3362), (1, 6638)]
no switch strategy [(0, 6638), (1, 3362)]
(here zero represents a loss and one is a win, both followed by the count among 10,000 games)

So out of 10,000 randomly generated Monty Hall trials switching won 6,638 (pretty close to the expected 2/3), and not switching only won 3,362 (about the expected 1/3).  It works!  My initial intuition that they should be about 50/50 is proved wrong.

As I mentioned above, you can get to this result without the simulations by using probability theory, but for me the result really becomes concrete and intuitive only after I code it up and see it for myself.

[Another thing worth pointing out is how simple it is to express this game in Python using sets, though I might have gotten a little carried away with them!]