Saturday, January 18, 2014

My paper on Indel Variation in Mimulus is published: Omissions and an Admission

Mimulus guttatus
Edit: Below I point out two relevant papers I wish I had discovered and cited in my recent publication.  Here is a third paper that is also highly relevant to our manuscript, and also escaped my notice until it was too late!

My latest paper just came out in Genome Biology and Evolution (here).  In it we describe insertion/deletion (indel) variation in a natural population of the wildflower Mimulus guttatus.  We sequenced 10 inbred lines derived from this population and found millions of SNPs and thousands of indels.  One of the more striking findings was the large number of indels associated with a certain class of plant disease resistances genes called NBS-LRRs.  We noted in the paper that this same observation had been made in soybeans and Arabidopsis and cited the appropriate papers.  Once our paper was in press I found 2 additional paper, one in barley and the other in musk melon that had also found lots of indels associated with NBS-LRRs. Unfortunately it was too late to cite the these 2 papers.  My apologies to the authors for the omission.

Next and admission.  In the paper I wanted to compare the spectrum of allele frequencies in the population  (often called the site frequency spectrum or SFS) for indel mutations vs. various forms of SNP mutation (see figure 2 and table 2 in the paper).  I also wanted to compare these to the expected frequency distribution under the standard neutral model, as this serves as a kind of a null expectation in population genetics. Here's my admission, until very recently I did not know how to compute the SFS under the standard neutral model.  So in my paper I used coalescent simulations to estimate it.  From this estimate I found that the expected mean neutral SFS for the rare (minor) allele was about 0.222 (and this means that the mean frequency for the common (major) allele is about 0.778). Recently I found a paper that describes the exact calculation very clearly. The equation is #6.  So let's recompute the expected mean SFS for the minor allele using the exact calculation.  I'll do it in R below.

#first make a vector of possible allele counts

#remember we had 10 inbred lines, so there are 9 possible states for a SNP site
ac = 1:9 # gives 1 2 3 4 5 6 7 8 9

#now implement eqn 6 from the paper noted above
unfolded_SFS = (1/ac) / (sum(1/ac))

#now fold the SFS to get the folded SFS

folded_SFS = c(unfolded_SFS[1]+unfolded_SFS[9], unfolded_SFS[2]+unfolded_SFS[8], unfolded_SFS[3]+unfolded_SFS[7], unfolded_SFS[4]+unfolded_SFS[6], unfolded_SFS[5])

#finally figure out the expected mean

exp_mean = sum((1:5/10)*folded_SFS) # gives 0.2282228

So there you have it, the exact expected mean minor allele frequency is about 0.228.  My coalescent simulations were close at 0.222.  Now I know how to get the exact SFS (and it's pretty easy), but like the omissions noted above, it's a little too late to change my paper!