Sunday, November 24, 2013

Free Resources I'd Gladly Have Paid For

In this post I hope to compile a list of free resources I frequently use.  The range of topics mostly includes statistics, population and evolutionary genetics, and programming.  As the title says, to me these resources have been so useful that I'd have paid for them, but since they are free I guess I have received infinite value!

Stats

John McDonald's Handbook of Biological Statistics (free PDF):

-- I am a huge fan of this book.  If John McDonald has done one thing here it's making sure that the easy stuff is EASY!  He explains the basics like Chi-Square, correlation, t-test, and regression with such clarity.  This is my go-to resource for making sure I'm doing the simple stuff right.  The other thing that is brilliant about this book is that for each statistical test he has a section about when to use it, a section about what the null hypothesis is, an intuitive section about how the test works, and some great (biologically relevant) examples using the test.  My only tiny gripe (if one can gripe about anything this good and free!) is that his code examples typically use SAS, and I'm now throughly entrenched in R.  That said, his explanations are so clear I have little trouble translating the concepts into R.

Christian Walck's Handbook on Statistical Distributions for Experimentalists (free PDF)

-- This is a pretty straight forward guide to a whole little bestiary of statistical distributions. For this type of information I refer to this handbook and Wikipedia about equally.  I find that I need examples to understand how distributions work.   For some distributions I more easily grasp Wikipedia's examples and for others I like the examples in this handbook.

Course Notes from U. Wisc. Statistics 571 by Brets Hanlon and Larget (website)

-- There are some great examples in the course slides from the two Brets.  All the examples I've seen use R.  Hopefully this site will stay up if they stop teaching the course.  Here's a great example of how to do power analyses in R.  I love how they marry the intuitive images with the math and the R code so you can translate between all three mental activities. 

Allen Downey's Think Bayes and Think Stats (Bayes PDF, Stats PDF)

-- Both of these are useful, especially if you program in Python, as all the examples are given through Python code.  I also found that both are right at the boundary of where I can readily follow the math.  Downey teaches at a really good engineering school, so I think his typical student is probably very good at math! 

Think Stats is OK, especially regarding intuitive descriptions of various distributions and their real world applications.  I'm more in favor of John McDonald's book (above) when it comes to understanding statistical tests and knowing which test to choose.  Regarding Think Bayes, it's a nice introduction to Bayesian thinking.  He avoids dragging the reader through the "why frequentist stats are wrong and need to be replaced by Bayesian stats" zealotry that many other Bayesian texts start with, and I appreciate that.  In the future I intend to blog more about my dabbles into Bayesian inference.  I'm currently working my way through John Kruschke's book, which is not free, but seems to me well worth the money (and does indulge in a little zealotry).  

UPDATE: Added the tutorial below on PCA by Lindsay Smith
Lindsay Smith's A Tutorial on Principal Components Analysis (free PDF)

-- Just like the title says, this is a nice gentle explanation of the inner workings of the PCA. Ever wonder what's happening under the hood when you run a PCA?  Read this and you'll have a working understanding of what's going on.

And one I don't use

-- A couple different folks with heavy-duty computational backgrounds have pointed me to David MacKay's Information Theory, Inference, and Learning Algorithms (free PDF). I've only spent a little time with it.  It's not my cup of tea, but I know people who think it's the bees knees so I'll list it here.  I think the problem is that I have no background and no current need for machine learning. (or maybe I do have a need but know so little about the topic that I don't realize it)

Population Genetics - Four Awesome Free Resources for the Price of Nothing!

Kent Holsinger's Lecture Notes in Population Genetics (free PDF)

-- I met Kent once a few years ago when I interviewed for a position at UConn.  He was a super nice guy and among the UConn grad students his population genetics course is something of a legend.  These are the course notes from that course.  Like John McDonald's stats handbook, Holsinger does a really nice job of explaining what you are trying to do, when you'd want to do it, and then works you through some great examples. 

Now with 3X Pop. Gen. Power per Page
Graham Coop's Notes on Population Genetics (free PDF) 
UPDATE: Per Graham's suggestion, I've linked to GitHub, where he keeps and up-to-date copy of his notes. You'll want the file called popgen_notes.pdf.

-- This is a super concentrated form of population genetics knowledge.  It's like that laundry detergent where you add a thimble full for a whole load of wash.  It has no Intro, no Table of Contents, and no Appendices or Indices, but you get 169 equations in 55 pages and just enough prose to connect all the equations!  I reference this often. Especially when I think I know what I'm doing but I want to make sure I'm right and don't want to read much fluff in the process.

Joe Felsenstein's Theoretical Population Genetics (free PDF)

-- This baby will probably retail for big bucks if/when Felsenstein every decides to publish it (so download your free copy today!).  It's a tome.  It has pretty much everything (though according to Felsenstein's website it is unfinished).  I read this when I want to make sure I really get something. And sometimes it helps, and sometimes it makes me realize that what I get is the tip of the iceberg! 

Magnus Nordborg's Chapter on Coalescent Theory (free PDF)

-- Several years ago I was really confused about coalescent theory.  I think the problem was that I was reading a bunch of really sophisticated uses of it, and lacked the necessary background information.  Then I found this book chapter and realized that 1) probably the most daunting thing about coalescent theory is it's fancy name, and 2) it's pretty intuitive to anyone who has done a fair bit of "tree thinking".  

UPDATE: Population Genetics for Non-Model Taxa (free videos and content)

-- The American Genetics Association is hosting videos and other materials for a course they offered in the summer of 2013 on Population Genetics for Non-Model Taxa.  I really like the videos, especially the five videos by Alex Buerkle (about half way down the page at the link above).  He does a great job of explaining Bayesian statistics and demonstrating how they can be useful for estimating allele frequencies and Fst from genomic data sets. There are also several other helpful videos detailing things like RAD and GBS methods and transcriptomics.  As the title of the course suggests, these methods are great for researchers working on non-model taxa (i.e. species with few existing genomic resources).      

Programming

Allen Downey's Think Python (free PDF)

-- I didn't use this book to learn Python. Instead I used Dive Into Python (also free).  I had come from programming in Perl and Java before finding Python, so the Dive In approach worked for me.  But Dive In pretty much assumes you have some programming background.  Rather than start with the classic "Hello World", the very first program is this one:
def buildConnectionString(params):
    """Build a connection string from a dictionary of parameters.
    Returns string."""
    return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
if __name__ == "__main__":
    myParams = {"server":"mpilgrim", \
                "database":"master", \
                "uid":"sa", \
                "pwd":"secret" \
                }
    print buildConnectionString(myParams)

Looking at that now, after using Python nearly every day for about 7 years, and I can't tell you what it's doing!  The book should be called Dive Into the Deep End of Python.

Think Python, on the other hand, takes you through a much gentler route.  This is the book I now recommend to others, especially those who are new to programming.


R Reference Card (free PDF)
UPDATE: see the comment below by Mary M, which includes a link to a nice set of introductory R videos.


-- I've never found a good free tutorial for R.  Maybe somebody out there knows of one?  I used this book by Peter Dalgaard and a great course by Dan Nettleton to learn R.  I would generally recommend both, but you don't have to move to Iowa to get Dalgaard!  


I do occasionally keep a copy of the R Reference Card (above), although I think I lost my printed copy in my most recent move.  Once you get a toe-hold with R, you can use it's handy little builtin search to figure out stuff pretty easily. For example ??binomial, shows all the help materials that contain the term "binomial". Based on the descriptions, I think I want to see this one: Binomial.  So then to get the documentation just type in help(Binomial), and it gives me a pretty good working description of what a Binomial is and some examples of how to use it in R.  

Lex's Evolutionary Biology Jargon Page

Like every scientific field, evolutionary biology has it’s own special thicket of jargon.  While reading papers I am occasionally stymied by this jargon.  It seemed the only way to improve the situation was to either commit the terms to memory or to have a handy look-up guide.  Given the state of my memory, I’ve gone with the latter.  Below is a list of some common jargon in evolutionary biology.  This list is a reflection of the sub-fields that interest me, and as such, it is incomplete.  It is also a work in progress.  I'll keep updating it as I get time.  I’ve also left out terms that I consider at least partially scrutable, for example “gene flow”.  Instead I’ve focused on terminology that I consider inaccessible, using myself as the standard by which this is judged.  (well not quite just myself, if you have any jargon suggestions send them my way and I'll add them to the list at the bottom of this entry.  Likewise, if I messed something up please point it out to me.)  

Spandrels of evolution:
Artsy Spandrels
This term refers to an analogy put forward by Gould & Lewontin in their now classic 1979 Proc Royal Acad Sci: B paper entitled The spandrels of San Marco and the panglossian paradigm: a critique of the adaptionist programme.  Despite the opaque title people read it.  The message of the paper is conveyed by analogy to an architectural feature known as a spandrel.   In architecture, a spandrel is a fitting that allows a curved shape, such as a dome or arch, to be connected to a rectangular structure.  Imagine setting a round dome on top of a square building, there are voids that need to be filled in to connect them or else things are going to be a little drafty.  Spandrels are the things that fill in these voids.  Over the millennia spandrels became highly ornate, often hosting paintings, statues, or carvings.  This, however, was secondary to their initial purpose, as first and foremost spandrels are a necessary work around to a geometric constraint.  Much in the way spandrels did not originate as sites for art, Gould & Lewontin imagined that many adaptive traits started as a work-around to some earlier constraint.   Sometime later the work-around became available for some new utility, in the same way that spandrels were eventually seen as convenient places to install art.   The “spandrel” message is two-fold.  First, some adaptations may have originated as the by-product of an earlier constraint, and second, because of this some adaptive traits may have an origin that is completely unrelated to their current utility.  In the spandrels paper, Gould & Lewontin use these points as a pretty stern critique of the “adaptionist programme”, that is researchers who assert that a trait arose more-or-less de novo as solution to the problem it currently solves.

A good example of a spandrel are the “horns” found on some dung beetles.  Armin Moczek and colleagues have shown that the horns of adult dung beetles are developmentally linked to an outgrowth that occurs on the head of the pupal dung beetles.  In the pupae of both horned and hornless dung beetles this outgrowth helps crack through the hard pupal head-case, allowing the adult beetle to emerge.  In hornless beetles this outgrowth is later reabsorbed, however in horned species it continues to grow.  At maturity these horns become weapons used in the fight for mates.  Thus we could say that the horns of adult beetles are spandrels born out of the beetle’s preexisting need to emerge from their pupal case. (read more about this cool story)


Dobzhansky-Muller Model, or Bateson-Dobzhansky-Muller Model, or DM-, or BDM-hybrid incompatibility

Some species, even some very closely related species, cannot form viable offspring when crossed.  One way this can happen is through harmful interactions between the parental genes found in the offspring.  A more technical terminology for this situation would be to call it a negative epistatic interaction between these parental genomes.  That is, Mom’s gene A and Dad’s gene B don’t play well together and the poor offspring is unfit or sterile.  This, in a nutshell, is the BDM-model of hybrid incompatibility. 

Interest in this model stems from the facts that it is commonly observed in nature and describes a plausible genetic route to speciation.  If we were going to “design” a speciation event, one of the first things we would want to do is cease gene flow between the two nascent species.   One way to do this is to make all of the offspring from unwanted matings either sterile or too unhealthy to make it to reproductive age.  That is, we’d want negative epistatic interactions to consistently crop up in the hybrids.  Or using the jargon, we’d want BDM-hybrid incompatibilities to genetically reinforce our attempt to cause speciation. 
  
Selective Sweeps, Soft Sweeps, Genetic Hitchhiking, Background Selection and the Hill-Robertson Effect

Wow, that’s a lot of jargon for one entry.  Good news, all of these ideas are linked by fairly simple underlying concepts.  If you can grasp the concepts, then you’ll see that all of this jargon is just special situations or consequences that emerge from these concepts. 

So let's start with the concepts.  First let's assert some fairly general rules that apply to many species.  1) Species have finite population sizes.  2) Species have fewer chromosomes than genes, so many genes must be physically linked on the same chromosome.  3) Different alleles at the same gene can have different fitness outcomes.  And 4) recombination can swap the physical links between genes creating new linked combinations.

All right, that’s a fairly manageable situation.  Let's now apply those rules to a very simple diploid species with one chromosome that has two linked genes, A and B. Got it?  Ok, let's start with the simple case where initially every individual in our species has the genotype ab for genes A and B, respectively.  Next let's imagine that a mutation occurs on one chromosome in one individual, and it produces a mutant form of a which we’ll call a’.  So now we have two genotypes to worry about, ab and a’b.  We’ll say that this new a’ is no more or less favored by selection than the original a allele, so we could call it a neutral mutation with respect to selection.  These ab and a’b genotypes drift (see the entry on drift for more detail) around in frequency for a while, until suddenly, one individual with the a’b genotype gets a mutation in b, we’ll call b’. This new mutant is really superior to b and individuals that get it tend to produce more and healthier offspring.  So now we have three genotypes ab, a’b, and a’b’.  Initially a’b’ is rare, but, because b’ improves fitness it is steadily becoming more common.   Eventually, in the absence of recombination between genotypes, we’re likely to end up with only on a’b’ chromosomes.  The moment b’ arose it started increasing in frequency, little by little replacing b, and in so doing it drug the a’ allele with it, replacing neutral diversity (the a allele) in the linked gene A.  A side effect of the strong advantage of b’, is that genetic diversity is being swept away, directly at gene B, but also at the linked gene A.  Using the jargon, we’d say the b’ allele is causing a selective sweep, or more specifically some might call this a hard selective sweep or hard sweep because it is being caused by the brand new b’ mutant.  So what about the lucky a’ allele?  We’d call it a genetic hitchhiker.  Remember, it was no better or worse than a, but it just happened to hook up with b’ and get swept along.

Ok, now let's replay the scenario, but this time, when b’ arises it too is neutral.  So we initially have ab, a’b, and a’b’.  They hang around for a while, and eventually in an individual recombination happens between ab and a’b’ that creates the forth possible genotype, ab’.  Now our 4 genotypes drift along again for some time, and then suddenly the environment changes dramatically.  Let’s say a meteor hits the earth.  As a consequence of this sudden change, b’, which had been neutral, suddenly becomes strongly favored.  This starts off another sweep, only this time b’ has managed to hook up with a and a’, so after the sweep were are left with ab’ and a’b’ genotypes (note both alleles still exist for gene A, but not for gene B).  This is a soft selective sweep.  It differs from the hard selective sweep in that the mutant b’ allele wasn’t preferred right away so it managed to associate with the neutral genetic diversity found at the gene A, not creating a single hitchhiker, but instead multiple hitchhikers, in effect preserving diversity in gene A.  Basically the difference between a hard and soft sweep is the timing of when the selection kicks and how varied and diverse the hitchhikers are (real hitchhikers are always varied and diverse).    

So that’s sweeps and hitchhiking.  They involve beneficial mutations, but what about detrimental mutations?  It’s the same idea, just backward.  Let’s just change the sign of the selection coefficient on b’ and say that it is deleterious.  This will tend to cause selection to remove b’ from the population, and in so doing take any linked alleles with it unless they can escape by recombining to a b allele background.   This type of erosion of genetic diversity is called background selection, it’s sort of the opposite of a selective sweep.   There’s a key distinction though.  In a selective sweep an allele goes from a low frequency, to a high one, so it’s going to have a massive increase in it’s frequency, and as a consequence it can knock out a lot of diversity.  On the other hand, under background selection, we usual imagine that the deleterious allele never makes it to a very high frequency, so it’s trip is from a low frequency to a frequency of zero (it goes extinct).  This is a much shorter journey, so this little blip of a bad allele doesn’t tend to knock out much diversity, because it was never linked to much in the first place.  The only places where background selection can really make a dent are areas with really low recombination and fairly high rates of deleterious mutation.  Here you can imagine waves and waves of background selection constantly removing variation, leaving this region with fairly low genetic diversity.

So, neutral genetic diversity can be removed from the population by not being linked to the good allele, or by being linked to a bad allele.  In both cases alleles at the gene under selection (gene B) are rapidly changing in frequency, and the only way to save linked diversity is for recombination to create new linkages.  Now imagine that from our ancestral ab genotype, an a’b and an ab’ genotype both arise, and the a’ and b’ alleles are both superior to the ancestral a and b alleles.  The unfortunate thing is that they are unlinked, because what your really want is the super-fit a’b’ genotype to emerge.  Without recombination you have to wait for a lucky mutation to either turn a’b to a’b’ or ab’ to a’b’, which may take a long time, or indeed never happen.  The key insight here is that recombination can play a role speeding this up.  With recombination, you can have an exchange between a’b and ab’ to create ab and the sought-after a’b’. This increase in the rate of evolution provided by recombination is called the Hill-Robertson effect, and some believe this may be the reason that recombination evolved in the first place. 

Linkage Disequilibrium
(to my mind this piece of jargon “would be the first against the wall when the revolution comes”, to quote the great Douglas Adams):

Linkage disequilibrium or LD is a complicated sounding bit of jargon that describes a simple concept.  If you understand correlation, then you get LD.  Sadly because the term is so awkward, you would have never guessed it was something you were already familiar with.

The concept LD aims to capture is the amount of nonrandom association between alleles at different loci.  It is usually studied within or between populations of a single species, and as such is a fundamental component of population genetics.  In the simplest case, imagine two loci, where both loci have two alleles each, we’ll call them A and a at locus #1 and B and b at locus #2.  If we sampled a haplotype from a population with these four alleles there are four combinations of the alleles we could find, namely AB, Ab, aB, and ab.  If these loci are in strong LD we’d expect to largely observe only 2 of the 4 combinations, perhaps AB and ab, though these symbols are arbitrary, so it could just as well be Ab and aB.  On the other hand, if LD is very low or absent, the latter being termed linkage equilibrium, we might expect to find all four allelic combinations in proportions corresponding to the products of their allele frequencies.  So what does that mean exactly??  Let's play a game.  I’ll give you a million dollars if you can guess the alleles at one gene by only knowing the alleles at another gene.  I get to pick one random individual, and we’ll sequence her genome.  I know all her alleles at all genes.  I’ll give you the alleles for any gene of your choice in the genome, and you have to guess the alleles at any other gene in the genome.  It’s up to you to choose those two genes.  How would you go about choosing the best pair of genes so that you can win the big prize?  The best way would be to choose two genes with very high LD between them, because knowing the alleles at one gene in strong LD with another gene basically tells you what that other gene will be without even looking.  Often genes with high LD are physically linked, however no requirement that genes in high LD be linked, and in fact, there are a few cases of LD between genes residing on different chromosomes.

There are several statistics used to quantify the amount of LD.  Two of the most popular, r2 and D’, yield values on a scale from 0 to 1, with 0 being complete linkage equilibrium, 1 being complete linkage disequilibrium, and intermediate values indicating partial LD.  As the notation suggests, r2 is derived from the statistical R2 (i.e. coefficient of determination), which is the square of the correlation coefficient.  So if you get correlation, then it’s simple to extend that understanding LD.

Maybe not jargon (or perhaps better titled: some of my pet peeves):

Genetic Map:
This is a characterization of a genome using genetic markers.  Various approaches are used to estimate the linear order of these markers along the chromosomes.  Often the markers are laid out in the order of their estimated genetic linkage, which is related to the fraction of meiosis that yield a recombination between the markers.  For this reason the distances on the resulting map are a function of the recombination rate between various points along the chromosome, and may not accurately represent the real physical distances between markers in base pairs.   Also, even the most complex genetic maps may contain only a few thousand markers, whereas most genomes contain millions or billions of nucleotides, so every genetic map is an incomplete characterization of the genome.

Genome Sequence:
This is a representation of the genome created by identifying the DNA nucleotide sequence at a single base pair resolution.  In a genome sequence the distance between loci is their physical distance, in base pairs.  To date, very few genome sequences are truly complete; they are often missing hard to deal with regions, like areas where there are lots of repeats.  That said, even a sparsely completed genome sequence will contain far more loci than a genetic map (which isn’t a statement about the utility of genetic maps, they have myriad uses).

Lex’s Map vs. Sequence Rant:
Often in the popular press you’ll find the terminology of genetic mapping and genome sequencing jumbled together.  For example the headline might read, “Scientists have begun sequencing the genome of species X”.  Then in the article you’ll find a nugget of wisdom like, “scientists believe that one of the best ways to fully unlock the potential of species X is to map its DNA”.  To a geneticist this use of the term “map” is quite confusing because it conjures the idea of a genetic map.  Instead, I think it’s intended as a crude way of equating a genome sequence with a traditional map, like a road map (remember those things that used to live in your glove box before we got iPhones).  They do both tell you where things are in relation to one another, but unfortunately this use of the word “map” muddles important concepts, and moreover, the road map analogy is a stretch at best.  A genome sequence would be like a road map that contains the exact order of every molecule of the road!  

Stuff for a later date!
genomic conflict or intragenomic conflict
Punq. Eq.
pre-adaptation and exaptation
genetic lesion / point mutation
Genetic accommodation 
Trivers-Willard
long-branch attraction
selfish gene
QTL
Wright's shifting balance theory
Hopeful monsters
genomic shock
neutral theory
inbreeding depression/heterosis/hybrid vigor
Mendelian trait
Quantitative trait
Hardy-Weinberg
Lamarkism
Lysenkoism
Dollo's law
Horizontal transfer
drift
balancing, purifying (negative), directional, disruptive, artificial, and stabilizing selection
mutation-selection balance
molecular clock
transition/transversion
handicap principle
effective population size
G X E
Batesian Mimicry
Mullerian Mimicry
Red Queen Hypothesis
Haldane's Rule
Bergmann's Rule
Baker's law
Fisher-Wright Model  
Lek Paradox
Haldane's Dilema
Haldane’s Sieve
2-fold cost of sex
Price Equation

The Monty Hall Problem

The Monty Hall Problem is an interesting problem in probability theory.  The problem comes from the old TV show Let's Make a Deal hosted by Monty Hall.  You can read more about the background here on Wikipedia.  The basic premise is that there are 3 doors, behind one is a prize (like a new car), and behind the other two is nothing.  As the contestant you get to pick a door and you win whatever is behind it.  The wrinkle is this, first you pick one of the three doors, then Monty picks and opens one of the 2 remaining doors revealing a losing door.  Now he asks you if you want to stay with your first pick, or switch to the other closed door.  What should you do?
Monty Hall
(I dig the doo!)

My first thought was, well when you picked the door at the beginning, it had a 1 in 3 chance of being right. Now Monty has shown you one door that was a loser, so you have 1 in 2 chance of being right.  From this you can reason that switching to the remaining door also results in a 1 in 2 chance of being right, so it's arbitrary, you have a 50% chance of winning either way.  Boy this logic seems intuitive, but it turns out it is wrong!!


In fact, you should always switch from the door you first chose to the door that remains after Monty shows you a loser door.  Switching doubles your chance of winning! (you have a 1/3 chance of winning if you stay with your original door, and a 2/3 chance if you switch)  The trick is basically doing some careful bookkeeping and tracking conditional probabilities.  When Monty opens a door showing you a loser he is giving you valuable information.  You can read all about conditional probabilities at the Wikipedia link given above.


For me, doing is believing, so I have to play the Monty Hall game for this to really sink in.  One way would be to get a willing friend to help me play many rounds of the game some where I switch, and others where I don't and track how often I win using either strategy.  Sounds tedious.  Another route is to have a computer play the game for me.  Below is how I played 10,000 rounds of the Monty Hall game in Python.


from random import *
from collections import defaultdict

def count_all(xlist, proportions=False):
    '''Count all the items in a list, return a dict
       with the item as key and counts as value'''
    out =  defaultdict(int)
    for i in xlist: out[i]+=1
    if proportions:
        out2 = {}
        tot_sz = float(sum(out.values()))
        for i in out: out2[i] = out[i] / tot_sz
        return out2
    else: return out

switch = []
no_switch = []

m = [0,0,1]  #a representation of Monty's doors, two losers (i.e. zeros) and one winner (i.e. one)

for i in xrange(10000): #ten thousand games
    shuffle(m) #randomly shuffle which door is the winner
    p = choice([0,1,2]) #contestant randomly picks a door
    x = [idx for idx,i in enumerate(m) if not i]  #find the two loser doors
    if p in x: show = set(x) - set([p]) #if our pick (p) is a loser pick the other door
    if p not in x: show = set([choice(x)]) #if our pick is the winner, randomly choose one of the two losers to show
    show = show.pop() #the door we will show
    all_s = set(range(3)) #setting up a set of all doors
    f = set([p, show]) # setting up a set with the orig pick and the loser we will show
    new_p = all_s.difference(f) #new_p is the set diff, meaning it's the door we'd switch to after Monty shows the loser
    new_p = new_p.pop()
    if m[p]: no_switch.append(1) #if the no switch strategy succeeded add 1 to list
    else: no_switch.append(0) # if no switch was a loser add 0 to list
    if m[new_p]: switch.append(1) #ditto above, but for switch strategy
    else: switch.append(0)

print 'switch strategy', count_all(switch).items()  #print the counts of the results
print 'no switch strategy', count_all(no_switch).items()


I run this code with Python and here is what I get:
switch strategy [(0, 3362), (1, 6638)]
no switch strategy [(0, 6638), (1, 3362)]
(here zero represents a loss and one is a win, both followed by the count among 10,000 games)

So out of 10,000 randomly generated Monty Hall trials switching won 6,638 (pretty close to the expected 2/3), and not switching only won 3,362 (about the expected 1/3).  It works!  My initial intuition that they should be about 50/50 is proved wrong.

As I mentioned above, you can get to this result without the simulations by using probability theory, but for me the result really becomes concrete and intuitive only after I code it up and see it for myself.

[Another thing worth pointing out is how simple it is to express this game in Python using sets, though I might have gotten a little carried away with them!]
A Hopeful Monster?
This plant mutation is called fasciation.  It's pretty common.  If you
keep your eyes open you can frequently find fasciated dandelions.  
Thanks for finding this page!  This is the first post of my first blog.  The blog is titled A Hopeful Monster, in reference to a concept from evolutionary biology.  Some biologists feel that evolution largely proceeds at a slow and steady pace, with changes from one generation to the next being nearly imperceptible.  Another school of thought typically accepts the imperceptible change bit, but adds onto that the occasional radical mutant that upsets the whole apple cart.  A geneticist named Richard Goldschmidt called these radical mutants the Hopeful Monsters and thought that they might be really important in adaptation and the creation of new species.

I'm not sure my blog will be all that radical, or give rise to any Hopeful Monsters, but at any rate, it's a cute phrase!

I am a evolutionary geneticist and for several years I have kept files and records of things I've learned. My memory is pretty spotty, so once I get a grasp of a concept I usually write up my understanding as a sort of blog post to myself.  I've amassed a decent collection of these materials, and in part to organize this stuff, and perhaps to also help out others who wondered about the same questions, I'm planning to blog these things out.

In addition to that, I'm trying to learn new concepts (this year's task is Bayesian statistics).  I plan to keep tabs on what I've learned by blogging about this stuff as I understand it.