A Hopeful Monster

Friday, June 1, 2018

A very serious and completely accurate FAQ on using Deep Neural Nets for Population Genetic Inference

This post is in conjunction with our paper on bioarxiv. My hope is to add a little more detail around the idea of deep learning in population genetics.

Where did the idea of using neural networks to do pop gen come from?

It’s mostly stolen from Google. We’re not that clever, but we steal from the best.

The neural nets we built are very similar to those commonly used on MNIST handwriting data, which is like basic neural nets 101.

Machine learning, deep learning, deep neural nets, and artificial intelligence. What is all this stuff? And where does statistics fit in?

It’s confusing. Here’s my impression of how to structure some of the terms:

ML=Machine Learning; DL=Deep Learning; DNN=Deep Neural Nets

ML is very general, and encompasses many things, even including standard stats methods like regression (depending on who you ask). DL is a catch-all term that refers to a large class of (mostly) neural network methods without calling out any specific one. And DNN is a just any neural network that has more than one internal layer (i.e. “deep” layers). There are many sub-types of DNNs too. For example we made heavy use of Convolutional DNNs.

And artificial intelligence, or AI. That term is fuzzy and really used more for marketing. All this stuff is probably AI, or none of it is, or who knows. You should sprinkle AI into the conversation when you are trying to impress people.

Finally, where does stats fit in? That’s also debated. I don’t think of neural nets as a statistical method. But others may disagree. They certainly incorporate ideas from statistics, and if you know some stats that will probably make it easier to learn neural nets.

I’ve got this project I want to do, and it’s a really hard problem. Also there is no good way get good training data, and even if there were, there’d be no signal in it. Do you think the neural nets can solve this problem?

Nope. There is no magic.

I’ve got another problem, and it can be easily solved by regression, but I was wondering if I should use a neural net instead?

Nope. Whenever possible stick to the explainable stuff. But bonus points if you do it both ways and learn something.

I've got a NP problem (like phylogenetics), can neural nets do that?

Generally speaking neural nets are inappropriate for optimization problems. Sometimes you can recast optimization as classification and use a neural network, but this only works well if there are pretty small number of possible classes (and usually in those cases you can just do an exhaustive search).

Do I have to be a math expert to learn neural nets?

No! Ever run a BLAST search? Do you really now what’s going on under the hood? Me either, but we can both use them to get our work done. Same thing with neural nets.

Also, the math is not so bad, and I don’t think you need to master it to be able to conceptualize what’s going on.

Right now is a magical time to learn this stuff. You can learn about deep learning from the people who pioneered it. Many of them teach online courses or have really great books. Most of their teaching is free. Don’t miss the opportunity to learn from these people. Below are just a few links.

https://goo.gl/ryQ1kD (really great visuals)

https://www.youtube.com/watch?v=vOppzHpvTiQ (love this guy)
http://neuralnetworksanddeeplearning.com/

http://www.deeplearningbook.org/

https://www.coursera.org/specializations/deep-learning

What kind of pop gen theory are these deep neural net based on?

None! Neural nets are completely naive to any relevant pop gene theory. They haven't even heard of Hardy & Weinberg.

Neural nets are just algorithms that learn to do pattern recognition by computational brute force. And they work freaking well! And on a whole bunch of different problems. Kinda neat huh! Learn one technique and solve many problems.

The good news is that when we point the neural nets at data they aren’t limited by our current state-of-the-art in pop gen theory. They just learn patterns and invent their own theory (so to speak) for solving the problem at hand. So they may be doing things that we never imagined! And I said there was good news, so that probably means there is bad news too. Namely the innards of a neural net are nearly impossible to interpret. So they are really good for detection (i.e. this is a selective sweep and that’s not), but they are really bad for explanation (i.e. I know this is a sweep because of X).

Oh, I see, so you jerks are putting theorists out of business!

Ah, you got us!

Actually, no. See above. Neural nets are great for detection, but terrible for explanation. What they learn unfortunately a big black box to us.

We imagine that a hybrid approach might be the best way to use deep neural nets in pop gen. For example, first test a neural net and other methods on simulated data and figure out what works best for your question (and that might be some ensemble of several methods). If it’s the neural net looks good, you should use it. It seems silly to do otherwise. Then to tell your story use traditional pop gen techniques to try to explain what the neural net found. Some things might not make sense. Like the net found a selected site, but no other method detects it. On the one hand that presents a challenge, but on the other it’s probably worth investigating because it could be something we’ve been missing and could lead to new theory.

A second thing you should know. Big advances are being made in explaining what neural nets are learning. And who is going to take the knowledge out of the neural net and translate it to the rest of us? Theoreticians.

You keep calling it a "black box", what the hell are you talking about?

Good question. Black box is kind of weasely. Also black box makes it sound like we can't look inside. That's not the case, we can look in, but I hope to convince that what we see is just really hard to interpret. Let's build a neural net and crack it open see what's going on.

I made a neural net that predicts Tajima's D. The nice thing is we know the theory and math behind Tajima's D, so if we train a net to predict it and look at the inside, maybe we can find the components of theory and math in there somewhere?? That's the dream anyway.

Below is the structure of the net. I highlighted the neuron layers in red and for each the shapes of their output matrices are given.

InputLayer (40, 20)

Conv1D (39, 128)

Conv1D (38, 128)

AveragePooling1D (19, 128)

Dropout (19, 128)

Flatten (2432)

Dense (128)

Dropout (128)

OutputLayer (1)

And you feed this thing images (20 chromosomes and 40 segregating sites) like the one below:

That's the "input layer" in the structure above. Then it goes through two 1D convolutional layers (among other things noted above), and a fully connected dense layer (each with 128 neurons), and finally a single neuron output layer which just predicts Tajima's D as a continuous value. Below is the quality of this network on unseen test data after training.

It's pretty darn good. So it really gets how to predict Tajima's D. So now lets look how it does this using the input sequence alignment given above as an example. Here's the output of the first convolutional layer:

White pixels indicates a strong signal and black means no signal. Now here's the next convolutional layer:

And finally below is the fully connected "dense" layer: (This thing is flattened out, so it's really just a vector. But I gave it a little height so you could see it.)

And that thing feeds to one linear neuron and it sums it up and outputs 1.53, which is really close to the true value of Tajima's D (1.56) for the sequence alignment above.

Like I said, we know what goes into Tajima's D. What you'd like to find in these images is something corresponding to Watterson's Theta and Pi, and also something about their variances. Or maybe some bits that appear to count the frequency spectrum or some kind of intermediate stuff that you might use to get to Watterson's Theta and Pi. But it's just not there. At least not in any way we know how to extract (and perhaps it is really just not there). This is what the "black box" term refers to. What's in there is just a bunch of signals (off, on a little, on a lot, etc), and they give rise to other signals, and finally the signals are summed up and the sum (almost magically) comes out really close to the correct value.

What's worse, there are a vast array of internal configurations of this network that can achieve high accuracy. So if we retrained this thing from a new random starting position, we would likely see a completely new pattern of signals flowing across the network for this same input. And yet, the thing would probably again guess a value very near to 1.56. It's a little like our brains. If I ask you to imagine ice cream, you can conjure up some mental image. But if I compare how your brain makes that image to how my brain does it, we surely have different patterns of neurons firing. So just like our brains, the neural net is not learning some canonical way to get Tajima's D out of a sequence alignment. Instead it's just finding a local optimum, and there are many many equivalent local optima out there. So there is nothing really special about the patterns above, other than the fact that they represent one of the many ways of getting a decent prediction of Tajima's D out of a neural net.

Now imagine we instead built a neural network that could prediction something we don't know how to compute (like the signature of a soft sweep). We might get a network that does really well, and it'd be a dream back-out what the network learned into new theory. In practice, what this would entail is making sense and structure out of stuff like the patterns above. And perhaps worse, I'm not sure there is any certainty that their would be new theory in there. Like maybe you could toil at this for years, only to later learn that there was never anything there to find. I hate ending on a downer like that, but anyway, you hopefully understand why people call these things a black box.

Ok, I read the stuff above, and I still want to develop a neural net for my research. What does the process look like?

I’m so glad you are interested! Let me help you get started. This is a bit of a long answer. It’s a multistep thing.

First, maybe you have a bunch of genomes from your favorite species where the exact genealogical history of every SNP is known. And you know every evolutionary force that has ever acted on each SNP too. Then all you have to do is…

What’s that? You don’t have that kind of data? Or even a kind of halfway approximate version of it. OK, well, first let me welcome you to the club. In this club we do coalescent simulations to make up fake data sets that are sorta like those of the real critters we are studying. Then we feed these fake data sets into our neural net to train it and hope that we’re close enough to reality that when we point the thing at the real data it works.

So step one, you have to simulate training data for your net. And like I said, what that usually means is building a customized coalescent simulation. So you have to capture all the important complexity of the evolutionary history of your sample of genomes, and then simulate conditional on that. You also have to simulate the thing you want to detect, like a set of sims with the selective sweep and a set without. Probably 90% of all your mistakes are going to happen during this stage of the process. But since everything is super dependent on the exact nature of your sample and the critters you study, all your mistakes will be new and different from all of ours, so I can’t give you a lot of tips other than to talk to coalescent simulation experts. So that’s my advice. Make mistakes, talk to people, fix those mistakes and repeat until all the remaining mistakes are so deep that the average reviewer probably will never notice.

I’ll assume you accomplished step one, and you have a lot of coalescent sims in hand and they are all really really perfect for your question. And hopefully you still have some of your motivation intact too.

Next you have to get your coalescent sims into python as matrices (oh yeah, if you don’t already know python, go learn that too). We have some code that you can use as an example. We used ‘ms’ formatted files to store our coalescent sims. They’re unpleasant to parse. I’d recommend using msprime, which gives you the output of the sim in a nice python format off the bat.

Now we are nearly ready to train our network. But first we have to build our compute environment. I highly recommend using Keras and Tensorflow. They are two great python packages for training neural nets. Tensorflow is the math guts of neural nets, and Keras is a package that sits on top and hides a lot of the math from you. Tensorflow is a bear to install. I recommend using a cloud service like FloydHub where it’s already pre-installed (I’m not affiliated with FloydHub). Or bribe someone to install it for you, but probably just use a cloud service. They’re so cheap and your time is valuable.

Neural networks are computationally intensive, so make sure you use a machine with a GPU, because it will make things way way faster. What’s a GPU you say? The truth is I don’t really know. The Wikipedia page says things like “stream processing” and “compute kernels”, which are all words that mean something I guess. Here’s what I do know. GPUs do matrix math really fast, and training a neural net involves tons of matrix math. So they make the whole process way faster.

OK, so you got Keras and Tensorflow up and running on a GPU either by bribing someone or using a cloud service. Now you are ready to train!!

Training is my favorite part, because it involves baby-sitting a computer while listening to music and drinking beer. So first get some tunes going. I recommend something electronic because that’s the mother-tongue of neural nets. The goal is to comfort your net with songs in its language so it can learn faster. The beer is for you, so that's your call.

OK, first step in training (after taking a good pull of your beer), is to split the coalescent sims into three groups, the training group, the validation group, and the testing group. We’re going to train the neural net with the training group, make parameter changes with the validation group, and then test its accuracy with the test group (pretty creative names huh!). So the training group should be big. We used like 100K-200K individual coalescent sims usually, but even more would probably have been better. And the validation and testing data can be smaller, like 5K-20K each. But the trick with the validation and testing data is that it needs to be big enough to have all the important features of the training data (and vice-versa). So let say we are trying to detect selective sweeps. It’d be a huge mistake to put all the hard sweeps in the training data and all the soft sweeps in the validation and testing data. Instead, we need to randomly mix them so both sets have similar distributions of hard and soft sweeps (and even similar intensities of hard and soft sweepy-ness).

Once we’ve split the data, we pick a neural net architecture and we start training. Architectures that work pretty well are here, and if you read the code you’ll see you just specify the architecture layer by layer in with Keras. Network architecture is a dense topic that I don’t completely understand, but experts seem to agree that nobody else understands it either.

Each pass over our training samples is called an epoch. Generally speaking one epoch takes about a quarter of a beer. So you fire off one epoch, and sit back, enjoy the beer, and groove to tunes. Keras provides a little progress bar and 2 numbers called loss and val_loss, and it kinda looks like this at the end of the epoch.

Train on 143064 samples, validate on 7167 samples

Epoch 1/1 143064/143064 [==============================] -325s -loss: 0.3728 -val_loss: 0.3117

loss is the current state of the loss function on the training data, and val_loss is the current state of the loss function on the validation data. The loss function is just a measure of how well the net is doing at getting the right answer. We want both loss and val_loss to be small, but really we want val_loss to be small, because that’s the true test. And as the model trains we want to see both loss and val_loss reduce with every epoch.

There’s something funny going on above. The value at loss is actually greater than the one at val_loss. What that means is that the net does better on data it has never seen than on data it was trained on. WTF is up with that? Is it psychic or something??

Relax, it’s cool. In fact, it’s really good. What’s going on is that we are using a technique called dropout. By using dropout we handicap the net during training, so it thinks it’s dumber than it really is and tries extra hard to learn. Then when it applies itself to the validation data, we take the handicap away and it does better than it ever thought it could. This is kind of a shitty thing to do to it of course. Our little net is definitely going to have trust issues down the road, but it’s just a bunch of bits, so tough luck kid. That said, when the computers take over we are all going to pay dearly for this dropout stuff.

Training only takes about ½ a brain and a couple beers. Maybe less of the former and more of the latter. That’s why I like training. There is one thing you should do at some point. Figure out the math that makes up your loss function (they’re usually the mean square error or cross-entropy, which are pretty easy equations), and figure out some way to convert this into units you can contextualize (like RMSE or accuracy (and Keras will do accuracy for you)). That way while this thing runs you can see how it’s doing in units that make sense and you can feel good about life as it gets smarter and you get drunker.

But don’t get too drunk. There’s one other thing you have to do. You have to watch out for overfitting. This is when the net starts predicting your training data super well, but at the same time gets worse at predicting your validation data. Overfitting is just a fancy way of saying that something trained really hard to memorize a specific task, but didn’t learn to solve the problems in a general way.

Neural nets are really smart, and they love to overfit, however overfitting is pretty easy to diagnose. Any time loss is smaller than val_loss you are seeing overfitting. That’s why dropout is good. It keeps loss bigger than val_loss. Usually a little over fitting is ok, and sometimes it corrects itself after a few epochs. But when it gets out of hand it’s pretty ugly, and often it spirals until your model is so overfit that it’s pretty much useless at predicting the validation data even though it’s crushing it on the training data. If it overfits really fast and well before it achieves any kind of accuracy you’d call acceptable, then you might have to change the architecture, or make more training data, or use even more dropout (you heartless jerk). Here’s a cool blog post that talks you through your options. I recommend carefully going through it step-by-step, and for now skipping all the steps you don’t understand.

On the other hand, maybe your net just keeps trucking along getting better and better. Lucky you! But at some point you have to stop drinking and call it good enough. Then you evaluate your net one last time, this time on test data that were never used in training. This is important, because you tuned the model using the validation data, but now you need an independent assessment, and that's what the untouched test data is for. Then you save the trained net and the accuracy on the test data and get ready to run it on your real data and do real biology!

Have you ever learned about some flashy new computational approach for solving your exact problem? Holy cow, exactly what you needed! Amazing, right! Then you download it and realize that the inconsiderate bastards who wrote this crappy software went out of their way to make it hard to use. You spend weeks reformatting your data and just seething with contempt for these jerks. Important stuff is undocumented, and it even crashes on their stupid example, and whitespace is f-ing important but of course that’s not discussed and… uhg. Who do they think they are?! How could they be such assholes, wasting countless hours of everyone’s time, just because they couldn’t be bothered to use some standard input formats, or write halfway decent documentation, or provide code that compiles. Oh do you hate them, just a burning hatred!

OK, first you’ve got some issues, but don’t worry so do I, so no judgment here. But second - and this is the big payoff - guess what, when you make your own neural net to solve your own problem you get to be your own inconsiderate bastard who makes it a huge pain in the ass to use. How cool is that!

Yeah, so I should have mentioned this earlier. Before you start, you need to think about your real data a lot. Consider how you want to run it through the net. Then you need to tailor your coalescent sims to match whatever paradigm you choose. If you have 20 individuals and you want to run 20 kb chunks of the genome through, then you need to simulate 20 individuals and tune the sim parameters to approximate diversity in those 20 kb chunks. Also, are your data phased or unphased, and how much sequence error do you think you have? And do the A,C,G, and T bases matter, or can you code all your biallelic sites arbitrarily as 0 & 1? You gotta work through all this stuff, and tune the sims accordingly. It all matters, because the neural net will learn from what you give it, not what you meant to give it. Also, you are probably going to tweak and rerun the simulations several times before you get it right. Hence my comment above about 90% of your mistakes, it’s this stage where they all come to light.

Sounds bad, I know, but just use the neural net that sits on you shoulders and try to imagine what the artificial one in your computer is seeing in the data, and then adjust the sims accordingly. Plus when you are your own inconsiderate jerk who made the thing a pain to use, at least the feedback loop usually is pretty short.

You’re almost done! You made your training data, and trained your net, then you tested it against unseen data, and finally you ran real data through. And guess what, it found stuff! Time to write a paper! So now you have to explain what it found. One approach is to fall back to non-neural net methods. Like apply some pop gen summary stats just as you always have. It’s cool, go ahead, I won’t judge. Then maybe compare the patterns of the net versus the traditional methods and figure out how to explain it to someone using traditional summary stats. Or maybe don’t. Instead you could just point to the quality of your neural net on the test data and rest your case. If a reviewer asks tell them “Stats! We don’t need no stinkin’ stats!”. I’m cool with that too. Whatever works for the story you want to tell. It might even be helpful to read some cool machine learning papers in biology and see how other people reported discoveries made with a black box network.

Here’s my final pitch: we’ve been doing pattern recognition in pop gen for years (though rarely calling it that). I (and other like-minded folks) think it’s time we adopt more powerful pattern recognition approaches. I’m excited about neural nets and their potential in this regard, but they are different and will require a little retooling on your part. So, if you decide to give it a whirl and get stuck, reach out to me. I’m happy to chat and offer any tips I can. The best ways to reach me are on Twitter @flagelbagel or my email, which is a gmail account with the prefix flag0010.

OK, I’m getting it, and I see this whole thing as being terribly reliant on the quality of my coalescent sims. Seems like a house of cards. Right?

Yes, how well your coalescent sims match your real data really matters. I can’t stress that enough. You can’t train a neural net on sims that are appropriate for elephant demography and expect it to do well on fruit flies. In fact, you probably can’t even train on the demography of one species of fruit fly and hope it will work well on another. But hey, it’s not hard to get heaps of genomic data, so just estimate the demography and simulate accordingly. This is kinda like specifying your prior, and in my opinion that's a good thing.

You also should know that neural nets do well at interpolation, but they struggle with extrapolation. So for example, if you train one to look for sweeps when the effective population size ranges from 10K-50K, they will work well on real data that falls in that same range, even for points in that range you never actually simulated (interpolation). But they probably won’t work well for real data that lies outside that range (extrapolation). So you have to be cognizant of the extremes in your data and train all the way to those extremes. And bear in mind, the coalescent is quite stochastic. So things like the effective population size are going to have a big range across different loci in the genome and you need your sims to be similarly extreme.

Finally, traditional summary stats based pop gen also tends to rely heavily on coalescent simulations too. A common practice is to generate oodles of summary stat values from the real data and then use coalescent sims to determine the cutoff values for significance. So if you do that kind of pop gen already, then either way you slice it, it’s turtles and coalescent sims all the way down.

There’s so much hype around neural nets. What happens in 5 years when they become the next “big data” or “VR” or “wearables”?

Yes, there is a LOT of hype around deep learning in tech right now, and you are absolutely right to be a skeptic when you see this kind of hype. “Deep Learning for X” startups are launching daily, and many are selling pixie dust. They and their financiers will be screwed soon enough, but you should know that the core tech behind neural nets is real and it’s not going away. This is absolutely happening. If you see yourself being in a quantitative field in the future, it's probably time to add neural nets to your toolbox.

Wednesday, April 26, 2017

My Stolen R Google Scholar Account has been Stolen From Me!!

For my own stupid amusement in 2014 I created a Google Scholar account for the R programming language and blogged about it. If you read that post, you'll see I mostly did it because I find it hilarious how many published papers screw up R's citation (probably including some of my own :).

Now, to be totally clear, I've never contributed a single thing to the development of R and I'm not at all associated with the project. I'm simply a fan who noticed it gets cited a lot and thought it could be interesting to give it a presence on Google Scholar. So did I have any credible authority to appropriate this account? Of course not. Or at least I had no more right than any of the other ~7 billion humans on earth who never contributed to R. Was it ethical of me to do this? Let's call that a grey area. On the one hand, it gave R some presence, but on the other hand, I controlled the account, which I'll admit is a little awkward.

So why am I writing this? Well, as you can probably guess from the title, the account appears to have been taken away from me. Silently taken away, I might add. I'm not sure when, because I honestly haven't checked on it in years. If you go to the account now, it links to CRAN as a homepage, and says that the email address associated with the account is registered to a Canadian University. Generally things seem above board, and probably this is the right, just, and equitable conclusion of my little spoof.

I think the only remaining interest here is how did someone at that Canadian University convince Google to turn over the account? And was it easier to do it that way than to simply email me?

So, chastened, I think I will no longer create novelty Google Scholar accounts. Certainly not for the living (be they people or projects). That said, if you are interested, I noticed that many notable deceased scientists, like Fisher, Pearson, Snedecor, Tukey, Lush, Haldane, or Henderson do not have accounts. That's a shame.

Wednesday, April 6, 2016

Warning Signs

CAUTION - VIEWER DISCRETION IS ADVISED. IMAGES BELOW DEPICT STICKMAN CARNAGE. YOU'VE BEEN WARNED

====================================================================

I hate gore. I have nightmares when I see a gory movie, and as a consequence I almost never see anything even mildly messy. However, I'm morbidly fascinated by industrial warning signs. What am I talking about? See below.

This is a warning label for a power takeoff shaft (PTO) on a tractor. You get the idea, this thing spins really fast and with a ton of torque and if you get to close it will turn you into a loaf of human challah.

I can't stop looking at these things when I operate heavy equipment, which I guess is the point. It's hard to be complacent about safety when you can't take your eyes off the cartoonish stickman being snuffed out on the warning label.

Here are just a few that have given me the creeps like the PTO image above.

Snowblower:

And a whole scary series of pinch points:

And their close relatives, crush points:

And here's one I saw just the other day, bike endo man:

(I think this is basically an admission by the city that their streets suck, it should say something like "Warning, we can't be bothered to maintain this road, bike at your own risk")

But there's one thing that freaks me out when I'm around it that has a shockingly benign warning label, compressed air. Here's the normal label:

Seriously! That doesn't look dangerous at all! Kind of looks like a rolling pin. Where is the stick guy getting blown up?

I propose a new compressed gas warning sign. We've all seen Jaws. After almost 2 hrs of eating everything in sight, they finally get the shark to bite down on a compressed air tank and Roy Scheider shoots it, causing ka-boom.

There you go. A simple three panel warning sign about the dangers of compressed air. Definitely don't put it in your mouth, especially if Scheider is gunning for you, or else, ka-boom.

Thursday, July 30, 2015

Drawing Truchet Tiles

In continuing my interest in geometric art, I've gotten interested in tiles and tiling patterns. One of the simplest patterns is called the Truchet tile. It consists of 4 types of tiles, each with one corner filled in with a right triangle. Below is a picture of the 4 tiles that make up the Truchet set.

I wrote a small python function that draws Truchet tiles using the matplotlib plotting library. I've put it in my GitHub repository. If you run the code it produces a bunch of Truchet tile patterns (below). Also you can clone the python functions and play around with them and make up your own tiles.

Here is the first pattern. Just a 20x20 grid of random Truchet tiles. I've added in blue gridlines so you can see the individual tiles.

Next is an arrangement the follows a particular pattern. This is where it gets interesting. Imagining patterns and drawing them. If you number the four tiles above 1-4, and start at the lower left and work across by row to the upper right, this pattern is a row of 1s, then a row of 2s, then a row of 4s, and finally a row of 3s, and repeat.

A variation on the traditional Truchet tile is the two tile set with semi-circles in the corners below.

Just sitting there those two tiles look kind of lame, but a grid of these is surprisingly cool. Below is a random 20x20 grid, and again I've marked the gridlines so you can see what's going on.

That's pretty fascinating. It's like a maze.

One can also apply patterns to these modified Truchet tiles. One examples are below. I'll leave it to you to work out the pattern.

Finally, I invented my own new Truchet tile. It's just a simple modification of the two semi-circle pattern, where a line is drawn that connects the semi-circles. Below is a 20x20 random grid of these 2 modified tiles. I think they look cool. You can imagine all kinds of other modifications that might look neat. My guess is that organic patterns, like things that look like leaves or flowers in the corners of the tiles might look really nice. One day I may get around to mocking that up and posting here.

Wednesday, September 3, 2014

Unsolicited Email

In any profession one of the downsides of attending conferences or publishing papers is that your email address gets out into the world and you become the target of unsolicited junk email. Most of what I receive is trying to sell me a scientific product or get me to attend a conference or publish in a journal that no one has ever never heard. I never open the bulk of it.

Recently I've received a ton of crap.

Today, as a sort of silver-lining, I got my all-time favorite unsolicited email, and one that I actually opened and read completely. Check it out below. The subject line was "Get all the DNA out of urine!". YES!!

Best part: "Urine is a veritable gold mine...". I couldn't agree more!

Anyway, I'm in a select group (I assume) of scientists chosen to be beta-testers of Zymo Research's Extract-ALL(TM) Urine DNA Kit, so I got that going for me, which is nice.

Thursday, July 10, 2014

Duke Divinity School says it answers what science cannot

I stumbled on an article in the Duke Chronicle with the title above. I assumed it was probably a young journalist getting a little carried away with the headline, but read it anyway. In fact, the headline was pretty factual. Here are some interesting nuggets:

Duke Chapel

Students in the Divinity graduate programs come from a wide variety of backgrounds, but all of them come to seek further study in the field of faith. Each come having accepted the fundamentals of their Christian faith—just as a mathematics graduate student accepts the concept of numbers, or a medical student accepts chemistry, Hays said.
[Hays is the Dean of the Div. School.]

“There is no field at Duke that doesn’t take on presuppositions,” Myers said. “I don’t think the argument should be about the crazy claims that the Christian Church makes because we all have crazy presuppositions."
[Myers is as grad student at the Div. School]

“Science seeks to describe empirical phenomena in a material world,” Hays said. “It describes how things work. Science cannot answer questions about why it exists or for what purposes or how it came to be. Those are the questions that theology tries to address.”
[Hays again]

The study of what it means to be human is at the heart of humanities studies, and that is where religion plays a role, said Carnes, who wants to become a theology professor.

“Humanities in general have something to do with what it means to be a human in a way that math and science can’t fully address,” she said.
[Carnes is another grad student at the Div. School]

I am a practicing scientist. I do not have a very sophisticated understanding in philosophy, and I'll take it as given that there is some lack of "proof" for the theory of numbers or atoms, as Hays claims. However, I absolutely reject the reasoning that somehow a belief in 2+2=4 is essentially the same as a belief in the Christian god (or any other god). That's nuts! Chemistry and Christianity are just not on equal footing in terms of supporting evidence, no matter how intractable the ultimate proof of atom theory may be.

These little philosophical slights-of-hand, like the one used by Hays, create a superficial notion of an equivalence of the factuality and veracity of science and religious belief. This kind of thinking just blows my mind. I also find this very disheartening because I have a hunch that these tricks are one of the more pernicious roots of the rejection of science. It's much easier to cast science aside if you pose it as a belief system or as equivalent to a religious belief system.

Second, what does it mean to study why things exist or for what purpose [Why do carrots exists, and for what purpose are supernovae?], or what it means to be human? I suppose in some sense I've been studying what it means to be human every day for the last 34 years. I'm sad to report I've had no breakthroughs :). Honestly, I'm not even sure I'd recognize a breakthrough if it occurred.

Friday, June 13, 2014

Programming Geometric Art

A few weeks ago I was listening the the brilliant podcast 99% Invisible. The particular episode was all about quatrefoils, which are cloverleaf shaped geometric designs. In the episode they point out that quatrefoils feature prominently in high class objects, like gothic cathedrals and in luxury products (e.g. Louis Vuitton bags). The episode really stuck with me. It was so interesting. These simple shapes can convey so much meaning, much of it subconsciously (and now for you more consciously, because you too are going to start seeing quatrefoils everywhere, especially for my friends at Duke!).

I have always been fascinated with kaleidoscopes and geometric designs. I love Arabic art and gothic patterns. There something soothing about geometric designs, they have order and organization but at the same time an organic familiarity because similar processes arise in nature.

This got me thinking. At a base level geometric art is just an algorithm. Draw a line here, move 10 degrees over, draw another line, repeat. I know a little about algorithms and I started to wonder if I could make anything pretty in R (R being an ironic choice because 99% of what I do in R is decidedly not pretty). At first I set out to make quatrefoils, but I never did figure it out (and if you do, paste your code in the comments section below). Then I reset my sights on just making something interesting.

Below are the first 5 pleasing things I've managed to produce along with the R code that draws them. The first 4 were designs I made intentionally. It's a fun exercise. Find
or imagine something you want to draw, then think about the algorithm that would make it, and finally translate that into code. It's one of those tasks that uses a lot of your brain.

Of the images below, I'm most proud of the one called "angles", because I got the process from idea to algorithm to code on the first try! All the others took some trial and error, with interesting squiggles and blobs emerging along the way.

These pictures raise a lot of questions which are beyond my grasp of geometry and trig. For example, in the mystic rose below why do all those internal circles arise? What is their radius? Why are there voids? There must be some way to derive answers to these things mathematically, and I'm sure if I meditated on it long enough I could probably figure it out. Unfortunately, for now I'm a little too busy to remedy my own ignorance.

Also, "mystic rose" isn't my name. I did some googling and that seems to be the standard name for that object. I'm a geneticist. I'd have called it the "full diallel". :) (And a hellacious one at that. 625 crosses. You'd need to be pretty ambitious or foolish.)

Finally, if you decide to run the R code below, it is best in regular old vanilla R, because it draws one line at a time and you can see the pattern emerge (and if you have an old slow computer like mine it will emerge at a nice human pace). When I plotted these in RStudio the image wouldn't print to the screen until it was done, which kind of kills the effect (I recommend running some of the code, it's mesmerizing).

##mystic rose

rad = function(x) (x*pi)/180

plot(c(0,0), c(0,0), ty='n', xlim=c(-1,1), ylim=c(-1, 1))

deg.rot = 15

for (i in seq(0,360, deg.rot)) {

for (j in seq(0, 360, deg.rot)) {

lines(c(cos(rad(i)), cos(rad(j))), c(sin(rad(i)), sin(rad(j))), lwd=0.5)

}

mystic rose

##circles

v = seq(0,360,45)
rad = function(x) (x*pi)/180

plot(0,0, ty='n', xlim=c(-20,20), ylim=c(-20,20))

for (i in v) {

for (j in seq(1, 10, 1)){

symbols(cos(rad(i))*j,sin(rad(i))*10, circles=1, add=T)

}

circles

#angles
rad = function(x) (x*pi)/180

v = seq(45,360,45)

steps = seq(0,1,.025)

plot(c(0,0), c(0,0), ty='n', xlim=c(-1,1), ylim=c(-1, 1))

for (i in v) {

next_angle = i + 45

for (j in steps) {

lines(c(cos(rad(i))*j, cos(rad(next_angle))*(1-j)), c(sin(rad(i))*j, sin(rad(next_angle))*(1-j)), lwd=0.5)

}

angles

##betas

v = seq(2, 1502, 15)**.75

plot(0,0, xlim=c(0,1), ylim=c(-20,20), ty='n')

for (i in v) {

lines(seq(0,1,0.01), dbeta(seq(0,1,.01), shape1=i, shape2=i), lwd=.1)

lines(seq(0,1,0.01), -dbeta(seq(0,1,.01), shape1=i, shape2=i), lwd=.1)

}
betas also tells a story. it's drawn using a beta distribution. it shows the uncertainty around a proportion estimated to be at 50% frequency as a function of an ever increasing sample size. it moves really fast at first, but quickly hits diminishing returns. at the end, adding hundreds of samples does very little to increase statistical power.

betas

##trig functions

v = seq(0,pi*.25, pi/10)

v2 = seq(0,pi*0.5, pi/200)

plot(c(0,0), c(0,0), ty='n', xlim=c(-120,120), ylim=c(-150, 150))

for(i in v) {

for (cons1 in seq(0,100,10) ){

for (cons2 in seq(0,100,5)) {

xf = function (i) cons1*cos(i)+cos(3*i)+cos(2*i)+cos(4*i)

yf = function(i) cons2*sin(i)+cons1*sin(3*i)

lines(xf(v2), yf(v2), lwd=.05)

lines(-xf(v2), -yf(v2), lwd=.05)

}

trig functions

A Hopeful Monster

Friday, June 1, 2018

A very serious and completely accurate FAQ on using Deep Neural Nets for Population Genetic Inference

This post is in conjunction with our paper on bioarxiv. My hope is to add a little more detail around the idea of deep learning in population genetics.

Where did the idea of using neural networks to do pop gen come from?

Machine learning, deep learning, deep neural nets, and artificial intelligence. What is all this stuff? And where does statistics fit in?

I’ve got this project I want to do, and it’s a really hard problem. Also there is no good way get good training data, and even if there were, there’d be no signal in it. Do you think the neural nets can solve this problem?

I’ve got another problem, and it can be easily solved by regression, but I was wondering if I should use a neural net instead?

I've got a NP problem (like phylogenetics), can neural nets do that?

Do I have to be a math expert to learn neural nets?

What kind of pop gen theory are these deep neural net based on?

Oh, I see, so you jerks are putting theorists out of business!

You keep calling it a "black box", what the hell are you talking about?

Ok, I read the stuff above, and I still want to develop a neural net for my research. What does the process look like?

OK, I’m getting it, and I see this whole thing as being terribly reliant on the quality of my coalescent sims. Seems like a house of cards. Right?

There’s so much hype around neural nets. What happens in 5 years when they become the next “big data” or “VR” or “wearables”?

Wednesday, April 26, 2017

My Stolen R Google Scholar Account has been Stolen From Me!!

Wednesday, April 6, 2016

Warning Signs

CAUTION - VIEWER DISCRETION IS ADVISED. IMAGES BELOW DEPICT STICKMAN CARNAGE. YOU'VE BEEN WARNED

Thursday, July 30, 2015

Drawing Truchet Tiles

Wednesday, September 3, 2014

Unsolicited Email

Thursday, July 10, 2014

Duke Divinity School says it answers what science cannot

Friday, June 13, 2014

Programming Geometric Art

Blog Archive

About Me