A lot of people seem to struggle with the idea that stochastic processes can produce deterministic outcomes on grand scales. A prime example is genetic drift and how its results (and it’s interactions with silent mutations) can be used to assess past population structures and sizes.

As I’m increasingly finding that a good way to better understand a concept is to build it into a little model where you can tweak the parameters at your will (and also as an exercise in try-except-constructions and classes), I’ve written some Python code that simulates neutral evolution(0). It doesn’t show anything new, but it may be didactical. The bottom line is the following graph, which shows how the diversity of a population (the number of different alleles present) starts to oscillate around a point of equilibrium between mutations introducing new variants and existent once drifting out of the population after a while, the level where that equilibrium is reached being function of the population size and irrespective of initial diversity.

What this little toy of mine can *not* measure is different *types* of diversity: In my model, all variants are created equal, and don’t stand in any special relation to each other, so when I assume a possibility space of 100000, if and when a carrier of variant #73489 undergoes a mutation, the result can be any of the remaining 99999 possible variants. In reality, variants form a network of possible transformations, with variants that are closer or more distant from each other. So my model (even if it allowed to change the size of a population over time, which would currently require some ugly hacks) is insufficient to distinguish a large population that is the result of a recent expansion from a medium-sized population from a large population that is the result of a not-so-recent expansion from a small population – both will show less diversity than is expected for their size, from which fact, absent a way to tell different types of diversity apart, we could only conclude (assuming we know the mutation rate and reproduction patterns) is that either of these things must have happened. In reality, one would have (say) hundred different variants most of which are close to each other (converging at just a few right before the expansion started), while the other might have the same number of variants, but those would be more distinct from each other. Another feature of real genomes absent in this simple model is that you can track the variants of multiple genetic loci individually. This allows to diagnose subdivisions of the population when the limits of the ranges of individual variants correlate, which they shouldn’t if sheer distance in an otherwise uniform population is all that’s at work.

These shortcomings notwithstanding, the model is sufficient to see the effect of population size on gross diversity.

**The baseline**

Let’s start with a *very* small population, i.e. 50 smurfs.(1) In other words, I initiate the population with the command `population = Pop([x for x in range(50)])`

. Those are lined up in a circle such that individual #49 is a direct neighbour of individual #0. Imagine theylive along the coast of a small island with circumference of 500 meters, each occupying its own small patch 10 metres wide.

Initially, we start out with 50 smurfs representing 50 different genotypes, like this (I’ve fiddled around with the colors a bit so that all and only the individuals whose descendants stay around until generation 13 when the variability has shrunken to ten different genotypes are identified with a unique color, and all those who’ve gone extinct by then default to turquoise):

At the end of each generation, the smurfs reproduce and die. Each dying individual is replaced by a clone of either itself or one of the nearby individuals, where the distance between parent and offspring is regulated by a normal distribution. For this demonstration, I used a standard deviation of 2.0, so over 90% of the time, parents will come from within a range of 3 neighbours to either side(2). For two marked individuals, I identified those ranges of likely parents of their successors as the area between the arcs.

Now that we’re set up with a maximally heterogeneous initial population, we can let the games begin (`population.run(1,2,limit=30)`

). here’s how the population developed over 30 generations – the result will be slightly different each time, but something along those lines. Each vertical column represents a single generation, each horizontal row a spot on the island’s shore, with the top and bottom rows connected imagine we’re mapping the surface of a tube (this map can be produced with `population.area_plot(v_stretch = True, h_stretch=True)`

, although I fiddled a bit with the colors: `for genotype in range(50): if genotype not in population.freqs[population.k]: population.colors[genotype] = "turquoise"`

should do the trick):

From generation 13 on, marked with a dashed vertical line, the total variability has been reduced to 10 different genotypes, now each identified with a unique color. By generation 30, we’re down to five, without any inherent advantages to any smurf in the original population – all because some smurfs and their successor happened to be in the right spot at the right time, and those things add up.

Let’s zoom in on generation 30 again:

Again, two individuals’ likely successors are marked with those same arcs. For the individual on the right, we see that it’s almost certainly going to remain red: the probability that its successor will be coming from within that solid block of reds of which it is a part is almost 99%.

If we continue the game, we’ll almost certainly achieve uniformity within a few hundred generations.

**Fiddling around with population sizes**

Now that we’ve established the basic principles at work, let’s try some more realistically sized populations, and compare the results for the same set of parameters except for varying population size (I’m using a slightly wider standard deviation of 10.0 for picking the parents than above with the minipopulation of 50 where i used 2.0 for the purposes of saving my battery – with a smaller local range, it takes longer for the same effects to become visible).

We see that the larger the population is, the longer it takes for it to achieve fixation, i.e. before one variant comes to totally dominate the genepool. With a population of 200, this happens within a few hundred generations, whether we start with a near homogeneous population (only two types initially), or a maximally heterogeneous one.

A population of 1000 may not achieve uniformity over 2000 generations, although we’re down to only two types before generation 1000. After that, the effect of local sampling, irrelevant in a population of 200, kicks in – the two subpopulations form local clusters, and their overall frequencies don’t change that much since only deaths and births in the area where there’s an overlap have a potential to effect them.

This is even more pronounced with a population of 5000, where at the end of 2000 generations, we still have 11 different genotypes out of the original 5000. They too trend towards homogeneity, but much slower, and the geographical separation slows the process down quite a bit.

We can directly compare the degree of diversity in each population over time:

**Introducing mutations**

We have seen that drift is a force that will always push towards uniformity and that is the stronger the smaller the population (and the less internally structured, as we would see by changing the size of the sampling space for parents, leaving all else equal). But real populations don’t always trend towards uniformity, and neither should we expect them to. There’s a force that pushes towards heterogeneity: Mutations, especially selectively neutral mutations (the majority of them all). Mutations don’t care about the population size, they occur with the same probability per reproductive event regardless. With a mutation rate of 1/5000 (which is what I used in the simulations below), that means that a mutant smurf, representing a novel type that wasn’t in the initial population, will be born on average once every 25 generations in a population of 200, and every generation in a population of 5000. In both cases, most of those will go extinct again – when you start with a headcount of one and randomly drift up and down, chances are you will hit 0 before you get very far on the up side. But if and when a mutant is lucky enough to stay around, its descendants can quickly become the dominant type in a small population, to the point where the entire population converges on them (at least, before new mutations arise).

In a large population, this doesn’t happen. While there are many more mutations and therefore also many more mutations that *don’t* immediately drift out of existence again, change in frequencies occurs so much slower that by the time a mutant supplants all initial types (if at all), many more recent mutations will have claimed their share. The result is that in a large population, heterogeneity **never** goes all the way down. Instead, the number of types in the population fluctuates around the point of equilibrium between the homogenising force of drift and mutations introducing variation – and this point is, on the long run, independent of initial conditions: A population of 5000 that is initially maximally diverse will reach the same level of heterogeneity as one which starts out with only one or two different types, given the same mutation rates and sampling ranges. Here is, repeated from the introduction, what we get in terms of total heterogeneity under the same six different conditions as above, only adding a mutation rate of 1/5000 (with a possibility space of 100000 different types, so mutations are relatively unlikely to produce an already existing type even for the largest population modeled):

And this is, in a nutshell, how variation in a population can be used as a diagnostic of past population sizes, although the real world offers much finer grained analyses based on the variants have different distances from each other.

Let’s look at what happens in the individual populations in some more detail:

(Sourcecode posted under this link for anyone interested. Still poorly documented and debugged, but mostly works.)

**Footnotes:**

(0) This is not by any means the simplest possible model for genetic drift – the requirement to have local sampling adds considerable sophistication. Although it’s not strictly required for the main point of this entry, I think that the possibility it adds of displaying how different types establish themselves in a subset of the species’ total range before invading other areas makes the whole thing more intuitive than just plotting total frequencies.

(1) I am aware that smurf reproduction is an open research topic, but for the purposes of this discussion, I will assume that they reproduce asexually, probably through budding. I don’t really want to imagine the alternatives. Also, the smurfs on TV are misrepresented as tribal animals. Real *smurfs* are solitary.

(2) Roughly 20% of the time, an individual is succeeded by a clone of itself (for sigma values between -0.5 and +0.5), roughly 35% of the time by one of its direct neighbours (-1.5 to -0.5 and 0.5 to 1.5), roughly 24% of the time by one of its second neighbours, and about 13% of the time by one of its third neighbours to either side. Technically, any individual’s clone can replace any other individual, but the actual probability becomes increasingly negligible with distance.

(3) This is only the case because we’re modeling asexually reproducing smurfs here. Let it be clear that inn humans or any other sexually reproducing species, the fact that an individual gene’s lineage coalesces at a particular point in the past in one individual in no way implies that this individual was the only one alive at the time to leave descendants!

(4) In an infinite population (and with an infinite space of possibilities), the number of individuals representative of any of the initial genotypes in generation *n*, given a mutation rate of *M*, should be (1-M)^n, thus the number of mutants 1-(1-M)^n. For a mutation rate of 0.0002 and 2000 generations as used here, this should result in a steady increase to nearly a third of the individuals (slightly less because we’re using a finite possibility space) – almost what we see in the purple and yellow lines for both the 5000-conditions.