Simulation setup

Population-based ancestry is only well-defined with reference to some particular time. For instance, suppose that my maternal grandmother belonged to Population A, and my mother migrated into Population B. I inherited some of my genome from these ancestors – but which population did it come from? The answer depends on a point in time – and in particular, on whether we are interested in ancestry one or two generations ago.

By default, msprime and SLiM do not retain information about the ancestry of individuals at each timepoint in the simulated history. The rest of this page explains how to rectify this.

msprime simulations

Let’s make this more concrete with an example.

Suppose there was an admixture event between two populations, RED and BLUE, that produces an admixed population ADMIX. The admixture happened 100 generations ago. These two populations split from a common ancestral population, ANC. The split was 1000 generations ago.

The deme diagram for this simulated example.

Here’s a msprime.Demography object describing this demographic scenario:

import msprime

pop_size = 500
demography = msprime.Demography()
demography.add_population(name="RED", initial_size=pop_size)
demography.add_population(name="BLUE", initial_size=pop_size)
demography.add_population(name="ADMIX", initial_size=pop_size)
demography.add_population(name="ANC", initial_size=pop_size)
demography.add_admixture(
    time=100, derived="ADMIX", ancestral=["RED", "BLUE"], proportions=[0.5, 0.5]
)
demography.add_population_split(
    time=1000, derived=["RED", "BLUE"], ancestral="ANC"
)

We’ll simulate a 100kb genomic region for two diploid individuals from the admixed population ADMIX:

ts = msprime.sim_ancestry(
    samples={"RED": 0, "BLUE": 0, "ADMIX" : 2},
    demography=demography,
    random_seed=1011,
    sequence_length=100000,
    recombination_rate=3e-8
)

Have a look at this msprime tutorial if you need a refresher on this syntax.

By default, the tree sequences generated by msprime will only show the ancestral haplotypes that happen to be coalescent ancestors. On their own, this is not sufficient to get information about population-based ancestry at all locations along the genome.

For instance, here is the first tree in the tree sequence generated by the simulation above. This tree describes the genealogical relationships between the samples on the leftmost part of the simulated genome:

# Note that this code will only work in a Jupyter notebook
from IPython.display import SVG

colour_map = {0:"red", 1:"blue", 2: "purple", 3: "gray"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
tree = ts.first()
SVG(tree.draw(node_colours=node_colours))
The first tree in the tree sequence simulated with the code above

From this, we see that samples 1, 2 and 3 have ancestry with the red population at this location in their genomes. However, we cannot be sure about the provenance of sample 0 based on the information displayed here. At this genomic location, sample 0 is very deeply diverged from the other samples. In fact, it is so deeply diverged that it’s most recent coalescence with the other samples (at node 12) pre-dates the ‘split’ between the red and blue ancestral populations. To see which population it has inherited from at this location, we’d need to ‘mark’ one of the more recent (non-coalescent) ancestors of sample 0 to retain.

The msprime.Demography.add_census() method (documented here) is a special demographic event that we added into msprime to do precisely this. More specifically, add_census records a node on all lineages that are extant at some user-specified time in the simulation. This is needed to simulate complete information about local ancestry.

The code below is the same that we specified above, but with a census event at time=100.001. Note that this time is just before the admixture event creating population ADMIX.

# Make the Demography object.
demography = msprime.Demography()
demography.add_population(name="RED", initial_size=pop_size)
demography.add_population(name="BLUE", initial_size=pop_size)
demography.add_population(name="ADMIX", initial_size=pop_size)
demography.add_population(name="ANC", initial_size=pop_size)
demography.add_admixture(
    time=100, derived="ADMIX", ancestral=["RED", "BLUE"], proportions=[0.5, 0.5]
)
demography.add_census(time=100.01) # Census is here!
demography.add_population_split(
    time=1000, derived=["RED", "BLUE"], ancestral="ANC"
)

# Simulate.
ts = msprime.sim_ancestry(
    samples={"RED": 0, "BLUE": 0, "ADMIX" : 2},
    demography=demography,
    random_seed=1011,
    sequence_length=100000,
    recombination_rate=3e-8
)

Here is a diagram of the first tree in the tree sequence returned by this simulation.

The first tree in the tree sequence simulated with the code above

Note that there is now a node on every branch in the trees at the time specified in our census event. (In the tree above, these are nodes 5, 6 and 7). This is the information required to extract full information about population-based ancestry at all genomic locations in all samples. For instance, we see here that sample 0 has local ancestry with the blue population, while the other samples have ancestry with the red population.

SLiM simulations

Use a treeSeqRememberIndividuals() call to select census individuals.

initialize() {
        initializeTreeSeq();
        initializeMutationRate(0);
        initializeMutationType("m1", 0.5, "f", 0.0);
        initializeGenomicElementType("g1", m1, 1.0);
        initializeGenomicElement(g1, 0, 99999);
        initializeRecombinationRate(3e-8);
}
1 early() {
        sim.addSubpop("p3", 500); // "ANC"
}
1000 early() {
        sim.addSubpop("p0", 500); // "RED"
        sim.addSubpop("p1", 500); // "BLUE"
        p0.setMigrationRates(p3, 1.0);
        p1.setMigrationRates(p3, 1.0);
        p3.setSubpopulationSize(0);
}
1899 late() {
        // The 'census' event:
        // note these individuals have time 101 in the output
        sim.treeSeqRememberIndividuals(sim.subpopulations.individuals);
}
1900 early() {
        sim.addSubpop("p2", 500); // "ADMIX"
        p2.setMigrationRates(c(p0, p1), c(0.5, 0.5));
}
1901 early() {
        // admixture happens in a single generation
        p2.setMigrationRates(c(p0, p1), c(0, 0));
}
2000 late() {
        sim.treeSeqOutput("slim.trees");
}

When should you add the census?

msprime simulations

You should specify the census event at a time when

  1. All of the relevant ancestral populations are active.

  2. It is unlikely that all samples have coalesced anywhere.

  3. There are no other coalescent nodes.

In the example above, condition 1 suggests that we should choose a census time between 100 and 1000 generations in the past. Before this time, the populations did not ‘exist’, and after this time, the ancestors of the sample were already admixed. To make condition 2 as likely as possible, we should choose a time closer to 100 and 1000. The chosen time of 100.01 satisfies both of these conditions so far. Since we are running a (default) coalescent simulation here, condition 3 is unlikely to be an issue.

Note

Condition 3 is most important when you are running a DTWF simulation. In this situation, you want to avoid placing the census nodes ‘on top’ of the existing ancestors that are generated at discrete times, so a non-integer time is most suitable here.

SLiM simulations

You’ll usually want to place the treeSeqRememberIndividuals() call in the generation before admixture begins.