Timepiece II: SMC++#
From two sequences to many – demographic inference with the distinguished lineage.
The Mechanism at a Glance#
SMC++ (Terhorst, Kamm & Song, 2017) extends PSMC from a single diploid genome to multiple unphased diploid genomes. Where PSMC reads population size history from two haplotypes – one simple watch with two hands – SMC++ adds more hands to the dial without requiring phased data or full ARG inference. The result is sharper resolution in the recent past, exactly where PSMC’s two-sequence approach runs out of steam.
The key insight is the distinguished lineage. Rather than tracking the full genealogy of all samples (which would require exponentially many states), SMC++ singles out one lineage and tracks how it relates to a demographic background of \(n - 1\) undistinguished lineages. The coalescence time of the distinguished lineage is hidden; the presence or absence of the other lineages provides additional signal about population size. This trick keeps the state space manageable while extracting far more information than PSMC’s two-haplotype approach.
If PSMC is a two-hand watch, SMC++ is a chronograph – a complication that adds sub-dials tracking multiple time measurements simultaneously. Each additional sample genome is another sub-dial, providing independent readings of the same demographic history. The distinguished lineage is the central seconds hand, and the undistinguished lineages sweep around their own sub-dials, all driven by the same escapement (the coalescent process under variable population size).
Primary Reference
The four gears of SMC++:
The Distinguished Lineage (the escapement) – The setup: one lineage is singled out, and its coalescence time \(T\) is tracked as a hidden variable. The remaining \(n - 1\) lineages form a demographic background that modifies the coalescence rate. This is where PSMC’s two-lineage framework generalizes to many.
The ODE System (the gear train) – A system of ordinary differential equations that tracks the probability \(p_j(t)\) that \(j\) undistinguished lineages remain at time \(t\). The matrix exponential of the rate matrix gives exact transition probabilities. This replaces PSMC’s simple exponential coalescence with a richer model.
The Continuous HMM (the mainspring) – A modified transition matrix built from the ODE rates, combined via composite likelihood across pairs of sites. Gradient-based optimization (L-BFGS or EM) estimates the piecewise-constant population size function \(\lambda(t)\). This is the inference engine.
Population Splits (a complication) – Cross-population analysis: modified ODEs that track lineage counts before and after a population split, enabling joint estimation of \(N_A(t)\), \(N_B(t)\), and the split time \(T_{\text{split}}\).
These gears mesh together into a complete inference machine:
Multiple unphased diploid genomes
|
v
+-------------------------+
| CHOOSE DISTINGUISHED |
| LINEAGE |
| |
| Pair it with each of |
| the n-1 undistinguished|
| lineages |
+-------------------------+
|
v
+-------> SOLVE ODE SYSTEM
| p_j(t): probability j
| undistinguished lineages
| remain at time t
| |
| v
| BUILD HMM
| States: discretized T
| Emissions: P(data | T)
| Transitions: from ODE
| |
| v
| COMPOSITE LIKELIHOOD
| across all pairs
| |
| v
| OPTIMIZE (L-BFGS)
| update lambda_k
| |
| Converged?
| NO ---+
|
YES
|
v
Output: lambda_0, ..., lambda_n
|
v
Scale to real units: N(t)
Prerequisites for this Timepiece
SMC++ builds directly on PSMC. Before starting, you should have worked through:
PSMC – the transition density, discretization, and HMM inference for two sequences. SMC++ generalizes every gear in PSMC.
Coalescent Theory – coalescence rates with multiple lineages, the relationship between population size and coalescence time
The SMC – the sequential Markov coalescent approximation
If you have built PSMC, you have most of the tools you need. SMC++ adds the multi-lineage generalization, but the underlying mathematical framework is the same.
Chapters#
Each chapter derives the math, explains the intuition, implements the code, and verifies it works. By the end, you’ll have built a complete multi-sample demographic inference engine – and you’ll see how PSMC’s simple watch grows into a chronograph.