Algorithm
A deep dive into how ancify infers ancestral alleles. This page teaches you the algorithm from first principles, with worked examples at every step.
Overview
ancify infers the ancestral allele at every position in a focal species’ reference genome using a two-tier outgroup voting scheme:
INPUT PROCESSING OUTPUT
───── ────────── ──────
Net AXT files ┌──────────────────────┐
(one per outgroup │ Phase 1: PROJECT │ Projected FASTAs
species) ────▶│ Map outgroup bases │───▶ (one per species
│ to focal coords │ per chromosome)
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Phase 2: CALL │ Ancestral FASTAs
────▶│ Two-tier voting │───▶ (one per chromosome,
│ + confidence encode │ confidence-encoded)
└──────────────────────┘
Phase 1: Coordinate Projection
Before ancestral states can be called, each outgroup’s pairwise alignment must be converted into a sequence in the focal species’ coordinate system. This is the most computationally expensive step.
Input: Net AXT alignments
The pipeline reads net AXT pairwise alignment files from UCSC. These represent best-in-genome one-to-one alignments between the focal and outgroup species.
Each alignment block in an AXT file consists of four lines:
Line 1: header 0 chr1 100 110 chrQ 500 510 + 1000
Line 2: target ACGTNNACGT ← focal species' sequence
Line 3: query ACGTAAACGT ← outgroup species' sequence
Line 4: ← blank separator
Why net alignments? Raw pairwise alignments can have overlapping blocks (e.g. from segmental duplications). The “netting” algorithm resolves these overlaps by keeping only the highest-scoring chain at each position, producing clean one-to-one mappings.
How projection works
For each alignment block, the algorithm walks through the focal sequence character by character:
Focal (target): A C G - T N N A C G T
Outgroup (query): A C G T T A A A C G T
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
Projected: A C G T A A A C G T
Rules:
- Focal is a base → record the outgroup's base at that focal position
- Focal is a gap (-) → skip (insertion in outgroup, no focal coordinate)
- Position not covered by any block → fill with N
The result is a FASTA file for each (outgroup, chromosome) pair, with the same length as the focal chromosome. Every position either contains the outgroup’s aligned base or N (no alignment).
Worked example
Suppose the focal chromosome is 20 bases long and we have two alignment blocks:
Focal chromosome: positions 1-20 (20 bases)
Block A (positions 3-8):
Focal: A C G T A C
Outgroup: A C A T A C
Block B (positions 15-18):
Focal: T G A C
Outgroup: T G A T
Projected sequence:
Position: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Result: N N A C A T A C N N N N N N T G A T N N
└── Block A ──┘ └─ Block B ┘
Positions 1-2, 9-14, and 19-20 have no alignment coverage and are filled with N.
Phase 2: Ancestral State Inference
This is the core algorithm. At every genomic position, it collects the projected bases from all outgroups, computes votes, and produces a confidence-encoded ancestral call.
Step 1: Majority vote
For a set of bases from multiple species, the majority vote selects the most frequent valid nucleotide (A, C, G, or T):
Example: inner outgroups at position chr1:1000
Bonobo: A
Chimp: A
Gorilla: G
Tally: A=2, G=1
Majority vote → A (frequency 2)
Tie-breaking: Ties are broken alphabetically (A > C > G > T). This is arbitrary but deterministic.
Filtering: Bases that are N, -, or other non-nucleotide characters are excluded from the vote. If no base reaches the minimum frequency threshold (min_inner_freq or min_outer_freq), the consensus is N (missing).
Example with min_inner_freq=2:
Bonobo: A
Chimp: N ← excluded (missing)
Gorilla: G
Tally: A=1, G=1
Neither reaches threshold of 2 → consensus is N
Step 2: Compare inner and outer
Once the inner consensus and outer consensus are computed, the algorithm compares them:
┌─────────┬─────────┬─────────────┬─────────────────────────────────┐
│ Inner │ Outer │ Output │ Interpretation │
├─────────┼─────────┼─────────────┼─────────────────────────────────┤
│ A │ A │ A (HIGH) │ Both tiers agree → confident │
│ A │ N │ a (low) │ Only inner has data │
│ N │ A │ a (low) │ Only outer has data │
│ A │ T │ n (none) │ Tiers disagree → unresolved │
│ N │ N │ N (none) │ No data at all │
└─────────┴─────────┴─────────────┴─────────────────────────────────┘
The complete algorithm as pseudocode
def call_ancestral_base(inner_bases, outer_bases,
min_inner_freq=1, min_outer_freq=1):
inner = majority_vote(inner_bases, min_freq=min_inner_freq)
outer = majority_vote(outer_bases, min_freq=min_outer_freq)
if inner != "N" and inner == outer:
return inner # HIGH confidence (uppercase)
if inner == "N" and outer != "N":
return outer.lower() # LOW confidence (lowercase)
if inner != "N" and outer == "N":
return inner.lower() # LOW confidence (lowercase)
if inner == "N" and outer == "N":
return "N" # Both missing
return "n" # Disagreement (unresolved)
Worked example: full pipeline for one position
Let us trace position chr1:50,000 through the entire algorithm:
Step 1: Collect projected bases at chr1:50,000
Bonobo (inner): T
Chimp (inner): T
Gorilla (inner): T
Macaque (outer): T
Step 2: Inner majority vote
Tally: T=3
min_inner_freq=1, so T qualifies
Inner consensus: T
Step 3: Outer majority vote
Tally: T=1
min_outer_freq=1, so T qualifies
Outer consensus: T
Step 4: Compare
Inner=T, Outer=T → they agree
Output: "T" (uppercase = high confidence)
Now consider a more interesting position, chr1:75,000:
Step 1: Collect projected bases at chr1:75,000
Bonobo (inner): A
Chimp (inner): A
Gorilla (inner): G
Macaque (outer): G
Step 2: Inner majority vote
Tally: A=2, G=1
Inner consensus: A (majority)
Step 3: Outer majority vote
Tally: G=1
Outer consensus: G
Step 4: Compare
Inner=A, Outer=G → DISAGREEMENT
Output: "n" (unresolved)
This pattern (inner says A, outer says G) is consistent with incomplete lineage sorting in the human-gorilla ancestor, where gorilla and macaque inherited one allele and bonobo+chimp inherited another.
Confidence encoding
The output FASTA encodes confidence via letter case. This is compatible with the Ensembl EPO ancestral sequence convention:
Character |
Confidence |
Condition |
Typical fraction (human) |
|---|---|---|---|
|
High |
Inner and outer outgroups agree |
~75% |
|
Low |
Only one outgroup tier has data |
~15% |
|
Unresolved |
Inner and outer disagree |
~0.1% |
|
Missing |
Both tiers lack data |
~10% |
This encoding means that a single FASTA file carries both the ancestral call and its reliability. No sidecar files or separate quality tracks are needed.
Biological rationale
Why two tiers instead of one big vote?
Consider what happens with a single flat vote across all outgroups:
Single-tier vote (all 4 species equally weighted):
Bonobo: A ┐
Chimp: A ├─ These 3 are NOT independent observations.
Gorilla: A ┘ Due to ILS, they may all inherit the same
Macaque: G wrong allele from the ancestral population.
Flat vote: A wins 3-to-1.
But what if A is the derived allele, shared through ILS?
The two-tier approach treats the inner outgroups as a single correlated observation and demands independent confirmation from a distant outgroup:
Two-tier vote:
Inner consensus: A (3/3)
Outer consensus: G (1/1)
→ DISAGREEMENT → flagged as "n"
This catches the ILS case!
Incomplete lineage sorting (ILS)
When ancestral populations were large and speciation events were close together, gene trees can disagree with the species tree:
Species tree: Gene tree at this locus:
┌── Bonobo ┌── Bonobo
┌──┤ ┌──┤
│ └── Chimp │ └── Gorilla ← ILS!
──┤ ──┤
│ ┌── Gorilla │ ┌── Chimp
└──┤ └──┤
└── Macaque └── Macaque
For the human-chimp-gorilla clade, ILS affects ~1-3% of the genome. The two-tier approach detects most of these cases because the outer outgroup (macaque) provides a phylogenetically independent anchor.
When the algorithm can still fail
No method is perfect. The algorithm can produce incorrect calls in rare cases:
Failure mode |
Frequency |
Mechanism |
|---|---|---|
Convergent substitution |
<0.01% |
Same mutation on both inner and outer branches |
Correlated ILS |
<0.1% |
All inner species AND the outer share the derived allele through deep coalescence |
Systematic alignment error |
Variable |
Paralog confusion in segmental duplications |
These contribute to the ~0.1% disagreement rate observed when comparing ancify’s human calls to the Ensembl EPO 13-primate reference.
The min_inner_freq parameter in depth
This parameter lets you trade coverage for accuracy:
3 inner species, varying min_inner_freq:
┌───────────────────┬───────────┬──────────┬──────────────────────┐
│ min_inner_freq │ Rule │ Coverage │ Accuracy │
├───────────────────┼───────────┼──────────┼──────────────────────┤
│ 1 │ Any 1 │ Highest │ Lowest (but still │
│ │ suffices │ │ >99% for primates) │
│ 2 │ 2 of 3 │ Moderate │ Good │
│ │ agree │ │ │
│ 3 │ All 3 │ Lowest │ Highest │
│ │ agree │ │ │
└───────────────────┴───────────┴──────────┴──────────────────────┘
Example: With min_inner_freq=2 and bases [A, N, A]:
Tally: A=2, N excluded
Threshold met (2 >= 2): consensus is A
With min_inner_freq=2 and bases [A, N, G]:
Tally: A=1, G=1
Neither reaches threshold: consensus is N (missing)
The same logic applies to min_outer_freq for the outer tier. Since there is typically only one outer outgroup species, min_outer_freq is usually left at 1.
Alignment quality and its effects
The pipeline’s accuracy depends fundamentally on the quality of the input alignments:
Good alignment region Poor alignment region
(unique sequence) (segmental duplication)
Focal: ACGTACGT Focal: ACGTACGT
Outgroup: ACGTACGT Outgroup: TGCATGCA ← paralog!
^^^^^^^^ ^^^^^^^^
Correct projected Incorrect projected
bases bases → bad ancestral call
Net alignments mitigate paralog confusion by keeping only the best one-to-one chain at each position, but regions with recent segmental duplications, transposable elements, or structural variants can still produce artifacts. These typically manifest as N (no alignment) or disagreements (n).
Alternative method: Fitch parsimony
Since version 1.2.0 ancify supports a second inference method: Fitch parsimony (Fitch, 1971). Instead of splitting outgroups into two hand-picked tiers and majority-voting within each, this method uses the actual phylogenetic tree of the outgroup species to reconstruct the most parsimonious ancestral allele at the root.
When to prefer parsimony
You have three or more outgroups at varying phylogenetic distances.
You want the tree topology to determine how species are weighted, rather than manually choosing inner/outer groups.
Resolving cases that the two-tier method marks as “unresolved” (
n) matters to you. In many of those cases, parsimony can find a unique most-parsimonious solution by leveraging the tree structure.
The Fitch algorithm
The algorithm runs on a rooted phylogenetic tree whose leaves are the outgroup species. At every genomic position it performs two passes:
Pass 1: Bottom-up (post-order traversal)
Starting at the leaves, work toward the root. Each leaf gets a set containing its observed allele (or {A,C,G,T} if missing). At each internal node, compute:
Intersection of children’s sets, if non-empty
Union of children’s sets, otherwise
The intersection case means all children are compatible; no mutation is needed on the branch. The union case means at least one mutation is required.
Example: (((bonobo,chimp),gorilla),macaque)
Observed: bonobo=G, chimp=G, gorilla=A, macaque=A
Bottom-up:
bonobo → {G} chimp → {G}
(bonobo,chimp) → {G} ∩ {G} = {G}
gorilla → {A}
((bc),gorilla) → {G} ∩ {A} = ∅ → {G} ∪ {A} = {A,G}
macaque → {A}
root → {A,G} ∩ {A} = {A}
Pass 2: Top-down (pre-order traversal)
Starting at the root, assign a concrete allele from the node’s set. Prefer the parent’s allele if it is in the child’s set (minimizes mutations). Ties are broken alphabetically for determinism.
Continuing the example:
root → {A} → picks A
(bc,gorilla) → {A,G} → parent is A, A ∈ set → picks A
gorilla → {A} → A
(bc) → {G} → parent is A, A ∉ set → picks G
bonobo → {G} → G
chimp → {G} → G
Result: A is ancestral at the root (1 mutation: A→G on the bonobo-chimp branch)
Handling missing data
When a leaf has no alignment data (N), it is assigned the full set {A,C,G,T}. This makes it compatible with any allele, so missing data never forces a mutation. If all leaves are missing, the root set remains {A,C,G,T} and the position is reported as N (missing).
Confidence encoding (parsimony)
The parsimony method maps to the same case-based confidence encoding as the voting method:
Character |
Confidence |
Condition |
|---|---|---|
|
High |
Root Fitch set has exactly one allele (unambiguous) |
|
Low |
Root Fitch set has multiple alleles (ambiguous; one is chosen) |
|
Missing |
All outgroup leaves are |
Note that there is no n (unresolved) category for parsimony — the algorithm always produces a call unless all data is missing.
Comparison: voting vs. parsimony
Position with: bonobo=G, chimp=G, gorilla=A, macaque=A
Two-tier voting:
Inner consensus (bonobo, chimp, gorilla) → majority = G
Outer consensus (macaque) → A
Inner ≠ Outer → "n" (UNRESOLVED)
Fitch parsimony on (((bonobo,chimp),gorilla),macaque):
Root = A (one mutation needed: A→G on the bonobo-chimp branch)
→ "A" (HIGH CONFIDENCE)
The tree structure lets parsimony recognize that the G allele is a shared derived change in the bonobo-chimp clade, while A is the ancestral state supported by both gorilla and macaque.
Configuration
To use parsimony, add two fields to your YAML config:
method: parsimony
tree: "(((bonobo,chimp),gorilla),macaque)"
The tree field accepts either an inline Newick string or a path to a .nwk file. Leaf names must match outgroup name fields in your config. See Configuration Reference for the complete reference.
Alternative method: Likelihood (Felsenstein pruning)
Since version 1.4.0 ancify supports a fourth inference method: likelihood-based ancestral reconstruction using Felsenstein’s pruning algorithm (Felsenstein, 1981). Instead of counting alleles (voting) or minimising mutations (parsimony), this method uses an explicit substitution model and branch lengths to compute the posterior probability of each ancestral base at the root of the phylogenetic tree.
When to prefer likelihood
You have a phylogenetic tree with branch lengths (in expected substitutions per site).
You want posterior probabilities rather than the binary confident/ambiguous split of parsimony.
You want the reconstruction to account for the fact that long branches accumulate more substitutions than short branches.
You want to use a specific substitution model that reflects known biology (e.g. transition/transversion bias via K80 or HKY85).
Substitution models
All models define a 4×4 instantaneous rate matrix Q, normalised so that one unit of branch length corresponds to one expected substitution:
┌───────────┬────────────────────────────────────────────────────────────────┐
│ Model │ Description │
├───────────┼────────────────────────────────────────────────────────────────┤
│ JC69 │ Jukes–Cantor (1969). All substitutions equally likely. │
│ │ No free parameters. │
├───────────┼────────────────────────────────────────────────────────────────┤
│ K80 │ Kimura (1980) two-parameter model. Distinguishes transitions │
│ │ (A↔G, C↔T) from transversions. One parameter: kappa (κ). │
├───────────┼────────────────────────────────────────────────────────────────┤
│ HKY85 │ Hasegawa–Kishino–Yano (1985). Like K80 but with non-uniform │
│ │ equilibrium base frequencies. Parameters: κ + π = [A, C, G, T].│
├───────────┼────────────────────────────────────────────────────────────────┤
│ GTR │ General time-reversible. Six exchangeability rates + base │
│ │ frequencies. The most flexible reversible model. │
└───────────┴────────────────────────────────────────────────────────────────┘
The transition probability matrix P(t) = expm(Q·t) gives the probability of observing base j at the end of a branch of length t, given base i at the start.
The Felsenstein pruning algorithm
At every genomic position the algorithm performs a single bottom-up pass over the tree:
Step 1: Initialise leaves
Each leaf gets a conditional likelihood vector of length 4. If the observed base is (say) A, the vector is [1, 0, 0, 0]. If the leaf is missing (N), the vector is [1, 1, 1, 1] (compatible with any base).
Step 2: Bottom-up (post-order traversal)
At each internal node with children c₁, c₂, …, compute:
L(node, i) = ∏_c [ Σ_j P_c(i,j) · L(c, j) ]
where P_c is the transition matrix for the branch leading to child c. This multiplies together each child’s contribution: the sum over all possible child states j of the probability of transitioning from parent state i to j, weighted by the child’s likelihood of state j.
Step 3: Root posterior
Multiply the root’s conditional likelihoods by the prior (equilibrium base frequencies π) and normalise:
posterior(i) = π_i · L(root, i) / Σ_j π_j · L(root, j)
Step 4: Call the ancestral base
The base with the highest posterior probability is selected. Confidence encoding:
┌────────────────────────────────────────┬──────────────┬───────────────────┐
│ Condition │ Output │ Interpretation │
├────────────────────────────────────────┼──────────────┼───────────────────┤
│ max posterior ≥ high_threshold (0.8) │ ACGT (upper) │ High confidence │
│ max posterior ≥ low_threshold (0.5) │ acgt (lower) │ Low confidence │
│ max posterior < low_threshold │ n │ Unresolved │
│ All leaves missing │ N │ No data │
└────────────────────────────────────────┴──────────────┴───────────────────┘
Worked example
Tree: (((bonobo:0.008,chimp:0.008):0.002,gorilla:0.009):0.020,macaque:0.038)
Model: JC69
Observed: bonobo=G, chimp=G, gorilla=A, macaque=A
Step 1: Leaf likelihoods
bonobo → [0, 0, 1, 0] (G)
chimp → [0, 0, 1, 0] (G)
gorilla → [1, 0, 0, 0] (A)
macaque → [1, 0, 0, 0] (A)
Step 2: Bottom-up
At (bonobo,chimp) node, branch lengths 0.008:
P(0.008) ≈ diag(0.9894) + off-diag(0.0035)
Both children are G → parent L is dominated by G but with some
probability of A, C, T through the short branches.
At ((bc),gorilla) node:
Left child favours G, right child (gorilla) favours A.
The branch to gorilla (0.009) is short → A signal is strong.
Combined: both A and G have substantial likelihood.
At root:
Left child favours A/G mix, right child (macaque, branch 0.038) favours A.
A has the highest likelihood.
Step 3: Posterior
π = [0.25, 0.25, 0.25, 0.25] (JC69 has uniform frequencies)
posterior(A) ≈ 0.84, posterior(G) ≈ 0.15, others ≈ 0.005
Step 4: max posterior = 0.84 ≥ 0.8 → "A" (high confidence)
Compare this to the parsimony result (also A) and the voting result (n, unresolved). Likelihood agrees with parsimony here, but additionally provides a calibrated probability (84%) that quantifies how confident we should be.
Comparison: voting vs. parsimony vs. likelihood vs. ML
Position with: bonobo=G, chimp=G, gorilla=A, macaque=A
Two-tier voting:
Inner consensus = G, Outer consensus = A → DISAGREEMENT → "n"
Fitch parsimony:
Root = A (one mutation: A→G on bonobo-chimp branch) → "A"
Likelihood (JC69):
Root posterior: A ≈ 0.84 → "A" (high confidence, with probability)
ML classifier:
Sees outgroup bases + local context → may also predict A with calibrated p
Configuration
method: likelihood
tree: "(((bonobo:0.008,chimp:0.008):0.002,gorilla:0.009):0.020,macaque:0.038)"
substitution_model: HKY85
model_kappa: 2.0
model_base_freqs: [0.3, 0.2, 0.2, 0.3]
likelihood_high_threshold: 0.8
likelihood_low_threshold: 0.5
Branch lengths in the Newick tree are critical for likelihood — they determine how much weight each leaf’s observation carries. Trees estimated by phylogenetic methods (e.g. RAxML, IQ-TREE) typically provide branch lengths in substitutions per site, which is exactly what the models expect.
See Configuration Reference for the complete parameter reference.
Alternative method: Machine learning classifier
Since version 2.0.0 ancify supports a third inference method: a LightGBM gradient-boosted classifier that learns substitution patterns from data rather than relying on hand-crafted rules.
When to prefer ML
You have an external reference ancestral sequence (e.g. Ensembl EPO) that can serve as high-quality training labels.
You want the model to learn substitution biases (transition/transversion asymmetry, CpG hypermutation) and local sequence context automatically.
You want calibrated probability scores rather than the binary high/low confidence split of voting, or the ambiguous/unambiguous split of parsimony.
Resolving the ~0.1% of sites that voting marks as “unresolved” (
n) matters to you.
Feature engineering
At every genomic position, the classifier receives a feature vector built from the projected outgroup sequences:
Feature Description Count
────────────────────── ────────────────────────────────────────────── ─────
Outgroup bases Integer-encoded base per outgroup (0-3, -1=N) K
Data count Number of outgroups with valid data 1
Agreement ratio Fraction of valid outgroups agreeing w/ majority 1
Inner consensus Majority base among inner outgroups (0-4) 1
Outer consensus Majority base among outer outgroups (0-4) 1
Voting agreement Whether inner == outer (binary) 1
GC content GC fraction in ±50 bp window 1
CpG flag Whether site is part of a CpG dinucleotide 1
Flanking bases Bases immediately upstream/downstream 2
──────────────────────────────────────────────────────────────────────────
Total K + 9
The first K features capture raw outgroup data. The next five features encode the same information the voting method uses, letting the model learn when voting is reliable. The final four context features let the model learn position-dependent substitution biases (e.g. elevated C→T rates at CpG sites).
Training workflow
Training is a separate step from inference:
# Train a model using high-confidence voting sites (self-supervised):
ancify train -c config.yaml -o model.lgb
# Or, with an external reference for training labels:
# (set ml_training_reference in config.yaml)
ancify train -c config.yaml -o model.lgb
Two label sources are supported:
Self-supervised (default): Positions where the voting method’s inner and outer tiers agree (uppercase output) serve as training labels. No external data is needed.
Reference-supervised: An external ancestral reference (e.g. Ensembl EPO) provides ground-truth labels via the
ml_training_referenceconfig field. This is preferred when available.
Training subsamples up to ~5 million sites per chromosome (stratified by allele class) to keep training fast (<5 minutes for a full genome).
Confidence calibration
The model’s predicted class probability maps to confidence levels:
┌────────────────────────────────┬──────────────┬─────────────────────────────┐
│ Condition │ Output │ Interpretation │
├────────────────────────────────┼──────────────┼─────────────────────────────┤
│ p >= high_threshold (def 0.8) │ ACGT (upper) │ High confidence │
│ p >= low_threshold (def 0.5) │ acgt (lower) │ Low confidence │
│ p < low_threshold │ n │ Unresolved │
│ All outgroups missing │ N │ No data │
└────────────────────────────────┴──────────────┴─────────────────────────────┘
This is more informative than the binary high/low split of voting or parsimony, because the probability thresholds can be tuned to trade precision for recall.
Comparison: voting vs. parsimony vs. ML
Position with: bonobo=G, chimp=G, gorilla=A, macaque=A
Two-tier voting:
Inner consensus = G, Outer consensus = A → DISAGREEMENT → "n"
Fitch parsimony:
Root = A (one mutation: A→G on bonobo-chimp branch) → "A"
ML classifier:
Sees outgroup bases + local context (CpG site, high GC region)
Predicts A with probability 0.87 → "A" (high confidence)
The ML method can resolve cases like this by learning that the G→A pattern at a CpG site in a GC-rich region is more consistent with ancestral A than derived A, using patterns it has learned from the training data.
Why LightGBM?
Tabular data: The per-position feature space is naturally tabular, not sequential. Gradient-boosted trees outperform neural networks on tabular data in most benchmarks.
Native missing-value handling: LightGBM handles
NaN/ missing features natively, which is critical for positions where some outgroups lack alignment data.Speed: Training on ~5M sites takes <5 minutes; inference on a full genome takes minutes.
Interpretable: Feature importance scores show which outgroups and context features drive predictions, which is valuable for a scientific tool.
Summary
Two-tier voting (default)
For each position in the focal genome:
1. Gather projected bases from all outgroup species
2. Compute inner consensus (majority vote among close relatives)
3. Compute outer consensus (majority vote among distant relatives)
4. Compare:
agree → UPPERCASE (high confidence)
one N → lowercase (low confidence)
differ → n (unresolved)
both N → N (missing)
5. Write to output FASTA
Fitch parsimony
For each position in the focal genome:
1. Gather projected bases from all outgroup species
2. Run Fitch bottom-up pass on the phylogenetic tree
3. Run Fitch top-down pass to assign the root allele
4. Encode confidence:
root set size = 1 → UPPERCASE (high confidence)
root set size > 1 → lowercase (low confidence)
all missing → N
5. Write to output FASTA
Likelihood (Felsenstein pruning)
For each position in the focal genome:
1. Gather projected bases from all outgroup species
2. Run Felsenstein bottom-up pass on the tree with branch lengths
using the chosen substitution model (JC69/K80/HKY85/GTR)
3. Compute root posterior: π · L(root) / Σ
4. Encode confidence:
max posterior >= high_threshold → UPPERCASE (high confidence)
max posterior >= low_threshold → lowercase (low confidence)
max posterior < low_threshold → n (unresolved)
all missing → N
5. Write to output FASTA
ML classifier
Prerequisite: train a model with `ancify train -c config.yaml`
For each position in the focal genome:
1. Gather projected bases from all outgroup species
2. Extract feature vector (outgroup bases, voting consensus,
agreement, GC content, CpG flag, flanking bases)
3. Run LightGBM inference → predicted class + probability
4. Encode confidence:
p >= high_threshold → UPPERCASE (high confidence)
p >= low_threshold → lowercase (low confidence)
p < low_threshold → n (unresolved)
all missing → N
5. Write to output FASTA
All four methods are fast (Phase 2 takes minutes for a full genome) and produce reliable ancestral calls with built-in confidence assessment. Voting is the simplest and best-tested; parsimony is more principled for complex outgroup configurations; likelihood adds calibrated posterior probabilities using substitution models; ML can learn substitution biases and local context when training data is available.