Algorithm

A deep dive into how ancify infers ancestral alleles. This page teaches you the algorithm from first principles, with worked examples at every step.

Overview

ancify infers the ancestral allele at every position in a focal species’ reference genome using a two-tier outgroup voting scheme:

  INPUT                           PROCESSING                        OUTPUT
  ─────                           ──────────                        ──────

  Net AXT files        ┌──────────────────────┐
  (one per outgroup    │  Phase 1: PROJECT    │    Projected FASTAs
   species)       ────▶│  Map outgroup bases  │───▶ (one per species
                       │  to focal coords     │     per chromosome)
                       └──────────────────────┘
                                                         │
                                                         ▼
                       ┌──────────────────────┐
                       │  Phase 2: CALL       │    Ancestral FASTAs
                  ────▶│  Two-tier voting     │───▶ (one per chromosome,
                       │  + confidence encode │     confidence-encoded)
                       └──────────────────────┘

Phase 1: Coordinate Projection

Before ancestral states can be called, each outgroup’s pairwise alignment must be converted into a sequence in the focal species’ coordinate system. This is the most computationally expensive step.

Input: Net AXT alignments

The pipeline reads net AXT pairwise alignment files from UCSC. These represent best-in-genome one-to-one alignments between the focal and outgroup species.

Each alignment block in an AXT file consists of four lines:

  Line 1: header    0 chr1 100 110 chrQ 500 510 + 1000
  Line 2: target    ACGTNNACGT       ← focal species' sequence
  Line 3: query     ACGTAAACGT       ← outgroup species' sequence
  Line 4:                            ← blank separator

Why net alignments? Raw pairwise alignments can have overlapping blocks (e.g. from segmental duplications). The “netting” algorithm resolves these overlaps by keeping only the highest-scoring chain at each position, producing clean one-to-one mappings.

How projection works

For each alignment block, the algorithm walks through the focal sequence character by character:

  Focal (target):   A  C  G  -  T  N  N  A  C  G  T
  Outgroup (query): A  C  G  T  T  A  A  A  C  G  T
                    ↓  ↓  ↓     ↓  ↓  ↓  ↓  ↓  ↓  ↓
  Projected:        A  C  G     T  A  A  A  C  G  T

  Rules:
    - Focal is a base → record the outgroup's base at that focal position
    - Focal is a gap (-) → skip (insertion in outgroup, no focal coordinate)
    - Position not covered by any block → fill with N

The result is a FASTA file for each (outgroup, chromosome) pair, with the same length as the focal chromosome. Every position either contains the outgroup’s aligned base or N (no alignment).

Worked example

Suppose the focal chromosome is 20 bases long and we have two alignment blocks:

  Focal chromosome:    positions 1-20 (20 bases)

  Block A (positions 3-8):
    Focal:    A C G T A C
    Outgroup: A C A T A C

  Block B (positions 15-18):
    Focal:    T G A C
    Outgroup: T G A T

  Projected sequence:
    Position:  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20
    Result:    N  N  A  C  A  T  A  C  N  N  N  N  N  N  T  G  A  T  N  N
                     └── Block A ──┘                     └─ Block B ┘

Positions 1-2, 9-14, and 19-20 have no alignment coverage and are filled with N.

Phase 2: Ancestral State Inference

This is the core algorithm. At every genomic position, it collects the projected bases from all outgroups, computes votes, and produces a confidence-encoded ancestral call.

Step 1: Majority vote

For a set of bases from multiple species, the majority vote selects the most frequent valid nucleotide (A, C, G, or T):

  Example: inner outgroups at position chr1:1000

    Bonobo:   A
    Chimp:    A
    Gorilla:  G

    Tally:  A=2, G=1
    Majority vote → A (frequency 2)

Tie-breaking: Ties are broken alphabetically (A > C > G > T). This is arbitrary but deterministic.

Filtering: Bases that are N, -, or other non-nucleotide characters are excluded from the vote. If no base reaches the minimum frequency threshold (min_inner_freq or min_outer_freq), the consensus is N (missing).

  Example with min_inner_freq=2:

    Bonobo:   A
    Chimp:    N       ← excluded (missing)
    Gorilla:  G

    Tally:  A=1, G=1
    Neither reaches threshold of 2 → consensus is N

Step 2: Compare inner and outer

Once the inner consensus and outer consensus are computed, the algorithm compares them:

  ┌─────────┬─────────┬─────────────┬─────────────────────────────────┐
  │  Inner  │  Outer  │   Output    │  Interpretation                 │
  ├─────────┼─────────┼─────────────┼─────────────────────────────────┤
  │    A    │    A    │   A (HIGH)  │  Both tiers agree → confident   │
  │    A    │    N    │   a (low)   │  Only inner has data            │
  │    N    │    A    │   a (low)   │  Only outer has data            │
  │    A    │    T    │   n (none)  │  Tiers disagree → unresolved    │
  │    N    │    N    │   N (none)  │  No data at all                 │
  └─────────┴─────────┴─────────────┴─────────────────────────────────┘

The complete algorithm as pseudocode

def call_ancestral_base(inner_bases, outer_bases,
                        min_inner_freq=1, min_outer_freq=1):
    inner = majority_vote(inner_bases, min_freq=min_inner_freq)
    outer = majority_vote(outer_bases, min_freq=min_outer_freq)

    if inner != "N" and inner == outer:
        return inner              # HIGH confidence (uppercase)
    if inner == "N" and outer != "N":
        return outer.lower()      # LOW confidence (lowercase)
    if inner != "N" and outer == "N":
        return inner.lower()      # LOW confidence (lowercase)
    if inner == "N" and outer == "N":
        return "N"                # Both missing
    return "n"                    # Disagreement (unresolved)

Worked example: full pipeline for one position

Let us trace position chr1:50,000 through the entire algorithm:

  Step 1: Collect projected bases at chr1:50,000

    Bonobo (inner):    T
    Chimp (inner):     T
    Gorilla (inner):   T
    Macaque (outer):   T

  Step 2: Inner majority vote

    Tally: T=3
    min_inner_freq=1, so T qualifies
    Inner consensus: T

  Step 3: Outer majority vote

    Tally: T=1
    min_outer_freq=1, so T qualifies
    Outer consensus: T

  Step 4: Compare

    Inner=T, Outer=T → they agree
    Output: "T" (uppercase = high confidence)

Now consider a more interesting position, chr1:75,000:

  Step 1: Collect projected bases at chr1:75,000

    Bonobo (inner):    A
    Chimp (inner):     A
    Gorilla (inner):   G
    Macaque (outer):   G

  Step 2: Inner majority vote

    Tally: A=2, G=1
    Inner consensus: A (majority)

  Step 3: Outer majority vote

    Tally: G=1
    Outer consensus: G

  Step 4: Compare

    Inner=A, Outer=G → DISAGREEMENT
    Output: "n" (unresolved)

This pattern (inner says A, outer says G) is consistent with incomplete lineage sorting in the human-gorilla ancestor, where gorilla and macaque inherited one allele and bonobo+chimp inherited another.

Confidence encoding

The output FASTA encodes confidence via letter case. This is compatible with the Ensembl EPO ancestral sequence convention:

Character	Confidence	Condition	Typical fraction (human)
`ACGT`	High	Inner and outer outgroups agree	~75%
`acgt`	Low	Only one outgroup tier has data	~15%
`n`	Unresolved	Inner and outer disagree	~0.1%
`N`	Missing	Both tiers lack data	~10%

This encoding means that a single FASTA file carries both the ancestral call and its reliability. No sidecar files or separate quality tracks are needed.

Biological rationale

Why two tiers instead of one big vote?

Consider what happens with a single flat vote across all outgroups:

  Single-tier vote (all 4 species equally weighted):

    Bonobo:   A     ┐
    Chimp:    A     ├─ These 3 are NOT independent observations.
    Gorilla:  A     ┘  Due to ILS, they may all inherit the same
    Macaque:  G        wrong allele from the ancestral population.

    Flat vote: A wins 3-to-1.
    But what if A is the derived allele, shared through ILS?

The two-tier approach treats the inner outgroups as a single correlated observation and demands independent confirmation from a distant outgroup:

  Two-tier vote:
    Inner consensus: A (3/3)
    Outer consensus: G (1/1)
    → DISAGREEMENT → flagged as "n"

  This catches the ILS case!

Incomplete lineage sorting (ILS)

When ancestral populations were large and speciation events were close together, gene trees can disagree with the species tree:

  Species tree:                  Gene tree at this locus:

       ┌── Bonobo                     ┌── Bonobo
    ┌──┤                           ┌──┤
    │  └── Chimp                   │  └── Gorilla     ← ILS!
  ──┤                            ──┤
    │  ┌── Gorilla                 │  ┌── Chimp
    └──┤                           └──┤
       └── Macaque                    └── Macaque

For the human-chimp-gorilla clade, ILS affects ~1-3% of the genome. The two-tier approach detects most of these cases because the outer outgroup (macaque) provides a phylogenetically independent anchor.

When the algorithm can still fail

No method is perfect. The algorithm can produce incorrect calls in rare cases:

Failure mode	Frequency	Mechanism
Convergent substitution	<0.01%	Same mutation on both inner and outer branches
Correlated ILS	<0.1%	All inner species AND the outer share the derived allele through deep coalescence
Systematic alignment error	Variable	Paralog confusion in segmental duplications

These contribute to the ~0.1% disagreement rate observed when comparing ancify’s human calls to the Ensembl EPO 13-primate reference.

The `min_inner_freq` parameter in depth

This parameter lets you trade coverage for accuracy:

  3 inner species, varying min_inner_freq:

  ┌───────────────────┬───────────┬──────────┬──────────────────────┐
  │ min_inner_freq    │ Rule      │ Coverage │ Accuracy             │
  ├───────────────────┼───────────┼──────────┼──────────────────────┤
  │ 1                 │ Any 1     │ Highest  │ Lowest (but still    │
  │                   │ suffices  │          │ >99% for primates)   │
  │ 2                 │ 2 of 3    │ Moderate │ Good                 │
  │                   │ agree     │          │                      │
  │ 3                 │ All 3     │ Lowest   │ Highest              │
  │                   │ agree     │          │                      │
  └───────────────────┴───────────┴──────────┴──────────────────────┘

Example: With min_inner_freq=2 and bases [A, N, A]:

Tally: A=2, N excluded
Threshold met (2 >= 2): consensus is A

With min_inner_freq=2 and bases [A, N, G]:

Tally: A=1, G=1
Neither reaches threshold: consensus is N (missing)

The same logic applies to min_outer_freq for the outer tier. Since there is typically only one outer outgroup species, min_outer_freq is usually left at 1.

Alignment quality and its effects

The pipeline’s accuracy depends fundamentally on the quality of the input alignments:

  Good alignment region        Poor alignment region
  (unique sequence)            (segmental duplication)

  Focal:    ACGTACGT            Focal:    ACGTACGT
  Outgroup: ACGTACGT            Outgroup: TGCATGCA  ← paralog!
            ^^^^^^^^                      ^^^^^^^^
            Correct projected             Incorrect projected
            bases                         bases → bad ancestral call

Net alignments mitigate paralog confusion by keeping only the best one-to-one chain at each position, but regions with recent segmental duplications, transposable elements, or structural variants can still produce artifacts. These typically manifest as N (no alignment) or disagreements (n).

Alternative method: Fitch parsimony

Since version 1.2.0 ancify supports a second inference method: Fitch parsimony (Fitch, 1971). Instead of splitting outgroups into two hand-picked tiers and majority-voting within each, this method uses the actual phylogenetic tree of the outgroup species to reconstruct the most parsimonious ancestral allele at the root.

When to prefer parsimony

You have three or more outgroups at varying phylogenetic distances.
You want the tree topology to determine how species are weighted, rather than manually choosing inner/outer groups.
Resolving cases that the two-tier method marks as “unresolved” (n) matters to you. In many of those cases, parsimony can find a unique most-parsimonious solution by leveraging the tree structure.

The Fitch algorithm

The algorithm runs on a rooted phylogenetic tree whose leaves are the outgroup species. At every genomic position it performs two passes:

Pass 1: Bottom-up (post-order traversal)

Starting at the leaves, work toward the root. Each leaf gets a set containing its observed allele (or {A,C,G,T} if missing). At each internal node, compute:

Intersection of children’s sets, if non-empty
Union of children’s sets, otherwise

The intersection case means all children are compatible; no mutation is needed on the branch. The union case means at least one mutation is required.

  Example: (((bonobo,chimp),gorilla),macaque)
  Observed: bonobo=G, chimp=G, gorilla=A, macaque=A

  Bottom-up:
    bonobo  → {G}       chimp → {G}
    (bonobo,chimp)      → {G} ∩ {G} = {G}
    gorilla → {A}
    ((bc),gorilla)      → {G} ∩ {A} = ∅ → {G} ∪ {A} = {A,G}
    macaque → {A}
    root                → {A,G} ∩ {A} = {A}

Pass 2: Top-down (pre-order traversal)

Starting at the root, assign a concrete allele from the node’s set. Prefer the parent’s allele if it is in the child’s set (minimizes mutations). Ties are broken alphabetically for determinism.

  Continuing the example:
    root  → {A}    → picks A
    (bc,gorilla) → {A,G} → parent is A, A ∈ set → picks A
    gorilla      → {A}   → A
    (bc)         → {G}   → parent is A, A ∉ set → picks G
    bonobo       → {G}   → G
    chimp        → {G}   → G

  Result: A is ancestral at the root (1 mutation: A→G on the bonobo-chimp branch)

Handling missing data

When a leaf has no alignment data (N), it is assigned the full set {A,C,G,T}. This makes it compatible with any allele, so missing data never forces a mutation. If all leaves are missing, the root set remains {A,C,G,T} and the position is reported as N (missing).

Confidence encoding (parsimony)

The parsimony method maps to the same case-based confidence encoding as the voting method:

Character	Confidence	Condition
`ACGT`	High	Root Fitch set has exactly one allele (unambiguous)
`acgt`	Low	Root Fitch set has multiple alleles (ambiguous; one is chosen)
`N`	Missing	All outgroup leaves are `N`

Note that there is no n (unresolved) category for parsimony — the algorithm always produces a call unless all data is missing.

Comparison: voting vs. parsimony

  Position with: bonobo=G, chimp=G, gorilla=A, macaque=A

  Two-tier voting:
    Inner consensus (bonobo, chimp, gorilla) → majority = G
    Outer consensus (macaque)                → A
    Inner ≠ Outer → "n" (UNRESOLVED)

  Fitch parsimony on (((bonobo,chimp),gorilla),macaque):
    Root = A (one mutation needed: A→G on the bonobo-chimp branch)
    → "A" (HIGH CONFIDENCE)

The tree structure lets parsimony recognize that the G allele is a shared derived change in the bonobo-chimp clade, while A is the ancestral state supported by both gorilla and macaque.

Configuration

To use parsimony, add two fields to your YAML config:

method: parsimony
tree: "(((bonobo,chimp),gorilla),macaque)"

The tree field accepts either an inline Newick string or a path to a .nwk file. Leaf names must match outgroup name fields in your config. See Configuration Reference for the complete reference.

Alternative method: Likelihood (Felsenstein pruning)

Since version 1.4.0 ancify supports a fourth inference method: likelihood-based ancestral reconstruction using Felsenstein’s pruning algorithm (Felsenstein, 1981). Instead of counting alleles (voting) or minimising mutations (parsimony), this method uses an explicit substitution model and branch lengths to compute the posterior probability of each ancestral base at the root of the phylogenetic tree.

When to prefer likelihood

You have a phylogenetic tree with branch lengths (in expected substitutions per site).
You want posterior probabilities rather than the binary confident/ambiguous split of parsimony.
You want the reconstruction to account for the fact that long branches accumulate more substitutions than short branches.
You want to use a specific substitution model that reflects known biology (e.g. transition/transversion bias via K80 or HKY85).

Substitution models

All models define a 4×4 instantaneous rate matrix Q, normalised so that one unit of branch length corresponds to one expected substitution:

  ┌───────────┬────────────────────────────────────────────────────────────────┐
  │ Model     │ Description                                                    │
  ├───────────┼────────────────────────────────────────────────────────────────┤
  │ JC69      │ Jukes–Cantor (1969). All substitutions equally likely.         │
  │           │ No free parameters.                                            │
  ├───────────┼────────────────────────────────────────────────────────────────┤
  │ K80       │ Kimura (1980) two-parameter model. Distinguishes transitions   │
  │           │ (A↔G, C↔T) from transversions. One parameter: kappa (κ).       │
  ├───────────┼────────────────────────────────────────────────────────────────┤
  │ HKY85     │ Hasegawa–Kishino–Yano (1985). Like K80 but with non-uniform   │
  │           │ equilibrium base frequencies. Parameters: κ + π = [A, C, G, T].│
  ├───────────┼────────────────────────────────────────────────────────────────┤
  │ GTR       │ General time-reversible. Six exchangeability rates + base       │
  │           │ frequencies. The most flexible reversible model.               │
  └───────────┴────────────────────────────────────────────────────────────────┘

The transition probability matrix P(t) = expm(Q·t) gives the probability of observing base j at the end of a branch of length t, given base i at the start.

The Felsenstein pruning algorithm

At every genomic position the algorithm performs a single bottom-up pass over the tree:

Step 1: Initialise leaves

Each leaf gets a conditional likelihood vector of length 4. If the observed base is (say) A, the vector is [1, 0, 0, 0]. If the leaf is missing (N), the vector is [1, 1, 1, 1] (compatible with any base).

Step 2: Bottom-up (post-order traversal)

At each internal node with children c₁, c₂, …, compute:

  L(node, i) = ∏_c [ Σ_j P_c(i,j) · L(c, j) ]

where P_c is the transition matrix for the branch leading to child c. This multiplies together each child’s contribution: the sum over all possible child states j of the probability of transitioning from parent state i to j, weighted by the child’s likelihood of state j.

Step 3: Root posterior

Multiply the root’s conditional likelihoods by the prior (equilibrium base frequencies π) and normalise:

  posterior(i) = π_i · L(root, i) / Σ_j π_j · L(root, j)

Step 4: Call the ancestral base

The base with the highest posterior probability is selected. Confidence encoding:

  ┌────────────────────────────────────────┬──────────────┬───────────────────┐
  │ Condition                              │ Output       │ Interpretation    │
  ├────────────────────────────────────────┼──────────────┼───────────────────┤
  │ max posterior ≥ high_threshold (0.8)   │ ACGT (upper) │ High confidence   │
  │ max posterior ≥ low_threshold  (0.5)   │ acgt (lower) │ Low confidence    │
  │ max posterior <  low_threshold         │ n            │ Unresolved        │
  │ All leaves missing                     │ N            │ No data           │
  └────────────────────────────────────────┴──────────────┴───────────────────┘

Worked example

  Tree: (((bonobo:0.008,chimp:0.008):0.002,gorilla:0.009):0.020,macaque:0.038)
  Model: JC69
  Observed: bonobo=G, chimp=G, gorilla=A, macaque=A

  Step 1: Leaf likelihoods
    bonobo  → [0, 0, 1, 0]  (G)
    chimp   → [0, 0, 1, 0]  (G)
    gorilla → [1, 0, 0, 0]  (A)
    macaque → [1, 0, 0, 0]  (A)

  Step 2: Bottom-up
    At (bonobo,chimp) node, branch lengths 0.008:
      P(0.008) ≈ diag(0.9894) + off-diag(0.0035)
      Both children are G → parent L is dominated by G but with some
      probability of A, C, T through the short branches.

    At ((bc),gorilla) node:
      Left child favours G, right child (gorilla) favours A.
      The branch to gorilla (0.009) is short → A signal is strong.
      Combined: both A and G have substantial likelihood.

    At root:
      Left child favours A/G mix, right child (macaque, branch 0.038) favours A.
      A has the highest likelihood.

  Step 3: Posterior
    π = [0.25, 0.25, 0.25, 0.25]  (JC69 has uniform frequencies)
    posterior(A) ≈ 0.84, posterior(G) ≈ 0.15, others ≈ 0.005

  Step 4: max posterior = 0.84 ≥ 0.8 → "A" (high confidence)

Compare this to the parsimony result (also A) and the voting result (n, unresolved). Likelihood agrees with parsimony here, but additionally provides a calibrated probability (84%) that quantifies how confident we should be.

Comparison: voting vs. parsimony vs. likelihood vs. ML

  Position with: bonobo=G, chimp=G, gorilla=A, macaque=A

  Two-tier voting:
    Inner consensus = G, Outer consensus = A → DISAGREEMENT → "n"

  Fitch parsimony:
    Root = A (one mutation: A→G on bonobo-chimp branch) → "A"

  Likelihood (JC69):
    Root posterior: A ≈ 0.84 → "A" (high confidence, with probability)

  ML classifier:
    Sees outgroup bases + local context → may also predict A with calibrated p

Configuration

method: likelihood
tree: "(((bonobo:0.008,chimp:0.008):0.002,gorilla:0.009):0.020,macaque:0.038)"
substitution_model: HKY85
model_kappa: 2.0
model_base_freqs: [0.3, 0.2, 0.2, 0.3]
likelihood_high_threshold: 0.8
likelihood_low_threshold: 0.5

Branch lengths in the Newick tree are critical for likelihood — they determine how much weight each leaf’s observation carries. Trees estimated by phylogenetic methods (e.g. RAxML, IQ-TREE) typically provide branch lengths in substitutions per site, which is exactly what the models expect.

See Configuration Reference for the complete parameter reference.

Alternative method: Machine learning classifier

Since version 2.0.0 ancify supports a third inference method: a LightGBM gradient-boosted classifier that learns substitution patterns from data rather than relying on hand-crafted rules.

When to prefer ML

You have an external reference ancestral sequence (e.g. Ensembl EPO) that can serve as high-quality training labels.
You want the model to learn substitution biases (transition/transversion asymmetry, CpG hypermutation) and local sequence context automatically.
You want calibrated probability scores rather than the binary high/low confidence split of voting, or the ambiguous/unambiguous split of parsimony.
Resolving the ~0.1% of sites that voting marks as “unresolved” (n) matters to you.

Feature engineering

At every genomic position, the classifier receives a feature vector built from the projected outgroup sequences:

  Feature                 Description                                     Count
  ──────────────────────  ──────────────────────────────────────────────  ─────
  Outgroup bases          Integer-encoded base per outgroup (0-3, -1=N)  K
  Data count              Number of outgroups with valid data             1
  Agreement ratio         Fraction of valid outgroups agreeing w/ majority 1
  Inner consensus         Majority base among inner outgroups (0-4)      1
  Outer consensus         Majority base among outer outgroups (0-4)      1
  Voting agreement        Whether inner == outer (binary)                1
  GC content              GC fraction in ±50 bp window                   1
  CpG flag                Whether site is part of a CpG dinucleotide     1
  Flanking bases          Bases immediately upstream/downstream           2
  ──────────────────────────────────────────────────────────────────────────
  Total                                                                  K + 9

The first K features capture raw outgroup data. The next five features encode the same information the voting method uses, letting the model learn when voting is reliable. The final four context features let the model learn position-dependent substitution biases (e.g. elevated C→T rates at CpG sites).

Training workflow

Training is a separate step from inference:

# Train a model using high-confidence voting sites (self-supervised):
ancify train -c config.yaml -o model.lgb

# Or, with an external reference for training labels:
# (set ml_training_reference in config.yaml)
ancify train -c config.yaml -o model.lgb

Two label sources are supported:

Self-supervised (default): Positions where the voting method’s inner and outer tiers agree (uppercase output) serve as training labels. No external data is needed.
Reference-supervised: An external ancestral reference (e.g. Ensembl EPO) provides ground-truth labels via the ml_training_reference config field. This is preferred when available.

Training subsamples up to ~5 million sites per chromosome (stratified by allele class) to keep training fast (<5 minutes for a full genome).

Confidence calibration

The model’s predicted class probability maps to confidence levels:

  ┌────────────────────────────────┬──────────────┬─────────────────────────────┐
  │ Condition                      │ Output       │ Interpretation              │
  ├────────────────────────────────┼──────────────┼─────────────────────────────┤
  │ p >= high_threshold (def 0.8)  │ ACGT (upper) │ High confidence             │
  │ p >= low_threshold  (def 0.5)  │ acgt (lower) │ Low confidence              │
  │ p <  low_threshold             │ n            │ Unresolved                  │
  │ All outgroups missing          │ N            │ No data                     │
  └────────────────────────────────┴──────────────┴─────────────────────────────┘

This is more informative than the binary high/low split of voting or parsimony, because the probability thresholds can be tuned to trade precision for recall.

Comparison: voting vs. parsimony vs. ML

  Position with: bonobo=G, chimp=G, gorilla=A, macaque=A

  Two-tier voting:
    Inner consensus = G, Outer consensus = A → DISAGREEMENT → "n"

  Fitch parsimony:
    Root = A (one mutation: A→G on bonobo-chimp branch) → "A"

  ML classifier:
    Sees outgroup bases + local context (CpG site, high GC region)
    Predicts A with probability 0.87 → "A" (high confidence)

The ML method can resolve cases like this by learning that the G→A pattern at a CpG site in a GC-rich region is more consistent with ancestral A than derived A, using patterns it has learned from the training data.

Why LightGBM?

Tabular data: The per-position feature space is naturally tabular, not sequential. Gradient-boosted trees outperform neural networks on tabular data in most benchmarks.
Native missing-value handling: LightGBM handles NaN / missing features natively, which is critical for positions where some outgroups lack alignment data.
Speed: Training on ~5M sites takes <5 minutes; inference on a full genome takes minutes.
Interpretable: Feature importance scores show which outgroups and context features drive predictions, which is valuable for a scientific tool.

Summary

Two-tier voting (default)

  For each position in the focal genome:

  1. Gather projected bases from all outgroup species
  2. Compute inner consensus (majority vote among close relatives)
  3. Compute outer consensus (majority vote among distant relatives)
  4. Compare:
       agree  → UPPERCASE (high confidence)
       one N  → lowercase (low confidence)
       differ → n (unresolved)
       both N → N (missing)
  5. Write to output FASTA

Fitch parsimony

  For each position in the focal genome:

  1. Gather projected bases from all outgroup species
  2. Run Fitch bottom-up pass on the phylogenetic tree
  3. Run Fitch top-down pass to assign the root allele
  4. Encode confidence:
       root set size = 1 → UPPERCASE (high confidence)
       root set size > 1 → lowercase (low confidence)
       all missing       → N
  5. Write to output FASTA

Likelihood (Felsenstein pruning)

  For each position in the focal genome:

  1. Gather projected bases from all outgroup species
  2. Run Felsenstein bottom-up pass on the tree with branch lengths
     using the chosen substitution model (JC69/K80/HKY85/GTR)
  3. Compute root posterior: π · L(root) / Σ
  4. Encode confidence:
       max posterior >= high_threshold → UPPERCASE (high confidence)
       max posterior >= low_threshold  → lowercase (low confidence)
       max posterior <  low_threshold  → n (unresolved)
       all missing                    → N
  5. Write to output FASTA

ML classifier

  Prerequisite: train a model with `ancify train -c config.yaml`

  For each position in the focal genome:

  1. Gather projected bases from all outgroup species
  2. Extract feature vector (outgroup bases, voting consensus,
     agreement, GC content, CpG flag, flanking bases)
  3. Run LightGBM inference → predicted class + probability
  4. Encode confidence:
       p >= high_threshold → UPPERCASE (high confidence)
       p >= low_threshold  → lowercase (low confidence)
       p <  low_threshold  → n (unresolved)
       all missing         → N
  5. Write to output FASTA

All four methods are fast (Phase 2 takes minutes for a full genome) and produce reliable ancestral calls with built-in confidence assessment. Voting is the simplest and best-tested; parsimony is more principled for complex outgroup configurations; likelihood adds calibrated posterior probabilities using substitution models; ML can learn substitution biases and local context when training data is available.

Algorithm

Overview

Phase 1: Coordinate Projection

Input: Net AXT alignments

How projection works

Worked example

Phase 2: Ancestral State Inference

Step 1: Majority vote

Step 2: Compare inner and outer

The complete algorithm as pseudocode

Worked example: full pipeline for one position

Confidence encoding

Biological rationale

Why two tiers instead of one big vote?

Incomplete lineage sorting (ILS)

When the algorithm can still fail

The min_inner_freq parameter in depth

Alignment quality and its effects

Alternative method: Fitch parsimony

When to prefer parsimony

The Fitch algorithm

Handling missing data

Confidence encoding (parsimony)

Comparison: voting vs. parsimony

Configuration

Alternative method: Likelihood (Felsenstein pruning)

When to prefer likelihood

Substitution models

The Felsenstein pruning algorithm

Worked example

Comparison: voting vs. parsimony vs. likelihood vs. ML

Configuration

Alternative method: Machine learning classifier

When to prefer ML

Feature engineering

Training workflow

Confidence calibration

Comparison: voting vs. parsimony vs. ML

Why LightGBM?

Summary

Two-tier voting (default)

Fitch parsimony

Likelihood (Felsenstein pruning)

ML classifier

The `min_inner_freq` parameter in depth