ancify — Fast ancestral allele polarization from outgroup alignments
ancify is a config-driven Python pipeline that determines the ancestral state at every position in a reference genome by comparing pairwise alignments from multiple outgroup species. It supports four inference methods: two-tier voting, Fitch parsimony on a phylogenetic tree, likelihood (Felsenstein pruning with substitution models), and a machine-learning classifier (LightGBM) — all producing case-encoded confidence levels.
┌──────────┐
Inner: │ Bonobo │──┐
closely ├──────────┤ │ majority ┌──────────────┐
related │ Chimp │──┼────vote────▶│ Inner │
species ├──────────┤ │ │ consensus │
│ Gorilla │──┘ └──────┬───────┘
│ compare
┌──────────┐ ┌────────▼───────┐
Outer: │ Macaque │──────vote────▶│ Ancestral │
distant └──────────┘ │ call │
species └───────────────┘
Agree? → UPPERCASE (high confidence)
One tier? → lowercase (low confidence)
Disagree? → n (unresolved)
Both empty? → N (missing)
Four methods, one config field:
method: voting # default — two-tier majority vote (above diagram)
method: parsimony # Fitch algorithm on a Newick phylogenetic tree
method: likelihood # Felsenstein pruning + JC69/K80/HKY85/GTR (tree with branch lengths)
method: ml # LightGBM classifier trained on your alignment data
Get started in five minutes:
pip install .
ancify init -o config.yaml
# edit config.yaml with your species, alignments, and paths
ancify run -c config.yaml
Why ancify?
Four inference methods — choose the approach that fits your data: two-tier voting (fast, transparent), Fitch parsimony (tree-aware, resolves ILS), likelihood (substitution models + branch lengths, posterior probabilities), or ML classifier (learns substitution biases from your data). See Algorithm.
GPU-accelerated — optional PyTorch backend turns 12-hour genome-wide runs into ~2 minutes on NVIDIA GPUs. See GPU Acceleration & Vectorization.
Species-agnostic — works with humans, mice, flies, fish, plants, or any species with outgroup alignments.
Educational — the docs teach you the population genetics behind polarization, not just the buttons to press.
Config-driven — one YAML file controls everything. No scripts to edit.
Transparent — confidence is encoded directly in the output (uppercase/lowercase). You always know how reliable each call is.
Validated — tested against the Ensembl EPO 13-primate ancestral reference with >99.9% agreement.
User Guide
Deep Dives
- Algorithm
- Overview
- Phase 1: Coordinate Projection
- Phase 2: Ancestral State Inference
- Confidence encoding
- Biological rationale
- The
min_inner_freqparameter in depth - Alignment quality and its effects
- Alternative method: Fitch parsimony
- Alternative method: Likelihood (Felsenstein pruning)
- Alternative method: Machine learning classifier
- Summary
- Evaluation
- FAQ & Troubleshooting