Configuration Reference

Everything in ancify is controlled by a single YAML file. This page is the complete reference.

Generate a starter config

ancify init -o config.yaml

This creates a fully annotated template. You can also start from one of the example configs in example_configs/.


Complete annotated config

focal_species: human

chromosome_lengths: chromoLens.txt

# Optional: restrict to specific chromosomes.
# If omitted, every entry in the lengths file is processed.
chromosomes:
  - chr1
  - chr2
  - chrX

outgroups:
  inner:
    - name: bonobo
      alignment: hg38.panPan3.net.axt.gz
    - name: chimp
      alignment: hg38.panTro6.net.axt.gz
    - name: gorilla
      alignment: hg38.gorGor6.net.axt.gz
  outer:
    - name: macaque
      alignment: hg38.rheMac10.net.axt.gz

work_dir: .
output_dir: ./ancestral_calls

min_inner_freq: 1
min_outer_freq: 1

num_cpus: 24

# Inference method: "voting" (default), "parsimony", or "ml".
method: voting

# --- Parsimony options (only needed when method: parsimony) ---
# tree: "(((bonobo,chimp),gorilla),macaque)"   # inline Newick, or:
# tree: species_tree.nwk                        # path to .nwk file

# --- Likelihood options (only needed when method: likelihood) ---
# substitution_model: HKY85         # JC69 (default), K80, HKY85, or GTR
# model_kappa: 2.0                  # transition/transversion ratio (K80, HKY85)
# model_base_freqs: [0.3, 0.2, 0.2, 0.3]  # equilibrium freqs (HKY85, GTR)
# model_rates: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]  # exchangeability rates (GTR)
# likelihood_high_threshold: 0.8    # min posterior → uppercase (high conf)
# likelihood_low_threshold: 0.5     # min posterior → lowercase (low conf)

# --- ML options (only needed when method: ml) ---
# ml_model_path: model.lgb          # path to trained LightGBM model
# ml_training_reference: ./ensembl_ancestor/  # optional: supervised labels
# ml_high_threshold: 0.8            # min probability → uppercase (high conf)
# ml_low_threshold: 0.5             # min probability → lowercase (low conf)

# Performance backend: "auto", "cpu", or "gpu".
# "auto" uses GPU when PyTorch + CUDA is available, otherwise CPU.
backend: auto

# Optional: restrict to specific GPUs (default: use all).
# Example: gpu_devices: [0, 1, 2]

# Optional evaluation block (Phase 3).
evaluation:
  reference_dir: ./ensembl_ancestor/
  reference_pattern: "homo_sapiens_ancestor_{chrom_id}.fa"
  vcf_dir: ./vcf/
  vcf_pattern: "ALL.chr{chrom_id}.vcf.gz"

Field reference

Required fields

Field

Type

Description

focal_species

string

Label for the focal species (used in log messages only)

chromosome_lengths

path

Tab-separated file with at least two columns: chromosome name and length

outgroups.inner

list

One or more closely related outgroup species

outgroups.outer

list

One or more distantly related outgroup species

Each outgroup entry requires:

Field

Type

Description

name

string

Species identifier (used in directory names for projected files)

alignment

path

Path to the net AXT pairwise alignment file (may be gzipped)

Optional fields

Field

Default

Description

chromosomes

all from lengths file

List of chromosomes to process

work_dir

.

Working directory for intermediate projected sequences

output_dir

./ancestral_calls

Output directory for final ancestral FASTAs

min_inner_freq

1

Minimum allele count for inner majority vote

min_outer_freq

1

Minimum allele count for outer majority vote

num_cpus

4

Number of parallel worker processes

backend

auto

Compute backend: "auto", "cpu", or "gpu" (see GPU Acceleration & Vectorization)

gpu_devices

all available

List of GPU device IDs to use, e.g. [0, 1, 2]

method

voting

Ancestral inference method: "voting", "parsimony", "likelihood", or "ml" (see Algorithm)

tree

none

Newick tree of outgroup species (required when method: parsimony or likelihood). Inline string or path to .nwk file.

substitution_model

JC69

Substitution model for likelihood method: "JC69", "K80", "HKY85", or "GTR".

model_kappa

2.0

Transition/transversion ratio κ (used by K80 and HKY85).

model_base_freqs

uniform

Equilibrium base frequencies [π_A, π_C, π_G, π_T] (used by HKY85 and GTR). Must sum to 1.

model_rates

none

Six GTR exchangeability rates [AC, AG, AT, CG, CT, GT] (required when substitution_model: GTR).

likelihood_high_threshold

0.8

Minimum posterior probability for high (uppercase) confidence (likelihood method).

likelihood_low_threshold

0.5

Minimum posterior probability for low (lowercase) confidence (likelihood method). Below this → n.

ml_model_path

none

Path to a trained LightGBM model file (required when method: ml). Produced by ancify train.

ml_training_reference

none

Directory of reference ancestral FASTAs for supervised training (optional; used by ancify train).

ml_high_threshold

0.8

Minimum predicted probability for high (uppercase) confidence.

ml_low_threshold

0.5

Minimum predicted probability for low (lowercase) confidence. Below this → n.

evaluation

none

Optional evaluation block (see below)

Evaluation fields

All evaluation fields are optional. If the evaluation block is omitted entirely, Phase 3 is skipped.

Field

Default

Description

evaluation.reference_dir

none

Directory with reference ancestral FASTA files

evaluation.reference_pattern

{chrom}.fa

Filename pattern for reference files

evaluation.vcf_dir

none

Directory with VCF files

evaluation.vcf_pattern

{chrom}.vcf.gz

Filename pattern for VCF files


Understanding key parameters

min_inner_freq: the stringency dial

This is the single most important tuning parameter. It controls how many inner outgroup species must agree before a consensus is accepted.

  With 3 inner species (e.g. bonobo, chimp, gorilla):

  min_inner_freq=1     min_inner_freq=2     min_inner_freq=3
  ──────────────────   ──────────────────   ──────────────────
  Any 1 species        At least 2 of 3      All 3 must agree
  suffices             must agree

  ✓ Maximum coverage   ✓ Balanced            ✓ Maximum stringency
  ✗ Lower accuracy     ✓ Good for most       ✗ Lower coverage
                         use cases            ✓ Highest accuracy

Recommendation: Start with the default (1) and increase if you see higher-than-expected disagreement rates in the evaluation.

num_cpus: parallelism

Each chromosome is processed independently. num_cpus controls how many chromosomes are processed simultaneously.

  Memory usage ≈ num_cpus × (num_outgroups) × (avg_chrom_length)

  Example: 24 CPUs × 4 outgroups × 150 Mb avg = ~14.4 GB peak

If you are running on a machine with limited memory, reduce num_cpus. For Phase 1 (projection), memory is modest; the bottleneck is Phase 2 (calling), which loads all projected sequences for each active chromosome.

method: choosing your inference approach

ancify supports four methods for inferring the ancestral allele at each position. You switch between them with a single line in your YAML:

method: voting      # default
  ┌──────────────┬────────────────────────────────────────────────────────────┐
  │ method       │ Description                                                │
  ├──────────────┼────────────────────────────────────────────────────────────┤
  │ voting       │ Two-tier majority vote. Inner outgroups vote first; outer  │
  │ (default)    │ provides independent check. Fast, no extra deps.           │
  ├──────────────┼────────────────────────────────────────────────────────────┤
  │ parsimony    │ Fitch (1971) algorithm on a Newick phylogenetic tree.      │
  │              │ Resolves many "unresolved" (n) sites that voting misses.   │
  │              │ Requires: tree field.                                      │
  ├──────────────┼────────────────────────────────────────────────────────────┤
  │ likelihood   │ Felsenstein (1981) pruning with an explicit substitution   │
  │              │ model. Computes posterior probabilities at the root using   │
  │              │ branch lengths. Requires: tree with branch lengths, scipy. │
  ├──────────────┼────────────────────────────────────────────────────────────┤
  │ ml           │ LightGBM gradient-boosted classifier. Learns substitution  │
  │              │ biases (CpG, GC context, etc.) from data. Yields calibrated│
  │              │ probability scores rather than binary high/low.            │
  │              │ Requires: pip install 'ancify[ml]', a trained model.       │
  └──────────────┴────────────────────────────────────────────────────────────┘

Not sure which to use? Start with voting. It requires no extra configuration and gives >99.9% accuracy for most species. Switch to parsimony if you have a reliable tree and want to resolve more “unresolved” sites. Use likelihood if you have a tree with branch lengths and want calibrated posterior probabilities from an explicit substitution model. Use ml if you have an external reference sequence for training labels and want to learn context-dependent patterns. See Algorithm for a detailed comparison with worked examples.

Parsimony YAML

method: parsimony
tree: "(((bonobo,chimp),gorilla),macaque)"

Leaf names in the Newick string must exactly match the name fields of your outgroup entries. You can also point to a file:

tree: species_tree.nwk

Likelihood YAML

method: likelihood
tree: "(((bonobo:0.008,chimp:0.008):0.002,gorilla:0.009):0.020,macaque:0.038)"

# Substitution model (default: JC69)
substitution_model: HKY85
model_kappa: 2.0
model_base_freqs: [0.3, 0.2, 0.2, 0.3]

# Optional: tune confidence thresholds (defaults shown)
likelihood_high_threshold: 0.8   # posterior ≥ 0.8 → UPPERCASE (high conf)
likelihood_low_threshold: 0.5    # posterior ≥ 0.5 → lowercase (low conf)
                                 # below 0.5 → n (unresolved)

Branch lengths are required — they control how much influence each leaf has. Trees from RAxML, IQ-TREE, or similar tools provide lengths in substitutions per site, which is exactly what the models expect. For GTR, also supply six exchangeability rates:

substitution_model: GTR
model_rates: [1.0, 2.0, 1.0, 1.0, 2.0, 1.0]   # [AC, AG, AT, CG, CT, GT]
model_base_freqs: [0.3, 0.2, 0.2, 0.3]

ML YAML (two-step workflow)

Step 1: Train the model (run once, before calling):

# Add to config.yaml — optionally point to a reference for supervised labels:
ml_training_reference: /data/ensembl_ancestor/   # omit for self-supervised
ancify train -c config.yaml -o model.lgb

Step 2: Call with the trained model:

method: ml
ml_model_path: model.lgb

# Optional: tune confidence thresholds (defaults shown)
ml_high_threshold: 0.8   # predicted probability ≥ 0.8 → UPPERCASE (high conf)
ml_low_threshold: 0.5    # predicted probability ≥ 0.5 → lowercase (low conf)
                         # below 0.5 → n (unresolved)
ancify call -c config.yaml

Install ML dependencies first if you haven’t:

pip install 'ancify[ml]'

backend: the compute engine

This field selects which execution backend ancify uses for the heavy computation in Phases 1 and 2. See GPU Acceleration & Vectorization for the full details.

  backend: "auto"             backend: "cpu"             backend: "gpu"
  ──────────────────          ──────────────             ──────────────
  Uses GPU if PyTorch         Forces CPU-only            Forces GPU mode
  + CUDA is detected,         (vectorized NumPy).        (requires PyTorch
  else falls back to          Still much faster than     + CUDA).
  vectorized CPU.             the unvectorized path.

When backend: "gpu" (or "auto" with CUDA detected) is active, chromosomes are distributed round-robin across all available GPUs. Use gpu_devices to restrict which ones are used:

# Use all available GPUs (default)
backend: auto

# Use specific GPUs only
backend: auto
gpu_devices: [0, 1, 2]   # three A100s

# Single GPU, explicit
backend: gpu
gpu_devices: [0]

# Force CPU-only even if a GPU is present
backend: cpu

Install PyTorch with CUDA to unlock the GPU path:

pip install torch --index-url https://download.pytorch.org/whl/cu128

See GPU Acceleration & Vectorization for supported hardware, memory budgets, and benchmarks.

Choosing chromosomes

By default, ancify processes every chromosome in the lengths file. This includes autosomes, sex chromosomes, and scaffolds. To process only specific chromosomes:

chromosomes:
  - chr1
  - chr2
  - chrX

Tip

For an initial test run, try processing just one small chromosome (e.g. chr22 for humans or chr19 for mouse) to verify everything works before committing to a full-genome run.


Pattern placeholders

Evaluation filename patterns support two placeholders:

Placeholder

Example input: chr1

Example input: 2L

{chrom}

chr1

2L

{chrom_id}

1 (strips chr prefix)

2L (no chr prefix to strip)

This handles the common case where your focal genome uses chr1 but reference files use 1:

reference_pattern: "homo_sapiens_ancestor_{chrom_id}.fa"
#                                          ↑ becomes "1" for chr1

The chromosome lengths file

A simple tab-separated text file. Only the first two columns matter:

chr1    248956422
chr2    242193529
chrX    156040895

Additional columns (GenBank accession, RefSeq ID, etc.) are silently ignored.

How to create one

From a FASTA index:

samtools faidx reference.fa
cut -f1,2 reference.fa.fai > chromoLens.txt

From UCSC MySQL:

mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A \
  -e "SELECT chrom, size FROM chromInfo" hg38 > chromoLens.txt

From an NCBI assembly report:

grep -v "^#" assembly_report.txt | awk -F'\t' '{print $1, $9}' > chromoLens.txt

Validation

When a config is loaded, ancify validates:

  • At least one inner outgroup species is specified (voting, ml).

  • At least one outer outgroup species is specified (voting, ml).

  • At least one outgroup species is specified (parsimony, likelihood).

  • The chromosome lengths file exists and is readable.

  • All alignment files exist and are accessible.

  • If method is parsimony or likelihood: a tree field is present and leaf names match outgroup names.

  • If method is likelihood: the substitution model name is valid, GTR has 6 rates, base frequencies (if provided) have 4 entries summing to ~1, and thresholds satisfy 0 ≤ low ≤ high ≤ 1.

If any check fails, a clear error message is printed before any processing begins. This prevents wasting hours on Phase 1 only to discover a typo in a file path.


Config recipes

Minimal (2 species)

The simplest possible configuration — one inner, one outer:

focal_species: mouse
chromosome_lengths: mm39.chromLens.txt
outgroups:
  inner:
    - name: rat
      alignment: mm39.rn7.net.axt.gz
  outer:
    - name: rabbit
      alignment: mm39.oryCun2.net.axt.gz
output_dir: ./mouse_ancestral

Maximal (many species, strict settings, evaluation)

focal_species: human
chromosome_lengths: chromoLens.txt
chromosomes: [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9,
              chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,
              chr18, chr19, chr20, chr21, chr22, chrX]
outgroups:
  inner:
    - name: bonobo
      alignment: hg38.panPan3.net.axt.gz
    - name: chimp
      alignment: hg38.panTro6.net.axt.gz
    - name: gorilla
      alignment: hg38.gorGor6.net.axt.gz
  outer:
    - name: macaque
      alignment: hg38.rheMac10.net.axt.gz
work_dir: /scratch/polarization
output_dir: /results/human_ancestral
min_inner_freq: 2
min_outer_freq: 1
num_cpus: 32
backend: auto
gpu_devices: [0, 1, 2]
evaluation:
  reference_dir: /data/ensembl_ancestor/
  reference_pattern: "homo_sapiens_ancestor_{chrom_id}.fa"
  vcf_dir: /data/1kg/
  vcf_pattern: "ALL.chr{chrom_id}.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz"

Fitch parsimony

The same outgroup configuration with the phylogenetic tree-based method:

focal_species: human
chromosome_lengths: chromoLens.txt
outgroups:
  inner:
    - name: bonobo
      alignment: hg38.panPan3.net.axt.gz
    - name: chimp
      alignment: hg38.panTro6.net.axt.gz
    - name: gorilla
      alignment: hg38.gorGor6.net.axt.gz
  outer:
    - name: macaque
      alignment: hg38.rheMac10.net.axt.gz
method: parsimony
tree: "(((bonobo,chimp),gorilla),macaque)"
output_dir: ./human_ancestral_parsimony

Leaf names in the Newick tree must match the name fields of your outgroup entries. When method is parsimony, the inner/outer distinction is ignored — all outgroups are placed on the tree and the Fitch algorithm determines their relative weights.

You can also point tree to an external file:

tree: species_tree.nwk

Likelihood

The same outgroup configuration with the likelihood method and an HKY85 substitution model:

focal_species: human
chromosome_lengths: chromoLens.txt
outgroups:
  inner:
    - name: bonobo
      alignment: hg38.panPan3.net.axt.gz
    - name: chimp
      alignment: hg38.panTro6.net.axt.gz
    - name: gorilla
      alignment: hg38.gorGor6.net.axt.gz
  outer:
    - name: macaque
      alignment: hg38.rheMac10.net.axt.gz
method: likelihood
tree: "(((bonobo:0.008,chimp:0.008):0.002,gorilla:0.009):0.020,macaque:0.038)"
substitution_model: HKY85
model_kappa: 2.0
model_base_freqs: [0.3, 0.2, 0.2, 0.3]
output_dir: ./human_ancestral_likelihood

When method is likelihood, branch lengths in the tree are critical — they determine how much weight each leaf’s observation receives. The inner/outer distinction is ignored; all outgroups are placed on the tree.

ML classifier

The ML method requires a two-step workflow: train first, then call.

Step 1: Train (uses high-confidence voting sites as labels by default):

focal_species: human
chromosome_lengths: chromoLens.txt
outgroups:
  inner:
    - name: bonobo
      alignment: hg38.panPan3.net.axt.gz
    - name: chimp
      alignment: hg38.panTro6.net.axt.gz
    - name: gorilla
      alignment: hg38.gorGor6.net.axt.gz
  outer:
    - name: macaque
      alignment: hg38.rheMac10.net.axt.gz
output_dir: ./human_ancestral_ml
ancify train -c config.yaml -o model.lgb

To use an external reference (e.g. Ensembl EPO) as training labels instead, add:

ml_training_reference: /data/ensembl_ancestor/

The reference directory should contain one FASTA per chromosome (e.g. chr1.fa).

Step 2: Call (add the trained model path and set method: ml):

method: ml
ml_model_path: model.lgb

# Optional: tune confidence thresholds (defaults shown)
ml_high_threshold: 0.8
ml_low_threshold: 0.5
ancify call -c config.yaml

Quick test run (single chromosome)

focal_species: human
chromosome_lengths: chromoLens.txt
chromosomes:
  - chr22
outgroups:
  inner:
    - name: chimp
      alignment: hg38.panTro6.net.axt.gz
  outer:
    - name: macaque
      alignment: hg38.rheMac10.net.axt.gz
output_dir: ./test_run
num_cpus: 1