API Reference

ancify can be used as a Python library for programmatic access to the ancestral calling pipeline.

Quick example

from ancify.config import load_config
from ancify.project import run_projection
from ancify.ancestral import run_ancestral_calling, call_ancestral_base

# Run the full pipeline
cfg = load_config("config.yaml")
run_projection(cfg)
run_ancestral_calling(cfg)

# Or call the core function directly
base = call_ancestral_base(
    inner_bases=["A", "A", "G"],
    outer_bases=["A"],
)
# Returns "A" (high confidence)

ancify.utils

I/O utilities and common helpers for ancify.

ancify.utils.read_fasta(path)[source]

Read a single-record FASTA file and return (header, sequence).

Handles gzip-compressed files transparently. Multi-line sequences are concatenated into a single string.

ancify.utils.write_fasta(path, header, sequence)[source]: Write a single-record FASTA file.

ancify.utils.read_chromosome_lengths(path)[source]

Read chromosome lengths from a tab-separated file.

Expects at least two columns per line: chromosome_name <TAB> length. Additional columns (e.g. GenBank accession, RefSeq) are ignored. Returns a dict mapping chromosome name to integer length.

ancify.utils.majority_vote(bases, min_freq=1)[source]

Return the most frequent valid nucleotide among bases, or 'N'.

Only bases in {A, C, G, T} are considered. Ties are broken in alphabetical order (A > C > G > T priority) for reproducibility. If no allele reaches min_freq occurrences, returns 'N'.

ancify.utils.chrom_id(chrom)[source]

Strip a leading chr prefix, if present.

Useful for mapping chromosome names between naming conventions, e.g. chr1 -> 1, chrX -> X, 3 -> 3.

ancify.config

Configuration loading and validation for ancify.

class ancify.config.OutgroupSpec(name, alignment)[source]

Specification for a single outgroup species.

Parameters:

name (str)
alignment (str)

name: str

alignment: str

class ancify.config.EvaluationConfig(reference_dir=None, reference_pattern='{chrom}.fa', vcf_dir=None, vcf_pattern='{chrom}.vcf.gz')[source]

Optional evaluation settings.

Parameters:

reference_dir (str | None)
reference_pattern (str)
vcf_dir (str | None)
vcf_pattern (str)

reference_dir: str | None = None

reference_pattern: str = '{chrom}.fa'

vcf_dir: str | None = None

vcf_pattern: str = '{chrom}.vcf.gz'

class ancify.config.PipelineConfig(focal_species, chromosome_lengths, outgroups_inner, outgroups_outer, chromosomes=None, work_dir='.', output_dir='./ancestral_calls', min_inner_freq=1, min_outer_freq=1, num_cpus=4, backend='auto', gpu_devices=None, method='voting', tree=None, ml_model_path=None, ml_training_reference=None, ml_high_threshold=0.8, ml_low_threshold=0.5, substitution_model='JC69', model_kappa=2.0, model_base_freqs=None, model_rates=None, likelihood_high_threshold=0.8, likelihood_low_threshold=0.5, evaluation=None)[source]

Complete pipeline configuration.

Parameters:

focal_species (str)
chromosome_lengths (str)
outgroups_inner (List[OutgroupSpec])
outgroups_outer (List[OutgroupSpec])
chromosomes (List[str] | None)
work_dir (str)
output_dir (str)
min_inner_freq (int)
min_outer_freq (int)
num_cpus (int)
backend (str)
gpu_devices (List[int] | None)
method (str)
tree (str | None)
ml_model_path (str | None)
ml_training_reference (str | None)
ml_high_threshold (float)
ml_low_threshold (float)
substitution_model (str)
model_kappa (float)
model_base_freqs (List[float] | None)
model_rates (List[float] | None)
likelihood_high_threshold (float)
likelihood_low_threshold (float)
evaluation (EvaluationConfig | None)

focal_species: str

chromosome_lengths: str

outgroups_inner: List[OutgroupSpec]

outgroups_outer: List[OutgroupSpec]

chromosomes: List[str] | None = None

work_dir: str = '.'

output_dir: str = './ancestral_calls'

min_inner_freq: int = 1

min_outer_freq: int = 1

num_cpus: int = 4

backend: str = 'auto'

gpu_devices: List[int] | None = None

method: str = 'voting'

tree: str | None = None

ml_model_path: str | None = None

ml_training_reference: str | None = None

ml_high_threshold: float = 0.8

ml_low_threshold: float = 0.5

substitution_model: str = 'JC69'

model_kappa: float = 2.0

model_base_freqs: List[float] | None = None

model_rates: List[float] | None = None

likelihood_high_threshold: float = 0.8

likelihood_low_threshold: float = 0.5

evaluation: EvaluationConfig | None = None

resolve_chromosomes()[source]

Return the list of chromosomes to process.

If chromosomes was not set explicitly, reads all chromosome names from the chromosome-lengths file.

property all_outgroups: List[OutgroupSpec]: All outgroup species (inner + outer), deduplicated by name.

ancify.config.load_config(path)[source]: Load a YAML configuration file and return a PipelineConfig.

ancify.config.validate_config(cfg)[source]

Check that a PipelineConfig is self-consistent.

Raises ValueError with a descriptive message on any problem.

ancify.project

Phase 1: Project outgroup alignments onto focal-species coordinates.

Reads pairwise net-AXT alignment files and produces one FASTA file per chromosome, where each position corresponds to a position in the focal species reference. Unaligned positions are filled with N.

The implementation uses a two-pass strategy:

Parse – read the AXT file and collect block metadata for the target chromosome (fast sequential I/O, optionally isal-accelerated for gzip).
Scatter – fill the output array using vectorized NumPy operations (CPU) or batched PyTorch scatter on a CUDA device (GPU).

ancify.project.project_alignment(axt_path, chrom_length, target_chrom, device_id=None)[source]

Project one chromosome from an AXT alignment onto focal coordinates.

Parameters:

axt_path (str or Path) – Path to the (optionally gzip-compressed) net-AXT alignment file.
chrom_length (int) – Length of target_chrom in the focal species reference.
target_chrom (str) – Chromosome name to extract (must match the target field in the AXT).
device_id (int or None) – CUDA device index for GPU-accelerated scatter (None = CPU).

Returns:

Outgroup sequence in focal-species coordinates (length == chrom_length). Positions not covered by any alignment block contain N.

Return type:

str

ancify.project.run_projection(config)[source]

Execute Phase 1 for all outgroup species and chromosomes.

Creates <work_dir>/projected/<species_name>/<chrom>.fa for every combination of outgroup species and chromosome.

ancify.ancestral

Phase 2: Infer ancestral alleles from projected outgroup sequences.

Uses a two-tier outgroup voting scheme:

Inner outgroup – closely related species; the most frequent nucleotide among them forms the inner consensus.
Outer outgroup – more distantly related species; serves as an independent confirmation.

Confidence is encoded via letter case in the output FASTA:

Char	Confidence	Condition
`ACGT`	High	Inner and outer outgroups agree
`acgt`	Low	Only one tier has data
`n`	Unresolved	Inner and outer disagree
`N`	Missing	Both tiers lack data

ancify.ancestral.call_ancestral_base(inner_bases, outer_bases, min_inner_freq=1, min_outer_freq=1)[source]

Infer the ancestral allele at a single position.

Parameters:

inner_bases (list of str) – Nucleotides from the inner (closely related) outgroup species.
outer_bases (list of str) – Nucleotides from the outer (distantly related) outgroup species.
min_inner_freq (int) – Minimum allele count to accept a majority-vote consensus.
min_outer_freq (int) – Minimum allele count to accept a majority-vote consensus.

Returns:

Single character with case-encoded confidence (see module docstring).

Return type:

str

ancify.ancestral.call_ancestral_base_parsimony(tree, species_bases)[source]

Infer the ancestral allele at a single position using Fitch parsimony.

Parameters:

tree (TreeNode) – Phylogenetic tree of outgroup species.
species_bases (dict) – Mapping of species name → observed nucleotide (single character).

Returns:

Single character with case-encoded confidence: uppercase = unambiguous, lowercase = ambiguous, N = all missing.

Return type:

str

ancify.ancestral.run_ancestral_calling(config)[source]

Execute Phase 2: call ancestral alleles for every chromosome.

Reads projected FASTA files from <work_dir>/projected/<species>/ and writes ancestral FASTA files to <output_dir>/<chrom>.fa.

Supported methods: "voting" (default), "parsimony", "ml".

ancify.parsimony

Fitch parsimony for ancestral allele reconstruction on a phylogenetic tree.

Implements the Fitch (1971) algorithm:

Bottom-up (post-order): At each leaf, assign the observed allele as a singleton set. N is treated as {A, C, G, T} (compatible with everything). At each internal node, take the intersection of children’s sets if non-empty, otherwise their union.
Top-down (pre-order): Starting at the root, pick a concrete allele from the node’s set (preferring the parent’s assignment for determinism, breaking ties alphabetically). Propagate down.

The root allele is the inferred ancestral state. Ambiguity at the root (set size > 1 after the bottom-up pass) indicates multiple equally parsimonious reconstructions.

class ancify.parsimony.TreeNode(name=None, children=<factory>, branch_length=None)[source]

A node in an unrooted/rooted phylogenetic tree.

Leaves have a name matching an outgroup species identifier. Internal nodes have name=None. Branch lengths are parsed but not used by the Fitch algorithm.

Parameters:

name (str | None)
children (List[TreeNode])
branch_length (float | None)

name: str | None = None

children: List[TreeNode]

branch_length: float | None = None

property is_leaf: bool

leaf_names()[source]

Return all leaf names in pre-order.

Return type:: List[str]

ancify.parsimony.get_leaf_names(tree)[source]

Return all leaf names from tree (module-level convenience wrapper).

Return type:: List[str]
Parameters:: tree (TreeNode)

ancify.parsimony.parse_newick(text)[source]

Parse a Newick-format string into a TreeNode tree.

Supports optional branch lengths (name:length) and nested clades. Whitespace is ignored. The trailing semicolon is optional.

Examples:

>>> tree = parse_newick("((A,B),C);")
>>> tree.leaf_names()
['A', 'B', 'C']

Return type:: TreeNode
Parameters:: text (str)

ancify.parsimony.fitch_bottom_up(node, leaf_alleles)[source]

Post-order traversal: compute the Fitch set at every node.

Parameters:

node (TreeNode) – Root of the (sub)tree.
leaf_alleles (dict) – Mapping of leaf name to observed allele (single uppercase character). 'N' or any character not in {A, C, G, T} is treated as the full set (wildcard).

Returns:

Mapping of id(node) to a frozenset of possible alleles.

Return type:

dict

ancify.parsimony.fitch_top_down(node, node_sets, parent_allele=None)[source]

Pre-order traversal: assign a concrete allele at each node.

Deterministic tie-breaking: prefer the parent’s allele if it is in the node’s Fitch set; otherwise pick the lexicographically smallest.

Returns:

Mapping of id(node) to the assigned allele character.

Return type:

dict

Parameters:

node (TreeNode)
node_sets (Dict[int, Set[str]])
parent_allele (str | None)

ancify.parsimony.fitch_ancestral(tree, leaf_alleles)[source]

Run the full Fitch algorithm and return the inferred root state.

Parameters:

tree (TreeNode) – The phylogenetic tree of outgroup species.
leaf_alleles (dict) – Mapping of leaf name to observed allele at this position.

Returns:

(root_allele, is_ambiguous) where is_ambiguous is True when the root’s Fitch set contained more than one allele.

Return type:

tuple of (str, bool)

ancify.evaluate

Phase 3: Evaluate ancestral calls against a reference and/or VCF variants.

All evaluation steps are optional and driven by the evaluation block in the pipeline configuration file:

Coverage statistics are always produced (no extra data needed).
Reference comparison requires a directory of reference ancestral FASTA files (e.g. Ensembl EPO) and a filename pattern.
VCF comparison requires a directory of VCF files and a filename pattern, plus the scikit-allel package.

ancify.evaluate.compute_coverage_stats(sequence)[source]: Compute coverage and confidence statistics for an ancestral sequence.

ancify.evaluate.compare_to_reference(test_seq, ref_seq, positions=None)[source]: Compare two ancestral sequences, optionally at specific positions.

ancify.evaluate.compare_to_vcf(ancestral_seq, vcf_path)[source]

Compare ancestral calls against VCF REF/ALT alleles.

Requires scikit-allel (install with pip install scikit-allel).

ancify.evaluate.run_evaluation(config)[source]

Execute Phase 3: evaluate ancestral calls for every chromosome.

Writes per-chromosome summary files to <output_dir>/evaluation/.

ancify.cli

Command-line interface for ancify.

Usage:

ancify init [-o config.yaml]         # generate a template config
ancify project -c config.yaml        # Phase 1: project alignments
ancify call -c config.yaml           # Phase 2: call ancestral states
ancify evaluate -c config.yaml       # Phase 3: evaluate calls
ancify run -c config.yaml            # run all phases
ancify train -c config.yaml [-o model.lgb]  # train ML model

The package can also be invoked as python -m ancify <cmd> ....

ancify.cli.cmd_init(args)[source]

ancify.cli.cmd_project(args)[source]

ancify.cli.cmd_call(args)[source]

ancify.cli.cmd_evaluate(args)[source]

ancify.cli.cmd_train(args)[source]

ancify.cli.cmd_run(args)[source]

ancify.cli.main()[source]