API Reference
ancify can be used as a Python library for programmatic access to the ancestral calling pipeline.
Quick example
from ancify.config import load_config
from ancify.project import run_projection
from ancify.ancestral import run_ancestral_calling, call_ancestral_base
# Run the full pipeline
cfg = load_config("config.yaml")
run_projection(cfg)
run_ancestral_calling(cfg)
# Or call the core function directly
base = call_ancestral_base(
inner_bases=["A", "A", "G"],
outer_bases=["A"],
)
# Returns "A" (high confidence)
ancify.utils
I/O utilities and common helpers for ancify.
- ancify.utils.read_fasta(path)[source]
Read a single-record FASTA file and return (header, sequence).
Handles gzip-compressed files transparently. Multi-line sequences are concatenated into a single string.
- ancify.utils.read_chromosome_lengths(path)[source]
Read chromosome lengths from a tab-separated file.
Expects at least two columns per line: chromosome_name <TAB> length. Additional columns (e.g. GenBank accession, RefSeq) are ignored. Returns a dict mapping chromosome name to integer length.
- ancify.utils.majority_vote(bases, min_freq=1)[source]
Return the most frequent valid nucleotide among bases, or
'N'.Only bases in {A, C, G, T} are considered. Ties are broken in alphabetical order (A > C > G > T priority) for reproducibility. If no allele reaches min_freq occurrences, returns
'N'.
ancify.config
Configuration loading and validation for ancify.
- class ancify.config.OutgroupSpec(name, alignment)[source]
Specification for a single outgroup species.
- class ancify.config.EvaluationConfig(reference_dir=None, reference_pattern='{chrom}.fa', vcf_dir=None, vcf_pattern='{chrom}.vcf.gz')[source]
Optional evaluation settings.
- Parameters:
- class ancify.config.PipelineConfig(focal_species, chromosome_lengths, outgroups_inner, outgroups_outer, chromosomes=None, work_dir='.', output_dir='./ancestral_calls', min_inner_freq=1, min_outer_freq=1, num_cpus=4, backend='auto', gpu_devices=None, method='voting', tree=None, ml_model_path=None, ml_training_reference=None, ml_high_threshold=0.8, ml_low_threshold=0.5, substitution_model='JC69', model_kappa=2.0, model_base_freqs=None, model_rates=None, likelihood_high_threshold=0.8, likelihood_low_threshold=0.5, evaluation=None)[source]
Complete pipeline configuration.
- Parameters:
focal_species (str)
chromosome_lengths (str)
outgroups_inner (List[OutgroupSpec])
outgroups_outer (List[OutgroupSpec])
work_dir (str)
output_dir (str)
min_inner_freq (int)
min_outer_freq (int)
num_cpus (int)
backend (str)
method (str)
tree (str | None)
ml_model_path (str | None)
ml_training_reference (str | None)
ml_high_threshold (float)
ml_low_threshold (float)
substitution_model (str)
model_kappa (float)
likelihood_high_threshold (float)
likelihood_low_threshold (float)
evaluation (EvaluationConfig | None)
- outgroups_inner: List[OutgroupSpec]
- outgroups_outer: List[OutgroupSpec]
- evaluation: EvaluationConfig | None = None
- resolve_chromosomes()[source]
Return the list of chromosomes to process.
If chromosomes was not set explicitly, reads all chromosome names from the chromosome-lengths file.
- property all_outgroups: List[OutgroupSpec]
All outgroup species (inner + outer), deduplicated by name.
ancify.project
Phase 1: Project outgroup alignments onto focal-species coordinates.
Reads pairwise net-AXT alignment files and produces one FASTA file per
chromosome, where each position corresponds to a position in the focal
species reference. Unaligned positions are filled with N.
The implementation uses a two-pass strategy:
Parse – read the AXT file and collect block metadata for the target chromosome (fast sequential I/O, optionally isal-accelerated for gzip).
Scatter – fill the output array using vectorized NumPy operations (CPU) or batched PyTorch
scatteron a CUDA device (GPU).
- ancify.project.project_alignment(axt_path, chrom_length, target_chrom, device_id=None)[source]
Project one chromosome from an AXT alignment onto focal coordinates.
- Parameters:
axt_path (str or Path) – Path to the (optionally gzip-compressed) net-AXT alignment file.
chrom_length (int) – Length of target_chrom in the focal species reference.
target_chrom (str) – Chromosome name to extract (must match the target field in the AXT).
device_id (int or None) – CUDA device index for GPU-accelerated scatter (
None= CPU).
- Returns:
Outgroup sequence in focal-species coordinates (length == chrom_length). Positions not covered by any alignment block contain
N.- Return type:
ancify.ancestral
Phase 2: Infer ancestral alleles from projected outgroup sequences.
Uses a two-tier outgroup voting scheme:
Inner outgroup – closely related species; the most frequent nucleotide among them forms the inner consensus.
Outer outgroup – more distantly related species; serves as an independent confirmation.
Confidence is encoded via letter case in the output FASTA:
Char |
Confidence |
Condition |
|---|---|---|
|
High |
Inner and outer outgroups agree |
|
Low |
Only one tier has data |
|
Unresolved |
Inner and outer disagree |
|
Missing |
Both tiers lack data |
- ancify.ancestral.call_ancestral_base(inner_bases, outer_bases, min_inner_freq=1, min_outer_freq=1)[source]
Infer the ancestral allele at a single position.
- Parameters:
inner_bases (list of str) – Nucleotides from the inner (closely related) outgroup species.
outer_bases (list of str) – Nucleotides from the outer (distantly related) outgroup species.
min_inner_freq (int) – Minimum allele count to accept a majority-vote consensus.
min_outer_freq (int) – Minimum allele count to accept a majority-vote consensus.
- Returns:
Single character with case-encoded confidence (see module docstring).
- Return type:
- ancify.ancestral.call_ancestral_base_parsimony(tree, species_bases)[source]
Infer the ancestral allele at a single position using Fitch parsimony.
- ancify.ancestral.run_ancestral_calling(config)[source]
Execute Phase 2: call ancestral alleles for every chromosome.
Reads projected FASTA files from
<work_dir>/projected/<species>/and writes ancestral FASTA files to<output_dir>/<chrom>.fa.Supported methods:
"voting"(default),"parsimony","ml".
ancify.parsimony
Fitch parsimony for ancestral allele reconstruction on a phylogenetic tree.
Implements the Fitch (1971) algorithm:
Bottom-up (post-order): At each leaf, assign the observed allele as a singleton set.
Nis treated as{A, C, G, T}(compatible with everything). At each internal node, take the intersection of children’s sets if non-empty, otherwise their union.Top-down (pre-order): Starting at the root, pick a concrete allele from the node’s set (preferring the parent’s assignment for determinism, breaking ties alphabetically). Propagate down.
The root allele is the inferred ancestral state. Ambiguity at the root (set size > 1 after the bottom-up pass) indicates multiple equally parsimonious reconstructions.
- class ancify.parsimony.TreeNode(name=None, children=<factory>, branch_length=None)[source]
A node in an unrooted/rooted phylogenetic tree.
Leaves have a name matching an outgroup species identifier. Internal nodes have
name=None. Branch lengths are parsed but not used by the Fitch algorithm.
- ancify.parsimony.get_leaf_names(tree)[source]
Return all leaf names from tree (module-level convenience wrapper).
- ancify.parsimony.parse_newick(text)[source]
Parse a Newick-format string into a
TreeNodetree.Supports optional branch lengths (
name:length) and nested clades. Whitespace is ignored. The trailing semicolon is optional.Examples:
>>> tree = parse_newick("((A,B),C);") >>> tree.leaf_names() ['A', 'B', 'C']
- ancify.parsimony.fitch_bottom_up(node, leaf_alleles)[source]
Post-order traversal: compute the Fitch set at every node.
- Parameters:
- Returns:
Mapping of
id(node)to a frozenset of possible alleles.- Return type:
- ancify.parsimony.fitch_top_down(node, node_sets, parent_allele=None)[source]
Pre-order traversal: assign a concrete allele at each node.
Deterministic tie-breaking: prefer the parent’s allele if it is in the node’s Fitch set; otherwise pick the lexicographically smallest.
ancify.evaluate
Phase 3: Evaluate ancestral calls against a reference and/or VCF variants.
All evaluation steps are optional and driven by the evaluation block
in the pipeline configuration file:
Coverage statistics are always produced (no extra data needed).
Reference comparison requires a directory of reference ancestral FASTA files (e.g. Ensembl EPO) and a filename pattern.
VCF comparison requires a directory of VCF files and a filename pattern, plus the
scikit-allelpackage.
- ancify.evaluate.compute_coverage_stats(sequence)[source]
Compute coverage and confidence statistics for an ancestral sequence.
- ancify.evaluate.compare_to_reference(test_seq, ref_seq, positions=None)[source]
Compare two ancestral sequences, optionally at specific positions.
ancify.cli
Command-line interface for ancify.
Usage:
ancify init [-o config.yaml] # generate a template config
ancify project -c config.yaml # Phase 1: project alignments
ancify call -c config.yaml # Phase 2: call ancestral states
ancify evaluate -c config.yaml # Phase 3: evaluate calls
ancify run -c config.yaml # run all phases
ancify train -c config.yaml [-o model.lgb] # train ML model
The package can also be invoked as python -m ancify <cmd> ....