Changelog

1.5.0 (2026)

Documentation: tutorials for ancestral FASTAs and VCF polarization, plus a species catalogue.

Documentation

  • Tutorials. New Tutorial 4: Getting Ancestral FASTA Files — how to locate, verify, and read per-chromosome output; sanity checks; loading all chromosomes; confidence encoding recap. New Tutorial 5: Polarizing VCF Variants — annotating REF/ALT as ancestral/derived with cyvcf2 and scikit-allel, a complete polarize_vcf.py script (AA/DAF INFO fields), unfolded SFS example, and tips on confidence filtering.

  • Species guide. New Species catalogue with 115 commonly studied species: suggested inner and outer outgroups, UCSC assembly identifiers, and approximate divergence times. Grouped by clade (primates, rodents, carnivores, ungulates, birds, fish, Drosophila, plants, fungi, etc.). Includes a short section on generating your own net AXT alignments with lastz when UCSC data is unavailable.


1.4.0 (2026)

Likelihood-based ancestral reconstruction and expanded installation docs.

Likelihood method (Felsenstein pruning)

  • New method: likelihood. Infers ancestral alleles using Felsenstein’s pruning algorithm on a user-supplied tree with branch lengths. Root posterior probabilities are computed under a continuous-time substitution model and mapped to the same case-encoded confidence scheme (uppercase / lowercase / n / N).

  • Substitution models. Four models are supported: JC69, K80, HKY85, and GTR. All use normalised rate matrices (one expected substitution per unit branch length). Transition probabilities use scipy.linalg.expm (JC69 has a closed-form shortcut).

  • New ancify.likelihood module. SubstitutionModel base class, JC69, K80, HKY85, GTR classes, felsenstein_pruning(), call_ancestral_base_likelihood(), _call_chromosome_likelihood() worker, and build_model() factory.

  • New config fields: substitution_model, model_kappa, model_base_freqs, model_rates, likelihood_high_threshold, likelihood_low_threshold. Validation requires a tree with leaf names matching outgroups; GTR requires six model_rates; base frequencies must sum to ~1.

  • Core dependency: added SciPy (>=1.7) for matrix exponentiation.

Documentation and README

  • Algorithm docs. New section on the likelihood method: substitution models, Felsenstein pruning steps, worked example, and comparison with voting/parsimony/ML. Summary section updated for all four methods.

  • Configuration docs. Field reference and method table updated for likelihood; new subsection “Likelihood YAML” with branch-length requirement and GTR example; validation and config recipes updated.

  • Landing page (docs/index.rst). Likelihood added to the method list and “Why ancify?” bullet.

  • README. Major expansion: Installation now includes prerequisites, core install, optional extras table (evaluate, fast, ml, docs, dev, all), GPU acceleration (voting only), verify-install commands, and quick-reference table. README also updated for four methods throughout (intro, confidence encoding, key fields, How it works, CLI, project structure).

Tests

  • New tests/test_likelihood.py. Rate-matrix properties (row sums, detailed balance, normalisation), transition-probability properties (P(0)=I, rows sum to 1, equilibrium limit), Felsenstein pruning and posteriors, confidence encoding, agreement with parsimony on unambiguous cases.

  • tests/test_config.py. New TestValidateLikelihood for tree requirement, model name, GTR rates, base freqs, and thresholds.

  • tests/test_ancestral.py. New TestCallAncestralBaseLikelihood mirroring parsimony tests.

Version and CLI

  • Version set to 1.4.0 in pyproject.toml.

  • CLI template (EXAMPLE_CONFIG in cli.py) updated with commented likelihood example and branch-length tree.


1.3.0 (2026)

Machine learning-based ancestral calling and documentation updates.

  • ML-based ancestral calling. New method: ml option uses a LightGBM gradient-boosted classifier trained on per-position features (outgroup agreement, GC content, CpG flag, etc.) to predict ancestral alleles. Confidence is derived from predicted class probabilities. Install with pip install 'ancify[ml]' (requires lightgbm and scikit-learn).

  • New ancify.ml module. Feature extraction (extract_features()), model loading, and vectorized prediction for full-chromosome runs. Integrates with the existing pipeline via config and CLI.

  • New config field: method now supports "voting", "parsimony", and "ml". For method: ml, optional model_path points to a trained LightGBM model (or uses a bundled default when available).

  • CLI and config updated to pass method selection and ML options through to the calling phase.

  • Documentation: algorithm page and configuration reference updated for the ML method; GPU logo and conf tweaks.

  • Tests: tests/test_ml.py for feature extraction, prediction shape, and integration.

  • Lock file: uv.lock added for reproducible installs.

1.2.0 (2026)

Fitch parsimony for tree-based ancestral inference.

  • Fitch parsimony method. New method: parsimony option uses the Fitch (1971) algorithm on a user-supplied Newick phylogenetic tree to infer ancestral alleles. This resolves many positions that the two-tier voting method marks as “unresolved” by leveraging the tree topology.

  • Newick tree parser. Built-in recursive-descent parser for Newick-format trees (ancify.parsimony). Supports branch lengths, quoted labels, and multifurcations.

  • New config fields: method ("voting" / "parsimony") and tree (inline Newick string or path to .nwk file).

  • Config validation checks that tree leaf names match outgroup species names when parsimony is selected.

  • New call_ancestral_base_parsimony() function in ancify.ancestral for programmatic per-position Fitch calls.

  • Comprehensive test suite for the parsimony module: Newick parsing, Fitch bottom-up/top-down passes, full algorithm with ILS scenarios, missing data handling, and confidence encoding.

  • Documentation updates: algorithm page with Fitch walkthrough, configuration reference with parsimony examples, API reference for the new module, and updated README.

1.1.0 (2026)

GPU acceleration and a vectorized compute backend for much faster Phase 1 and Phase 2 runs.

Performance

  • GPU-accelerated ancestral calling (Phase 2). Ancestral state inference runs as a small number of tensor operations on the GPU instead of per-position Python loops. On an NVIDIA A100, the full human genome completes in under 2 minutes (vs. hours on the original scalar path).

  • Vectorized coordinate projection (Phase 1). Net AXT projection uses NumPy vectorized scatter (CPU) or PyTorch scatter on CUDA. The per-character Python loop is removed, giving roughly 20–50× speedup on CPU.

  • Multi-GPU support. When using the GPU backend, chromosomes are distributed round-robin across available NVIDIA GPUs. Use the gpu_devices config field to restrict which devices are used.

  • Faster gzip decompression. Optional isal (Intel ISA-L) dependency provides 2–5× faster gzip decompression for large AXT files. Install with pip install 'ancify[fast]'.

New module and config

  • ancify.backend module. Central abstraction for CPU/GPU execution: detect_backend(), get_available_gpus(), open_gz(), and vectorized implementations of majority vote, ancestral calling, and block scatter for projection.

  • New config fields: backend ("auto" / "cpu" / "gpu") and gpu_devices (optional list of GPU IDs, e.g. [0, 1, 2]). With backend: auto, ancify uses the GPU when PyTorch and CUDA are available, otherwise the vectorized CPU path.

Correctness and docs

  • Bit-identical output. Vectorized and GPU code paths produce the same results as the original scalar implementation. Tie-breaking, min_inner_freq / min_outer_freq behaviour, and case-encoded confidence are unchanged.

  • New documentation page: GPU Acceleration & Vectorization with GPU setup, supported hardware, architecture overview, and tuning tips.

1.0.0 (2026)

Initial release.

  • Config-driven YAML pipeline for any focal species.

  • Three-phase workflow: project, call, evaluate.

  • Two-tier inner/outer outgroup voting with case-encoded confidence.

  • Support for arbitrary numbers of inner and outer outgroup species.

  • Parallel execution via ProcessPoolExecutor.

  • Optional evaluation against reference ancestral sequences and VCF data.

  • CLI with subcommands: init, project, call, evaluate, run.

  • Python API for programmatic use.

  • 108 unit and integration tests.

  • Example configs for human, mouse, Drosophila, and Brassica rapa.

  • Comprehensive documentation with population genetics background, tutorials, and algorithm deep dives.

  • Installable with pip or uv.