FAQ & Troubleshooting
Common questions and solutions for issues you might encounter.
General Questions
What species can I use ancify with?
Any species for which you have:
A reference genome assembly
Pairwise net AXT alignments to at least two outgroup species (one inner, one outer)
The UCSC Genome Browser provides alignments for hundreds of species pairs. If your focal species has a UCSC assembly, you almost certainly have the data you need.
How accurate is ancify?
For the human genome (hg38) using the BCGM method (bonobo + chimp + gorilla + macaque):
>99.9% agreement with the Ensembl EPO 13-primate ancestral reference at high-confidence sites
~0.08% disagreement rate on autosomes
>99.5% of ancestral calls match either the REF or ALT allele at known variant sites
Accuracy depends on the divergence times and alignment quality for your specific species.
How long does it take?
Phase |
CPU (scalar) |
CPU (vectorized) |
GPU (NVIDIA A100) |
|---|---|---|---|
Phase 1 (projection) |
2-8 hours |
~5 minutes |
~2 minutes |
Phase 2 (calling) |
5-15 minutes |
~10 minutes |
~10 seconds |
Phase 3 (evaluation) |
5-15 minutes |
5-15 minutes |
5-15 minutes |
With the GPU backend enabled, the full human genome completes in ~2 minutes. Even without a GPU, the vectorized NumPy backend is much faster than the original scalar path. See GPU Acceleration & Vectorization for details.
How much memory do I need?
Phase 2 is the memory-intensive step:
memory ≈ num_cpus × (num_outgroups) × (avg_chrom_length)
For human chr1 (~249 Mb) with 4 outgroups and 24 CPUs: ~24 GB peak. Reduce num_cpus if needed.
Can I run phases independently?
Yes. Each phase has its own subcommand:
ancify project -c config.yaml # Phase 1 only
ancify call -c config.yaml # Phase 2 only (requires Phase 1 output)
ancify evaluate -c config.yaml # Phase 3 only (requires Phase 2 output)
This is useful when you want to re-run Phase 2 with different min_inner_freq settings without repeating the expensive Phase 1.
Data Questions
Where do I get net AXT alignment files?
From the UCSC Genome Browser downloads:
https://hgdownload.soe.ucsc.edu/goldenPath/<focal_assembly>/vs<Outgroup>/
For example, human vs. chimp:
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsPanTro6/hg38.panTro6.net.axt.gz
Browse all available alignments at UCSC Downloads.
What if UCSC does not have alignments for my species?
You can generate your own net AXT files using the UCSC Kent tools:
Run
lastzorminimap2to produce pairwise alignmentsConvert to chain format with
axtChainNet with
chainNetConvert back to AXT with
netToAxt
This is more involved but allows ancify to work with any pair of genome assemblies.
Can I use MAF files instead of AXT?
Not directly. ancify reads AXT format. You can convert MAF to AXT using UCSC tools:
mafToAxt input.maf focal_assembly outgroup_assembly output.axt
What chromosome naming convention should I use?
ancify does not enforce any naming convention. Use whatever names appear in your chromosome lengths file. Examples:
chr1,chr2, …,chrX(UCSC style)1,2, …,X(Ensembl style)2L,2R,3L,3R(Drosophila)A01,A02(Brassica)
The names just need to be consistent between your lengths file, your alignment files, and your config.
Performance & Resource Issues
MemoryError during Phase 2
Each chromosome loads all projected sequences simultaneously. Solutions:
Reduce
num_cpus— this limits how many chromosomes are loaded in parallel.Process a subset of chromosomes — use the
chromosomesconfig field to run in batches.Use a machine with more RAM — Phase 2 for human chr1 with 4 outgroups needs ~1 GB per concurrent chromosome.
Phase 1 is very slow
Phase 1 streams through large compressed files. This is I/O-bound, not CPU-bound:
Install
isal—pip install 'ancify[fast]'provides 2–5× faster gzip decompression using Intel ISA-LEnable the vectorized backend — set
backend: autoorbackend: cpuin your config. The vectorized NumPy scatter is 20–50× faster than the original character loopUse an SSD for the alignment files
Consider splitting by species: run projection for each outgroup on a separate machine/job, since AXT files are independent
Phase 1 only needs to be run once; subsequent Phase 2 runs reuse the projected files
Do I need a GPU?
No. The vectorized NumPy backend (selected automatically when PyTorch is not installed, or with backend: cpu) is already much faster than the original scalar path. GPU acceleration is an additional speedup on top of that, primarily beneficial for large genomes (> 1 Gb). See GPU Acceleration & Vectorization for benchmarks.
How do I enable GPU acceleration?
Install PyTorch with CUDA support:
pip install torch --index-url https://download.pytorch.org/whl/cu128Set
backend: auto(orbackend: gpu) in your configOptionally set
gpu_devices: [0, 1, 2]to specify which GPUs to use
ancify will detect PyTorch + CUDA automatically and use the GPU backend. Check with ancify run -c config.yaml -v — the first log line reports the active backend.
My GPU runs out of memory
Each chromosome needs roughly 6 GB of GPU memory for 4 outgroups and a 248M-position chromosome. Solutions:
Process fewer chromosomes in parallel — reduce
num_cpusRestrict to fewer GPUs — set
gpu_devicesto a list with fewer entries so memory is not split across too many concurrent tasksFall back to CPU — set
backend: cpufor the vectorized NumPy path, which uses system RAM instead
Does the GPU backend change the output?
No. All three backends (scalar, vectorized CPU, GPU) produce bit-identical output. The tie-breaking rule, frequency thresholds, and confidence encoding are preserved exactly. You can verify this by running the same config with different backend settings and comparing the output FASTAs.
Can I resume a failed run?
ancify does not have built-in checkpointing, but:
Phase 1 writes one file per (species, chromosome). If some files already exist, you can manually skip those species by temporarily removing them from the config and re-adding after.
Phase 2 is fast enough to re-run from scratch.
Phase 3 is also fast and can be re-run independently.
Output Questions
What does each character mean in the output FASTA?
Character |
Confidence |
Condition |
|---|---|---|
|
High |
Inner and outer outgroups agree |
|
Low |
Only one outgroup tier has data |
|
Unresolved |
Inner and outer disagree |
|
Missing |
Both tiers lack data |
My output has a lot of Ns. Is something wrong?
Not necessarily. High N rates are expected in:
Repetitive regions (centromeres, telomeres, satellite DNA)
Lineage-specific insertions (transposable elements inserted in the focal species after divergence from outgroups)
Assembly gaps in the focal genome
chrY (very repetitive, poor alignment coverage)
Species with distant outgroups (less alignable sequence)
Compare your coverage rates against the expected values for your species pair’s divergence time.
Can I use the output to polarize a VCF?
Yes — this is the primary use case. See the Quickstart for code examples and the Tutorials for a complete walkthrough.
Is the confidence encoding compatible with Ensembl EPO?
Yes. The uppercase/lowercase/N convention matches the Ensembl EPO ancestral sequence format. Tools that parse Ensembl ancestral sequences should work with ancify output without modification.
Scientific Questions
How does this compare to the Ensembl EPO method?
The Ensembl EPO (Enredo-Pecan-Ortheus) method uses:
Multiple sequence alignment of 13+ primate genomes
A probabilistic model (Ortheus) that reconstructs ancestral sequences on a phylogenetic tree
An order of magnitude more species
ancify uses:
Pairwise alignments only (simpler input, widely available)
Deterministic majority voting (transparent, reproducible)
Typically 2-5 outgroup species
Trade-offs:
Aspect |
Ensembl EPO |
ancify |
|---|---|---|
Input complexity |
High (MSA of 13+ genomes) |
Low (2-5 pairwise AXTs) |
Species coverage |
Vertebrates only |
Any species with AXT alignments |
Accuracy |
Slightly higher |
Very close (>99.9% agreement) |
Transparency |
Probabilistic model |
Simple, auditable rules |
Customizability |
Fixed |
Full control via config |
For most applications, the practical difference in accuracy is negligible.
When should I NOT use parsimony-based polarization?
Parsimony can be unreliable when:
Divergence times are very large — back mutations accumulate, eroding the ancestral signal.
The focal species is in a rapid radiation — ILS is pervasive and multiple outgroups may all carry derived alleles.
Only one outgroup is available and it is distant — error rates increase without redundancy.
In these cases, consider probabilistic methods (like Ortheus or est-sfs) that explicitly model substitution rates along branches.
What min_inner_freq should I use?
It depends on your tolerance for errors vs. missing data:
min_inner_freq=1 (default): Maximize coverage. Best for demographic inference where the unfolded SFS shape matters more than individual-site accuracy.
min_inner_freq=2 (with 3 inner species): A good balance. Recommended if you are doing site-by-site analysis (e.g. McDonald-Kreitman tests).
min_inner_freq=N (unanimity): Maximum stringency. Use when even rare errors would be problematic.
Still stuck?
Open an issue on GitHub with:
Your config file (with paths anonymized if needed)
The full error message
The output of
ancify --help(to confirm installation)Your Python version (
python --version)