Evaluation

Phase 3 of the pipeline evaluates ancestral calls against external data. This page explains what gets measured, how to interpret the results, and what the numbers mean for your analysis.

Overview

Evaluation is optional and only runs if the evaluation block is present in your configuration file. It answers two questions:

How do my calls compare to a known reference? (e.g., Ensembl EPO ancestral sequence)
Do my calls make sense at known variant sites? (e.g., 1000 Genomes VCF)

What gets measured

Coverage statistics (always computed)

For each chromosome, ancify counts:

Metric	What it measures
High-confidence positions	Uppercase `ACGT` — both tiers agree
Low-confidence positions	Lowercase `acgt` — one tier has data
Unresolved positions	`n` — tiers disagree
Missing positions	`N` — no data

Reference comparison (optional)

If reference_dir and reference_pattern are configured, ancify compares its calls position-by-position against a reference ancestral sequence.

Metric	What it measures
Agreement rate	Fraction of positions where both methods call and agree
Disagreement rate	Fraction where both call but disagree
Complementarity	How often one method fills gaps left by the other

VCF comparison (optional)

If vcf_dir and vcf_pattern are configured, ancify checks how often its ancestral call matches the REF or ALT allele at known variant sites.

Metric	What it measures
Proportion non-missing	Fraction of variant sites with an ancestral call
Matches REF	Fraction matching the VCF reference allele
Matches ALT	Fraction matching the VCF alternate allele
Matches either	Should be >99% for a correct pipeline

Output format

Results are written as per-chromosome text files in <output_dir>/evaluation/:

[coverage]
  total_positions: 248956422
  high_confidence: 197145832
  low_confidence: 25203651
  missing: 26606939
  prop_nonmissing: 0.8931
  prop_high_confidence: 0.7918

[reference_comparison]
  test_nonmissing: 0.7912
  ref_nonmissing: 0.9016
  both_nonmissing: 5234112
  agreement_rate: 0.9992
  disagreement_rate: 0.0008

[vcf_comparison]
  num_variants: 6102435
  prop_nonmissing: 0.8965
  matches_ref: 0.9453
  matches_alt: 0.0494
  matches_either: 0.9948

How to interpret the results

Coverage: “How much of my genome has an ancestral call?”

  ┌─────────────────────────────────────────────────────────┐
  │                    Total genome                          │
  │                                                         │
  │  ████████████████████████████  ░░░░░░░░  ▓▓▓▓  ████    │
  │  HIGH confidence (75%)         LOW (16%)  n(0.1) N(9%)  │
  │                                                         │
  └─────────────────────────────────────────────────────────┘

Typical values for human (autosomes):

prop_nonmissing ≈ 0.89-0.91 (89-91% of positions have some call)
prop_high_confidence ≈ 0.75-0.80 (75-80% are high confidence)

If your values are much lower:

Check that all alignment files were downloaded completely (gzip -t file.axt.gz)
Verify chromosome names match between your lengths file and alignment files
More distant outgroups will naturally have lower coverage

Agreement rate: “Do I agree with the gold standard?”

  agreement_rate:    0.9992    ← 99.92% of compared positions match
  disagreement_rate: 0.0008   ← 0.08% differ

What does 0.08% disagreement mean?

On a chromosome with 200 million callable positions, 0.08% is ~160,000 positions. This sounds like a lot, but in the context of the SFS:

Most disagreements are at positions with rare alleles or in regions of poor alignment quality.
The SFS is a summary statistic — individual site errors are diluted across frequency bins.
Demographic inference tools are robust to error rates well below 1%.

If your disagreement rate is much higher:

Check alignment quality for the most divergent outgroup
Consider increasing min_inner_freq for stricter consensus
Examine whether the disagreements are clustered in specific genomic regions (e.g., segmental duplications)

VCF comparison: “Does my ancestral allele match known variants?”

  matches_ref:    0.9453    ← 94.5% of variant sites: ancestral = REF
  matches_alt:    0.0494    ←  4.9% of variant sites: ancestral = ALT
  matches_either: 0.9948    ← 99.5% match one or the other

Interpreting these numbers:

matches_ref ≈ 95% is expected. For most variants, the reference genome carries the ancestral allele (by construction — reference assemblies are built from common alleles).
matches_alt ≈ 5% means 5% of variant sites have a derived reference allele. These are the sites where polarization matters most — it tells you to flip the frequency.
matches_either > 99% is the key quality metric. This should always be very high. Values below 99% suggest a problem with the ancestral calls or the VCF data.

What about the ~0.5% that match neither?

These are positions where the ancestral call is a base that is neither the VCF REF nor ALT. Causes:

Triallelic sites (the VCF REF and ALT are both derived, the ancestral allele is a third base)
VCF errors or multi-allelic sites recorded as biallelic
Ancestral calling errors (rare)

Validation results: Human hg38 (BCGM)

The BCGM method (bonobo + chimp + gorilla + macaque) was validated against the Ensembl EPO 13-primate ancestral reference and 1000 Genomes Phase 3 VCFs:

Metric	chr1	chr22	chrX
Coverage (BCGM)	79.1%	74.6%	44.4%
Coverage (Ensembl EPO)	90.2%	81.2%	44.1%
Disagreement rate	0.08%	0.11%	3.51%
Matches REF/ALT	99.6%	99.5%	99.3%

Key observations:

Autosomes show excellent agreement (<0.12% disagreement).
chrX has higher disagreement (3.51%), reflecting male hemizygosity and reduced effective population size, which increases ILS rates.
Coverage is lower than Ensembl EPO because ancify uses 4 species vs. EPO’s 13. Including low-confidence calls raises BCGM coverage from ~79% to ~90% on chr1.

Configuring evaluation

evaluation:
  reference_dir: ./ensembl_ancestor/
  reference_pattern: "homo_sapiens_ancestor_{chrom_id}.fa"
  vcf_dir: ./vcf/
  vcf_pattern: "ALL.chr{chrom_id}.vcf.gz"

You can include either, both, or neither of reference_dir and vcf_dir:

Reference only: Compare against a known ancestral sequence.
VCF only: Check that ancestral calls match known variants.
Both: Full validation suite.
Neither: Evaluation block present but empty — only coverage statistics are computed.

Pattern placeholders

Placeholder	Example (`chr1`)	Example (`2L`)
`{chrom}`	`chr1`	`2L`
`{chrom_id}`	`1`	`2L`

Running evaluation standalone

If Phases 1 and 2 have already been completed:

ancify evaluate -c config.yaml

This only runs Phase 3, reading the existing ancestral FASTA files from output_dir.

When you do not need evaluation

Evaluation requires external reference data, which may not exist for non-model organisms. In that case:

Omit the evaluation block from your config entirely.
Phase 3 will be skipped.
You can still assess quality by inspecting coverage statistics manually (count uppercase, lowercase, n, and N in the output FASTAs).

The Tutorials page includes Python code for computing these statistics from the output files.