Skip to content

Make the sex-linked contig configurable in CNV frequency analysis#1314

Open
khushthecoder wants to merge 2 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1313-cnv-sex-contig
Open

Make the sex-linked contig configurable in CNV frequency analysis#1314
khushthecoder wants to merge 2 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1313-cnv-sex-contig

Conversation

@khushthecoder
Copy link
Copy Markdown
Contributor

Summary

Fix #1313.

gene_cnv_frequencies() and gene_cnv_frequencies_advanced() hardcoded region.contig == "X" when deciding the expected copy number per sample. That is correct for every currently supported assembly, but bakes in a gambiae-complex assumption: any future species whose sex-linked contig uses a different name would silently fall through to the diploid branch and report incorrect amplification / deletion frequencies for hemizygous males.

Because CNV frequencies directly inform insecticide resistance surveillance, silent misclassification of the sex-linked contig is a correctness issue, not a cosmetic one.

Change

Configurable sex_contig

  • Added a sex_contig constructor parameter to AnophelesGenomeSequenceData, defaulting to "X", and exposed a matching public sex_contig property. This mirrors the existing pattern used for virtual_contigs.
  • Threaded sex_contig through AnophelesDataResource.__init__ so species subclasses can override it at construction time without touching the frequency analysis code.

Single source of truth for expected copy number

  • Replaced both hardcoded "X" sites in cnv_frq.py with a shared _expected_copy_number() helper, which consults self._sex_contig and encapsulates all ploidy logic in one place.

Explicit handling of missing sex_call

Previously, requesting the sex-linked contig with a sample frame that happened to lack sex_call (e.g. an As1-style resource) would raise a bare KeyError from deep inside pandas. Now it emits a UserWarning and falls back to diploid for all samples, which is the safest interpretation when sex is unknown. Users get a clear message rather than an opaque crash, and the fallback behaviour is documented in the helper's docstring.

Backwards compatibility

  • Default sex_contig="X" keeps behaviour identical for Ag3, Af1, As1, and every other existing resource.
  • No public API removed or renamed.
  • The new sex_contig property is additive.
  • The new UserWarning only fires on a code path that previously raised KeyError, so nothing that used to succeed now warns.

Test plan

  • pre-commit run --files … — passed (ruff, ruff-format, trailing-whitespace, end-of-file)
  • poetry run mypy … — passed (no issues)
  • poetry run pytest tests/anoph/test_cnv_frq.py — 68 passed
  • poetry run pytest tests/anoph/test_genome_sequence.py tests/anoph/test_base.py — 62 passed
  • poetry run pytest tests/anoph/test_dipclust.py — 64 passed (exercises AnophelesCnvFrequencyAnalysis via the diplotype clustering path)
  • CI matrix on PR (linting, coverage, tests across Python 3.10 / 3.11 / 3.12 × numpy ==2.0.2 and >=2.0.2,<2.1)

Files

File Change
malariagen_data/anoph/genome_sequence.py Add sex_contig init param + public property.
malariagen_data/anopheles.py Thread sex_contig through AnophelesDataResource.
malariagen_data/anoph/cnv_frq.py Introduce _expected_copy_number() helper; remove hardcoded "X"; add sex_call fallback warning.

@khushthecoder
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas — thanks again for merging #1304.

When you have a spare moment, I'd appreciate a look at this pr as well. It closes #1313 (the hardcoded "X" sex-contig in cnv_frq.py) by threading a configurable sex_contig through AnophelesGenomeSequenceData with a default of "X", so behaviour is unchanged for Ag3/Af1/As1 but non-gambiae assemblies can opt in. Local checks pass and CI is green. No rush — whenever it fits your queue.

@khushthecoder khushthecoder force-pushed the fix/issue-1313-cnv-sex-contig branch from cec3b7b to c37ccdf Compare April 18, 2026 21:42
Fix malariagen#1313.

gene_cnv_frequencies() and gene_cnv_frequencies_advanced() hardcoded
`region.contig == "X"` when computing expected copy number. That is
correct for every currently supported assembly, but bakes in a
gambiae-complex assumption: any future species whose sex-linked contig
uses a different name would silently fall through to the diploid branch
and report incorrect amplification / deletion frequencies for males.

Changes:

- Add a `sex_contig` constructor parameter to
  `AnophelesGenomeSequenceData` (defaulting to "X") and expose it as a
  public `sex_contig` property, matching the pattern already used for
  `virtual_contigs`.
- Thread `sex_contig` through `AnophelesDataResource.__init__` so
  species subclasses can override it when their assembly differs.
- Replace both hardcoded `"X"` comparisons in cnv_frq.py with a shared
  `_expected_copy_number()` helper that consults `self._sex_contig`.
- Handle missing `sex_call` explicitly: if the sex-linked contig is
  requested but `sex_call` is absent from the sample metadata, emit a
  UserWarning and fall back to diploid for all samples rather than
  raising KeyError or silently returning a misleading scalar.

The default of "X" keeps behaviour unchanged for Ag3, Af1, As1, and
every other existing resource. New species (e.g. An. farauti) can opt
in by passing `sex_contig=` during construction.
@khushthecoder khushthecoder force-pushed the fix/issue-1313-cnv-sex-contig branch from c37ccdf to 087182e Compare April 18, 2026 21:44
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 18, 2026

Codecov Report

❌ Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.05%. Comparing base (06563cf) to head (087182e).
⚠️ Report is 24 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/cnv_frq.py 80.00% 2 Missing ⚠️
malariagen_data/anoph/genome_sequence.py 75.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (78.57%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1314      +/-   ##
==========================================
+ Coverage   88.88%   89.05%   +0.17%     
==========================================
  Files          56       57       +1     
  Lines        6439     6508      +69     
==========================================
+ Hits         5723     5796      +73     
+ Misses        716      712       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add unit tests for _expected_copy_number covering:
- autosome contig returns scalar 2
- sex contig with sex_call returns per-sample CN (1 for M, 2 for F)
- sex contig without sex_call emits UserWarning and falls back to 2
- custom sex_contig name (e.g. "Z") is respected

Add property tests for AnophelesGenomeSequenceData.sex_contig covering
the default ("X") across ag3/af1/as1 sim fixtures, and the configurable
override.

Fills the three patch-coverage gaps reported by codecov on PR malariagen#1314
(cnv_frq.py warning branch and genome_sequence.py sex_contig getter).
@khushthecoder
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas — gentle follow-up on this one whenever you get a chance.

Quick status:

All CI checks are green (linting, coverage job, and tests across Python 3.10 / 3.11 / 3.12 × both numpy pins).
The diff is small and fully backwards-compatible: sex_contig defaults to "X", so Ag3, Af1, As1, and every existing resource keep identical behaviour. Non-gambiae assemblies can simply override it at construction time.
Codecov flagged patch coverage at 78.57% (target 80%). The uncovered lines are the defensive branches inside _expected_copy_number() — happy to add a small unit test targeting them directly if you'd prefer the patch to clear the threshold before merge, just let me know.
No rush at all — thanks for your time.

@jonbrenas
Copy link
Copy Markdown
Collaborator

I don't think there is a species for which the sex chromosomes are known and not named 'X' and 'Y'.

@khushthecoder
Copy link
Copy Markdown
Contributor Author

I don't think there is a species for which the sex chromosomes are known and not named 'X' and 'Y'.

Thanks for the quick reply, @jonbrenas — that's a fair point, and I agree: for every assembly
this library currently targets (and realistically any that would ever ship here), the sex
chromosome is "X". So the "future non-X contig" framing in the original PR description is
weaker than I made it out to be.

That said, there are two pieces in the diff that stand on their own regardless of the
configurability:

  1. sex_call safety guard — the current code accesses df_samples["sex_call"]
    unconditionally inside the X branch (cnv_frq.py:294 and :561). If a sample set ever lacks
    that column — plausible for a newer species where sex calling hasn't been run yet — it surfaces
    as a bare KeyError from inside pandas. The PR turns that into an explicit UserWarning with a
    diploid (CN=2) fallback, which I think is the safer default when sex is unknown.

  2. Single source of truth — the literal region.contig == "X" is duplicated across two
    methods (cnv_frq.py:293 and :560). Extracting a small _expected_copy_number() helper
    removes the duplication and makes the ploidy logic explicit in one place, without adding real
    complexity.

Since sex_contig defaults to "X", there's zero behaviour change for Ag3, Af1, or As1 —
existing notebooks and downstream code keep working identically.

Happy to go in whichever direction you prefer:

  • Keep only (1) and (2) — drop the sex_contig constructor parameter and public property
    entirely, leaving just the helper refactor and the sex_call guard. Much smaller diff.
  • Keep as-is if the configurability is worth having as defensive scaffolding.
  • Close the PR if you don't think the motivation justifies the surface-area change — totally
    your call, no hard feelings.

Let me know which you'd like and I'll push the change (or close it) today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cnv_frq.py Hardcodes X-Chromosome Sex-Linked Ploidy Assumption — Breaks CNV Frequency Analysis for Non-Gambiae Species

2 participants