Make the sex-linked contig configurable in CNV frequency analysis#1314
Make the sex-linked contig configurable in CNV frequency analysis#1314khushthecoder wants to merge 2 commits intomalariagen:masterfrom
Conversation
|
Hi @jonbrenas — thanks again for merging #1304. When you have a spare moment, I'd appreciate a look at this pr as well. It closes #1313 (the hardcoded "X" sex-contig in cnv_frq.py) by threading a configurable sex_contig through AnophelesGenomeSequenceData with a default of "X", so behaviour is unchanged for Ag3/Af1/As1 but non-gambiae assemblies can opt in. Local checks pass and CI is green. No rush — whenever it fits your queue. |
cec3b7b to
c37ccdf
Compare
Fix malariagen#1313. gene_cnv_frequencies() and gene_cnv_frequencies_advanced() hardcoded `region.contig == "X"` when computing expected copy number. That is correct for every currently supported assembly, but bakes in a gambiae-complex assumption: any future species whose sex-linked contig uses a different name would silently fall through to the diploid branch and report incorrect amplification / deletion frequencies for males. Changes: - Add a `sex_contig` constructor parameter to `AnophelesGenomeSequenceData` (defaulting to "X") and expose it as a public `sex_contig` property, matching the pattern already used for `virtual_contigs`. - Thread `sex_contig` through `AnophelesDataResource.__init__` so species subclasses can override it when their assembly differs. - Replace both hardcoded `"X"` comparisons in cnv_frq.py with a shared `_expected_copy_number()` helper that consults `self._sex_contig`. - Handle missing `sex_call` explicitly: if the sex-linked contig is requested but `sex_call` is absent from the sample metadata, emit a UserWarning and fall back to diploid for all samples rather than raising KeyError or silently returning a misleading scalar. The default of "X" keeps behaviour unchanged for Ag3, Af1, As1, and every other existing resource. New species (e.g. An. farauti) can opt in by passing `sex_contig=` during construction.
c37ccdf to
087182e
Compare
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (78.57%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #1314 +/- ##
==========================================
+ Coverage 88.88% 89.05% +0.17%
==========================================
Files 56 57 +1
Lines 6439 6508 +69
==========================================
+ Hits 5723 5796 +73
+ Misses 716 712 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add unit tests for _expected_copy_number covering:
- autosome contig returns scalar 2
- sex contig with sex_call returns per-sample CN (1 for M, 2 for F)
- sex contig without sex_call emits UserWarning and falls back to 2
- custom sex_contig name (e.g. "Z") is respected
Add property tests for AnophelesGenomeSequenceData.sex_contig covering
the default ("X") across ag3/af1/as1 sim fixtures, and the configurable
override.
Fills the three patch-coverage gaps reported by codecov on PR malariagen#1314
(cnv_frq.py warning branch and genome_sequence.py sex_contig getter).
|
Hi @jonbrenas — gentle follow-up on this one whenever you get a chance. Quick status: All CI checks are green (linting, coverage job, and tests across Python 3.10 / 3.11 / 3.12 × both numpy pins). |
|
I don't think there is a species for which the sex chromosomes are known and not named 'X' and 'Y'. |
Thanks for the quick reply, @jonbrenas — that's a fair point, and I agree: for every assembly That said, there are two pieces in the diff that stand on their own regardless of the
Since Happy to go in whichever direction you prefer:
Let me know which you'd like and I'll push the change (or close it) today. |
Summary
Fix #1313.
gene_cnv_frequencies()andgene_cnv_frequencies_advanced()hardcodedregion.contig == "X"when deciding the expected copy number per sample. That is correct for every currently supported assembly, but bakes in a gambiae-complex assumption: any future species whose sex-linked contig uses a different name would silently fall through to the diploid branch and report incorrect amplification / deletion frequencies for hemizygous males.Because CNV frequencies directly inform insecticide resistance surveillance, silent misclassification of the sex-linked contig is a correctness issue, not a cosmetic one.
Change
Configurable
sex_contigsex_contigconstructor parameter toAnophelesGenomeSequenceData, defaulting to"X", and exposed a matching publicsex_contigproperty. This mirrors the existing pattern used forvirtual_contigs.sex_contigthroughAnophelesDataResource.__init__so species subclasses can override it at construction time without touching the frequency analysis code.Single source of truth for expected copy number
"X"sites incnv_frq.pywith a shared_expected_copy_number()helper, which consultsself._sex_contigand encapsulates all ploidy logic in one place.Explicit handling of missing
sex_callPreviously, requesting the sex-linked contig with a sample frame that happened to lack
sex_call(e.g. an As1-style resource) would raise a bareKeyErrorfrom deep inside pandas. Now it emits aUserWarningand falls back to diploid for all samples, which is the safest interpretation when sex is unknown. Users get a clear message rather than an opaque crash, and the fallback behaviour is documented in the helper's docstring.Backwards compatibility
sex_contig="X"keeps behaviour identical for Ag3, Af1, As1, and every other existing resource.sex_contigproperty is additive.UserWarningonly fires on a code path that previously raisedKeyError, so nothing that used to succeed now warns.Test plan
pre-commit run --files …— passed (ruff, ruff-format, trailing-whitespace, end-of-file)poetry run mypy …— passed (no issues)poetry run pytest tests/anoph/test_cnv_frq.py— 68 passedpoetry run pytest tests/anoph/test_genome_sequence.py tests/anoph/test_base.py— 62 passedpoetry run pytest tests/anoph/test_dipclust.py— 64 passed (exercisesAnophelesCnvFrequencyAnalysisvia the diplotype clustering path)==2.0.2and>=2.0.2,<2.1)Files
malariagen_data/anoph/genome_sequence.pysex_contiginit param + public property.malariagen_data/anopheles.pysex_contigthroughAnophelesDataResource.malariagen_data/anoph/cnv_frq.py_expected_copy_number()helper; remove hardcoded"X"; addsex_callfallback warning.