Description
The CNV frequency analysis in gene_cnv_frequencies() and gene_cnv_frequencies_advanced() contains a hardcoded assumption at cnv_frq.py:293 and cnv_frq.py:560:
if region.contig == "X":
is_male = (df_samples["sex_call"] == "M").values
expected_cn = np.where(is_male, 1, 2)[np.newaxis, :]
else:
expected_cn = 2
This assumes:
- The sex-linked chromosome is always named
"X" — true for An. gambiae (Ag3) but not guaranteed for other species assemblies.
- Male mosquitoes are always hemizygous (CN=1) on the sex chromosome — true for gambiae complex but An. farauti and other Pacific vectors may use different chromosome naming or sex-determination architecture.
- The
sex_call column is always present — As1 samples may not have sex_call populated.
As MalariaGEN expands to Pacific vectors (An. farauti) for GSoC Project 1, this hardcoding will silently produce incorrect amplification/deletion frequencies for any species whose genome assembly uses different contig naming conventions.
Why It Is Important
- CNV frequencies directly inform insecticide resistance surveillance — incorrect values lead to wrong public health decisions.
- GSoC Project 1 explicitly requires building cloud-native analysis functions for An. farauti that are interoperable with existing resources.
- This is a correctness bug, not a style issue.
Steps to Reproduce
- Examine
malariagen_data/anoph/cnv_frq.py lines 293 and 560.
- Note the hardcoded
"X" string comparison.
- Attempt to run
gene_cnv_frequencies() on a species where the sex chromosome is not named "X" — the expected copy number will default to 2 for all samples regardless of sex.
Proposed Approach
1. Add a sex_contig property to the species configuration.
Each species class defines which contig is sex-linked:
# e.g., in the species-specific resource class
@property
def _sex_contig(self) -> str:
return "X" # override per species as needed
2. Replace the hardcoded string comparison.
# Before (current behaviour)
if region.contig == "X":
...
# After (proposed fix)
if region.contig == self._sex_contig:
...
3. Add a safe fallback when sex_call is missing.
Rather than crashing or silently computing incorrect values, log a warning and default to diploid:
if "sex_call" not in df_samples.columns:
warnings.warn(
"sex_call column not found; defaulting to diploid (CN=2) for all samples.",
UserWarning,
)
expected_cn = 2
else:
is_male = (df_samples["sex_call"] == "M").values
expected_cn = np.where(is_male, 1, 2)[np.newaxis, :]
4. Pass sex_contig through AnophelesDataResource.__init__() alongside existing species-specific params like virtual_contigs:
def __init__(
self,
...,
sex_contig: str = "X", # new param, defaults to "X" for backward compatibility
virtual_contigs: Optional[Mapping] = None,
...
):
self._sex_contig = sex_contig
...
Expected Outcome After Fix
- CNV frequency analysis produces correct sex-adjusted copy numbers for any Anopheles species, not just gambiae.
- New species (An. farauti, An. darlingi) can be onboarded by simply specifying their sex contig name in the constructor.
- No behavior change for existing Ag3 / Af1 / As1 users.
- Missing
sex_call data raises an explicit warning rather than silently returning incorrect frequencies.
Description
The CNV frequency analysis in
gene_cnv_frequencies()andgene_cnv_frequencies_advanced()contains a hardcoded assumption atcnv_frq.py:293andcnv_frq.py:560:This assumes:
"X"— true for An. gambiae (Ag3) but not guaranteed for other species assemblies.sex_callcolumn is always present — As1 samples may not havesex_callpopulated.As MalariaGEN expands to Pacific vectors (An. farauti) for GSoC Project 1, this hardcoding will silently produce incorrect amplification/deletion frequencies for any species whose genome assembly uses different contig naming conventions.
Why It Is Important
Steps to Reproduce
malariagen_data/anoph/cnv_frq.pylines293and560."X"string comparison.gene_cnv_frequencies()on a species where the sex chromosome is not named"X"— the expected copy number will default to2for all samples regardless of sex.Proposed Approach
1. Add a
sex_contigproperty to the species configuration.Each species class defines which contig is sex-linked:
2. Replace the hardcoded string comparison.
3. Add a safe fallback when
sex_callis missing.Rather than crashing or silently computing incorrect values, log a warning and default to diploid:
4. Pass
sex_contigthroughAnophelesDataResource.__init__()alongside existing species-specific params likevirtual_contigs:Expected Outcome After Fix
sex_calldata raises an explicit warning rather than silently returning incorrect frequencies.