Skip to content

cnv_frq.py Hardcodes X-Chromosome Sex-Linked Ploidy Assumption — Breaks CNV Frequency Analysis for Non-Gambiae Species #1313

@khushthecoder

Description

@khushthecoder

Description

The CNV frequency analysis in gene_cnv_frequencies() and gene_cnv_frequencies_advanced() contains a hardcoded assumption at cnv_frq.py:293 and cnv_frq.py:560:

if region.contig == "X":
    is_male = (df_samples["sex_call"] == "M").values
    expected_cn = np.where(is_male, 1, 2)[np.newaxis, :]
else:
    expected_cn = 2

This assumes:

  • The sex-linked chromosome is always named "X" — true for An. gambiae (Ag3) but not guaranteed for other species assemblies.
  • Male mosquitoes are always hemizygous (CN=1) on the sex chromosome — true for gambiae complex but An. farauti and other Pacific vectors may use different chromosome naming or sex-determination architecture.
  • The sex_call column is always present — As1 samples may not have sex_call populated.
    As MalariaGEN expands to Pacific vectors (An. farauti) for GSoC Project 1, this hardcoding will silently produce incorrect amplification/deletion frequencies for any species whose genome assembly uses different contig naming conventions.

Why It Is Important

  • CNV frequencies directly inform insecticide resistance surveillance — incorrect values lead to wrong public health decisions.
  • GSoC Project 1 explicitly requires building cloud-native analysis functions for An. farauti that are interoperable with existing resources.
  • This is a correctness bug, not a style issue.

Steps to Reproduce

  1. Examine malariagen_data/anoph/cnv_frq.py lines 293 and 560.
  2. Note the hardcoded "X" string comparison.
  3. Attempt to run gene_cnv_frequencies() on a species where the sex chromosome is not named "X" — the expected copy number will default to 2 for all samples regardless of sex.

Proposed Approach

1. Add a sex_contig property to the species configuration.

Each species class defines which contig is sex-linked:

# e.g., in the species-specific resource class
@property
def _sex_contig(self) -> str:
    return "X"  # override per species as needed

2. Replace the hardcoded string comparison.

# Before (current behaviour)
if region.contig == "X":
    ...
 
# After (proposed fix)
if region.contig == self._sex_contig:
    ...

3. Add a safe fallback when sex_call is missing.

Rather than crashing or silently computing incorrect values, log a warning and default to diploid:

if "sex_call" not in df_samples.columns:
    warnings.warn(
        "sex_call column not found; defaulting to diploid (CN=2) for all samples.",
        UserWarning,
    )
    expected_cn = 2
else:
    is_male = (df_samples["sex_call"] == "M").values
    expected_cn = np.where(is_male, 1, 2)[np.newaxis, :]

4. Pass sex_contig through AnophelesDataResource.__init__() alongside existing species-specific params like virtual_contigs:

def __init__(
    self,
    ...,
    sex_contig: str = "X",  # new param, defaults to "X" for backward compatibility
    virtual_contigs: Optional[Mapping] = None,
    ...
):
    self._sex_contig = sex_contig
    ...

Expected Outcome After Fix

  • CNV frequency analysis produces correct sex-adjusted copy numbers for any Anopheles species, not just gambiae.
  • New species (An. farauti, An. darlingi) can be onboarded by simply specifying their sex contig name in the constructor.
  • No behavior change for existing Ag3 / Af1 / As1 users.
  • Missing sex_call data raises an explicit warning rather than silently returning incorrect frequencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions