Skip to content

Refactor: Move hardcoded species-specific parameters to a unified configuration schema #1194

@khushthecoder

Description

@khushthecoder

Description

Several modules in malariagen_data/anoph/ contain hardcoded or constructor-passed parameters that differ between species but are not managed through a centralized configuration. Two existing TODO comments from maintainers explicitly call this out:

genome_features.py:37-38

# TODO Consider moving these parameters to configuration, as they could
# change if the GFF ever changed.
self._gff_gene_type = gff_gene_type
self._gff_gene_name_attribute = gff_gene_name_attribute
self._gff_default_attributes = gff_default_attributes

aim_data.py:44

# TODO Consider moving this to data resource configuration.
self._aim_ids = aim_ids
self._aim_palettes = aim_palettes

Current State

AnophelesDataResource.__init__() (anopheles.py:105-143) currently accepts 36 constructor parameters, of which at least 15 are species-specific configuration values that vary between resources but rarely change at runtime:

Parameter Ag3 Af1 Adir1 Amin1
gff_gene_type "gene" "protein_coding_gene" "protein_coding_gene" "protein_coding_gene"
gff_gene_name_attribute "Name" "Note" "Note" "Note"
default_site_mask "gamb_colu_arab" "funestus" "dirus" "minimus"
default_phasing_analysis "gamb_colu_arab" "funestus" "dirus" None
default_coverage_calls_analysis "gamb_colu" "funestus" "dirus" "minimus_noneyet"
aim_analysis "20220528" None None None
cohorts_analysis "20230516" "20221129" "20250116" "20250116"
_roh_hmm_cache_name "ag3_roh_hmm_v1" "af1_roh_hmm_v1" ... ...
_xpehh_gwss_cache_name "ag3_xpehh_gwss_v1" "af1_xpehh_gwss_v1" ... ...
_ihs_gwss_cache_name "ag3_ihs_gwss_v1" "af1_ihs_gwss_v1" ... ...

Each species class (Ag3, Af1, Adir1, Amin1, Adar1) re-declares most of these values in their module-level constants and passes them to super().__init__(). When adding a new species, a developer must:

  1. Study an existing species file to identify which params are needed
  2. Copy 15+ constant declarations
  3. Know the correct value for each param (e.g., "gene" vs "protein_coding_gene")
  4. Set cache name class attributes in the right format
  5. Set optional features to None manually (e.g., aim_analysis=None)

There is no validation, no documentation of valid values, and no single reference showing what a "complete" species configuration looks like.


Impact

  • Adding new species is fragile: When integrating a new species (as seen with Adir1 via PR Adir1.x #795 and the upcoming Adar1 in Add support for Anopheles darlingi (Adar1.x) #837), there's no schema to validate that all required params are set correctly.
  • Cache names are inconsistent: Issue Error accessing roh_hmm #1151 happened partly because cache name params are managed through scattered class attributes and abstract methods rather than a unified config.
  • Maintenance burden: Changes to config structure (e.g., adding a new species-specific parameter) require updating 5+ species files.

Proposed Solution

Create a SpeciesConfig dataclass that bundles all species-specific parameters into a single, validated object:

@dataclass(frozen=True)
class SpeciesConfig:
    """Configuration for a species data resource."""
    # GFF features
    gff_gene_type: str = "protein_coding_gene"
    gff_gene_name_attribute: str = "Note"
    gff_default_attributes: Tuple[str, ...] = ("ID", "Parent", "Note", "description")
 
    # Analysis defaults
    default_site_mask: Optional[str] = None
    default_phasing_analysis: Optional[str] = None
    default_coverage_calls_analysis: Optional[str] = None
 
    # AIM configuration
    aim_analysis: Optional[str] = None
    aim_metadata_dtype: Optional[Dict[str, str]] = None
 
    # Cache names
    roh_hmm_cache_name: str = "roh_hmm_v1"
    xpehh_gwss_cache_name: str = "xpehh_gwss_v1"
    ihs_gwss_cache_name: str = "ihs_gwss_v1"
 
    # Display
    taxon_colors: Optional[Dict[str, str]] = None
    virtual_contigs: Optional[Dict[str, List[str]]] = None
    gene_names: Optional[Dict[str, str]] = None

Each species class would then simply define one config object:

AG3_CONFIG = SpeciesConfig(
    gff_gene_type="gene",
    gff_gene_name_attribute="Name",
    default_site_mask="gamb_colu_arab",
    ...
)

Benefits

  • Self-documenting: A new contributor can read SpeciesConfig to see every knob available.
  • Validated: frozen=True prevents accidental mutation; type hints catch errors early.
  • Sensible defaults: Non-African species (which lack AIM, phasing, etc.) just use defaults without explicit None declarations.
  • Adding new species becomes trivial: Define one config object instead of 15+ scattered constants.
  • Addresses both TODOs in genome_features.py:37 and aim_data.py:44.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions