You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several modules in malariagen_data/anoph/ contain hardcoded or constructor-passed parameters that differ between species but are not managed through a centralized configuration. Two existing TODO comments from maintainers explicitly call this out:
genome_features.py:37-38
# TODO Consider moving these parameters to configuration, as they could# change if the GFF ever changed.self._gff_gene_type=gff_gene_typeself._gff_gene_name_attribute=gff_gene_name_attributeself._gff_default_attributes=gff_default_attributes
aim_data.py:44
# TODO Consider moving this to data resource configuration.self._aim_ids=aim_idsself._aim_palettes=aim_palettes
Current State
AnophelesDataResource.__init__() (anopheles.py:105-143) currently accepts 36 constructor parameters, of which at least 15 are species-specific configuration values that vary between resources but rarely change at runtime:
Parameter
Ag3
Af1
Adir1
Amin1
gff_gene_type
"gene"
"protein_coding_gene"
"protein_coding_gene"
"protein_coding_gene"
gff_gene_name_attribute
"Name"
"Note"
"Note"
"Note"
default_site_mask
"gamb_colu_arab"
"funestus"
"dirus"
"minimus"
default_phasing_analysis
"gamb_colu_arab"
"funestus"
"dirus"
None
default_coverage_calls_analysis
"gamb_colu"
"funestus"
"dirus"
"minimus_noneyet"
aim_analysis
"20220528"
None
None
None
cohorts_analysis
"20230516"
"20221129"
"20250116"
"20250116"
_roh_hmm_cache_name
"ag3_roh_hmm_v1"
"af1_roh_hmm_v1"
...
...
_xpehh_gwss_cache_name
"ag3_xpehh_gwss_v1"
"af1_xpehh_gwss_v1"
...
...
_ihs_gwss_cache_name
"ag3_ihs_gwss_v1"
"af1_ihs_gwss_v1"
...
...
Each species class (Ag3, Af1, Adir1, Amin1, Adar1) re-declares most of these values in their module-level constants and passes them to super().__init__(). When adding a new species, a developer must:
Study an existing species file to identify which params are needed
Copy 15+ constant declarations
Know the correct value for each param (e.g., "gene" vs "protein_coding_gene")
Set cache name class attributes in the right format
Set optional features to None manually (e.g., aim_analysis=None)
There is no validation, no documentation of valid values, and no single reference showing what a "complete" species configuration looks like.
Impact
Adding new species is fragile: When integrating a new species (as seen with Adir1 via PR Adir1.x #795 and the upcoming Adar1 in Add support for Anopheles darlingi (Adar1.x) #837), there's no schema to validate that all required params are set correctly.
Cache names are inconsistent: Issue Error accessing roh_hmm #1151 happened partly because cache name params are managed through scattered class attributes and abstract methods rather than a unified config.
Maintenance burden: Changes to config structure (e.g., adding a new species-specific parameter) require updating 5+ species files.
Proposed Solution
Create a SpeciesConfig dataclass that bundles all species-specific parameters into a single, validated object:
Description
Several modules in
malariagen_data/anoph/contain hardcoded or constructor-passed parameters that differ between species but are not managed through a centralized configuration. Two existing TODO comments from maintainers explicitly call this out:genome_features.py:37-38aim_data.py:44Current State
AnophelesDataResource.__init__()(anopheles.py:105-143) currently accepts 36 constructor parameters, of which at least 15 are species-specific configuration values that vary between resources but rarely change at runtime:Ag3Af1Adir1Amin1gff_gene_type"gene""protein_coding_gene""protein_coding_gene""protein_coding_gene"gff_gene_name_attribute"Name""Note""Note""Note"default_site_mask"gamb_colu_arab""funestus""dirus""minimus"default_phasing_analysis"gamb_colu_arab""funestus""dirus"Nonedefault_coverage_calls_analysis"gamb_colu""funestus""dirus""minimus_noneyet"aim_analysis"20220528"NoneNoneNonecohorts_analysis"20230516""20221129""20250116""20250116"_roh_hmm_cache_name"ag3_roh_hmm_v1""af1_roh_hmm_v1"......_xpehh_gwss_cache_name"ag3_xpehh_gwss_v1""af1_xpehh_gwss_v1"......_ihs_gwss_cache_name"ag3_ihs_gwss_v1""af1_ihs_gwss_v1"......Each species class (
Ag3,Af1,Adir1,Amin1,Adar1) re-declares most of these values in their module-level constants and passes them tosuper().__init__(). When adding a new species, a developer must:"gene"vs"protein_coding_gene")Nonemanually (e.g.,aim_analysis=None)There is no validation, no documentation of valid values, and no single reference showing what a "complete" species configuration looks like.
Impact
Adir1via PR Adir1.x #795 and the upcomingAdar1in Add support for Anopheles darlingi (Adar1.x) #837), there's no schema to validate that all required params are set correctly.Proposed Solution
Create a
SpeciesConfigdataclass that bundles all species-specific parameters into a single, validated object:Each species class would then simply define one config object:
Benefits
SpeciesConfigto see every knob available.frozen=Trueprevents accidental mutation; type hints catch errors early.Nonedeclarations.genome_features.py:37andaim_data.py:44.