Summary
_setup_cohort_queries() still drops sample_query_options when it re-applies cohort queries during cohort-size validation, so parameterised sample filters like country in @countries_list fail in any API that relies on shared cohort normalisation.
Why this matters
This is the same correctness family as #995 / #999, but at a deeper layer: the recent caller-level forwarding fixes in #996 and #1001 do not cover the helper that actually rebuilds and rechecks cohort queries.
As a result, higher-level analyses that accept sample_query_options still break for valid pandas query context such as:
local_dict
global_dict
resolvers
That matters for multi-cohort analysis surfaces because it makes query behaviour inconsistent depending on which code path is used, even when the public signature advertises sample_query_options support.
Reproduction
Using the simulated-data APIs locally, both of these currently raise pandas.errors.UndefinedVariableError: local variable 'countries_list' is not defined:
countries_list = ["Angola", "Mali"]
api.pairwise_average_fst(
region="2L",
cohorts="country",
sample_query="country in @countries_list",
sample_query_options={"local_dict": {"countries_list": countries_list}},
min_cohort_size=1,
)
api.plot_h12_gwss_multi_overlay(
contig="2L",
cohorts="country",
window_size=200,
sample_query="country in @countries_list",
sample_query_options={"local_dict": {"countries_list": countries_list}},
min_cohort_size=1,
show=False,
)
The failure happens inside malariagen_data/anoph/sample_metadata.py when _setup_cohort_queries() loops over derived cohort queries and calls:
self.sample_metadata(sample_sets=sample_sets, sample_query=cohort_query)
without passing sample_query_options.
Additional mismatch in the same query contract
Once sample_query_options is forwarded into the shared helper, sample_metadata() also needs to treat engine="python" as a default rather than forcing it as a duplicate keyword argument. The current implementation documents engine as a supported query option, but paths that now correctly forward sample_query_options={"engine": "python"} can trip:
TypeError: DataFrame.query() got multiple values for keyword argument 'engine'
Proposed fix direction
- Pass
sample_query_options through the cohort-size validation calls inside _setup_cohort_queries().
- Normalize query options in one shared place so
engine="python" remains the default, but an explicitly supplied engine does not become a duplicate keyword.
- Add regression tests at both the metadata layer and at least one multi-cohort public API surface (e.g. H12/Fst) using
local_dict.
Impacted files
malariagen_data/anoph/sample_metadata.py
malariagen_data/anoph/base.py
This looks like a good small-but-deep correctness fix because one shared abstraction affects multiple analysis entry points.
Summary
_setup_cohort_queries()still dropssample_query_optionswhen it re-applies cohort queries during cohort-size validation, so parameterised sample filters likecountry in @countries_listfail in any API that relies on shared cohort normalisation.Why this matters
This is the same correctness family as #995 / #999, but at a deeper layer: the recent caller-level forwarding fixes in #996 and #1001 do not cover the helper that actually rebuilds and rechecks cohort queries.
As a result, higher-level analyses that accept
sample_query_optionsstill break for valid pandas query context such as:local_dictglobal_dictresolversThat matters for multi-cohort analysis surfaces because it makes query behaviour inconsistent depending on which code path is used, even when the public signature advertises
sample_query_optionssupport.Reproduction
Using the simulated-data APIs locally, both of these currently raise
pandas.errors.UndefinedVariableError: local variable 'countries_list' is not defined:The failure happens inside
malariagen_data/anoph/sample_metadata.pywhen_setup_cohort_queries()loops over derived cohort queries and calls:without passing
sample_query_options.Additional mismatch in the same query contract
Once
sample_query_optionsis forwarded into the shared helper,sample_metadata()also needs to treatengine="python"as a default rather than forcing it as a duplicate keyword argument. The current implementation documentsengineas a supported query option, but paths that now correctly forwardsample_query_options={"engine": "python"}can trip:Proposed fix direction
sample_query_optionsthrough the cohort-size validation calls inside_setup_cohort_queries().engine="python"remains the default, but an explicitly supplied engine does not become a duplicate keyword.local_dict.Impacted files
malariagen_data/anoph/sample_metadata.pymalariagen_data/anoph/base.pyThis looks like a good small-but-deep correctness fix because one shared abstraction affects multiple analysis entry points.