Skip to content

Preserve sample_query_options during cohort query normalisation #1100

@kshirajahere

Description

@kshirajahere

Summary

_setup_cohort_queries() still drops sample_query_options when it re-applies cohort queries during cohort-size validation, so parameterised sample filters like country in @countries_list fail in any API that relies on shared cohort normalisation.

Why this matters

This is the same correctness family as #995 / #999, but at a deeper layer: the recent caller-level forwarding fixes in #996 and #1001 do not cover the helper that actually rebuilds and rechecks cohort queries.

As a result, higher-level analyses that accept sample_query_options still break for valid pandas query context such as:

  • local_dict
  • global_dict
  • resolvers

That matters for multi-cohort analysis surfaces because it makes query behaviour inconsistent depending on which code path is used, even when the public signature advertises sample_query_options support.

Reproduction

Using the simulated-data APIs locally, both of these currently raise pandas.errors.UndefinedVariableError: local variable 'countries_list' is not defined:

countries_list = ["Angola", "Mali"]

api.pairwise_average_fst(
    region="2L",
    cohorts="country",
    sample_query="country in @countries_list",
    sample_query_options={"local_dict": {"countries_list": countries_list}},
    min_cohort_size=1,
)

api.plot_h12_gwss_multi_overlay(
    contig="2L",
    cohorts="country",
    window_size=200,
    sample_query="country in @countries_list",
    sample_query_options={"local_dict": {"countries_list": countries_list}},
    min_cohort_size=1,
    show=False,
)

The failure happens inside malariagen_data/anoph/sample_metadata.py when _setup_cohort_queries() loops over derived cohort queries and calls:

self.sample_metadata(sample_sets=sample_sets, sample_query=cohort_query)

without passing sample_query_options.

Additional mismatch in the same query contract

Once sample_query_options is forwarded into the shared helper, sample_metadata() also needs to treat engine="python" as a default rather than forcing it as a duplicate keyword argument. The current implementation documents engine as a supported query option, but paths that now correctly forward sample_query_options={"engine": "python"} can trip:

TypeError: DataFrame.query() got multiple values for keyword argument 'engine'

Proposed fix direction

  1. Pass sample_query_options through the cohort-size validation calls inside _setup_cohort_queries().
  2. Normalize query options in one shared place so engine="python" remains the default, but an explicitly supplied engine does not become a duplicate keyword.
  3. Add regression tests at both the metadata layer and at least one multi-cohort public API surface (e.g. H12/Fst) using local_dict.

Impacted files

  • malariagen_data/anoph/sample_metadata.py
  • malariagen_data/anoph/base.py

This looks like a good small-but-deep correctness fix because one shared abstraction affects multiple analysis entry points.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions