Skip to content

Fix: Sanitize user input passed to DataFrame.eval()/query() to prevent code injection#1293

Merged
jonbrenas merged 6 commits intomalariagen:masterfrom
kunal-10-cloud:fix/issue-1292-sanitize-eval-query
Apr 17, 2026
Merged

Fix: Sanitize user input passed to DataFrame.eval()/query() to prevent code injection#1293
jonbrenas merged 6 commits intomalariagen:masterfrom
kunal-10-cloud:fix/issue-1292-sanitize-eval-query

Conversation

@kunal-10-cloud
Copy link
Copy Markdown
Contributor

Summary

Fix #1292

Multiple methods across the codebase pass user-supplied query strings (e.g. sample_query, variant_query, snp_query) directly to pandas.DataFrame.eval() and DataFrame.query() without any sanitization. In several locations engine="python" was explicitly set, giving the expression evaluator access to the full Python runtime — enabling arbitrary code execution via crafted inputs like __import__('os').system('echo PWNED').

Additionally, several locations used f-string interpolation to build query strings (e.g. df.query(f"contig == '{contig}'")), which is vulnerable to expression injection through quote-breaking.

This PR eliminates these vulnerabilities through two complementary strategies:

1. AST-based query validation (safe_query.py)

A new validate_query() function parses query strings into a Python AST and validates every node against a strict allowlist:

  • Allowed: boolean operators (and/or/not), comparisons (==, !=, <, <=, >, >=, in, not in, is), arithmetic, constants (strings, numbers, booleans, None), column names, parenthesized expressions, tuples, and lists.
  • Rejected: function calls (__import__(), exec(), eval(), len(), etc.), attribute access (os.system), subscripting (x[0]), lambdas, comprehensions, f-strings, walrus operators, dict literals, generator expressions, starred expressions, and any identifier containing __ (dunder names).

All user-facing query parameters are validated through validate_query() before reaching any eval/query sink.

2. Safe boolean indexing for internal filters

All internal f-string query interpolations have been replaced with direct boolean indexing:

Before (vulnerable) After (safe)
df.query(f"contig == '{contig}'") df[df["contig"] == contig]
df.query(f"ID == '{region}'") df.loc[df["ID"] == region]
df.query(f"inversion == '{inversion}'") df.loc[df["inversion"] == inversion]
df.query(f"type == '{gene_type}'") df.loc[df["type"] == gene_type]
df.query(f"size >= {min_cohort_size}") df.loc[df["size"] >= min_cohort_size]

3. Removal of engine="python"

All engine="python" arguments have been removed from DataFrame.eval() and DataFrame.query() calls. The default numexpr engine is significantly more restrictive and does not allow arbitrary Python code execution.

Files Changed

File Change
malariagen_data/anoph/safe_query.py New — AST-based query validator module
malariagen_data/anoph/base.py Add validate_query() before eval(), remove engine="python"
malariagen_data/anoph/sample_metadata.py Add validate_query() before query()/eval(), remove engine="python"
malariagen_data/anoph/hap_data.py Add validate_query() before eval()
malariagen_data/anoph/hapclust.py Add validate_query() before query()
malariagen_data/anoph/snp_frq.py Add validate_query() before eval()/query()
malariagen_data/anoph/cnv_frq.py Add validate_query() before eval()
malariagen_data/anopheles.py Add validate_query() before eval()
malariagen_data/anoph/genome_features.py Replace f-string queries with boolean indexing
malariagen_data/util.py Replace f-string query with boolean indexing
malariagen_data/anoph/frq_base.py Replace f-string query with boolean indexing
malariagen_data/anoph/karyotype.py Replace f-string query with boolean indexing
tests/anoph/test_safe_query.py New — 43 security tests

Test Plan

  • 43 new unit tests in test_safe_query.py covering:
    • Safe expressions accepted: simple equality, numeric comparisons, boolean chains, in/not in, parenthesized expressions, is/is not, boolean literals, arithmetic, allowlisted column names
    • Malicious payloads rejected: __import__(), function calls, attribute access, subscripting, lambdas, comprehensions, dunder identifiers, exec()/eval(), dict literals, generator expressions, f-strings, walrus operators
    • Edge cases: empty strings, non-string input, syntax errors, multiple statements, quote-breaking attacks
  • All tests pass (pytest tests/anoph/test_safe_query.py — 43 passed)
  • ruff check and ruff format pass with no issues

Backwards Compatibility

All legitimate query expressions (comparisons, boolean logic, in membership, arithmetic) continue to work exactly as before. Only expressions containing unsafe constructs (function calls, attribute access, dunder names, etc.) are now rejected with a descriptive UnsafeQueryError.

…t code injection (malariagenGH-1292)

- Add safe_query.py module with AST-based validation that restricts query
  expressions to comparisons, boolean logic, constants, and column names
- Validate all sample_query, variant_query, and snp_query strings before
  passing them to pandas eval()/query()
- Remove engine="python" from all DataFrame.eval()/query() calls to
  prevent arbitrary Python execution
- Replace f-string interpolation in DataFrame.query() calls with safe
  boolean indexing (genome_features.py, util.py, frq_base.py, karyotype.py)
- Add comprehensive test suite (43 tests) covering safe expressions,
  malicious payloads, and edge cases

Closes malariagen#1292
Copilot AI review requested due to automatic review settings April 14, 2026 13:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses GH-1292 by preventing code injection via user-supplied expressions passed into pandas.DataFrame.eval() / DataFrame.query() across the Anopheles API, primarily by introducing an AST-based query validator and replacing f-string query construction with safe boolean indexing.

Changes:

  • Add validate_query() / UnsafeQueryError in a new safe_query.py module and validate user-facing query strings before eval()/query() sinks.
  • Remove explicit engine="python" usage and replace several interpolated .query(f"...") filters with boolean indexing.
  • Add a dedicated security-focused test suite for the query validator.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
malariagen_data/anoph/safe_query.py New AST allowlist validator for pandas query/eval expressions.
tests/anoph/test_safe_query.py New unit tests covering safe vs. malicious query validation behavior.
malariagen_data/anoph/base.py Validate sample_query before df_samples.eval() in sample filtering.
malariagen_data/anoph/sample_metadata.py Validate query strings before .query()/.eval() in multiple paths.
malariagen_data/anoph/hap_data.py Validate sample_query_prepped before .eval() when filtering phased samples.
malariagen_data/anoph/hapclust.py Validate snp_query before applying it to SNP effects dataframe.
malariagen_data/anoph/snp_frq.py Validate variant_query / snp_query before .eval()/.query().
malariagen_data/anoph/cnv_frq.py Validate variant_query before .eval() in CNV frequency calculations.
malariagen_data/anopheles.py Validate per-partition eval expressions used in haplotype network plotting.
malariagen_data/anoph/genome_features.py Replace interpolated .query() filters with boolean indexing.
malariagen_data/anoph/frq_base.py Replace interpolated .query() filter with boolean indexing for cohort size.
malariagen_data/anoph/karyotype.py Replace interpolated .query() filter with boolean indexing for inversion tags.
malariagen_data/util.py Replace interpolated .query() on gene annotations with safe boolean indexing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

try:
validate_query(query)
loc_coh = data.eval(query).values
except (KeyError, NameError, SyntaxError, TypeError, AttributeError) as e:
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_query(query) can raise UnsafeQueryError (a ValueError), but the surrounding try/except only catches KeyError, NameError, SyntaxError, TypeError, AttributeError. As a result, unsafe queries will bypass this wrapper and won’t be re-raised with the cohort context (Invalid query for cohort ...). Consider catching UnsafeQueryError here (or broadening to include ValueError) so users still get the cohort-specific error message.

Suggested change
except (KeyError, NameError, SyntaxError, TypeError, AttributeError) as e:
except (
KeyError,
NameError,
SyntaxError,
TypeError,
AttributeError,
ValueError,
) as e:

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +90
f"in query strings. Only comparisons, boolean logic, and constants "
f"are permitted."
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic error message for disallowed AST nodes says "Only comparisons, boolean logic, and constants are permitted." but this validator also explicitly allows arithmetic (ast.BinOp with +, -, *, etc.) and unary operators. Updating the message to reflect the actual allowlist would avoid confusing users when they hit a validation error.

Suggested change
f"in query strings. Only comparisons, boolean logic, and constants "
f"are permitted."
f"in query strings. Only comparisons, boolean logic, arithmetic, "
f"unary operations, names, and constants are permitted."

Copilot uses AI. Check for mistakes.
…ilures

- Preprocess @variable references (e.g., `taxon in @taxon_list`) by
  replacing them with safe placeholder identifiers before AST parsing,
  since @ is pandas-specific syntax not valid in Python AST
- Fix test_non_string_input to accept both UnsafeQueryError and
  TypeError (typeguard may intercept the type check first)
- Add tests for @variable reference patterns

Fixes 22 CI test failures caused by sample_query_options using @var syntax
@kunal-10-cloud kunal-10-cloud force-pushed the fix/issue-1292-sanitize-eval-query branch 2 times, most recently from 2b58edc to cbb0dcc Compare April 14, 2026 14:47
The CI uses typeguard which enforces type annotations at runtime,
raising TypeCheckError instead of TypeError when a non-str argument
is passed to validate_query(). Add TypeCheckError to the expected
exceptions tuple.
@kunal-10-cloud kunal-10-cloud force-pushed the fix/issue-1292-sanitize-eval-query branch from cbb0dcc to 90c58aa Compare April 14, 2026 14:48
@kunal-10-cloud
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas , after your approval all test cases have passed. The code is now ready to merge—please review it once.

@kunal-10-cloud
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas, just a quick follow-up—since everything is approved and checks are passing, would you mind merging when you get a moment? Thanks!

@jonbrenas jonbrenas merged commit 9c1bb56 into malariagen:master Apr 17, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsanitized User Input Passed to DataFrame.eval() Enables Arbitrary Code Execution

3 participants