Fix: Sanitize user input passed to DataFrame.eval()/query() to prevent code injection#1293
Conversation
…t code injection (malariagenGH-1292) - Add safe_query.py module with AST-based validation that restricts query expressions to comparisons, boolean logic, constants, and column names - Validate all sample_query, variant_query, and snp_query strings before passing them to pandas eval()/query() - Remove engine="python" from all DataFrame.eval()/query() calls to prevent arbitrary Python execution - Replace f-string interpolation in DataFrame.query() calls with safe boolean indexing (genome_features.py, util.py, frq_base.py, karyotype.py) - Add comprehensive test suite (43 tests) covering safe expressions, malicious payloads, and edge cases Closes malariagen#1292
There was a problem hiding this comment.
Pull request overview
This PR addresses GH-1292 by preventing code injection via user-supplied expressions passed into pandas.DataFrame.eval() / DataFrame.query() across the Anopheles API, primarily by introducing an AST-based query validator and replacing f-string query construction with safe boolean indexing.
Changes:
- Add
validate_query()/UnsafeQueryErrorin a newsafe_query.pymodule and validate user-facing query strings beforeeval()/query()sinks. - Remove explicit
engine="python"usage and replace several interpolated.query(f"...")filters with boolean indexing. - Add a dedicated security-focused test suite for the query validator.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
malariagen_data/anoph/safe_query.py |
New AST allowlist validator for pandas query/eval expressions. |
tests/anoph/test_safe_query.py |
New unit tests covering safe vs. malicious query validation behavior. |
malariagen_data/anoph/base.py |
Validate sample_query before df_samples.eval() in sample filtering. |
malariagen_data/anoph/sample_metadata.py |
Validate query strings before .query()/.eval() in multiple paths. |
malariagen_data/anoph/hap_data.py |
Validate sample_query_prepped before .eval() when filtering phased samples. |
malariagen_data/anoph/hapclust.py |
Validate snp_query before applying it to SNP effects dataframe. |
malariagen_data/anoph/snp_frq.py |
Validate variant_query / snp_query before .eval()/.query(). |
malariagen_data/anoph/cnv_frq.py |
Validate variant_query before .eval() in CNV frequency calculations. |
malariagen_data/anopheles.py |
Validate per-partition eval expressions used in haplotype network plotting. |
malariagen_data/anoph/genome_features.py |
Replace interpolated .query() filters with boolean indexing. |
malariagen_data/anoph/frq_base.py |
Replace interpolated .query() filter with boolean indexing for cohort size. |
malariagen_data/anoph/karyotype.py |
Replace interpolated .query() filter with boolean indexing for inversion tags. |
malariagen_data/util.py |
Replace interpolated .query() on gene annotations with safe boolean indexing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| validate_query(query) | ||
| loc_coh = data.eval(query).values | ||
| except (KeyError, NameError, SyntaxError, TypeError, AttributeError) as e: |
There was a problem hiding this comment.
validate_query(query) can raise UnsafeQueryError (a ValueError), but the surrounding try/except only catches KeyError, NameError, SyntaxError, TypeError, AttributeError. As a result, unsafe queries will bypass this wrapper and won’t be re-raised with the cohort context (Invalid query for cohort ...). Consider catching UnsafeQueryError here (or broadening to include ValueError) so users still get the cohort-specific error message.
| except (KeyError, NameError, SyntaxError, TypeError, AttributeError) as e: | |
| except ( | |
| KeyError, | |
| NameError, | |
| SyntaxError, | |
| TypeError, | |
| AttributeError, | |
| ValueError, | |
| ) as e: |
| f"in query strings. Only comparisons, boolean logic, and constants " | ||
| f"are permitted." |
There was a problem hiding this comment.
The generic error message for disallowed AST nodes says "Only comparisons, boolean logic, and constants are permitted." but this validator also explicitly allows arithmetic (ast.BinOp with +, -, *, etc.) and unary operators. Updating the message to reflect the actual allowlist would avoid confusing users when they hit a validation error.
| f"in query strings. Only comparisons, boolean logic, and constants " | |
| f"are permitted." | |
| f"in query strings. Only comparisons, boolean logic, arithmetic, " | |
| f"unary operations, names, and constants are permitted." |
…ilures - Preprocess @variable references (e.g., `taxon in @taxon_list`) by replacing them with safe placeholder identifiers before AST parsing, since @ is pandas-specific syntax not valid in Python AST - Fix test_non_string_input to accept both UnsafeQueryError and TypeError (typeguard may intercept the type check first) - Add tests for @variable reference patterns Fixes 22 CI test failures caused by sample_query_options using @var syntax
2b58edc to
cbb0dcc
Compare
The CI uses typeguard which enforces type annotations at runtime, raising TypeCheckError instead of TypeError when a non-str argument is passed to validate_query(). Add TypeCheckError to the expected exceptions tuple.
cbb0dcc to
90c58aa
Compare
|
Hi @jonbrenas , after your approval all test cases have passed. The code is now ready to merge—please review it once. |
|
Hi @jonbrenas, just a quick follow-up—since everything is approved and checks are passing, would you mind merging when you get a moment? Thanks! |
Summary
Fix #1292
Multiple methods across the codebase pass user-supplied query strings (e.g.
sample_query,variant_query,snp_query) directly topandas.DataFrame.eval()andDataFrame.query()without any sanitization. In several locationsengine="python"was explicitly set, giving the expression evaluator access to the full Python runtime — enabling arbitrary code execution via crafted inputs like__import__('os').system('echo PWNED').Additionally, several locations used f-string interpolation to build query strings (e.g.
df.query(f"contig == '{contig}'")), which is vulnerable to expression injection through quote-breaking.This PR eliminates these vulnerabilities through two complementary strategies:
1. AST-based query validation (
safe_query.py)A new
validate_query()function parses query strings into a Python AST and validates every node against a strict allowlist:and/or/not), comparisons (==,!=,<,<=,>,>=,in,not in,is), arithmetic, constants (strings, numbers, booleans,None), column names, parenthesized expressions, tuples, and lists.__import__(),exec(),eval(),len(), etc.), attribute access (os.system), subscripting (x[0]), lambdas, comprehensions, f-strings, walrus operators, dict literals, generator expressions, starred expressions, and any identifier containing__(dunder names).All user-facing query parameters are validated through
validate_query()before reaching any eval/query sink.2. Safe boolean indexing for internal filters
All internal f-string query interpolations have been replaced with direct boolean indexing:
df.query(f"contig == '{contig}'")df[df["contig"] == contig]df.query(f"ID == '{region}'")df.loc[df["ID"] == region]df.query(f"inversion == '{inversion}'")df.loc[df["inversion"] == inversion]df.query(f"type == '{gene_type}'")df.loc[df["type"] == gene_type]df.query(f"size >= {min_cohort_size}")df.loc[df["size"] >= min_cohort_size]3. Removal of
engine="python"All
engine="python"arguments have been removed fromDataFrame.eval()andDataFrame.query()calls. The defaultnumexprengine is significantly more restrictive and does not allow arbitrary Python code execution.Files Changed
malariagen_data/anoph/safe_query.pymalariagen_data/anoph/base.pyvalidate_query()beforeeval(), removeengine="python"malariagen_data/anoph/sample_metadata.pyvalidate_query()beforequery()/eval(), removeengine="python"malariagen_data/anoph/hap_data.pyvalidate_query()beforeeval()malariagen_data/anoph/hapclust.pyvalidate_query()beforequery()malariagen_data/anoph/snp_frq.pyvalidate_query()beforeeval()/query()malariagen_data/anoph/cnv_frq.pyvalidate_query()beforeeval()malariagen_data/anopheles.pyvalidate_query()beforeeval()malariagen_data/anoph/genome_features.pymalariagen_data/util.pymalariagen_data/anoph/frq_base.pymalariagen_data/anoph/karyotype.pytests/anoph/test_safe_query.pyTest Plan
test_safe_query.pycovering:in/not in, parenthesized expressions,is/is not, boolean literals, arithmetic, allowlisted column names__import__(), function calls, attribute access, subscripting, lambdas, comprehensions, dunder identifiers,exec()/eval(), dict literals, generator expressions, f-strings, walrus operatorspytest tests/anoph/test_safe_query.py— 43 passed)ruff checkandruff formatpass with no issuesBackwards Compatibility
All legitimate query expressions (comparisons, boolean logic,
inmembership, arithmetic) continue to work exactly as before. Only expressions containing unsafe constructs (function calls, attribute access, dunder names, etc.) are now rejected with a descriptiveUnsafeQueryError.