Unsanitized User Input Passed to `DataFrame.eval()` Enables Arbitrary Code Execution

## Description
 
Multiple methods pass user-supplied `sample_query` strings directly to `pandas.DataFrame.eval()` using the Python engine, which can execute arbitrary Python code. In `base.py:984`, `hap_data.py:423`, and `genome_features.py:121`, user-controlled strings are interpolated into `.query()` / `.eval()` without any sanitization or allowlisting.
 
---
 
## How to Reproduce
 
```python
import malariagen_data
ag3 = malariagen_data.Ag3()
 
# Benign query — works as expected:
ag3.sample_metadata(sample_query="country == 'Ghana'")
 
# Malicious query — executes arbitrary Python via pandas eval(engine="python"):
ag3.sample_metadata(sample_query="__import__('os').system('echo PWNED') or country == 'Ghana'")
```
 
The vulnerability exists because `base.py:984` calls:
 
```python
loc_samples = df_samples.eval(sample_query, **sample_query_options, engine="python")
```
 
The `engine="python"` flag gives the eval engine access to the full Python runtime, not just the restricted `numexpr` engine.
 
Similarly, `genome_features.py:121` interpolates the `contig` parameter directly:
 
```python
df = df.query(f"contig == '{contig}'")
```
 
---
 
## Why It Is Important
 
This is a **code injection vulnerability** — any function accepting a `sample_query` parameter is an attack surface.
 
- The library is used in **shared Jupyter notebook environments** (Google Colab, institutional JupyterHubs) where queries may come from URL parameters, config files, or shared notebooks.
- While pandas documents this risk, a scientific library should not expose it to researchers who may not be security-aware.
- [OWASP](https://owasp.org/www-project-top-ten/) classifies injection as a **top-10 vulnerability** category.
 
---
 
## Affected Locations
 
| File | Line | Sink |
|------|------|------|
| `base.py` | 984 | `df_samples.eval(sample_query, engine="python")` |
| `hap_data.py` | 423 | `DataFrame.eval()` with user input |
| `genome_features.py` | 121 | `df.query(f"contig == '{contig}'")` |
 
---
 
## Expected Impact After Resolution
 
- User inputs are validated against an **allowlist of safe column names and operators** before reaching `eval()`, or the `numexpr` engine is used where possible.
- Researchers using the library in shared environments are protected from injection via query strings.
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsanitized User Input Passed to `DataFrame.eval()` Enables Arbitrary Code Execution #1292

Description

How to Reproduce

Why It Is Important

Affected Locations

Expected Impact After Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Line	Sink
`base.py`	984	`df_samples.eval(sample_query, engine="python")`
`hap_data.py`	423	`DataFrame.eval()` with user input
`genome_features.py`	121	`df.query(f"contig == '{contig}'")`

Unsanitized User Input Passed to DataFrame.eval() Enables Arbitrary Code Execution #1292

Description

Description

How to Reproduce

Why It Is Important

Affected Locations

Expected Impact After Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unsanitized User Input Passed to `DataFrame.eval()` Enables Arbitrary Code Execution #1292