snp_calls_to_vcf() — Python-level per-sample loop creates severe performance bottleneck for large cohorts

## Description
 
The `snp_calls_to_vcf()` method in `malariagen_data/anoph/to_vcf.py` constructs every VCF record line using a pure-Python nested loop (lines 185–244). For each variant row, it iterates over every sample with a `for k in range(n_samples)` loop to format genotype strings one call at a time.
 
For a typical Ag3 export with ~3,000 samples and ~10 million variants, this results in **30+ billion Python-level string operations** — making what could be a minutes-long export take hours.
 
```python
# Current code (lines 203–238): O(variants × samples) Python loop
sample_fields = np.empty(n_samples, dtype=object)
for k in range(n_samples):
    parts = []
    a0 = gt_row[k, 0]
    a1 = gt_row[k, 1]
    if a0 < 0 or a1 < 0:
        parts.append("./.")
    else:
        parts.append(f"{a0}/{a1}")
    # ... GQ, AD, MQ formatting ...
    sample_fields[k] = ":".join(parts)
```
 
---
 
## Impact on the Project
 
- **User-facing performance:** VCF export is one of the primary data export functions. Researchers exporting whole-genome data for downstream tools (`bcftools`, `GATK`, etc.) face unreasonably long wait times.
- **Scalability:** The problem worsens linearly with both sample count and variant count — exactly the dimensions that grow as new releases add data.
- **CPU-bound bottleneck:** Memory is not the issue — the chunked `.compute()` approach is fine — the bottleneck is purely CPU-bound Python string formatting.
 
---
 
## Steps to Reproduce / Identify
 
1. Open `to_vcf.py`, lines 185–244.
2. Observe the nested `for j` (variants) → `for k` (samples) loop.
3. Run any VCF export with >500 samples and profile with `cProfile` — the inner loop dominates runtime.
 
---
 
## Proposed Solution
 
Replace the per-sample Python loop with vectorized NumPy string operations that operate on entire rows at once:
 
```python
# Vectorized GT formatting for an entire chunk
gt_chunk_2d = gt_chunk.reshape(gt_chunk.shape[0], gt_chunk.shape[1], 2)
a0 = gt_chunk_2d[:, :, 0]  # shape: (n_variants, n_samples)
a1 = gt_chunk_2d[:, :, 1]
missing = (a0 < 0) | (a1 < 0)
gt_strings = np.where(
    missing,
    "./.",
    np.char.add(np.char.add(a0.astype(str), "/"), a1.astype(str))
)
# Join across samples per variant in one operation
lines = np.array(["\t".join(row) for row in gt_strings])
```
 
This approach:
 
- Eliminates `O(n_samples)` Python iterations per variant
- Uses NumPy's `np.char` module for bulk string operations
- Can yield **10–50× speedup** for typical dataset sizes
 
---
 
## Impact After Resolution
 
- VCF exports that currently take hours will complete in minutes.
- The function becomes practical for full-genome exports with thousands of samples.
- Aligns with the project's existing pattern of using vectorized NumPy/Dask operations for data processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snp_calls_to_vcf() — Python-level per-sample loop creates severe performance bottleneck for large cohorts #1280

Description

Impact on the Project

Steps to Reproduce / Identify

Proposed Solution

Impact After Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

snp_calls_to_vcf() — Python-level per-sample loop creates severe performance bottleneck for large cohorts #1280

Description

Description

Impact on the Project

Steps to Reproduce / Identify

Proposed Solution

Impact After Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions