Description
The snp_calls_to_vcf() method in malariagen_data/anoph/to_vcf.py constructs every VCF record line using a pure-Python nested loop (lines 185–244). For each variant row, it iterates over every sample with a for k in range(n_samples) loop to format genotype strings one call at a time.
For a typical Ag3 export with ~3,000 samples and ~10 million variants, this results in 30+ billion Python-level string operations — making what could be a minutes-long export take hours.
# Current code (lines 203–238): O(variants × samples) Python loop
sample_fields = np.empty(n_samples, dtype=object)
for k in range(n_samples):
parts = []
a0 = gt_row[k, 0]
a1 = gt_row[k, 1]
if a0 < 0 or a1 < 0:
parts.append("./.")
else:
parts.append(f"{a0}/{a1}")
# ... GQ, AD, MQ formatting ...
sample_fields[k] = ":".join(parts)
Impact on the Project
- User-facing performance: VCF export is one of the primary data export functions. Researchers exporting whole-genome data for downstream tools (
bcftools, GATK, etc.) face unreasonably long wait times.
- Scalability: The problem worsens linearly with both sample count and variant count — exactly the dimensions that grow as new releases add data.
- CPU-bound bottleneck: Memory is not the issue — the chunked
.compute() approach is fine — the bottleneck is purely CPU-bound Python string formatting.
Steps to Reproduce / Identify
- Open
to_vcf.py, lines 185–244.
- Observe the nested
for j (variants) → for k (samples) loop.
- Run any VCF export with >500 samples and profile with
cProfile — the inner loop dominates runtime.
Proposed Solution
Replace the per-sample Python loop with vectorized NumPy string operations that operate on entire rows at once:
# Vectorized GT formatting for an entire chunk
gt_chunk_2d = gt_chunk.reshape(gt_chunk.shape[0], gt_chunk.shape[1], 2)
a0 = gt_chunk_2d[:, :, 0] # shape: (n_variants, n_samples)
a1 = gt_chunk_2d[:, :, 1]
missing = (a0 < 0) | (a1 < 0)
gt_strings = np.where(
missing,
"./.",
np.char.add(np.char.add(a0.astype(str), "/"), a1.astype(str))
)
# Join across samples per variant in one operation
lines = np.array(["\t".join(row) for row in gt_strings])
This approach:
- Eliminates
O(n_samples) Python iterations per variant
- Uses NumPy's
np.char module for bulk string operations
- Can yield 10–50× speedup for typical dataset sizes
Impact After Resolution
- VCF exports that currently take hours will complete in minutes.
- The function becomes practical for full-genome exports with thousands of samples.
- Aligns with the project's existing pattern of using vectorized NumPy/Dask operations for data processing.
Description
The
snp_calls_to_vcf()method inmalariagen_data/anoph/to_vcf.pyconstructs every VCF record line using a pure-Python nested loop (lines 185–244). For each variant row, it iterates over every sample with afor k in range(n_samples)loop to format genotype strings one call at a time.For a typical Ag3 export with ~3,000 samples and ~10 million variants, this results in 30+ billion Python-level string operations — making what could be a minutes-long export take hours.
Impact on the Project
bcftools,GATK, etc.) face unreasonably long wait times..compute()approach is fine — the bottleneck is purely CPU-bound Python string formatting.Steps to Reproduce / Identify
to_vcf.py, lines 185–244.for j(variants) →for k(samples) loop.cProfile— the inner loop dominates runtime.Proposed Solution
Replace the per-sample Python loop with vectorized NumPy string operations that operate on entire rows at once:
This approach:
O(n_samples)Python iterations per variantnp.charmodule for bulk string operationsImpact After Resolution