Skip to content

snp_calls_to_vcf() — Python-level per-sample loop creates severe performance bottleneck for large cohorts #1280

@khushthecoder

Description

@khushthecoder

Description

The snp_calls_to_vcf() method in malariagen_data/anoph/to_vcf.py constructs every VCF record line using a pure-Python nested loop (lines 185–244). For each variant row, it iterates over every sample with a for k in range(n_samples) loop to format genotype strings one call at a time.

For a typical Ag3 export with ~3,000 samples and ~10 million variants, this results in 30+ billion Python-level string operations — making what could be a minutes-long export take hours.

# Current code (lines 203–238): O(variants × samples) Python loop
sample_fields = np.empty(n_samples, dtype=object)
for k in range(n_samples):
    parts = []
    a0 = gt_row[k, 0]
    a1 = gt_row[k, 1]
    if a0 < 0 or a1 < 0:
        parts.append("./.")
    else:
        parts.append(f"{a0}/{a1}")
    # ... GQ, AD, MQ formatting ...
    sample_fields[k] = ":".join(parts)

Impact on the Project

  • User-facing performance: VCF export is one of the primary data export functions. Researchers exporting whole-genome data for downstream tools (bcftools, GATK, etc.) face unreasonably long wait times.
  • Scalability: The problem worsens linearly with both sample count and variant count — exactly the dimensions that grow as new releases add data.
  • CPU-bound bottleneck: Memory is not the issue — the chunked .compute() approach is fine — the bottleneck is purely CPU-bound Python string formatting.

Steps to Reproduce / Identify

  1. Open to_vcf.py, lines 185–244.
  2. Observe the nested for j (variants) → for k (samples) loop.
  3. Run any VCF export with >500 samples and profile with cProfile — the inner loop dominates runtime.

Proposed Solution

Replace the per-sample Python loop with vectorized NumPy string operations that operate on entire rows at once:

# Vectorized GT formatting for an entire chunk
gt_chunk_2d = gt_chunk.reshape(gt_chunk.shape[0], gt_chunk.shape[1], 2)
a0 = gt_chunk_2d[:, :, 0]  # shape: (n_variants, n_samples)
a1 = gt_chunk_2d[:, :, 1]
missing = (a0 < 0) | (a1 < 0)
gt_strings = np.where(
    missing,
    "./.",
    np.char.add(np.char.add(a0.astype(str), "/"), a1.astype(str))
)
# Join across samples per variant in one operation
lines = np.array(["\t".join(row) for row in gt_strings])

This approach:

  • Eliminates O(n_samples) Python iterations per variant
  • Uses NumPy's np.char module for bulk string operations
  • Can yield 10–50× speedup for typical dataset sizes

Impact After Resolution

  • VCF exports that currently take hours will complete in minutes.
  • The function becomes practical for full-genome exports with thousands of samples.
  • Aligns with the project's existing pattern of using vectorized NumPy/Dask operations for data processing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions