Skip to content

snp_calls_to_vcf() silently drops requested FORMAT fields when underlying data arrays don't exist #1283

@Yashsingh045

Description

@Yashsingh045

Description

When a user requests optional FORMAT fields like GQ, AD, or MQ via the fields parameter of snp_calls_to_vcf(), the method silently falls back to writing "." (missing) for every sample if the corresponding data array doesn't exist in the dataset. There is no warning, log message, or exception to inform the user that their requested field is absent from the underlying data.

Impact

  • Data integrity: Users may unknowingly base analyses on "empty" FORMAT fields, leading to incorrect quality filtering or allele depth-based conclusions.
  • Misleading VCF headers: The output file's header declares fields that contain no real data, violating the VCF specification's expectation that declared FORMAT fields carry meaningful values.
  • Debugging difficulty: When a downstream tool (e.g., bcftools filter -e 'GQ<20') behaves unexpectedly because all GQ values are missing, the user has no indication that the field was silently absent from the source data.

Steps to Reproduce

  1. Open malariagen_data/anoph/to_vcf.py, lines 136–150 and 169–183.
  2. Note the except KeyError: pass pattern — no warnings.
  3. Attempt to export VCF with fields=("GT", "GQ", "AD") for a sample set that doesn't contain call_GQ or call_AD arrays.
  4. Inspect the output VCF: header declares GQ and AD, but all values are ".".

Proposed Solution

Add warnings.warn() calls when a requested field's data array is not found.
Additionally, ensure the VCF header only declares fields that actually have data available in the dataset.

Impact After Resolution

  • Users receive clear feedback when requested FORMAT fields are absent, enabling informed decision-making.
  • VCF output integrity improves — headers accurately reflect available data.
  • Consistent with the project's existing warnings.warn() pattern for missing metadata files.
  • Prevents silent downstream analysis errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions