Description
When a user requests optional FORMAT fields like GQ, AD, or MQ via the fields parameter of snp_calls_to_vcf(), the method silently falls back to writing "." (missing) for every sample if the corresponding data array doesn't exist in the dataset. There is no warning, log message, or exception to inform the user that their requested field is absent from the underlying data.
Impact
- Data integrity: Users may unknowingly base analyses on "empty" FORMAT fields, leading to incorrect quality filtering or allele depth-based conclusions.
- Misleading VCF headers: The output file's header declares fields that contain no real data, violating the VCF specification's expectation that declared FORMAT fields carry meaningful values.
- Debugging difficulty: When a downstream tool (e.g.,
bcftools filter -e 'GQ<20') behaves unexpectedly because all GQ values are missing, the user has no indication that the field was silently absent from the source data.
Steps to Reproduce
- Open
malariagen_data/anoph/to_vcf.py, lines 136–150 and 169–183.
- Note the
except KeyError: pass pattern — no warnings.
- Attempt to export VCF with
fields=("GT", "GQ", "AD") for a sample set that doesn't contain call_GQ or call_AD arrays.
- Inspect the output VCF: header declares GQ and AD, but all values are ".".
Proposed Solution
Add warnings.warn() calls when a requested field's data array is not found.
Additionally, ensure the VCF header only declares fields that actually have data available in the dataset.
Impact After Resolution
- Users receive clear feedback when requested FORMAT fields are absent, enabling informed decision-making.
- VCF output integrity improves — headers accurately reflect available data.
- Consistent with the project's existing
warnings.warn() pattern for missing metadata files.
- Prevents silent downstream analysis errors.
Description
When a user requests optional FORMAT fields like GQ, AD, or MQ via the fields parameter of
snp_calls_to_vcf(), the method silently falls back to writing "." (missing) for every sample if the corresponding data array doesn't exist in the dataset. There is no warning, log message, or exception to inform the user that their requested field is absent from the underlying data.Impact
bcftools filter -e 'GQ<20') behaves unexpectedly because all GQ values are missing, the user has no indication that the field was silently absent from the source data.Steps to Reproduce
malariagen_data/anoph/to_vcf.py, lines 136–150 and 169–183.except KeyError: passpattern — no warnings.fields=("GT", "GQ", "AD")for a sample set that doesn't containcall_GQorcall_ADarrays.Proposed Solution
Add
warnings.warn()calls when a requested field's data array is not found.Additionally, ensure the VCF header only declares fields that actually have data available in the dataset.
Impact After Resolution
warnings.warn()pattern for missing metadata files.