Summary
In malariagen_data/veff.py, the function _get_within_cds_effect() has an incomplete branch for in-frame complex variants — variants that are simultaneously a multi-nucleotide polymorphism (MNP) AND an indel, where the net length change is a multiple of 3 (no frameshift).
Current broken behaviour
Lines 447-449 currently contain:
effect = base_effect._replace(
effect="TODO in-frame complex variation (MNP + INDEL)",
impact="UNKNOWN"
)
This means a researcher who calls snp_effects() on such a variant
receives a DataFrame where:
- The
effect column contains the literal string "TODO in-frame complex variation (MNP + INDEL)"
- The
impact column contains "UNKNOWN"
Why this is serious
-
"UNKNOWN" is not a valid impact level anywhere else in this codebase.
All other values are "HIGH", "MODERATE", "LOW", or "MODIFIER".
This silently breaks any downstream filtering such as df[df["impact"] == "HIGH"].
-
The TODO string is a developer note leaking directly into scientific output with no warning or error raised.
How to reproduce
Call snp_effects() on any variant where:
- len(ref) > 1 and len(alt) > 1 (not a simple insertion or deletion)
- len(ref) != len(alt) (not a pure MNP)
- (len(alt) - len(ref)) % 3 == 0 (in-frame, not a frameshift)
Example: ref="GCC", alt="GCCATG" at a CDS position.
Proposed fix
Replace the TODO branch with:
effect = base_effect._replace(effect="CODON_CHANGE", impact="MODERATE")
This is consistent with how pure MNPs are already handled in the elif branch directly above. Both cases represent in-frame changes to one or more codons with no frameshift.
Related
This is related to issue #1180. I will submit a PR with the fix and a regression test.
Summary
In
malariagen_data/veff.py, the function_get_within_cds_effect()has an incomplete branch for in-frame complex variants — variants that are simultaneously a multi-nucleotide polymorphism (MNP) AND an indel, where the net length change is a multiple of 3 (no frameshift).Current broken behaviour
Lines 447-449 currently contain:
This means a researcher who calls
snp_effects()on such a variantreceives a DataFrame where:
effectcolumn contains the literal string "TODO in-frame complex variation (MNP + INDEL)"impactcolumn contains "UNKNOWN"Why this is serious
"UNKNOWN" is not a valid impact level anywhere else in this codebase.
All other values are "HIGH", "MODERATE", "LOW", or "MODIFIER".
This silently breaks any downstream filtering such as df[df["impact"] == "HIGH"].
The TODO string is a developer note leaking directly into scientific output with no warning or error raised.
How to reproduce
Call snp_effects() on any variant where:
Example: ref="GCC", alt="GCCATG" at a CDS position.
Proposed fix
Replace the TODO branch with:
This is consistent with how pure MNPs are already handled in the elif branch directly above. Both cases represent in-frame changes to one or more codons with no frameshift.
Related
This is related to issue #1180. I will submit a PR with the fix and a regression test.