Summary
PlasmodiumDataResource._subset_genome_sequence_region() uses truthiness checks on region.start and region.end to decide whether to slice the genome sequence. Because Python evaluates 0 as falsy, specifying region.start = 0 silently skips slicing and returns the entire contig instead of the requested sub-region.
This is the same class of bug as #940, but in a completely separate code path (plasmodium.py) that is not covered by that issue.
Affected Location
File: malariagen_data/plasmodium.py, lines 241–248
def _subset_genome_sequence_region(
self, genome, region, inline_array=True, chunks="native"
):
"""Sebset reference genome sequence."""
region = self._resolve_region(region)
z = genome[region.contig]
d = _da_from_zarr(z, inline_array=inline_array, chunks=chunks)
if region.start: # ← BUG: 0 is falsy
slice_start = region.start - 1
else:
slice_start = None
if region.end: # ← BUG: 0 is falsy
slice_stop = region.end
else:
slice_stop = None
loc_region = slice(slice_start, slice_stop)
return d[loc_region]
Expected Behaviour
When region.start = 0 is specified, data should be sliced from position 0. The value 0 is a valid genomic coordinate.
Actual Behaviour
Slicing is skipped. slice_start is set to None, so d[None:] returns the entire contig as if no region was specified. No error or warning is raised.
Why This Is Hard to Catch
- The failure is silent — no exception, no warning
- Most real-world queries don't start at position 0
- The returned data is valid and well-formed, just the wrong amount of it
- Downstream analysis runs to completion on the inflated dataset
Proposed Fix
Replace the truthiness checks with explicit None checks:
if region.start is not None:
slice_start = region.start - 1
else:
slice_start = None
if region.end is not None:
slice_stop = region.end
else:
slice_stop = None
Additional fix in the same function
The docstring on line 235 has a typo: "Sebset" → "Subset".
Relationship to #940
Issue #940 covers 7 instances of the same truthiness bug in snp_data.py and hap_data.py. This issue covers the separate occurrence in plasmodium.py, which uses a different code pattern (separate if blocks instead of if ... or ...) and is in a completely different class (PlasmodiumDataResource vs. the Anopheles data classes).
Happy to submit a PR for this!
Summary
PlasmodiumDataResource._subset_genome_sequence_region()uses truthiness checks onregion.startandregion.endto decide whether to slice the genome sequence. Because Python evaluates0as falsy, specifyingregion.start = 0silently skips slicing and returns the entire contig instead of the requested sub-region.This is the same class of bug as #940, but in a completely separate code path (plasmodium.py) that is not covered by that issue.
Affected Location
File: malariagen_data/plasmodium.py, lines 241–248
Expected Behaviour
When
region.start = 0is specified, data should be sliced from position 0. The value0is a valid genomic coordinate.Actual Behaviour
Slicing is skipped.
slice_startis set toNone, sod[None:]returns the entire contig as if no region was specified. No error or warning is raised.Why This Is Hard to Catch
Proposed Fix
Replace the truthiness checks with explicit
Nonechecks:Additional fix in the same function
The docstring on line 235 has a typo:
"Sebset"→"Subset".Relationship to #940
Issue #940 covers 7 instances of the same truthiness bug in snp_data.py and hap_data.py. This issue covers the separate occurrence in plasmodium.py, which uses a different code pattern (separate if blocks instead of
if ... or ...) and is in a completely different class (PlasmodiumDataResource vs. the Anopheles data classes).Happy to submit a PR for this!