Skip to content

Plasmodium _subset_genome_sequence_region has the same truthiness bug #1108

@Gopisokk

Description

@Gopisokk

Summary

PlasmodiumDataResource._subset_genome_sequence_region() uses truthiness checks on region.start and region.end to decide whether to slice the genome sequence. Because Python evaluates 0 as falsy, specifying region.start = 0 silently skips slicing and returns the entire contig instead of the requested sub-region.

This is the same class of bug as #940, but in a completely separate code path (plasmodium.py) that is not covered by that issue.

Affected Location

File: malariagen_data/plasmodium.py, lines 241–248

def _subset_genome_sequence_region(
    self, genome, region, inline_array=True, chunks="native"
):
    """Sebset reference genome sequence."""
    region = self._resolve_region(region)
    z = genome[region.contig]

    d = _da_from_zarr(z, inline_array=inline_array, chunks=chunks)

    if region.start:           # ← BUG: 0 is falsy
        slice_start = region.start - 1
    else:
        slice_start = None
    if region.end:             # ← BUG: 0 is falsy
        slice_stop = region.end
    else:
        slice_stop = None
    loc_region = slice(slice_start, slice_stop)

    return d[loc_region]

Expected Behaviour

When region.start = 0 is specified, data should be sliced from position 0. The value 0 is a valid genomic coordinate.

Actual Behaviour

Slicing is skipped. slice_start is set to None, so d[None:] returns the entire contig as if no region was specified. No error or warning is raised.

Why This Is Hard to Catch

  • The failure is silent — no exception, no warning
  • Most real-world queries don't start at position 0
  • The returned data is valid and well-formed, just the wrong amount of it
  • Downstream analysis runs to completion on the inflated dataset

Proposed Fix

Replace the truthiness checks with explicit None checks:

if region.start is not None:
    slice_start = region.start - 1
else:
    slice_start = None
if region.end is not None:
    slice_stop = region.end
else:
    slice_stop = None

Additional fix in the same function

The docstring on line 235 has a typo: "Sebset""Subset".

Relationship to #940

Issue #940 covers 7 instances of the same truthiness bug in snp_data.py and hap_data.py. This issue covers the separate occurrence in plasmodium.py, which uses a different code pattern (separate if blocks instead of if ... or ...) and is in a completely different class (PlasmodiumDataResource vs. the Anopheles data classes).

Happy to submit a PR for this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions