Skip to content

veff.py caches genome data in an unbounded dict — memory leak for multi-contig analyses #1189

@kunal-10-cloud

Description

@kunal-10-cloud

Description

In malariagen_data/veff.py, the Annotator class caches genome sequence data in a plain, unbounded dictionary:

self._genome_cache = dict()

Whenever a new contig's reference sequence is requested via get_ref_seq(), the entire contig byte array is loaded from the Zarr store and added to this dictionary:

seq = self._genome[chrom][:]
self._genome_cache[chrom] = seq

The Problem:
There is no eviction mechanism, no size limit, and no clear_cache() method. Because the Annotator is instantiated as a @cached_property on the AnophelesSnpFrequencyAnalysis API class (and similarly in other classes), it persists for the lifetime of the session.

For workflows that perform analyses across many contigs (e.g., iterating through all chromosomes in a batch script or a long-running Jupyter notebook), this unbounded cache continuously consumes memory until the session crashes or the system runs out of RAM.

Expected Behavior

The genome sequence cache should have a reasonable default maximum size and evict least recently used (LRU) contigs when full, preventing unbounded memory growth. It should also expose a public method to manually clear the cache if a user needs to aggressively reclaim memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions