Description
In malariagen_data/veff.py, the Annotator class caches genome sequence data in a plain, unbounded dictionary:
self._genome_cache = dict()
Whenever a new contig's reference sequence is requested via get_ref_seq(), the entire contig byte array is loaded from the Zarr store and added to this dictionary:
seq = self._genome[chrom][:]
self._genome_cache[chrom] = seq
The Problem:
There is no eviction mechanism, no size limit, and no clear_cache() method. Because the Annotator is instantiated as a @cached_property on the AnophelesSnpFrequencyAnalysis API class (and similarly in other classes), it persists for the lifetime of the session.
For workflows that perform analyses across many contigs (e.g., iterating through all chromosomes in a batch script or a long-running Jupyter notebook), this unbounded cache continuously consumes memory until the session crashes or the system runs out of RAM.
Expected Behavior
The genome sequence cache should have a reasonable default maximum size and evict least recently used (LRU) contigs when full, preventing unbounded memory growth. It should also expose a public method to manually clear the cache if a user needs to aggressively reclaim memory.
Description
In
malariagen_data/veff.py, theAnnotatorclass caches genome sequence data in a plain, unbounded dictionary:Whenever a new contig's reference sequence is requested via
get_ref_seq(), the entire contig byte array is loaded from the Zarr store and added to this dictionary:The Problem:
There is no eviction mechanism, no size limit, and no
clear_cache()method. Because theAnnotatoris instantiated as a@cached_propertyon theAnophelesSnpFrequencyAnalysisAPI class (and similarly in other classes), it persists for the lifetime of the session.For workflows that perform analyses across many contigs (e.g., iterating through all chromosomes in a batch script or a long-running Jupyter notebook), this unbounded cache continuously consumes memory until the session crashes or the system runs out of RAM.
Expected Behavior
The genome sequence cache should have a reasonable default maximum size and evict least recently used (LRU) contigs when full, preventing unbounded memory growth. It should also expose a public method to manually clear the cache if a user needs to aggressively reclaim memory.