Skip to content

_cache_locate_site_class can grow without bound despite _LOCATE_SITE_CLASS_CACHE_MAXSIZE constant #1276

@khushthecoder

Description

@khushthecoder

Description

In malariagen_data/anoph/snp_data.py, there is a constant _LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64 defined at line 45, and a cache dictionary self._cache_locate_site_class initialized at line 78. However, the maxsize constant is never actually enforced — the dictionary grows without any eviction logic.

# Line 45: Constant defined but never referenced in eviction logic
_LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64
 
# Line 78: Plain dict with no size limit
self._cache_locate_site_class: Dict[Tuple[Any, ...], np.ndarray] = dict()

The _locate_site_class method (lines 787+) caches large NumPy boolean arrays (one per contig × site_mask × site_class combination). Each array can be tens of millions of elements. Without eviction, a user iterating over many contigs, site masks, and site classes accumulates hundreds of megabytes of cached arrays that are never freed.


Impact on the Project

  • Memory leak in long-running sessions: Users working in Jupyter notebooks (the primary use case) often maintain a single API client instance for extended sessions. Each unique (region, site_mask, site_class) combination permanently adds a large array to memory.
  • The constant creates a false sense of safety: A developer reading the code sees _LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64 and assumes the cache is bounded — but it is not.
  • Contrast with related fix: Issue veff.py caches genome data in an unbounded dict — memory leak for multi-contig analyses #1189 and the corresponding PR fixed an unbounded dict cache in veff.py by converting it to an lru_cache. The same pattern should be applied here for consistency.

Steps to Reproduce / Identify

  1. Open snp_data.py, line 45 — see the unused constant.
  2. Search for _LOCATE_SITE_CLASS_CACHE_MAXSIZE in the codebase — it is defined but never referenced in any conditional or eviction logic.
  3. Inspect the _locate_site_class method — it only does self._cache_locate_site_class[cache_key] = loc_ann with no size check.
# Verify the constant is unused:
grep -rn "_LOCATE_SITE_CLASS_CACHE_MAXSIZE" malariagen_data/
# Expected output: only the definition line, no usage

Proposed Solution

Convert the plain dict to a bounded cache, consistent with the lru_cache pattern already used for _cached_snp_calls (line 86) in the same class.

Option A — Use functools.lru_cache (preferred for consistency)

from functools import lru_cache
 
# In __init__:
self._cached_locate_site_class = lru_cache(maxsize=_LOCATE_SITE_CLASS_CACHE_MAXSIZE)(
    self._locate_site_class_impl
)

Option B — Add manual eviction to the existing dict

# In _locate_site_class, before inserting:
if len(self._cache_locate_site_class) >= _LOCATE_SITE_CLASS_CACHE_MAXSIZE:
    # Evict oldest entry (FIFO)
    oldest_key = next(iter(self._cache_locate_site_class))
    del self._cache_locate_site_class[oldest_key]
self._cache_locate_site_class[cache_key] = loc_ann

Impact After Resolution

  • Memory usage in long-running sessions becomes predictable and bounded.
  • The existing _LOCATE_SITE_CLASS_CACHE_MAXSIZE constant is actually used, removing dead code.
  • Consistent with the lru_cache pattern already established for _cached_snp_calls in the same class.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions