You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In malariagen_data/anoph/snp_data.py, there is a constant _LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64 defined at line 45, and a cache dictionary self._cache_locate_site_class initialized at line 78. However, the maxsize constant is never actually enforced — the dictionary grows without any eviction logic.
# Line 45: Constant defined but never referenced in eviction logic_LOCATE_SITE_CLASS_CACHE_MAXSIZE=64# Line 78: Plain dict with no size limitself._cache_locate_site_class: Dict[Tuple[Any, ...], np.ndarray] =dict()
The _locate_site_class method (lines 787+) caches large NumPy boolean arrays (one per contig × site_mask × site_class combination). Each array can be tens of millions of elements. Without eviction, a user iterating over many contigs, site masks, and site classes accumulates hundreds of megabytes of cached arrays that are never freed.
Impact on the Project
Memory leak in long-running sessions: Users working in Jupyter notebooks (the primary use case) often maintain a single API client instance for extended sessions. Each unique (region, site_mask, site_class) combination permanently adds a large array to memory.
The constant creates a false sense of safety: A developer reading the code sees _LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64 and assumes the cache is bounded — but it is not.
Open snp_data.py, line 45 — see the unused constant.
Search for _LOCATE_SITE_CLASS_CACHE_MAXSIZE in the codebase — it is defined but never referenced in any conditional or eviction logic.
Inspect the _locate_site_class method — it only does self._cache_locate_site_class[cache_key] = loc_ann with no size check.
# Verify the constant is unused:
grep -rn "_LOCATE_SITE_CLASS_CACHE_MAXSIZE" malariagen_data/
# Expected output: only the definition line, no usage
Proposed Solution
Convert the plain dict to a bounded cache, consistent with the lru_cache pattern already used for _cached_snp_calls (line 86) in the same class.
Option A — Use functools.lru_cache (preferred for consistency)
fromfunctoolsimportlru_cache# In __init__:self._cached_locate_site_class=lru_cache(maxsize=_LOCATE_SITE_CLASS_CACHE_MAXSIZE)(
self._locate_site_class_impl
)
Option B — Add manual eviction to the existing dict
# In _locate_site_class, before inserting:iflen(self._cache_locate_site_class) >=_LOCATE_SITE_CLASS_CACHE_MAXSIZE:
# Evict oldest entry (FIFO)oldest_key=next(iter(self._cache_locate_site_class))
delself._cache_locate_site_class[oldest_key]
self._cache_locate_site_class[cache_key] =loc_ann
Impact After Resolution
Memory usage in long-running sessions becomes predictable and bounded.
The existing _LOCATE_SITE_CLASS_CACHE_MAXSIZE constant is actually used, removing dead code.
Consistent with the lru_cache pattern already established for _cached_snp_calls in the same class.
Description
In
malariagen_data/anoph/snp_data.py, there is a constant_LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64defined at line 45, and a cache dictionaryself._cache_locate_site_classinitialized at line 78. However, the maxsize constant is never actually enforced — the dictionary grows without any eviction logic.The
_locate_site_classmethod (lines 787+) caches large NumPy boolean arrays (one per contig × site_mask × site_class combination). Each array can be tens of millions of elements. Without eviction, a user iterating over many contigs, site masks, and site classes accumulates hundreds of megabytes of cached arrays that are never freed.Impact on the Project
(region, site_mask, site_class)combination permanently adds a large array to memory._LOCATE_SITE_CLASS_CACHE_MAXSIZE = 64and assumes the cache is bounded — but it is not.veff.pyby converting it to anlru_cache. The same pattern should be applied here for consistency.Steps to Reproduce / Identify
snp_data.py, line 45 — see the unused constant._LOCATE_SITE_CLASS_CACHE_MAXSIZEin the codebase — it is defined but never referenced in any conditional or eviction logic._locate_site_classmethod — it only doesself._cache_locate_site_class[cache_key] = loc_annwith no size check.Proposed Solution
Convert the plain dict to a bounded cache, consistent with the
lru_cachepattern already used for_cached_snp_calls(line 86) in the same class.Option A — Use
functools.lru_cache(preferred for consistency)Option B — Add manual eviction to the existing dict
Impact After Resolution
_LOCATE_SITE_CLASS_CACHE_MAXSIZEconstant is actually used, removing dead code.lru_cachepattern already established for_cached_snp_callsin the same class.