Skip to content

No Public API to Clear or Inspect In-Memory Caches — Users Cannot Reclaim Memory in Long-Running Sessions #1289

@khushthecoder

Description

@khushthecoder

Severity: High — Silent Memory Pressure

This is not a feature request. Every researcher running a multi-contig or multi-sample-set analysis in a Jupyter notebook will accumulate hundreds of MB to several GB of irrecoverable cached data with no way to release it. In Google Colab (12 GB RAM free tier), this routinely leads to kernel OOM crashes mid-analysis.


Environment

Field Value
Library version 15.6.0 (also reproducible on earlier versions)
Python 3.10 / 3.11 / 3.12
OS Linux (Colab), macOS, Windows (any platform)
Affected classes Ag3, Af1, Adir1, Amin1, Adar1, and all Plasmodium resources (Pf7, Pf8, Pf9, Pv4)

Description

The library maintains 14 separate in-memory dict caches across 7 modules, but exposes no public method to clear, inspect, or monitor any of them. Each unique combination of parameters (region, sample_set, analysis) creates a new dict entry that persists for the object's lifetime. There is no eviction, no size limit (except for snp_data._cache_locate_site_class which correctly uses an OrderedDict with LRU eviction at maxsize=64 and _cached_snp_calls with lru_cache(maxsize=2)), and no public clear_cache() method.

The only existing clear_* method is clear_extra_metadata() (sample_metadata.py:709), which clears user-added metadata — not the internal data caches.

All 14 Unbounded Caches

Module Cache Attribute What it stores Approximate size per entry
base.py:178 _cache_sample_sets DataFrames ~KB
base.py:179 _cache_available_sample_sets DataFrames ~KB
base.py:184 _cache_files Raw bytes from GCS Variable, up to MB
sample_metadata.py:86 _cache_sample_metadata DataFrames ~MB
sample_metadata.py:87 _cache_cohorts DataFrames ~KB
sample_metadata.py:88 _cache_cohort_geometries GeoJSON dicts ~KB
hap_data.py:46 _cache_haplotypes xarray Datasets ~100–500 MB
hap_data.py:47 _cache_haplotype_sites xarray Datasets ~50–200 MB
cnv_data.py:50 _cache_cnv_hmm xarray Datasets ~50–200 MB
cnv_data.py:51 _cache_cnv_coverage_calls xarray Datasets ~50–200 MB
cnv_data.py:52 _cache_cnv_discordant_read_calls xarray Datasets ~50–200 MB
aim_data.py:63 _cache_aim_variants xarray Datasets ~10–50 MB
genome_features.py:63 _cache_genome_features DataFrames ~MB
plasmodium.py:41 _cache_genome_features DataFrames ~MB

The bolded entries are the most dangerous — a handful of cached haplotype or CNV datasets can consume multiple GB.


How to Reproduce

import malariagen_data
import sys
 
ag3 = malariagen_data.Ag3()
 
# Simulate a typical multi-region analysis:
for contig in ["2R", "2L", "3R", "3L", "X"]:
    ag3.haplotypes(region=contig, sample_sets="AG1000G-BF-A", analysis="gamb_colu")
 
# Check how many entries are cached (must reach into private attrs):
print(f"Haplotype cache entries: {len(ag3._cache_haplotypes)}")      # 5
print(f"Haplotype sites entries: {len(ag3._cache_haplotype_sites)}")  # 5
 
# Estimate memory (rough — see discussion below on measurement):
total = sum(sys.getsizeof(v) for v in ag3._cache_haplotypes.values())
print(f"Shallow haplotype cache size: {total / 1e6:.0f} MB")
 
# There is no way to free this memory:
# ag3.clear_cache()          — does not exist
# ag3.cache_info()           — does not exist

Current Workaround (fragile, undocumented)

Until this is fixed, users can manually clear individual caches by reaching into private attributes. This is unsupported and may break in any release:

# Clear the largest caches to reclaim memory:
ag3._cache_haplotypes.clear()
ag3._cache_haplotype_sites.clear()
ag3._cache_cnv_hmm.clear()
 
# To clear everything, you must know all 14 cache names:
for attr in dir(ag3):
    if attr.startswith("_cache_") and isinstance(getattr(ag3, attr), dict):
        getattr(ag3, attr).clear()
 
# Or destroy and recreate (slow — re-authenticates GCS, re-fetches config):
del ag3
ag3 = malariagen_data.Ag3()

Why It Is Important


Proposed API

# Clear all caches:
ag3.clear_cache()
 
# Clear a specific category:
ag3.clear_cache("haplotypes")      # clears _cache_haplotypes + _cache_haplotype_sites
ag3.clear_cache("cnv")             # clears _cache_cnv_hmm + _cache_cnv_coverage_calls + ...
ag3.clear_cache("sample_metadata") # clears _cache_sample_metadata + _cache_cohorts + ...
 
# Invalid category should raise ValueError with valid options:
ag3.clear_cache("nonexistent")
# ValueError: Unknown cache category 'nonexistent'. Valid options:
#   'all', 'haplotypes', 'cnv', 'snp', 'sample_metadata', 'aim', 'genome_features'

On cache_info() Return Format

The useful metric is number of entries and deep memory size. For xarray Datasets backed by in-memory NumPy arrays (the common case after .compute()), Dataset.nbytes gives the true size. For Dask-backed lazy arrays, only the task graph metadata is in memory — cache_info() should report this distinction:

ag3.cache_info()
# Returns:
# {
#   "haplotypes":      {"entries": 5, "nbytes_mb": 487.2, "note": "computed arrays"},
#   "cnv_hmm":         {"entries": 0, "nbytes_mb": 0.0},
#   "sample_metadata": {"entries": 3, "nbytes_mb": 1.4},
#   ...
# }

Using xr.Dataset.nbytes (which sums array.nbytes for each variable) is accurate for NumPy-backed data. For dask arrays, dask.array.Array.nbytes reports the would-be materialized size, which should be noted but is still useful as an upper bound.


Expected Impact After Resolution

Researchers can manage memory in long-running sessions without kernel restarts. OOM crashes in Colab become diagnosable. The API is consistent with the existing clear_extra_metadata() pattern and integrates naturally into notebook workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions