Severity: High — Silent Memory Pressure
This is not a feature request. Every researcher running a multi-contig or multi-sample-set analysis in a Jupyter notebook will accumulate hundreds of MB to several GB of irrecoverable cached data with no way to release it. In Google Colab (12 GB RAM free tier), this routinely leads to kernel OOM crashes mid-analysis.
Environment
| Field |
Value |
| Library version |
15.6.0 (also reproducible on earlier versions) |
| Python |
3.10 / 3.11 / 3.12 |
| OS |
Linux (Colab), macOS, Windows (any platform) |
| Affected classes |
Ag3, Af1, Adir1, Amin1, Adar1, and all Plasmodium resources (Pf7, Pf8, Pf9, Pv4) |
Description
The library maintains 14 separate in-memory dict caches across 7 modules, but exposes no public method to clear, inspect, or monitor any of them. Each unique combination of parameters (region, sample_set, analysis) creates a new dict entry that persists for the object's lifetime. There is no eviction, no size limit (except for snp_data._cache_locate_site_class which correctly uses an OrderedDict with LRU eviction at maxsize=64 and _cached_snp_calls with lru_cache(maxsize=2)), and no public clear_cache() method.
The only existing clear_* method is clear_extra_metadata() (sample_metadata.py:709), which clears user-added metadata — not the internal data caches.
All 14 Unbounded Caches
| Module |
Cache Attribute |
What it stores |
Approximate size per entry |
base.py:178 |
_cache_sample_sets |
DataFrames |
~KB |
base.py:179 |
_cache_available_sample_sets |
DataFrames |
~KB |
base.py:184 |
_cache_files |
Raw bytes from GCS |
Variable, up to MB |
sample_metadata.py:86 |
_cache_sample_metadata |
DataFrames |
~MB |
sample_metadata.py:87 |
_cache_cohorts |
DataFrames |
~KB |
sample_metadata.py:88 |
_cache_cohort_geometries |
GeoJSON dicts |
~KB |
hap_data.py:46 |
_cache_haplotypes |
xarray Datasets |
~100–500 MB |
hap_data.py:47 |
_cache_haplotype_sites |
xarray Datasets |
~50–200 MB |
cnv_data.py:50 |
_cache_cnv_hmm |
xarray Datasets |
~50–200 MB |
cnv_data.py:51 |
_cache_cnv_coverage_calls |
xarray Datasets |
~50–200 MB |
cnv_data.py:52 |
_cache_cnv_discordant_read_calls |
xarray Datasets |
~50–200 MB |
aim_data.py:63 |
_cache_aim_variants |
xarray Datasets |
~10–50 MB |
genome_features.py:63 |
_cache_genome_features |
DataFrames |
~MB |
plasmodium.py:41 |
_cache_genome_features |
DataFrames |
~MB |
The bolded entries are the most dangerous — a handful of cached haplotype or CNV datasets can consume multiple GB.
How to Reproduce
import malariagen_data
import sys
ag3 = malariagen_data.Ag3()
# Simulate a typical multi-region analysis:
for contig in ["2R", "2L", "3R", "3L", "X"]:
ag3.haplotypes(region=contig, sample_sets="AG1000G-BF-A", analysis="gamb_colu")
# Check how many entries are cached (must reach into private attrs):
print(f"Haplotype cache entries: {len(ag3._cache_haplotypes)}") # 5
print(f"Haplotype sites entries: {len(ag3._cache_haplotype_sites)}") # 5
# Estimate memory (rough — see discussion below on measurement):
total = sum(sys.getsizeof(v) for v in ag3._cache_haplotypes.values())
print(f"Shallow haplotype cache size: {total / 1e6:.0f} MB")
# There is no way to free this memory:
# ag3.clear_cache() — does not exist
# ag3.cache_info() — does not exist
Current Workaround (fragile, undocumented)
Until this is fixed, users can manually clear individual caches by reaching into private attributes. This is unsupported and may break in any release:
# Clear the largest caches to reclaim memory:
ag3._cache_haplotypes.clear()
ag3._cache_haplotype_sites.clear()
ag3._cache_cnv_hmm.clear()
# To clear everything, you must know all 14 cache names:
for attr in dir(ag3):
if attr.startswith("_cache_") and isinstance(getattr(ag3, attr), dict):
getattr(ag3, attr).clear()
# Or destroy and recreate (slow — re-authenticates GCS, re-fetches config):
del ag3
ag3 = malariagen_data.Ag3()
Why It Is Important
Proposed API
# Clear all caches:
ag3.clear_cache()
# Clear a specific category:
ag3.clear_cache("haplotypes") # clears _cache_haplotypes + _cache_haplotype_sites
ag3.clear_cache("cnv") # clears _cache_cnv_hmm + _cache_cnv_coverage_calls + ...
ag3.clear_cache("sample_metadata") # clears _cache_sample_metadata + _cache_cohorts + ...
# Invalid category should raise ValueError with valid options:
ag3.clear_cache("nonexistent")
# ValueError: Unknown cache category 'nonexistent'. Valid options:
# 'all', 'haplotypes', 'cnv', 'snp', 'sample_metadata', 'aim', 'genome_features'
On cache_info() Return Format
The useful metric is number of entries and deep memory size. For xarray Datasets backed by in-memory NumPy arrays (the common case after .compute()), Dataset.nbytes gives the true size. For Dask-backed lazy arrays, only the task graph metadata is in memory — cache_info() should report this distinction:
ag3.cache_info()
# Returns:
# {
# "haplotypes": {"entries": 5, "nbytes_mb": 487.2, "note": "computed arrays"},
# "cnv_hmm": {"entries": 0, "nbytes_mb": 0.0},
# "sample_metadata": {"entries": 3, "nbytes_mb": 1.4},
# ...
# }
Using xr.Dataset.nbytes (which sums array.nbytes for each variable) is accurate for NumPy-backed data. For dask arrays, dask.array.Array.nbytes reports the would-be materialized size, which should be noted but is still useful as an upper bound.
Expected Impact After Resolution
Researchers can manage memory in long-running sessions without kernel restarts. OOM crashes in Colab become diagnosable. The API is consistent with the existing clear_extra_metadata() pattern and integrates naturally into notebook workflows.
Severity: High — Silent Memory Pressure
This is not a feature request. Every researcher running a multi-contig or multi-sample-set analysis in a Jupyter notebook will accumulate hundreds of MB to several GB of irrecoverable cached data with no way to release it. In Google Colab (12 GB RAM free tier), this routinely leads to kernel OOM crashes mid-analysis.
Environment
Ag3,Af1,Adir1,Amin1,Adar1, and all Plasmodium resources (Pf7,Pf8,Pf9,Pv4)Description
The library maintains 14 separate in-memory dict caches across 7 modules, but exposes no public method to clear, inspect, or monitor any of them. Each unique combination of parameters (
region,sample_set,analysis) creates a new dict entry that persists for the object's lifetime. There is no eviction, no size limit (except forsnp_data._cache_locate_site_classwhich correctly uses anOrderedDictwith LRU eviction atmaxsize=64and_cached_snp_callswithlru_cache(maxsize=2)), and no publicclear_cache()method.The only existing
clear_*method isclear_extra_metadata()(sample_metadata.py:709), which clears user-added metadata — not the internal data caches.All 14 Unbounded Caches
base.py:178_cache_sample_setsbase.py:179_cache_available_sample_setsbase.py:184_cache_filessample_metadata.py:86_cache_sample_metadatasample_metadata.py:87_cache_cohortssample_metadata.py:88_cache_cohort_geometrieshap_data.py:46_cache_haplotypeshap_data.py:47_cache_haplotype_sitescnv_data.py:50_cache_cnv_hmmcnv_data.py:51_cache_cnv_coverage_callscnv_data.py:52_cache_cnv_discordant_read_callsaim_data.py:63_cache_aim_variantsgenome_features.py:63_cache_genome_featuresplasmodium.py:41_cache_genome_featuresHow to Reproduce
Current Workaround (fragile, undocumented)
Until this is fixed, users can manually clear individual caches by reaching into private attributes. This is unsupported and may break in any release:
Why It Is Important
Directly causes OOM crashes: Researchers doing genome-wide analyses across 5 contigs with multiple sample sets can accumulate 2–5 GB in haplotype/CNV caches alone. On Colab free tier (12 GB), this competes with the genotype arrays themselves and leads to kernel crashes.
No observability: When a notebook hits OOM, there's no way to diagnose how much memory the caches hold. Users blame their data loading code, not invisible internal caches.
Workaround is fragile: The
_cache_*naming convention is an implementation detail. Any refactor (e.g., the ongoing Anopheles refactor in Anopheles refactor #366, or the cache pattern changes in merged PRs Fix unbounded genome cache memory leak in Annotator #1190 and fix: enforce LRU eviction for _cache_locate_site_class (fixes #1276) #1277) can rename or restructure these without notice.Precedent in the codebase:
clear_extra_metadata()already exists as a public method for user metadata. The same pattern should apply to data caches.Distinct from _cache_locate_site_class can grow without bound despite _LOCATE_SITE_CLASS_CACHE_MAXSIZE constant #1276 / PR fix: enforce LRU eviction for _cache_locate_site_class (fixes #1276) #1277 (which fixed unbounded growth in
_cache_locate_site_classby adding LRU eviction): Even with bounded caches, users need the ability to proactively free memory when switching between analyses or nearing memory limits.Proposed API
On
cache_info()Return FormatThe useful metric is number of entries and deep memory size. For xarray Datasets backed by in-memory NumPy arrays (the common case after
.compute()),Dataset.nbytesgives the true size. For Dask-backed lazy arrays, only the task graph metadata is in memory —cache_info()should report this distinction:Using
xr.Dataset.nbytes(which sumsarray.nbytesfor each variable) is accurate for NumPy-backed data. For dask arrays,dask.array.Array.nbytesreports the would-be materialized size, which should be noted but is still useful as an upper bound.Expected Impact After Resolution
Researchers can manage memory in long-running sessions without kernel restarts. OOM crashes in Colab become diagnosable. The API is consistent with the existing
clear_extra_metadata()pattern and integrates naturally into notebook workflows.