No Public API to Clear or Inspect In-Memory Caches — Users Cannot Reclaim Memory in Long-Running Sessions

**Severity: High — Silent Memory Pressure**
 
This is not a feature request. Every researcher running a multi-contig or multi-sample-set analysis in a Jupyter notebook will accumulate hundreds of MB to several GB of irrecoverable cached data with no way to release it. In Google Colab (12 GB RAM free tier), this routinely leads to kernel OOM crashes mid-analysis.
 
---
 
## Environment
 
| Field | Value |
|---|---|
| Library version | 15.6.0 (also reproducible on earlier versions) |
| Python | 3.10 / 3.11 / 3.12 |
| OS | Linux (Colab), macOS, Windows (any platform) |
| Affected classes | `Ag3`, `Af1`, `Adir1`, `Amin1`, `Adar1`, and all Plasmodium resources (`Pf7`, `Pf8`, `Pf9`, `Pv4`) |
 
---
 
## Description
 
The library maintains **14 separate in-memory dict caches** across 7 modules, but exposes no public method to clear, inspect, or monitor any of them. Each unique combination of parameters (`region`, `sample_set`, `analysis`) creates a new dict entry that persists for the object's lifetime. There is no eviction, no size limit (except for `snp_data._cache_locate_site_class` which correctly uses an `OrderedDict` with LRU eviction at `maxsize=64` and `_cached_snp_calls` with `lru_cache(maxsize=2)`), and no public `clear_cache()` method.
 
The only existing `clear_*` method is `clear_extra_metadata()` (`sample_metadata.py:709`), which clears user-added metadata — not the internal data caches.
 
### All 14 Unbounded Caches
 
| Module | Cache Attribute | What it stores | Approximate size per entry |
|---|---|---|---|
| `base.py:178` | `_cache_sample_sets` | DataFrames | ~KB |
| `base.py:179` | `_cache_available_sample_sets` | DataFrames | ~KB |
| `base.py:184` | `_cache_files` | Raw bytes from GCS | Variable, up to MB |
| `sample_metadata.py:86` | `_cache_sample_metadata` | DataFrames | ~MB |
| `sample_metadata.py:87` | `_cache_cohorts` | DataFrames | ~KB |
| `sample_metadata.py:88` | `_cache_cohort_geometries` | GeoJSON dicts | ~KB |
| `hap_data.py:46` | **`_cache_haplotypes`** | xarray Datasets | **~100–500 MB** |
| `hap_data.py:47` | **`_cache_haplotype_sites`** | xarray Datasets | **~50–200 MB** |
| `cnv_data.py:50` | **`_cache_cnv_hmm`** | xarray Datasets | **~50–200 MB** |
| `cnv_data.py:51` | **`_cache_cnv_coverage_calls`** | xarray Datasets | **~50–200 MB** |
| `cnv_data.py:52` | **`_cache_cnv_discordant_read_calls`** | xarray Datasets | **~50–200 MB** |
| `aim_data.py:63` | `_cache_aim_variants` | xarray Datasets | ~10–50 MB |
| `genome_features.py:63` | `_cache_genome_features` | DataFrames | ~MB |
| `plasmodium.py:41` | `_cache_genome_features` | DataFrames | ~MB |
 
> **The bolded entries are the most dangerous** — a handful of cached haplotype or CNV datasets can consume multiple GB.
 
---
 
## How to Reproduce
 
```python
import malariagen_data
import sys
 
ag3 = malariagen_data.Ag3()
 
# Simulate a typical multi-region analysis:
for contig in ["2R", "2L", "3R", "3L", "X"]:
    ag3.haplotypes(region=contig, sample_sets="AG1000G-BF-A", analysis="gamb_colu")
 
# Check how many entries are cached (must reach into private attrs):
print(f"Haplotype cache entries: {len(ag3._cache_haplotypes)}")      # 5
print(f"Haplotype sites entries: {len(ag3._cache_haplotype_sites)}")  # 5
 
# Estimate memory (rough — see discussion below on measurement):
total = sum(sys.getsizeof(v) for v in ag3._cache_haplotypes.values())
print(f"Shallow haplotype cache size: {total / 1e6:.0f} MB")
 
# There is no way to free this memory:
# ag3.clear_cache()          — does not exist
# ag3.cache_info()           — does not exist
```
 
---
 
## Current Workaround (fragile, undocumented)
 
Until this is fixed, users can manually clear individual caches by reaching into private attributes. **This is unsupported and may break in any release:**
 
```python
# Clear the largest caches to reclaim memory:
ag3._cache_haplotypes.clear()
ag3._cache_haplotype_sites.clear()
ag3._cache_cnv_hmm.clear()
 
# To clear everything, you must know all 14 cache names:
for attr in dir(ag3):
    if attr.startswith("_cache_") and isinstance(getattr(ag3, attr), dict):
        getattr(ag3, attr).clear()
 
# Or destroy and recreate (slow — re-authenticates GCS, re-fetches config):
del ag3
ag3 = malariagen_data.Ag3()
```
 
---
 
## Why It Is Important
 
- **Directly causes OOM crashes:** Researchers doing genome-wide analyses across 5 contigs with multiple sample sets can accumulate 2–5 GB in haplotype/CNV caches alone. On Colab free tier (12 GB), this competes with the genotype arrays themselves and leads to kernel crashes.
 
- **No observability:** When a notebook hits OOM, there's no way to diagnose how much memory the caches hold. Users blame their data loading code, not invisible internal caches.
 
- **Workaround is fragile:** The `_cache_*` naming convention is an implementation detail. Any refactor (e.g., the ongoing Anopheles refactor in #366, or the cache pattern changes in merged PRs #1190 and #1277) can rename or restructure these without notice.
 
- **Precedent in the codebase:** `clear_extra_metadata()` already exists as a public method for user metadata. The same pattern should apply to data caches.
 
- **Distinct from #1276 / PR #1277** (which fixed unbounded growth in `_cache_locate_site_class` by adding LRU eviction): Even with bounded caches, users need the ability to proactively free memory when switching between analyses or nearing memory limits.
 
---
 
## Proposed API
 
```python
# Clear all caches:
ag3.clear_cache()
 
# Clear a specific category:
ag3.clear_cache("haplotypes")      # clears _cache_haplotypes + _cache_haplotype_sites
ag3.clear_cache("cnv")             # clears _cache_cnv_hmm + _cache_cnv_coverage_calls + ...
ag3.clear_cache("sample_metadata") # clears _cache_sample_metadata + _cache_cohorts + ...
 
# Invalid category should raise ValueError with valid options:
ag3.clear_cache("nonexistent")
# ValueError: Unknown cache category 'nonexistent'. Valid options:
#   'all', 'haplotypes', 'cnv', 'snp', 'sample_metadata', 'aim', 'genome_features'
```
 
### On `cache_info()` Return Format
 
The useful metric is number of entries and deep memory size. For xarray Datasets backed by in-memory NumPy arrays (the common case after `.compute()`), `Dataset.nbytes` gives the true size. For Dask-backed lazy arrays, only the task graph metadata is in memory — `cache_info()` should report this distinction:
 
```python
ag3.cache_info()
# Returns:
# {
#   "haplotypes":      {"entries": 5, "nbytes_mb": 487.2, "note": "computed arrays"},
#   "cnv_hmm":         {"entries": 0, "nbytes_mb": 0.0},
#   "sample_metadata": {"entries": 3, "nbytes_mb": 1.4},
#   ...
# }
```
 
Using `xr.Dataset.nbytes` (which sums `array.nbytes` for each variable) is accurate for NumPy-backed data. For dask arrays, `dask.array.Array.nbytes` reports the would-be materialized size, which should be noted but is still useful as an upper bound.
 
---
 
## Expected Impact After Resolution
 
Researchers can manage memory in long-running sessions without kernel restarts. OOM crashes in Colab become diagnosable. The API is consistent with the existing `clear_extra_metadata()` pattern and integrates naturally into notebook workflows.
 

Module	Cache Attribute	What it stores	Approximate size per entry
`base.py:178`	`_cache_sample_sets`	DataFrames	~KB
`base.py:179`	`_cache_available_sample_sets`	DataFrames	~KB
`base.py:184`	`_cache_files`	Raw bytes from GCS	Variable, up to MB
`sample_metadata.py:86`	`_cache_sample_metadata`	DataFrames	~MB
`sample_metadata.py:87`	`_cache_cohorts`	DataFrames	~KB
`sample_metadata.py:88`	`_cache_cohort_geometries`	GeoJSON dicts	~KB
`hap_data.py:46`	`_cache_haplotypes`	xarray Datasets	~100–500 MB
`hap_data.py:47`	`_cache_haplotype_sites`	xarray Datasets	~50–200 MB
`cnv_data.py:50`	`_cache_cnv_hmm`	xarray Datasets	~50–200 MB
`cnv_data.py:51`	`_cache_cnv_coverage_calls`	xarray Datasets	~50–200 MB
`cnv_data.py:52`	`_cache_cnv_discordant_read_calls`	xarray Datasets	~50–200 MB
`aim_data.py:63`	`_cache_aim_variants`	xarray Datasets	~10–50 MB
`genome_features.py:63`	`_cache_genome_features`	DataFrames	~MB
`plasmodium.py:41`	`_cache_genome_features`	DataFrames	~MB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Public API to Clear or Inspect In-Memory Caches — Users Cannot Reclaim Memory in Long-Running Sessions #1289

Environment

Description

All 14 Unbounded Caches

How to Reproduce

Current Workaround (fragile, undocumented)

Why It Is Important

Proposed API

On `cache_info()` Return Format

Expected Impact After Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Value
Library version	15.6.0 (also reproducible on earlier versions)
Python	3.10 / 3.11 / 3.12
OS	Linux (Colab), macOS, Windows (any platform)
Affected classes	`Ag3`, `Af1`, `Adir1`, `Amin1`, `Adar1`, and all Plasmodium resources (`Pf7`, `Pf8`, `Pf9`, `Pv4`)

No Public API to Clear or Inspect In-Memory Caches — Users Cannot Reclaim Memory in Long-Running Sessions #1289

Description

Environment

Description

All 14 Unbounded Caches

How to Reproduce

Current Workaround (fragile, undocumented)

Why It Is Important

Proposed API

On cache_info() Return Format

Expected Impact After Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

On `cache_info()` Return Format