feat: add public cache_info() and clear_cache() API (fixes #1289) by khushthecoder · Pull Request #1295 · malariagen/malariagen-data-python

khushthecoder · 2026-04-14T13:55:16Z

Summary

Adds two public methods to AnophelesBase so that all resource classes (Ag3, Af1, Adir1, Amin1, Adar1, and all Plasmodium resources) inherit a stable, documented API for inspecting and releasing in-memory caches.

This directly addresses silent memory pressure in long-running Jupyter / Google Colab sessions, where cached haplotype, CNV, and SNP datasets accumulate across multi-contig or multi-sample-set analyses — often consuming multiple GB — with no public mechanism to reclaim that memory.

Problem

The library maintains 14+ separate unbounded dict caches across 7 modules (plus lru_cache and OrderedDict-based caches in snp_data). Each unique combination of region, sample_set, analysis, etc. creates a new entry that persists for the object's lifetime. There is:

No eviction policy on the largest caches (_cache_haplotypes, _cache_cnv_hmm, etc.)
No public method to clear caches
No way to inspect how much memory caches consume

The only existing clear_* method is clear_extra_metadata() which clears user-added metadata — not internal data caches. Users who hit OOM must either reach into private _cache_* attributes (fragile, undocumented) or destroy and recreate the API object (slow, re-authenticates GCS).

Solution

`cache_info() -> dict`

Returns a dictionary keyed by cache attribute name. Each entry includes:

Key	Type	Description
`entries`	`int`	Number of cached items
`nbytes`	`int`	Best-effort deep size estimate
`kind`	`str`	`"dict"`, `"lru_cache"`, or `"other"`
`note`	`str`	Estimation method (`xarray.nbytes`, `numpy.nbytes`, `dask upper bound`, `bytes length`, `sys.getsizeof shallow`, etc.)

`clear_cache(category="all") -> None`

Clears all caches or a specific category. Supported categories:

Category	Caches cleared
`"all"`	Every cache on the instance
`"base"`	`_cache_releases`, `_cache_sample_sets`, `_cache_files`, etc.
`"sample_metadata"`	`_cache_sample_metadata`, `_cache_cohorts`, `_cache_cohort_geometries`
`"genome_features"`	`_cache_genome_features`
`"genome_sequence"`	`_cache_genome`
`"snp"`	`_cache_snp_sites`, `_cache_snp_genotypes`, `_cache_site_filters`, `_cache_locate_site_class`, `_cached_snp_calls`, etc.
`"haplotypes"`	`_cache_haplotypes`, `_cache_haplotype_sites`
`"cnv"`	`_cache_cnv_hmm`, `_cache_cnv_coverage_calls`, `_cache_cnv_discordant_read_calls`
`"aim"`	`_cache_aim_variants`

An unknown category raises ValueError listing valid options.

Caches repopulate on demand after clearing, so clear_cache() is always safe to call mid-session.

Design decisions

Placed on AnophelesBase: All subclasses inherit automatically via MRO. No per-class boilerplate needed.
Category-to-attribute mapping via _CACHE_CATEGORIES class dict: Explicit and easy to extend when new caches are added. Uses getattr(..., None) so categories gracefully skip attributes that don't exist on a given subclass.
Handles all cache types: plain dict, OrderedDict, lru_cache wrappers, and single-value Optional caches (e.g., zarr groups).
Consistent with existing patterns: follows the clear_extra_metadata() precedent and uses @doc() numpydoc decorators matching existing style.

Usage

import malariagen_data

ag3 = malariagen_data.Ag3()

# Run analysis across multiple contigs...
for contig in ["2R", "2L", "3R", "3L", "X"]:
    ag3.haplotypes(region=contig, sample_sets="AG1000G-BF-A", analysis="gamb_colu")

# Inspect cache state:
ag3.cache_info()
# {'_cache_haplotypes': {'entries': 5, 'nbytes': 487200000, 'kind': 'dict', 'note': 'xarray.nbytes'}, ...}

# Free memory when switching analyses:
ag3.clear_cache("haplotypes")

# Or clear everything:
ag3.clear_cache()

Test plan

Add two public methods to AnophelesBase so that all resource classes (Ag3, Af1, Adir1, Amin1, Adar1, and Plasmodium resources) inherit a stable, documented way to inspect and release in-memory caches. This addresses silent memory pressure in long-running Jupyter/Colab sessions where cached haplotype, CNV, and SNP datasets accumulate without any public mechanism to reclaim the memory. - cache_info() returns a dict keyed by cache attribute name with entry count, estimated byte size, cache kind, and a note on the estimation method used (xarray.nbytes, numpy.nbytes, dask upper bound, bytes length, or sys.getsizeof shallow). - clear_cache(category="all") clears all or a specific category of caches. Supported categories: all, base, sample_metadata, genome_features, genome_sequence, snp, haplotypes, cnv, aim. Unknown categories raise ValueError listing valid options. Caches repopulate on demand after clearing, so calling clear_cache() is always safe mid-session. Closes malariagen#1289

codecov · 2026-04-14T15:15:38Z

Codecov Report

❌ Patch coverage is 93.22034% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.87%. Comparing base (c2c4f73) to head (430e972).
⚠️ Report is 50 commits behind head on master.

Files with missing lines	Patch %	Lines
malariagen_data/anoph/base.py	93.22%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1295      +/-   ##
==========================================
- Coverage   88.92%   88.87%   -0.06%     
==========================================
  Files          55       55              
  Lines        6257     6371     +114     
==========================================
+ Hits         5564     5662      +98     
- Misses        693      709      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Cover the previously untested lru_cache and "other" (single-value) code paths in cache_info() and clear_cache() to fix the codecov/project coverage drop.

khushthecoder · 2026-04-17T11:56:01Z

Hi @jonbrenas , just a gentle reminder whenever you get a chance to review the PR—would really appreciate your feedback

khushthecoder added 3 commits April 13, 2026 23:25

style: apply ruff-format to test_cache_api.py

56365e0

style: fix ruff-format for CI (match ruff v0.1.8 assert formatting)

053dfe9

khushthecoder mentioned this pull request Apr 14, 2026

No Public API to Clear or Inspect In-Memory Caches — Users Cannot Reclaim Memory in Long-Running Sessions #1289

Open

ci: retrigger CI for flaky test

c244222

khushthecoder force-pushed the fix/issue-1289-public-cache-api branch from d6faa20 to c244222 Compare April 14, 2026 14:33

Merge branch 'master' into fix/issue-1289-public-cache-api

430e972

test: add coverage for lru_cache and single-value cache branches

a2a7ac4

Cover the previously untested lru_cache and "other" (single-value) code paths in cache_info() and clear_cache() to fix the codecov/project coverage drop.

khushthecoder force-pushed the fix/issue-1289-public-cache-api branch from 6318247 to a2a7ac4 Compare April 14, 2026 15:45

khushthecoder added 3 commits April 14, 2026 21:15

Merge branch 'master' into fix/issue-1289-public-cache-api

81fde8d

Merge branch 'master' into fix/issue-1289-public-cache-api

3215df4

Merge branch 'master' into fix/issue-1289-public-cache-api

c74052f

Merge branch 'master' into fix/issue-1289-public-cache-api

e0a5adb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add public cache_info() and clear_cache() API (fixes #1289)#1295

feat: add public cache_info() and clear_cache() API (fixes #1289)#1295
khushthecoder wants to merge 10 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1289-public-cache-api

khushthecoder commented Apr 14, 2026

Uh oh!

codecov Bot commented Apr 14, 2026

Uh oh!

khushthecoder commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khushthecoder commented Apr 14, 2026

Summary

Problem

Solution

cache_info() -> dict

clear_cache(category="all") -> None

Design decisions

Usage

Test plan

Uh oh!

codecov Bot commented Apr 14, 2026

Codecov Report

Uh oh!

khushthecoder commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`cache_info() -> dict`

`clear_cache(category="all") -> None`