Skip to content

feat: add public cache_info() and clear_cache() API (fixes #1289)#1295

Open
khushthecoder wants to merge 10 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1289-public-cache-api
Open

feat: add public cache_info() and clear_cache() API (fixes #1289)#1295
khushthecoder wants to merge 10 commits intomalariagen:masterfrom
khushthecoder:fix/issue-1289-public-cache-api

Conversation

@khushthecoder
Copy link
Copy Markdown
Contributor

Fix: #1289

Summary

Adds two public methods to AnophelesBase so that all resource classes (Ag3, Af1, Adir1, Amin1, Adar1, and all Plasmodium resources) inherit a stable, documented API for inspecting and releasing in-memory caches.

This directly addresses silent memory pressure in long-running Jupyter / Google Colab sessions, where cached haplotype, CNV, and SNP datasets accumulate across multi-contig or multi-sample-set analyses — often consuming multiple GB — with no public mechanism to reclaim that memory.

Problem

The library maintains 14+ separate unbounded dict caches across 7 modules (plus lru_cache and OrderedDict-based caches in snp_data). Each unique combination of region, sample_set, analysis, etc. creates a new entry that persists for the object's lifetime. There is:

  • No eviction policy on the largest caches (_cache_haplotypes, _cache_cnv_hmm, etc.)
  • No public method to clear caches
  • No way to inspect how much memory caches consume

The only existing clear_* method is clear_extra_metadata() which clears user-added metadata — not internal data caches. Users who hit OOM must either reach into private _cache_* attributes (fragile, undocumented) or destroy and recreate the API object (slow, re-authenticates GCS).

Solution

cache_info() -> dict

Returns a dictionary keyed by cache attribute name. Each entry includes:

Key Type Description
entries int Number of cached items
nbytes int Best-effort deep size estimate
kind str "dict", "lru_cache", or "other"
note str Estimation method (xarray.nbytes, numpy.nbytes, dask upper bound, bytes length, sys.getsizeof shallow, etc.)

clear_cache(category="all") -> None

Clears all caches or a specific category. Supported categories:

Category Caches cleared
"all" Every cache on the instance
"base" _cache_releases, _cache_sample_sets, _cache_files, etc.
"sample_metadata" _cache_sample_metadata, _cache_cohorts, _cache_cohort_geometries
"genome_features" _cache_genome_features
"genome_sequence" _cache_genome
"snp" _cache_snp_sites, _cache_snp_genotypes, _cache_site_filters, _cache_locate_site_class, _cached_snp_calls, etc.
"haplotypes" _cache_haplotypes, _cache_haplotype_sites
"cnv" _cache_cnv_hmm, _cache_cnv_coverage_calls, _cache_cnv_discordant_read_calls
"aim" _cache_aim_variants

An unknown category raises ValueError listing valid options.

Caches repopulate on demand after clearing, so clear_cache() is always safe to call mid-session.

Design decisions

  • Placed on AnophelesBase: All subclasses inherit automatically via MRO. No per-class boilerplate needed.
  • Category-to-attribute mapping via _CACHE_CATEGORIES class dict: Explicit and easy to extend when new caches are added. Uses getattr(..., None) so categories gracefully skip attributes that don't exist on a given subclass.
  • Handles all cache types: plain dict, OrderedDict, lru_cache wrappers, and single-value Optional caches (e.g., zarr groups).
  • Consistent with existing patterns: follows the clear_extra_metadata() precedent and uses @doc() numpydoc decorators matching existing style.

Usage

import malariagen_data

ag3 = malariagen_data.Ag3()

# Run analysis across multiple contigs...
for contig in ["2R", "2L", "3R", "3L", "X"]:
    ag3.haplotypes(region=contig, sample_sets="AG1000G-BF-A", analysis="gamb_colu")

# Inspect cache state:
ag3.cache_info()
# {'_cache_haplotypes': {'entries': 5, 'nbytes': 487200000, 'kind': 'dict', 'note': 'xarray.nbytes'}, ...}

# Free memory when switching analyses:
ag3.clear_cache("haplotypes")

# Or clear everything:
ag3.clear_cache()

Test plan

  • test_cache_info_returns_dict — verifies output structure and types for all entries
  • test_cache_info_after_population — verifies entry counts increase after populating caches
  • test_clear_cache_all — verifies all dict caches are emptied
  • test_clear_cache_specific_category — verifies only the targeted category is cleared
  • test_clear_cache_invalid_category — verifies ValueError with helpful message
  • test_clear_cache_repopulates_on_demand — verifies caches repopulate after clearing
  • test_cache_info_size_estimation — unit tests for numpy, xarray, bytes, and fallback size estimation
  • test_clear_cache_direct_dict_manipulation — end-to-end test with directly populated caches verifying selective and full clearing
  • All 14 new tests pass; all 37 existing test_base.py tests pass (no regressions)
  • ruff check passes clean

Add two public methods to AnophelesBase so that all resource classes
(Ag3, Af1, Adir1, Amin1, Adar1, and Plasmodium resources) inherit
a stable, documented way to inspect and release in-memory caches.

This addresses silent memory pressure in long-running Jupyter/Colab
sessions where cached haplotype, CNV, and SNP datasets accumulate
without any public mechanism to reclaim the memory.

- cache_info() returns a dict keyed by cache attribute name with
  entry count, estimated byte size, cache kind, and a note on the
  estimation method used (xarray.nbytes, numpy.nbytes, dask upper
  bound, bytes length, or sys.getsizeof shallow).

- clear_cache(category="all") clears all or a specific category of
  caches. Supported categories: all, base, sample_metadata,
  genome_features, genome_sequence, snp, haplotypes, cnv, aim.
  Unknown categories raise ValueError listing valid options.

Caches repopulate on demand after clearing, so calling clear_cache()
is always safe mid-session.

Closes malariagen#1289
@khushthecoder khushthecoder force-pushed the fix/issue-1289-public-cache-api branch from d6faa20 to c244222 Compare April 14, 2026 14:33
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 93.22034% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.87%. Comparing base (c2c4f73) to head (430e972).
⚠️ Report is 50 commits behind head on master.

Files with missing lines Patch % Lines
malariagen_data/anoph/base.py 93.22% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1295      +/-   ##
==========================================
- Coverage   88.92%   88.87%   -0.06%     
==========================================
  Files          55       55              
  Lines        6257     6371     +114     
==========================================
+ Hits         5564     5662      +98     
- Misses        693      709      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Cover the previously untested lru_cache and "other" (single-value)
code paths in cache_info() and clear_cache() to fix the codecov/project
coverage drop.
@khushthecoder khushthecoder force-pushed the fix/issue-1289-public-cache-api branch from 6318247 to a2a7ac4 Compare April 14, 2026 15:45
@khushthecoder
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas , just a gentle reminder whenever you get a chance to review the PR—would really appreciate your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No Public API to Clear or Inspect In-Memory Caches — Users Cannot Reclaim Memory in Long-Running Sessions

1 participant