feat: add public cache_info() and clear_cache() API (fixes #1289)#1295
Open
khushthecoder wants to merge 10 commits intomalariagen:masterfrom
Open
feat: add public cache_info() and clear_cache() API (fixes #1289)#1295khushthecoder wants to merge 10 commits intomalariagen:masterfrom
khushthecoder wants to merge 10 commits intomalariagen:masterfrom
Conversation
Add two public methods to AnophelesBase so that all resource classes (Ag3, Af1, Adir1, Amin1, Adar1, and Plasmodium resources) inherit a stable, documented way to inspect and release in-memory caches. This addresses silent memory pressure in long-running Jupyter/Colab sessions where cached haplotype, CNV, and SNP datasets accumulate without any public mechanism to reclaim the memory. - cache_info() returns a dict keyed by cache attribute name with entry count, estimated byte size, cache kind, and a note on the estimation method used (xarray.nbytes, numpy.nbytes, dask upper bound, bytes length, or sys.getsizeof shallow). - clear_cache(category="all") clears all or a specific category of caches. Supported categories: all, base, sample_metadata, genome_features, genome_sequence, snp, haplotypes, cnv, aim. Unknown categories raise ValueError listing valid options. Caches repopulate on demand after clearing, so calling clear_cache() is always safe mid-session. Closes malariagen#1289
d6faa20 to
c244222
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1295 +/- ##
==========================================
- Coverage 88.92% 88.87% -0.06%
==========================================
Files 55 55
Lines 6257 6371 +114
==========================================
+ Hits 5564 5662 +98
- Misses 693 709 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Cover the previously untested lru_cache and "other" (single-value) code paths in cache_info() and clear_cache() to fix the codecov/project coverage drop.
6318247 to
a2a7ac4
Compare
Contributor
Author
|
Hi @jonbrenas , just a gentle reminder whenever you get a chance to review the PR—would really appreciate your feedback |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: #1289
Summary
Adds two public methods to
AnophelesBaseso that all resource classes (Ag3,Af1,Adir1,Amin1,Adar1, and all Plasmodium resources) inherit a stable, documented API for inspecting and releasing in-memory caches.This directly addresses silent memory pressure in long-running Jupyter / Google Colab sessions, where cached haplotype, CNV, and SNP datasets accumulate across multi-contig or multi-sample-set analyses — often consuming multiple GB — with no public mechanism to reclaim that memory.
Problem
The library maintains 14+ separate unbounded
dictcaches across 7 modules (pluslru_cacheandOrderedDict-based caches insnp_data). Each unique combination ofregion,sample_set,analysis, etc. creates a new entry that persists for the object's lifetime. There is:_cache_haplotypes,_cache_cnv_hmm, etc.)The only existing
clear_*method isclear_extra_metadata()which clears user-added metadata — not internal data caches. Users who hit OOM must either reach into private_cache_*attributes (fragile, undocumented) or destroy and recreate the API object (slow, re-authenticates GCS).Solution
cache_info() -> dictReturns a dictionary keyed by cache attribute name. Each entry includes:
entriesintnbytesintkindstr"dict","lru_cache", or"other"notestrxarray.nbytes,numpy.nbytes,dask upper bound,bytes length,sys.getsizeof shallow, etc.)clear_cache(category="all") -> NoneClears all caches or a specific category. Supported categories:
"all""base"_cache_releases,_cache_sample_sets,_cache_files, etc."sample_metadata"_cache_sample_metadata,_cache_cohorts,_cache_cohort_geometries"genome_features"_cache_genome_features"genome_sequence"_cache_genome"snp"_cache_snp_sites,_cache_snp_genotypes,_cache_site_filters,_cache_locate_site_class,_cached_snp_calls, etc."haplotypes"_cache_haplotypes,_cache_haplotype_sites"cnv"_cache_cnv_hmm,_cache_cnv_coverage_calls,_cache_cnv_discordant_read_calls"aim"_cache_aim_variantsAn unknown category raises
ValueErrorlisting valid options.Caches repopulate on demand after clearing, so
clear_cache()is always safe to call mid-session.Design decisions
AnophelesBase: All subclasses inherit automatically via MRO. No per-class boilerplate needed._CACHE_CATEGORIESclass dict: Explicit and easy to extend when new caches are added. Usesgetattr(..., None)so categories gracefully skip attributes that don't exist on a given subclass.dict,OrderedDict,lru_cachewrappers, and single-valueOptionalcaches (e.g., zarr groups).clear_extra_metadata()precedent and uses@doc()numpydoc decorators matching existing style.Usage
Test plan
test_cache_info_returns_dict— verifies output structure and types for all entriestest_cache_info_after_population— verifies entry counts increase after populating cachestest_clear_cache_all— verifies all dict caches are emptiedtest_clear_cache_specific_category— verifies only the targeted category is clearedtest_clear_cache_invalid_category— verifiesValueErrorwith helpful messagetest_clear_cache_repopulates_on_demand— verifies caches repopulate after clearingtest_cache_info_size_estimation— unit tests for numpy, xarray, bytes, and fallback size estimationtest_clear_cache_direct_dict_manipulation— end-to-end test with directly populated caches verifying selective and full clearingtest_base.pytests pass (no regressions)ruff checkpasses clean