Description
Currently, AnophelesSnpData (and related classes via sample_metadata.py) allows users to add custom metadata using add_extra_metadata(). However, the current implementation has a few UX and state-management pitfalls that can lead to silent errors, particularly in notebook environments.
The Problem
1. Silent Accumulation in Notebooks
add_extra_metadata() simply appends the provided DataFrame to the self._extra_metadata list. If a user re-runs a Jupyter notebook cell containing this call, the same extra metadata is appended multiple times. When sample_metadata() is called later, the repeated .merge() operations will result in duplicated columns (e.g., feature_x, feature_y), silently breaking downstream queries referencing the original column name.
2. No Visibility
There is currently no list_extra_metadata() or has_extra_metadata() method. Users have no way to inspect the internal API state to see what extra metadata is currently active.
3. Stale Cache Risk
While sample_metadata() merges the extra metadata after the initial _cache_sample_metadata hit (and merge() creates a new DataFrame), relying on clear_extra_metadata() to solely reset self._extra_metadata = [] leaves a theoretical risk if cached DataFrames were ever mutated in-place by edge-case operations.
Steps to Reproduce the Risk
import malariagen_data
import pandas as pd
ag3 = malariagen_data.Ag3()
# User creates some custom metadata
df_custom = pd.DataFrame({
"sample_id": ["AR0001-C", "AR0002-C"],
"my_custom_trait": [True, False]
})
# User accidentally re-runs this notebook cell twice
ag3.add_extra_metadata(df_custom, on="sample_id")
ag3.add_extra_metadata(df_custom, on="sample_id")
# No way to check what is active:
# ag3.list_extra_metadata() <-- DOES NOT EXIST
# The resulting DataFrame now has 'my_custom_trait_x' and 'my_custom_trait_y'
df_samples = ag3.sample_metadata(sample_sets="AG1000G-AO")
print([c for c in df_samples.columns if "my_custom_trait" in c])
# Output: ['my_custom_trait_x', 'my_custom_trait_y']
Proposed Solution
1. Add a list_extra_metadata() method
Add a list_extra_metadata() method (or a .extra_metadata_info property) that returns the names of the columns currently active in the extra metadata list, or the shapes of the registered extra DataFrames, so users can inspect current state at any time.
ag3.list_extra_metadata()
# e.g. returns: [{'columns': ['my_custom_trait'], 'shape': (2, 2)}]
2. Improve add_extra_metadata() safety
Check if the columns being added already exist in the currently registered extra metadata. Either warn the user or automatically overwrite instead of blindly appending:
# Option A: Warn on duplicate columns
warnings.warn(
"Column 'my_custom_trait' is already registered in extra metadata. "
"Call clear_extra_metadata() first, or use overwrite=True."
)
# Option B: Support an explicit overwrite flag
ag3.add_extra_metadata(df_custom, on="sample_id", overwrite=True)
3. Update clear_extra_metadata()
Have it optionally (or explicitly) also clear self._cache_sample_metadata to guarantee a pristine state when the user wants to reset their environment:
ag3.clear_extra_metadata(clear_cache=True)
Expected Behavior
Users should be able to:
- Inspect their active custom metadata via
list_extra_metadata() or an equivalent property.
- Be warned (or blocked) when attempting to register duplicate columns.
- Fully reset API state via
clear_extra_metadata() without risk of stale cached data persisting.
Description
Currently,
AnophelesSnpData(and related classes viasample_metadata.py) allows users to add custom metadata usingadd_extra_metadata(). However, the current implementation has a few UX and state-management pitfalls that can lead to silent errors, particularly in notebook environments.The Problem
1. Silent Accumulation in Notebooks
add_extra_metadata()simply appends the provided DataFrame to theself._extra_metadatalist. If a user re-runs a Jupyter notebook cell containing this call, the same extra metadata is appended multiple times. Whensample_metadata()is called later, the repeated.merge()operations will result in duplicated columns (e.g.,feature_x,feature_y), silently breaking downstream queries referencing the original column name.2. No Visibility
There is currently no
list_extra_metadata()orhas_extra_metadata()method. Users have no way to inspect the internal API state to see what extra metadata is currently active.3. Stale Cache Risk
While
sample_metadata()merges the extra metadata after the initial_cache_sample_metadatahit (andmerge()creates a new DataFrame), relying onclear_extra_metadata()to solely resetself._extra_metadata = []leaves a theoretical risk if cached DataFrames were ever mutated in-place by edge-case operations.Steps to Reproduce the Risk
Proposed Solution
1. Add a
list_extra_metadata()methodAdd a
list_extra_metadata()method (or a.extra_metadata_infoproperty) that returns the names of the columns currently active in the extra metadata list, or the shapes of the registered extra DataFrames, so users can inspect current state at any time.2. Improve
add_extra_metadata()safetyCheck if the columns being added already exist in the currently registered extra metadata. Either warn the user or automatically overwrite instead of blindly appending:
3. Update
clear_extra_metadata()Have it optionally (or explicitly) also clear
self._cache_sample_metadatato guarantee a pristine state when the user wants to reset their environment:Expected Behavior
Users should be able to:
list_extra_metadata()or an equivalent property.clear_extra_metadata()without risk of stale cached data persisting.