Skip to content

Add list_extra_metadata() and improve clear_extra_metadata() to prevent silent metadata accumulation #1168

@khushthecoder

Description

@khushthecoder

Description

Currently, AnophelesSnpData (and related classes via sample_metadata.py) allows users to add custom metadata using add_extra_metadata(). However, the current implementation has a few UX and state-management pitfalls that can lead to silent errors, particularly in notebook environments.


The Problem

1. Silent Accumulation in Notebooks

add_extra_metadata() simply appends the provided DataFrame to the self._extra_metadata list. If a user re-runs a Jupyter notebook cell containing this call, the same extra metadata is appended multiple times. When sample_metadata() is called later, the repeated .merge() operations will result in duplicated columns (e.g., feature_x, feature_y), silently breaking downstream queries referencing the original column name.

2. No Visibility

There is currently no list_extra_metadata() or has_extra_metadata() method. Users have no way to inspect the internal API state to see what extra metadata is currently active.

3. Stale Cache Risk

While sample_metadata() merges the extra metadata after the initial _cache_sample_metadata hit (and merge() creates a new DataFrame), relying on clear_extra_metadata() to solely reset self._extra_metadata = [] leaves a theoretical risk if cached DataFrames were ever mutated in-place by edge-case operations.


Steps to Reproduce the Risk

import malariagen_data
import pandas as pd
 
ag3 = malariagen_data.Ag3()
 
# User creates some custom metadata
df_custom = pd.DataFrame({
    "sample_id": ["AR0001-C", "AR0002-C"],
    "my_custom_trait": [True, False]
})
 
# User accidentally re-runs this notebook cell twice
ag3.add_extra_metadata(df_custom, on="sample_id")
ag3.add_extra_metadata(df_custom, on="sample_id")
 
# No way to check what is active:
# ag3.list_extra_metadata()  <-- DOES NOT EXIST
 
# The resulting DataFrame now has 'my_custom_trait_x' and 'my_custom_trait_y'
df_samples = ag3.sample_metadata(sample_sets="AG1000G-AO")
print([c for c in df_samples.columns if "my_custom_trait" in c])
# Output: ['my_custom_trait_x', 'my_custom_trait_y']

Proposed Solution

1. Add a list_extra_metadata() method

Add a list_extra_metadata() method (or a .extra_metadata_info property) that returns the names of the columns currently active in the extra metadata list, or the shapes of the registered extra DataFrames, so users can inspect current state at any time.

ag3.list_extra_metadata()
# e.g. returns: [{'columns': ['my_custom_trait'], 'shape': (2, 2)}]

2. Improve add_extra_metadata() safety

Check if the columns being added already exist in the currently registered extra metadata. Either warn the user or automatically overwrite instead of blindly appending:

# Option A: Warn on duplicate columns
warnings.warn(
    "Column 'my_custom_trait' is already registered in extra metadata. "
    "Call clear_extra_metadata() first, or use overwrite=True."
)
 
# Option B: Support an explicit overwrite flag
ag3.add_extra_metadata(df_custom, on="sample_id", overwrite=True)

3. Update clear_extra_metadata()

Have it optionally (or explicitly) also clear self._cache_sample_metadata to guarantee a pristine state when the user wants to reset their environment:

ag3.clear_extra_metadata(clear_cache=True)

Expected Behavior

Users should be able to:

  • Inspect their active custom metadata via list_extra_metadata() or an equivalent property.
  • Be warned (or blocked) when attempting to register duplicate columns.
  • Fully reset API state via clear_extra_metadata() without risk of stale cached data persisting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions