Merge branch 'master' into fix-colab-tpu-runtime

joshitha1808 · web-flow · commit 4b5c320f03be · 2026-03-06T22:53:50.000+05:30
diff --git a/LINUX_SETUP.md b/LINUX_SETUP.md
@@ -0,0 +1,67 @@
+# Developer setup (Linux)
+
+To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
+
+## 1. Fork and clone this repo
+```bash
+git clone git@github.com:[username]/malariagen-data-python.git
+cd malariagen-data-python
+```
+
+## 2. Install Python
+```bash
+sudo add-apt-repository ppa:deadsnakes/ppa
+sudo apt install python3.10 python3.10-venv
+```
+
+## 3. Install pipx and poetry
+```bash
+python3.10 -m pip install --user pipx
+python3.10 -m pipx ensurepath
+pipx install poetry
+```
+
+## 4. Create and activate development environment
+```bash
+poetry install
+poetry shell
+```
+
+## 5. Install pre-commit hooks
+```bash
+pipx install pre-commit
+pre-commit install
+```
+
+Run pre-commit checks manually:
+```bash
+pre-commit run --all-files
+```
+
+## 6. Run tests
+
+Run fast unit tests using simulated data:
+```bash
+poetry run pytest -v tests/anoph
+```
+
+## 7. Google Cloud authentication (for legacy tests)
+
+To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
+
+Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install):
+```bash
+./install_gcloud.sh
+```
+
+Then obtain application-default credentials:
+```bash
+./google-cloud-sdk/bin/gcloud auth application-default login
+```
+
+Once authenticated, run legacy tests:
+```bash
+poetry run pytest --ignore=tests/anoph -v tests
+```
+
+Tests will run slowly the first time, as data will be read from GCS and cached locally in the `gcs_cache` folder.
diff --git a/MACOS_SETUP.md b/MACOS_SETUP.md
@@ -0,0 +1,77 @@
+# Developer setup (macOS)
+
+The Linux setup guide is available in [LINUX_SETUP.md](LINUX_SETUP.md). If you are on macOS, follow these steps instead.
+
+## 1. Install Miniconda
+
+Download and install Miniconda for macOS from https://docs.conda.io/en/latest/miniconda.html.
+Choose the Apple Silicon installer if you have an Apple Silicon Mac, or the Intel installer otherwise. You can check with:
+```bash
+uname -m
+# arm64 = Apple Silicon, x86_64 = Intel
+```
+
+After installation, close and reopen your terminal for conda to be available.
+
+## 2. Create a conda environment
+
+The package requires Python `>=3.10, <3.13`. Python 3.13+ is not currently supported.
+```bash
+conda create -n malariagen python=3.11
+conda activate malariagen
+```
+
+## 3. Fork and clone this repo
+
+Fork the repository on GitHub, then clone your fork:
+```bash
+git clone git@github.com:[username]/malariagen-data-python.git
+cd malariagen-data-python
+pip install -e ".[dev]"
+```
+
+## 4. Install pre-commit hooks
+```bash
+pre-commit install
+```
+
+Run pre-commit checks manually:
+```bash
+pre-commit run --all-files
+```
+
+## 5. Run tests
+
+Run fast unit tests using simulated data:
+```bash
+pytest -v tests/anoph
+```
+
+## 6. Google Cloud authentication (for legacy tests)
+
+To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
+
+Once access has been granted, install the Google Cloud CLI:
+```bash
+brew install google-cloud-sdk
+```
+
+Then authenticate:
+```bash
+gcloud auth application-default login
+```
+
+This opens a browser — log in with any Google account.
+
+Once authenticated, run legacy tests:
+```bash
+pytest --ignore=tests/anoph -v tests
+```
+
+Tests will run slowly the first time, as data will be read from GCS and cached locally in the `gcs_cache` folder.
+
+## 7. VS Code terminal integration
+
+To use the `code` command from the terminal:
+
+Open VS Code → `Cmd + Shift + P` → type `Shell Command: Install 'code' command in PATH` → press Enter.
diff --git a/README.md b/README.md
@@ -44,8 +44,11 @@ for release notes.
 
 ## Developer setup
 
-To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
+To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A).
 
+For detailed setup instructions, see:
+- [Linux setup guide](LINUX_SETUP.md)
+- [macOS setup guide](MACOS_SETUP.md)
 Detailed instructions can be found in the [Contributors guide](https://github.com/malariagen/malariagen-data-python/blob/master/CONTRIBUTING.md).
 
 ## AI use policy and guidelines
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -92,7 +92,7 @@ Some data from MalariaGEN are subject to **terms of use** which may include an e
 public communication of any analysis results without permission from data owners. If you
 have any questions about terms of use please email support@malariagen.net.
 
-By default, this sofware package accesses data directly from the **MalariaGEN cloud data repository**
+By default, this software package accesses data directly from the **MalariaGEN cloud data repository**
 hosted in Google Cloud Storage in the US. Note that data access will be more efficient if your
 computations are also run within the same region. Google Colab provides a convenient and free
 service which you can use to explore data and run computations. If you have any questions about
diff --git a/malariagen_data/anoph/frq_base.py b/malariagen_data/anoph/frq_base.py
@@ -396,39 +396,50 @@ def plot_frequencies_time_series(
         # Extract variant labels.
         variant_labels = ds["variant_label"].values
 
+        # Check if CI variables are available.
+        has_ci = "event_frequency_ci_low" in ds
+
         # Build a long-form dataframe from the dataset.
         dfs = []
         for cohort in df_cohorts.itertuples():
             ds_cohort = ds.isel(cohorts=cohort.Index)
-            df = pd.DataFrame(
-                {
-                    "taxon": cohort.taxon,
-                    "area": cohort.area,
-                    "date": cohort.period_start,
-                    "period": str(
-                        cohort.period
-                    ),  # use string representation for hover label
-                    "sample_size": cohort.size,
-                    "variant": variant_labels,
-                    "count": ds_cohort["event_count"].values,
-                    "nobs": ds_cohort["event_nobs"].values,
-                    "frequency": ds_cohort["event_frequency"].values,
-                    "frequency_ci_low": ds_cohort["event_frequency_ci_low"].values,
-                    "frequency_ci_upp": ds_cohort["event_frequency_ci_upp"].values,
-                }
-            )
+            cohort_data = {
+                "taxon": cohort.taxon,
+                "area": cohort.area,
+                "date": cohort.period_start,
+                "period": str(
+                    cohort.period
+                ),  # use string representation for hover label
+                "sample_size": cohort.size,
+                "variant": variant_labels,
+                "count": ds_cohort["event_count"].values,
+                "nobs": ds_cohort["event_nobs"].values,
+                "frequency": ds_cohort["event_frequency"].values,
+            }
+            if has_ci:
+                cohort_data["frequency_ci_low"] = ds_cohort[
+                    "event_frequency_ci_low"
+                ].values
+                cohort_data["frequency_ci_upp"] = ds_cohort[
+                    "event_frequency_ci_upp"
+                ].values
+            df = pd.DataFrame(cohort_data)
             dfs.append(df)
         df_events = pd.concat(dfs, axis=0).reset_index(drop=True)
 
         # Remove events with no observations.
         df_events = df_events.query("nobs > 0").copy()
 
-        # Calculate error bars.
-        frq = df_events["frequency"]
-        frq_ci_low = df_events["frequency_ci_low"]
-        frq_ci_upp = df_events["frequency_ci_upp"]
-        df_events["frequency_error"] = frq_ci_upp - frq
-        df_events["frequency_error_minus"] = frq - frq_ci_low
+        # Calculate error bars if CI data is available.
+        error_y_args = {}
+        if has_ci:
+            frq = df_events["frequency"]
+            frq_ci_low = df_events["frequency_ci_low"]
+            frq_ci_upp = df_events["frequency_ci_upp"]
+            df_events["frequency_error"] = frq_ci_upp - frq
+            df_events["frequency_error_minus"] = frq - frq_ci_low
+            error_y_args["error_y"] = "frequency_error"
+            error_y_args["error_y_minus"] = "frequency_error_minus"
 
         # Make a plot.
         fig = px.line(
@@ -437,8 +448,7 @@ def plot_frequencies_time_series(
             facet_row="area",
             x="date",
             y="frequency",
-            error_y="frequency_error",
-            error_y_minus="frequency_error_minus",
+            **error_y_args,
             color="variant",
             markers=True,
             hover_name="variant",
@@ -518,19 +528,19 @@ def plot_frequencies_map_markers(
             variant_label = variant
 
         # Convert to a dataframe for convenience.
-        df_markers = ds_variant[
-            [
-                "cohort_taxon",
-                "cohort_area",
-                "cohort_period",
-                "cohort_lat_mean",
-                "cohort_lon_mean",
-                "cohort_size",
-                "event_frequency",
-                "event_frequency_ci_low",
-                "event_frequency_ci_upp",
-            ]
-        ].to_dataframe()
+        cols = [
+            "cohort_taxon",
+            "cohort_area",
+            "cohort_period",
+            "cohort_lat_mean",
+            "cohort_lon_mean",
+            "cohort_size",
+            "event_frequency",
+        ]
+        has_ci = "event_frequency_ci_low" in ds
+        if has_ci:
+            cols += ["event_frequency_ci_low", "event_frequency_ci_upp"]
+        df_markers = ds_variant[cols].to_dataframe()
 
         # Select data matching taxon and period parameters.
         df_markers = df_markers.loc[
@@ -560,8 +570,11 @@ def plot_frequencies_map_markers(
                 Area: {x.cohort_area} <br/>
                 Period: {x.cohort_period} <br/>
                 Sample size: {x.cohort_size} <br/>
-                Frequency: {x.event_frequency:.0%}
-                (95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})
+                Frequency: {x.event_frequency:.0%}"""
+            if has_ci:
+                popup_html += f"""
+                (95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})"""
+            popup_html += """
             """
             marker.popup = ipyleaflet.Popup(
                 child=ipywidgets.HTML(popup_html),
diff --git a/malariagen_data/anoph/snp_frq.py b/malariagen_data/anoph/snp_frq.py
@@ -125,15 +125,15 @@ def snp_effects(
             A dataframe of SNP allele frequencies, one row per variant allele. The variant alleles are indexed by
             their contig, their position, the reference allele, the alternate allele and the associated amino acid change.
             The columns are split into three categories: there is one column for each taxon filter (e.g., pass_funestus, pass_gamb_colu, ...) containing whether the site of the variant allele passes the filter;
-            there is then 1 column for each cohort containing the frequency of the variant allele within the cohort, additionally there is a column `max_af` containing the maximum allele frequency of the variant allele accross all cohorts;
+            there is then 1 column for each cohort containing the frequency of the variant allele within the cohort, additionally there is a column `max_af` containing the maximum allele frequency of the variant allele across all cohorts;
             finally, there are 9 columns describing the variant allele: `transcript` contains the gene transcript used for this analysis,
             `effect` is the effect of the allele change,
             `impact`is the impact of the allele change,
             `ref_codon` is the reference codon,
             `alt_codon` is the altered codon with the variant allele,
             `aa_pos` is the position of the amino acid,
             `ref_aa` is the reference amino acid,
-            `alt_aa` is the altered amino acid with the varaint allele,
+            `alt_aa` is the altered amino acid with the variant allele,
             and `label` is the label of the variant allele.
         """,
         notes="""
@@ -296,15 +296,15 @@ def snp_allele_frequencies(
         returns="""
             A dataframe of amino acid allele frequencies, one row per variant. The variants are indexed by
             their amino acid change, their contig, their position.
-            The columns are split into two categories: there is 1 column for each cohort containing the frequency of the amino acid change within the cohort, additionally there is a column `max_af` containing the maximum frequency of the amino acide change accross all cohorts;
+            The columns are split into two categories: there is 1 column for each cohort containing the frequency of the amino acid change within the cohort, additionally there is a column `max_af` containing the maximum frequency of the amino acid change across all cohorts;
             finally, there are 9 columns describing the variant allele: `transcript` contains the gene transcript used for this analysis,
             `effect` is the effect of the allele change,
             `impact`is the impact of the allele change,
-            `ref_allele` is the reference allel,
+            `ref_allele` is the reference allele,
             `alt_allele` is the alternate allele,
             `aa_pos` is the position of the amino acid,
             `ref_aa` is the reference amino acid,
-            `alt_aa` is the altered amino acid with the varaint allele,
+            `alt_aa` is the altered amino acid with the variant allele,
             and `label` is the label of the variant allele.
         """,
         notes="""
diff --git a/malariagen_data/anopheles.py b/malariagen_data/anopheles.py
@@ -727,7 +727,7 @@ def ihs_gwss(
     ) -> Tuple[np.ndarray, np.ndarray]:
         # change this name if you ever change the behaviour of this function, to
         # invalidate any previously cached data
-        name = "roh"
+        name = self._ihs_gwss_cache_name
 
         params = dict(
             contig=contig,
diff --git a/malariagen_data/util.py b/malariagen_data/util.py
@@ -899,6 +899,12 @@ def __init__(
             handler = logging.FileHandler(out)
         self._handler = handler
 
+        # Remove any pre-existing handlers from the singleton logger to prevent
+        # accumulation (and FileHandler FD leaks) on repeated instantiation.
+        for existing_handler in logger.handlers[:]:
+            logger.removeHandler(existing_handler)
+            existing_handler.close()
+
         # configure handler
         if handler is not None:
             if debug:
diff --git a/tests/anoph/test_base.py b/tests/anoph/test_base.py
diff --git a/tests/anoph/test_frq_base.py b/tests/anoph/test_frq_base.py
diff --git a/tests/anoph/test_hap_frq.py b/tests/anoph/test_hap_frq.py