malariagen
diff --git a/‎.github/actions/setup-python/action.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.github/actions/setup-python/action.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 50 additions & 12 deletions b/‎CONTRIBUTING.md‎
Lines changed: 50 additions & 12 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/index.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎malariagen_data/anoph/base_params.py‎
Lines changed: 4 additions & 1 deletion b/‎malariagen_data/anoph/base_params.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎malariagen_data/anoph/distance.py‎
Lines changed: 37 additions & 18 deletions b/‎malariagen_data/anoph/distance.py‎
Lines changed: 37 additions & 18 deletions
diff --git a/‎malariagen_data/anoph/frq_base.py‎
Lines changed: 84 additions & 43 deletions b/‎malariagen_data/anoph/frq_base.py‎
Lines changed: 84 additions & 43 deletions
@@ -19,4 +19,4 @@ runs:
       shell: bash
       run: |
         poetry env use ${{ inputs.python-version }}
-        poetry install --extras dev
+        poetry install --with dev,test,docs
@@ -12,9 +12,10 @@ This package provides Python tools for accessing and analyzing genomic data from
 
 You'll need:
 
-- Python 3.10.x (CI-tested version)
-- [Poetry](https://python-poetry.org/) for dependency management
-- [Git](https://git-scm.com/) for version control
+- [pipx](https://pipx.pypa.io/) for installing Python tools
+- [git](https://git-scm.com/) for version control
+
+Both of these can be installed using your distribution's package manager or [Homebrew](https://brew.sh/) on Mac.
 
 ### Initial setup
 
@@ -33,18 +34,31 @@ You'll need:
    git remote add upstream https://github.com/malariagen/malariagen-data-python.git
    ```
 
-3. **Install Poetry** (if not already installed)
+3. **Install Poetry**
 
    ```bash
    pipx install poetry
    ```
 
-4. **Install the project and its dependencies**
+4. **Install Python 3.12**
+
+   Python 3.12 is tested in the CI-system and is the recommended version to use.
+
+   ```bash
+   poetry python install 3.12
+   ```
+
+5. **Install the project and its dependencies**
 
    ```bash
-   poetry install
+   poetry env use 3.12
+   poetry install --with dev,test,docs
    ```
 
+   This installs the runtime dependencies along with the `dev`, `test`, and `docs`
+   [dependency groups](https://python-poetry.org/docs/managing-dependencies/#dependency-groups).
+   If you only need to run tests, `poetry install --with test` is sufficient.
+
    **Recommended**: Use `poetry run` to run commands inside the virtual environment:
 
    ```bash
@@ -71,7 +85,7 @@ You'll need:
    python script.py
    ```
 
-5. **Install pre-commit hooks**
+6. **Install pre-commit hooks**
 
    ```bash
    pipx install pre-commit
@@ -107,16 +121,40 @@ You'll need:
 
 4. **Run tests locally**
 
-   Fast unit tests (no external data access):
+   Fast unit tests using simulated data (no external data access):
 
    ```bash
-   poetry run pytest -v tests/anoph
+   poetry run pytest -v tests --ignore tests/integration
    ```
 
-   All unit tests (requires setting up credentials for legacy tests):
+   To run integration tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
+
+   Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
 
    ```bash
-   poetry run pytest -v tests --ignore tests/integration
+   ./install_gcloud.sh
+   ```
+
+   You'll then need to obtain application-default credentials, e.g.:
+
+   ```bash
+   ./google-cloud-sdk/bin/gcloud auth application-default login
+   ```
+
+   Once this is done, you can run integration tests:
+
+   ```bash
+   poetry run pytest -v tests/integration
+   ```
+
+   Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.
+
+6. **Run typechecking**
+
+   Run static typechecking with mypy:
+
+   ```bash
+   poetry run mypy malariagen_data tests --ignore-missing-imports
    ```
 
 5. **Check code quality**
@@ -150,7 +188,7 @@ ruff format .
 - **Fast tests**: Unit tests should use simulated data when possible (see `tests/anoph/`)
 - **Integration tests**: Tests requiring GCS data access are slower and run separately
 
-Run type checking with:
+Run dynamic type checking with:
 
 ```bash
 poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.anoph
 
@@ -92,7 +92,7 @@ Some data from MalariaGEN are subject to **terms of use** which may include an e
 public communication of any analysis results without permission from data owners. If you
 have any questions about terms of use please email support@malariagen.net.
 
-By default, this sofware package accesses data directly from the **MalariaGEN cloud data repository**
+By default, this software package accesses data directly from the **MalariaGEN cloud data repository**
 hosted in Google Cloud Storage in the US. Note that data access will be more efficient if your
 computations are also run within the same region. Google Colab provides a convenient and free
 service which you can use to explore data and run computations. If you have any questions about
 
@@ -69,7 +69,10 @@
     str,
     """
     A pandas query string to be evaluated against the sample metadata, to
-    select samples to be included in the returned data.
+    select samples to be included in the returned data. E.g.,
+    "country == 'Uganda'". If the query returns zero results, a warning
+    will be emitted with fuzzy-match suggestions for possible typos or
+    case mismatches.
     """,
 ]
 
 
@@ -365,24 +365,43 @@ def _njt(
         from scipy.spatial.distance import squareform  # type: ignore
 
         # Compute pairwise distances.
-        dist, samples, n_snps = self.biallelic_diplotype_pairwise_distances(
-            region=region,
-            n_snps=n_snps,
-            metric=metric,
-            sample_sets=sample_sets,
-            sample_indices=sample_indices,
-            site_mask=site_mask,
-            site_class=site_class,
-            inline_array=inline_array,
-            chunks=chunks,
-            cohort_size=cohort_size,
-            min_cohort_size=min_cohort_size,
-            max_cohort_size=max_cohort_size,
-            random_seed=random_seed,
-            max_missing_an=max_missing_an,
-            min_minor_ac=min_minor_ac,
-            thin_offset=thin_offset,
-        )
+        try:
+            dist, samples, n_snps_used = self.biallelic_diplotype_pairwise_distances(
+                region=region,
+                n_snps=n_snps,
+                metric=metric,
+                sample_sets=sample_sets,
+                sample_indices=sample_indices,
+                site_mask=site_mask,
+                site_class=site_class,
+                inline_array=inline_array,
+                chunks=chunks,
+                cohort_size=cohort_size,
+                min_cohort_size=min_cohort_size,
+                max_cohort_size=max_cohort_size,
+                random_seed=random_seed,
+                max_missing_an=max_missing_an,
+                min_minor_ac=min_minor_ac,
+                thin_offset=thin_offset,
+            )
+
+        except ValueError as e:
+            raise ValueError(
+                f"Unable to construct neighbour-joining tree. {e} "
+                f"This could be because the selected region does not "
+                f"contain enough polymorphic SNPs for the given sample "
+                f"sets and query parameters."
+            ) from e
+
+        # Validate enough samples for a tree.
+        n_samples = len(samples)
+        if n_samples < 3:
+            raise ValueError(
+                f"Not enough samples to construct a neighbour-joining tree. "
+                f"A minimum of 3 samples is required, but only {n_samples} "
+                f"were found for the given region and sample sets."
+            )
+
         D = squareform(dist)
 
         # anjl supports passing in a progress bar function to get progress on the
 
@@ -29,6 +29,13 @@ def _prep_samples_for_cohort_grouping(
         # Users can explicitly override with True/False.
         filter_unassigned = taxon_by == "taxon"
 
+    # Validate taxon_by.
+    if taxon_by not in df_samples.columns:
+        raise ValueError(
+            f"Invalid value for `taxon_by`: {taxon_by!r}. "
+            f"Must be the name of an existing column in the sample metadata."
+        )
+
     if filter_unassigned:
         # Remove samples with "intermediate" or "unassigned" taxon values,
         # as we only want cohorts with clean taxon calls.
@@ -78,6 +85,13 @@ def _prep_samples_for_cohort_grouping(
         # Apply the matching period_by function to create a new "period" column.
         df_samples["period"] = df_samples.apply(period_by_func, axis="columns")
 
+    # Validate area_by.
+    if area_by not in df_samples.columns:
+        raise ValueError(
+            f"Invalid value for `area_by`: {area_by!r}. "
+            f"Must be the name of an existing column in the sample metadata."
+        )
+
     # Copy the specified area_by column to a new "area" column.
     df_samples["area"] = df_samples[area_by]
 
@@ -396,39 +410,50 @@ def plot_frequencies_time_series(
         # Extract variant labels.
         variant_labels = ds["variant_label"].values
 
+        # Check if CI variables are available.
+        has_ci = "event_frequency_ci_low" in ds
+
         # Build a long-form dataframe from the dataset.
         dfs = []
         for cohort in df_cohorts.itertuples():
             ds_cohort = ds.isel(cohorts=cohort.Index)
-            df = pd.DataFrame(
-                {
-                    "taxon": cohort.taxon,
-                    "area": cohort.area,
-                    "date": cohort.period_start,
-                    "period": str(
-                        cohort.period
-                    ),  # use string representation for hover label
-                    "sample_size": cohort.size,
-                    "variant": variant_labels,
-                    "count": ds_cohort["event_count"].values,
-                    "nobs": ds_cohort["event_nobs"].values,
-                    "frequency": ds_cohort["event_frequency"].values,
-                    "frequency_ci_low": ds_cohort["event_frequency_ci_low"].values,
-                    "frequency_ci_upp": ds_cohort["event_frequency_ci_upp"].values,
-                }
-            )
+            cohort_data = {
+                "taxon": cohort.taxon,
+                "area": cohort.area,
+                "date": cohort.period_start,
+                "period": str(
+                    cohort.period
+                ),  # use string representation for hover label
+                "sample_size": cohort.size,
+                "variant": variant_labels,
+                "count": ds_cohort["event_count"].values,
+                "nobs": ds_cohort["event_nobs"].values,
+                "frequency": ds_cohort["event_frequency"].values,
+            }
+            if has_ci:
+                cohort_data["frequency_ci_low"] = ds_cohort[
+                    "event_frequency_ci_low"
+                ].values
+                cohort_data["frequency_ci_upp"] = ds_cohort[
+                    "event_frequency_ci_upp"
+                ].values
+            df = pd.DataFrame(cohort_data)
             dfs.append(df)
         df_events = pd.concat(dfs, axis=0).reset_index(drop=True)
 
         # Remove events with no observations.
         df_events = df_events.query("nobs > 0").copy()
 
-        # Calculate error bars.
-        frq = df_events["frequency"]
-        frq_ci_low = df_events["frequency_ci_low"]
-        frq_ci_upp = df_events["frequency_ci_upp"]
-        df_events["frequency_error"] = frq_ci_upp - frq
-        df_events["frequency_error_minus"] = frq - frq_ci_low
+        # Calculate error bars if CI data is available.
+        error_y_args = {}
+        if has_ci:
+            frq = df_events["frequency"]
+            frq_ci_low = df_events["frequency_ci_low"]
+            frq_ci_upp = df_events["frequency_ci_upp"]
+            df_events["frequency_error"] = frq_ci_upp - frq
+            df_events["frequency_error_minus"] = frq - frq_ci_low
+            error_y_args["error_y"] = "frequency_error"
+            error_y_args["error_y_minus"] = "frequency_error_minus"
 
         # Make a plot.
         fig = px.line(
@@ -437,8 +462,7 @@ def plot_frequencies_time_series(
             facet_row="area",
             x="date",
             y="frequency",
-            error_y="frequency_error",
-            error_y_minus="frequency_error_minus",
+            **error_y_args,
             color="variant",
             markers=True,
             hover_name="variant",
@@ -518,19 +542,19 @@ def plot_frequencies_map_markers(
             variant_label = variant
 
         # Convert to a dataframe for convenience.
-        df_markers = ds_variant[
-            [
-                "cohort_taxon",
-                "cohort_area",
-                "cohort_period",
-                "cohort_lat_mean",
-                "cohort_lon_mean",
-                "cohort_size",
-                "event_frequency",
-                "event_frequency_ci_low",
-                "event_frequency_ci_upp",
-            ]
-        ].to_dataframe()
+        cols = [
+            "cohort_taxon",
+            "cohort_area",
+            "cohort_period",
+            "cohort_lat_mean",
+            "cohort_lon_mean",
+            "cohort_size",
+            "event_frequency",
+        ]
+        has_ci = "event_frequency_ci_low" in ds
+        if has_ci:
+            cols += ["event_frequency_ci_low", "event_frequency_ci_upp"]
+        df_markers = ds_variant[cols].to_dataframe()
 
         # Select data matching taxon and period parameters.
         df_markers = df_markers.loc[
@@ -560,8 +584,11 @@ def plot_frequencies_map_markers(
                 Area: {x.cohort_area} <br/>
                 Period: {x.cohort_period} <br/>
                 Sample size: {x.cohort_size} <br/>
-                Frequency: {x.event_frequency:.0%}
-                (95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})
+                Frequency: {x.event_frequency:.0%}"""
+            if has_ci:
+                popup_html += f"""
+                (95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})"""
+            popup_html += """
             """
             marker.popup = ipyleaflet.Popup(
                 child=ipywidgets.HTML(popup_html),
@@ -609,13 +636,27 @@ def plot_frequencies_interactive_map(
         variants = ds["variant_label"].values
         taxa = ds["cohort_taxon"].to_pandas().dropna().unique()  # type: ignore
         periods = ds["cohort_period"].to_pandas().dropna().unique()  # type: ignore
+
+        if len(variants) == 0:
+            raise ValueError("No variants available in dataset.")
+        if len(taxa) == 0:
+            raise ValueError("No taxons available in dataset.")
+        if len(periods) == 0:
+            raise ValueError("No periods available in dataset.")
+
         controls = ipywidgets.interactive(
             self.plot_frequencies_map_markers,
             m=ipywidgets.fixed(freq_map),
             ds=ipywidgets.fixed(ds),
-            variant=ipywidgets.Dropdown(options=variants, description="Variant: "),
-            taxon=ipywidgets.Dropdown(options=taxa, description="Taxon: "),
-            period=ipywidgets.Dropdown(options=periods, description="Period: "),
+            variant=ipywidgets.Dropdown(
+                options=variants, value=variants[0], description="Variant: "
+            ),
+            taxon=ipywidgets.Dropdown(
+                options=taxa, value=taxa[0], description="Taxon: "
+            ),
+            period=ipywidgets.Dropdown(
+                options=periods, value=periods[0], description="Period: "
+            ),
             clear=ipywidgets.fixed(True),
         )