Skip to content

Commit 3ac97f1

Browse files
authored
Merge branch 'master' into fix/replace-print-with-warnings-warn
2 parents 1f53b51 + 3014709 commit 3ac97f1

29 files changed

Lines changed: 1213 additions & 375 deletions

.github/actions/setup-python/action.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,4 @@ runs:
1919
shell: bash
2020
run: |
2121
poetry env use ${{ inputs.python-version }}
22-
poetry install --extras dev
22+
poetry install --with dev,test,docs

CONTRIBUTING.md

Lines changed: 50 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,10 @@ This package provides Python tools for accessing and analyzing genomic data from
1212

1313
You'll need:
1414

15-
- Python 3.10.x (CI-tested version)
16-
- [Poetry](https://python-poetry.org/) for dependency management
17-
- [Git](https://git-scm.com/) for version control
15+
- [pipx](https://pipx.pypa.io/) for installing Python tools
16+
- [git](https://git-scm.com/) for version control
17+
18+
Both of these can be installed using your distribution's package manager or [Homebrew](https://brew.sh/) on Mac.
1819

1920
### Initial setup
2021

@@ -33,18 +34,31 @@ You'll need:
3334
git remote add upstream https://github.com/malariagen/malariagen-data-python.git
3435
```
3536

36-
3. **Install Poetry** (if not already installed)
37+
3. **Install Poetry**
3738

3839
```bash
3940
pipx install poetry
4041
```
4142

42-
4. **Install the project and its dependencies**
43+
4. **Install Python 3.12**
44+
45+
Python 3.12 is tested in the CI-system and is the recommended version to use.
46+
47+
```bash
48+
poetry python install 3.12
49+
```
50+
51+
5. **Install the project and its dependencies**
4352

4453
```bash
45-
poetry install
54+
poetry env use 3.12
55+
poetry install --with dev,test,docs
4656
```
4757

58+
This installs the runtime dependencies along with the `dev`, `test`, and `docs`
59+
[dependency groups](https://python-poetry.org/docs/managing-dependencies/#dependency-groups).
60+
If you only need to run tests, `poetry install --with test` is sufficient.
61+
4862
**Recommended**: Use `poetry run` to run commands inside the virtual environment:
4963

5064
```bash
@@ -71,7 +85,7 @@ You'll need:
7185
python script.py
7286
```
7387

74-
5. **Install pre-commit hooks**
88+
6. **Install pre-commit hooks**
7589

7690
```bash
7791
pipx install pre-commit
@@ -107,16 +121,40 @@ You'll need:
107121

108122
4. **Run tests locally**
109123

110-
Fast unit tests (no external data access):
124+
Fast unit tests using simulated data (no external data access):
111125

112126
```bash
113-
poetry run pytest -v tests/anoph
127+
poetry run pytest -v tests --ignore tests/integration
114128
```
115129

116-
All unit tests (requires setting up credentials for legacy tests):
130+
To run integration tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
131+
132+
Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
117133

118134
```bash
119-
poetry run pytest -v tests --ignore tests/integration
135+
./install_gcloud.sh
136+
```
137+
138+
You'll then need to obtain application-default credentials, e.g.:
139+
140+
```bash
141+
./google-cloud-sdk/bin/gcloud auth application-default login
142+
```
143+
144+
Once this is done, you can run integration tests:
145+
146+
```bash
147+
poetry run pytest -v tests/integration
148+
```
149+
150+
Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.
151+
152+
6. **Run typechecking**
153+
154+
Run static typechecking with mypy:
155+
156+
```bash
157+
poetry run mypy malariagen_data tests --ignore-missing-imports
120158
```
121159

122160
5. **Check code quality**
@@ -150,7 +188,7 @@ ruff format .
150188
- **Fast tests**: Unit tests should use simulated data when possible (see `tests/anoph/`)
151189
- **Integration tests**: Tests requiring GCS data access are slower and run separately
152190

153-
Run type checking with:
191+
Run dynamic type checking with:
154192

155193
```bash
156194
poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.anoph

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ Some data from MalariaGEN are subject to **terms of use** which may include an e
9292
public communication of any analysis results without permission from data owners. If you
9393
have any questions about terms of use please email support@malariagen.net.
9494

95-
By default, this sofware package accesses data directly from the **MalariaGEN cloud data repository**
95+
By default, this software package accesses data directly from the **MalariaGEN cloud data repository**
9696
hosted in Google Cloud Storage in the US. Note that data access will be more efficient if your
9797
computations are also run within the same region. Google Colab provides a convenient and free
9898
service which you can use to explore data and run computations. If you have any questions about

malariagen_data/anoph/base_params.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,10 @@
6969
str,
7070
"""
7171
A pandas query string to be evaluated against the sample metadata, to
72-
select samples to be included in the returned data.
72+
select samples to be included in the returned data. E.g.,
73+
"country == 'Uganda'". If the query returns zero results, a warning
74+
will be emitted with fuzzy-match suggestions for possible typos or
75+
case mismatches.
7376
""",
7477
]
7578

malariagen_data/anoph/distance.py

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -365,24 +365,43 @@ def _njt(
365365
from scipy.spatial.distance import squareform # type: ignore
366366

367367
# Compute pairwise distances.
368-
dist, samples, n_snps = self.biallelic_diplotype_pairwise_distances(
369-
region=region,
370-
n_snps=n_snps,
371-
metric=metric,
372-
sample_sets=sample_sets,
373-
sample_indices=sample_indices,
374-
site_mask=site_mask,
375-
site_class=site_class,
376-
inline_array=inline_array,
377-
chunks=chunks,
378-
cohort_size=cohort_size,
379-
min_cohort_size=min_cohort_size,
380-
max_cohort_size=max_cohort_size,
381-
random_seed=random_seed,
382-
max_missing_an=max_missing_an,
383-
min_minor_ac=min_minor_ac,
384-
thin_offset=thin_offset,
385-
)
368+
try:
369+
dist, samples, n_snps_used = self.biallelic_diplotype_pairwise_distances(
370+
region=region,
371+
n_snps=n_snps,
372+
metric=metric,
373+
sample_sets=sample_sets,
374+
sample_indices=sample_indices,
375+
site_mask=site_mask,
376+
site_class=site_class,
377+
inline_array=inline_array,
378+
chunks=chunks,
379+
cohort_size=cohort_size,
380+
min_cohort_size=min_cohort_size,
381+
max_cohort_size=max_cohort_size,
382+
random_seed=random_seed,
383+
max_missing_an=max_missing_an,
384+
min_minor_ac=min_minor_ac,
385+
thin_offset=thin_offset,
386+
)
387+
388+
except ValueError as e:
389+
raise ValueError(
390+
f"Unable to construct neighbour-joining tree. {e} "
391+
f"This could be because the selected region does not "
392+
f"contain enough polymorphic SNPs for the given sample "
393+
f"sets and query parameters."
394+
) from e
395+
396+
# Validate enough samples for a tree.
397+
n_samples = len(samples)
398+
if n_samples < 3:
399+
raise ValueError(
400+
f"Not enough samples to construct a neighbour-joining tree. "
401+
f"A minimum of 3 samples is required, but only {n_samples} "
402+
f"were found for the given region and sample sets."
403+
)
404+
386405
D = squareform(dist)
387406

388407
# anjl supports passing in a progress bar function to get progress on the

malariagen_data/anoph/frq_base.py

Lines changed: 84 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,13 @@ def _prep_samples_for_cohort_grouping(
2929
# Users can explicitly override with True/False.
3030
filter_unassigned = taxon_by == "taxon"
3131

32+
# Validate taxon_by.
33+
if taxon_by not in df_samples.columns:
34+
raise ValueError(
35+
f"Invalid value for `taxon_by`: {taxon_by!r}. "
36+
f"Must be the name of an existing column in the sample metadata."
37+
)
38+
3239
if filter_unassigned:
3340
# Remove samples with "intermediate" or "unassigned" taxon values,
3441
# as we only want cohorts with clean taxon calls.
@@ -78,6 +85,13 @@ def _prep_samples_for_cohort_grouping(
7885
# Apply the matching period_by function to create a new "period" column.
7986
df_samples["period"] = df_samples.apply(period_by_func, axis="columns")
8087

88+
# Validate area_by.
89+
if area_by not in df_samples.columns:
90+
raise ValueError(
91+
f"Invalid value for `area_by`: {area_by!r}. "
92+
f"Must be the name of an existing column in the sample metadata."
93+
)
94+
8195
# Copy the specified area_by column to a new "area" column.
8296
df_samples["area"] = df_samples[area_by]
8397

@@ -396,39 +410,50 @@ def plot_frequencies_time_series(
396410
# Extract variant labels.
397411
variant_labels = ds["variant_label"].values
398412

413+
# Check if CI variables are available.
414+
has_ci = "event_frequency_ci_low" in ds
415+
399416
# Build a long-form dataframe from the dataset.
400417
dfs = []
401418
for cohort in df_cohorts.itertuples():
402419
ds_cohort = ds.isel(cohorts=cohort.Index)
403-
df = pd.DataFrame(
404-
{
405-
"taxon": cohort.taxon,
406-
"area": cohort.area,
407-
"date": cohort.period_start,
408-
"period": str(
409-
cohort.period
410-
), # use string representation for hover label
411-
"sample_size": cohort.size,
412-
"variant": variant_labels,
413-
"count": ds_cohort["event_count"].values,
414-
"nobs": ds_cohort["event_nobs"].values,
415-
"frequency": ds_cohort["event_frequency"].values,
416-
"frequency_ci_low": ds_cohort["event_frequency_ci_low"].values,
417-
"frequency_ci_upp": ds_cohort["event_frequency_ci_upp"].values,
418-
}
419-
)
420+
cohort_data = {
421+
"taxon": cohort.taxon,
422+
"area": cohort.area,
423+
"date": cohort.period_start,
424+
"period": str(
425+
cohort.period
426+
), # use string representation for hover label
427+
"sample_size": cohort.size,
428+
"variant": variant_labels,
429+
"count": ds_cohort["event_count"].values,
430+
"nobs": ds_cohort["event_nobs"].values,
431+
"frequency": ds_cohort["event_frequency"].values,
432+
}
433+
if has_ci:
434+
cohort_data["frequency_ci_low"] = ds_cohort[
435+
"event_frequency_ci_low"
436+
].values
437+
cohort_data["frequency_ci_upp"] = ds_cohort[
438+
"event_frequency_ci_upp"
439+
].values
440+
df = pd.DataFrame(cohort_data)
420441
dfs.append(df)
421442
df_events = pd.concat(dfs, axis=0).reset_index(drop=True)
422443

423444
# Remove events with no observations.
424445
df_events = df_events.query("nobs > 0").copy()
425446

426-
# Calculate error bars.
427-
frq = df_events["frequency"]
428-
frq_ci_low = df_events["frequency_ci_low"]
429-
frq_ci_upp = df_events["frequency_ci_upp"]
430-
df_events["frequency_error"] = frq_ci_upp - frq
431-
df_events["frequency_error_minus"] = frq - frq_ci_low
447+
# Calculate error bars if CI data is available.
448+
error_y_args = {}
449+
if has_ci:
450+
frq = df_events["frequency"]
451+
frq_ci_low = df_events["frequency_ci_low"]
452+
frq_ci_upp = df_events["frequency_ci_upp"]
453+
df_events["frequency_error"] = frq_ci_upp - frq
454+
df_events["frequency_error_minus"] = frq - frq_ci_low
455+
error_y_args["error_y"] = "frequency_error"
456+
error_y_args["error_y_minus"] = "frequency_error_minus"
432457

433458
# Make a plot.
434459
fig = px.line(
@@ -437,8 +462,7 @@ def plot_frequencies_time_series(
437462
facet_row="area",
438463
x="date",
439464
y="frequency",
440-
error_y="frequency_error",
441-
error_y_minus="frequency_error_minus",
465+
**error_y_args,
442466
color="variant",
443467
markers=True,
444468
hover_name="variant",
@@ -518,19 +542,19 @@ def plot_frequencies_map_markers(
518542
variant_label = variant
519543

520544
# Convert to a dataframe for convenience.
521-
df_markers = ds_variant[
522-
[
523-
"cohort_taxon",
524-
"cohort_area",
525-
"cohort_period",
526-
"cohort_lat_mean",
527-
"cohort_lon_mean",
528-
"cohort_size",
529-
"event_frequency",
530-
"event_frequency_ci_low",
531-
"event_frequency_ci_upp",
532-
]
533-
].to_dataframe()
545+
cols = [
546+
"cohort_taxon",
547+
"cohort_area",
548+
"cohort_period",
549+
"cohort_lat_mean",
550+
"cohort_lon_mean",
551+
"cohort_size",
552+
"event_frequency",
553+
]
554+
has_ci = "event_frequency_ci_low" in ds
555+
if has_ci:
556+
cols += ["event_frequency_ci_low", "event_frequency_ci_upp"]
557+
df_markers = ds_variant[cols].to_dataframe()
534558

535559
# Select data matching taxon and period parameters.
536560
df_markers = df_markers.loc[
@@ -560,8 +584,11 @@ def plot_frequencies_map_markers(
560584
Area: {x.cohort_area} <br/>
561585
Period: {x.cohort_period} <br/>
562586
Sample size: {x.cohort_size} <br/>
563-
Frequency: {x.event_frequency:.0%}
564-
(95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})
587+
Frequency: {x.event_frequency:.0%}"""
588+
if has_ci:
589+
popup_html += f"""
590+
(95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})"""
591+
popup_html += """
565592
"""
566593
marker.popup = ipyleaflet.Popup(
567594
child=ipywidgets.HTML(popup_html),
@@ -609,13 +636,27 @@ def plot_frequencies_interactive_map(
609636
variants = ds["variant_label"].values
610637
taxa = ds["cohort_taxon"].to_pandas().dropna().unique() # type: ignore
611638
periods = ds["cohort_period"].to_pandas().dropna().unique() # type: ignore
639+
640+
if len(variants) == 0:
641+
raise ValueError("No variants available in dataset.")
642+
if len(taxa) == 0:
643+
raise ValueError("No taxons available in dataset.")
644+
if len(periods) == 0:
645+
raise ValueError("No periods available in dataset.")
646+
612647
controls = ipywidgets.interactive(
613648
self.plot_frequencies_map_markers,
614649
m=ipywidgets.fixed(freq_map),
615650
ds=ipywidgets.fixed(ds),
616-
variant=ipywidgets.Dropdown(options=variants, description="Variant: "),
617-
taxon=ipywidgets.Dropdown(options=taxa, description="Taxon: "),
618-
period=ipywidgets.Dropdown(options=periods, description="Period: "),
651+
variant=ipywidgets.Dropdown(
652+
options=variants, value=variants[0], description="Variant: "
653+
),
654+
taxon=ipywidgets.Dropdown(
655+
options=taxa, value=taxa[0], description="Taxon: "
656+
),
657+
period=ipywidgets.Dropdown(
658+
options=periods, value=periods[0], description="Period: "
659+
),
619660
clear=ipywidgets.fixed(True),
620661
)
621662

0 commit comments

Comments
 (0)