Skip to content

Commit 4b5c320

Browse files
authored
Merge branch 'master' into fix-colab-tpu-runtime
2 parents e1677fe + dbe12c2 commit 4b5c320

11 files changed

Lines changed: 333 additions & 49 deletions

File tree

LINUX_SETUP.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Developer setup (Linux)
2+
3+
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
4+
5+
## 1. Fork and clone this repo
6+
```bash
7+
git clone git@github.com:[username]/malariagen-data-python.git
8+
cd malariagen-data-python
9+
```
10+
11+
## 2. Install Python
12+
```bash
13+
sudo add-apt-repository ppa:deadsnakes/ppa
14+
sudo apt install python3.10 python3.10-venv
15+
```
16+
17+
## 3. Install pipx and poetry
18+
```bash
19+
python3.10 -m pip install --user pipx
20+
python3.10 -m pipx ensurepath
21+
pipx install poetry
22+
```
23+
24+
## 4. Create and activate development environment
25+
```bash
26+
poetry install
27+
poetry shell
28+
```
29+
30+
## 5. Install pre-commit hooks
31+
```bash
32+
pipx install pre-commit
33+
pre-commit install
34+
```
35+
36+
Run pre-commit checks manually:
37+
```bash
38+
pre-commit run --all-files
39+
```
40+
41+
## 6. Run tests
42+
43+
Run fast unit tests using simulated data:
44+
```bash
45+
poetry run pytest -v tests/anoph
46+
```
47+
48+
## 7. Google Cloud authentication (for legacy tests)
49+
50+
To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
51+
52+
Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install):
53+
```bash
54+
./install_gcloud.sh
55+
```
56+
57+
Then obtain application-default credentials:
58+
```bash
59+
./google-cloud-sdk/bin/gcloud auth application-default login
60+
```
61+
62+
Once authenticated, run legacy tests:
63+
```bash
64+
poetry run pytest --ignore=tests/anoph -v tests
65+
```
66+
67+
Tests will run slowly the first time, as data will be read from GCS and cached locally in the `gcs_cache` folder.

MACOS_SETUP.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Developer setup (macOS)
2+
3+
The Linux setup guide is available in [LINUX_SETUP.md](LINUX_SETUP.md). If you are on macOS, follow these steps instead.
4+
5+
## 1. Install Miniconda
6+
7+
Download and install Miniconda for macOS from https://docs.conda.io/en/latest/miniconda.html.
8+
Choose the Apple Silicon installer if you have an Apple Silicon Mac, or the Intel installer otherwise. You can check with:
9+
```bash
10+
uname -m
11+
# arm64 = Apple Silicon, x86_64 = Intel
12+
```
13+
14+
After installation, close and reopen your terminal for conda to be available.
15+
16+
## 2. Create a conda environment
17+
18+
The package requires Python `>=3.10, <3.13`. Python 3.13+ is not currently supported.
19+
```bash
20+
conda create -n malariagen python=3.11
21+
conda activate malariagen
22+
```
23+
24+
## 3. Fork and clone this repo
25+
26+
Fork the repository on GitHub, then clone your fork:
27+
```bash
28+
git clone git@github.com:[username]/malariagen-data-python.git
29+
cd malariagen-data-python
30+
pip install -e ".[dev]"
31+
```
32+
33+
## 4. Install pre-commit hooks
34+
```bash
35+
pre-commit install
36+
```
37+
38+
Run pre-commit checks manually:
39+
```bash
40+
pre-commit run --all-files
41+
```
42+
43+
## 5. Run tests
44+
45+
Run fast unit tests using simulated data:
46+
```bash
47+
pytest -v tests/anoph
48+
```
49+
50+
## 6. Google Cloud authentication (for legacy tests)
51+
52+
To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
53+
54+
Once access has been granted, install the Google Cloud CLI:
55+
```bash
56+
brew install google-cloud-sdk
57+
```
58+
59+
Then authenticate:
60+
```bash
61+
gcloud auth application-default login
62+
```
63+
64+
This opens a browser — log in with any Google account.
65+
66+
Once authenticated, run legacy tests:
67+
```bash
68+
pytest --ignore=tests/anoph -v tests
69+
```
70+
71+
Tests will run slowly the first time, as data will be read from GCS and cached locally in the `gcs_cache` folder.
72+
73+
## 7. VS Code terminal integration
74+
75+
To use the `code` command from the terminal:
76+
77+
Open VS Code → `Cmd + Shift + P` → type `Shell Command: Install 'code' command in PATH` → press Enter.

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,11 @@ for release notes.
4444

4545
## Developer setup
4646

47-
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
47+
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A).
4848

49+
For detailed setup instructions, see:
50+
- [Linux setup guide](LINUX_SETUP.md)
51+
- [macOS setup guide](MACOS_SETUP.md)
4952
Detailed instructions can be found in the [Contributors guide](https://github.com/malariagen/malariagen-data-python/blob/master/CONTRIBUTING.md).
5053

5154
## AI use policy and guidelines

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ Some data from MalariaGEN are subject to **terms of use** which may include an e
9292
public communication of any analysis results without permission from data owners. If you
9393
have any questions about terms of use please email support@malariagen.net.
9494

95-
By default, this sofware package accesses data directly from the **MalariaGEN cloud data repository**
95+
By default, this software package accesses data directly from the **MalariaGEN cloud data repository**
9696
hosted in Google Cloud Storage in the US. Note that data access will be more efficient if your
9797
computations are also run within the same region. Google Colab provides a convenient and free
9898
service which you can use to explore data and run computations. If you have any questions about

malariagen_data/anoph/frq_base.py

Lines changed: 53 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -396,39 +396,50 @@ def plot_frequencies_time_series(
396396
# Extract variant labels.
397397
variant_labels = ds["variant_label"].values
398398

399+
# Check if CI variables are available.
400+
has_ci = "event_frequency_ci_low" in ds
401+
399402
# Build a long-form dataframe from the dataset.
400403
dfs = []
401404
for cohort in df_cohorts.itertuples():
402405
ds_cohort = ds.isel(cohorts=cohort.Index)
403-
df = pd.DataFrame(
404-
{
405-
"taxon": cohort.taxon,
406-
"area": cohort.area,
407-
"date": cohort.period_start,
408-
"period": str(
409-
cohort.period
410-
), # use string representation for hover label
411-
"sample_size": cohort.size,
412-
"variant": variant_labels,
413-
"count": ds_cohort["event_count"].values,
414-
"nobs": ds_cohort["event_nobs"].values,
415-
"frequency": ds_cohort["event_frequency"].values,
416-
"frequency_ci_low": ds_cohort["event_frequency_ci_low"].values,
417-
"frequency_ci_upp": ds_cohort["event_frequency_ci_upp"].values,
418-
}
419-
)
406+
cohort_data = {
407+
"taxon": cohort.taxon,
408+
"area": cohort.area,
409+
"date": cohort.period_start,
410+
"period": str(
411+
cohort.period
412+
), # use string representation for hover label
413+
"sample_size": cohort.size,
414+
"variant": variant_labels,
415+
"count": ds_cohort["event_count"].values,
416+
"nobs": ds_cohort["event_nobs"].values,
417+
"frequency": ds_cohort["event_frequency"].values,
418+
}
419+
if has_ci:
420+
cohort_data["frequency_ci_low"] = ds_cohort[
421+
"event_frequency_ci_low"
422+
].values
423+
cohort_data["frequency_ci_upp"] = ds_cohort[
424+
"event_frequency_ci_upp"
425+
].values
426+
df = pd.DataFrame(cohort_data)
420427
dfs.append(df)
421428
df_events = pd.concat(dfs, axis=0).reset_index(drop=True)
422429

423430
# Remove events with no observations.
424431
df_events = df_events.query("nobs > 0").copy()
425432

426-
# Calculate error bars.
427-
frq = df_events["frequency"]
428-
frq_ci_low = df_events["frequency_ci_low"]
429-
frq_ci_upp = df_events["frequency_ci_upp"]
430-
df_events["frequency_error"] = frq_ci_upp - frq
431-
df_events["frequency_error_minus"] = frq - frq_ci_low
433+
# Calculate error bars if CI data is available.
434+
error_y_args = {}
435+
if has_ci:
436+
frq = df_events["frequency"]
437+
frq_ci_low = df_events["frequency_ci_low"]
438+
frq_ci_upp = df_events["frequency_ci_upp"]
439+
df_events["frequency_error"] = frq_ci_upp - frq
440+
df_events["frequency_error_minus"] = frq - frq_ci_low
441+
error_y_args["error_y"] = "frequency_error"
442+
error_y_args["error_y_minus"] = "frequency_error_minus"
432443

433444
# Make a plot.
434445
fig = px.line(
@@ -437,8 +448,7 @@ def plot_frequencies_time_series(
437448
facet_row="area",
438449
x="date",
439450
y="frequency",
440-
error_y="frequency_error",
441-
error_y_minus="frequency_error_minus",
451+
**error_y_args,
442452
color="variant",
443453
markers=True,
444454
hover_name="variant",
@@ -518,19 +528,19 @@ def plot_frequencies_map_markers(
518528
variant_label = variant
519529

520530
# Convert to a dataframe for convenience.
521-
df_markers = ds_variant[
522-
[
523-
"cohort_taxon",
524-
"cohort_area",
525-
"cohort_period",
526-
"cohort_lat_mean",
527-
"cohort_lon_mean",
528-
"cohort_size",
529-
"event_frequency",
530-
"event_frequency_ci_low",
531-
"event_frequency_ci_upp",
532-
]
533-
].to_dataframe()
531+
cols = [
532+
"cohort_taxon",
533+
"cohort_area",
534+
"cohort_period",
535+
"cohort_lat_mean",
536+
"cohort_lon_mean",
537+
"cohort_size",
538+
"event_frequency",
539+
]
540+
has_ci = "event_frequency_ci_low" in ds
541+
if has_ci:
542+
cols += ["event_frequency_ci_low", "event_frequency_ci_upp"]
543+
df_markers = ds_variant[cols].to_dataframe()
534544

535545
# Select data matching taxon and period parameters.
536546
df_markers = df_markers.loc[
@@ -560,8 +570,11 @@ def plot_frequencies_map_markers(
560570
Area: {x.cohort_area} <br/>
561571
Period: {x.cohort_period} <br/>
562572
Sample size: {x.cohort_size} <br/>
563-
Frequency: {x.event_frequency:.0%}
564-
(95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})
573+
Frequency: {x.event_frequency:.0%}"""
574+
if has_ci:
575+
popup_html += f"""
576+
(95% CI: {x.event_frequency_ci_low:.0%} - {x.event_frequency_ci_upp:.0%})"""
577+
popup_html += """
565578
"""
566579
marker.popup = ipyleaflet.Popup(
567580
child=ipywidgets.HTML(popup_html),

malariagen_data/anoph/snp_frq.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,15 +125,15 @@ def snp_effects(
125125
A dataframe of SNP allele frequencies, one row per variant allele. The variant alleles are indexed by
126126
their contig, their position, the reference allele, the alternate allele and the associated amino acid change.
127127
The columns are split into three categories: there is one column for each taxon filter (e.g., pass_funestus, pass_gamb_colu, ...) containing whether the site of the variant allele passes the filter;
128-
there is then 1 column for each cohort containing the frequency of the variant allele within the cohort, additionally there is a column `max_af` containing the maximum allele frequency of the variant allele accross all cohorts;
128+
there is then 1 column for each cohort containing the frequency of the variant allele within the cohort, additionally there is a column `max_af` containing the maximum allele frequency of the variant allele across all cohorts;
129129
finally, there are 9 columns describing the variant allele: `transcript` contains the gene transcript used for this analysis,
130130
`effect` is the effect of the allele change,
131131
`impact`is the impact of the allele change,
132132
`ref_codon` is the reference codon,
133133
`alt_codon` is the altered codon with the variant allele,
134134
`aa_pos` is the position of the amino acid,
135135
`ref_aa` is the reference amino acid,
136-
`alt_aa` is the altered amino acid with the varaint allele,
136+
`alt_aa` is the altered amino acid with the variant allele,
137137
and `label` is the label of the variant allele.
138138
""",
139139
notes="""
@@ -296,15 +296,15 @@ def snp_allele_frequencies(
296296
returns="""
297297
A dataframe of amino acid allele frequencies, one row per variant. The variants are indexed by
298298
their amino acid change, their contig, their position.
299-
The columns are split into two categories: there is 1 column for each cohort containing the frequency of the amino acid change within the cohort, additionally there is a column `max_af` containing the maximum frequency of the amino acide change accross all cohorts;
299+
The columns are split into two categories: there is 1 column for each cohort containing the frequency of the amino acid change within the cohort, additionally there is a column `max_af` containing the maximum frequency of the amino acid change across all cohorts;
300300
finally, there are 9 columns describing the variant allele: `transcript` contains the gene transcript used for this analysis,
301301
`effect` is the effect of the allele change,
302302
`impact`is the impact of the allele change,
303-
`ref_allele` is the reference allel,
303+
`ref_allele` is the reference allele,
304304
`alt_allele` is the alternate allele,
305305
`aa_pos` is the position of the amino acid,
306306
`ref_aa` is the reference amino acid,
307-
`alt_aa` is the altered amino acid with the varaint allele,
307+
`alt_aa` is the altered amino acid with the variant allele,
308308
and `label` is the label of the variant allele.
309309
""",
310310
notes="""

malariagen_data/anopheles.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -727,7 +727,7 @@ def ihs_gwss(
727727
) -> Tuple[np.ndarray, np.ndarray]:
728728
# change this name if you ever change the behaviour of this function, to
729729
# invalidate any previously cached data
730-
name = "roh"
730+
name = self._ihs_gwss_cache_name
731731

732732
params = dict(
733733
contig=contig,

malariagen_data/util.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -899,6 +899,12 @@ def __init__(
899899
handler = logging.FileHandler(out)
900900
self._handler = handler
901901

902+
# Remove any pre-existing handlers from the singleton logger to prevent
903+
# accumulation (and FileHandler FD leaks) on repeated instantiation.
904+
for existing_handler in logger.handlers[:]:
905+
logger.removeHandler(existing_handler)
906+
existing_handler.close()
907+
902908
# configure handler
903909
if handler is not None:
904910
if debug:

0 commit comments

Comments
 (0)