Skip to content

Commit fa8d2a7

Browse files
committed
Merge branch 'master' into GH391_add_params_to_allele_frequencies_advanced
2 parents 57c7cd3 + a39a3c1 commit fa8d2a7

39 files changed

+5593
-2140
lines changed

README.md

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,7 @@ for release notes.
3535

3636
## Developer setup
3737

38-
To get setup for development, see [this
39-
video](https://youtu.be/QniQi-Hoo9A) and the instructions below.
38+
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
4039

4140
Fork and clone this repo:
4241

@@ -48,27 +47,27 @@ Install Python, e.g.:
4847

4948
```bash
5049
sudo add-apt-repository ppa:deadsnakes/ppa
51-
sudo apt install python3.9 python3.9-venv
50+
sudo apt install python3.10 python3.10-venv
5251
```
5352

5453
Install pipx, e.g.:
5554

5655
```bash
57-
python3.9 -m pip install --user pipx
58-
python3.9 -m pipx ensurepath
56+
python3.10 -m pip install --user pipx
57+
python3.10 -m pipx ensurepath
5958
```
6059

6160
Install [poetry](https://python-poetry.org/docs/#installation), e.g.:
6261

6362
```bash
64-
pipx install poetry==1.8.2 --python=/usr/bin/python3.9
63+
pipx install poetry
6564
```
6665

6766
Create development environment:
6867

6968
```bash
7069
cd malariagen-data-python
71-
poetry use 3.9
70+
poetry use 3.10
7271
poetry install
7372
```
7473

@@ -81,7 +80,7 @@ poetry shell
8180
Install pre-commit and pre-commit hooks:
8281

8382
```bash
84-
pipx install pre-commit --python=/usr/bin/python3.9
83+
pipx install pre-commit
8584
pre-commit install
8685
```
8786

@@ -97,7 +96,9 @@ Run fast unit tests using simulated data:
9796
poetry run pytest -v tests/anoph
9897
```
9998

100-
To run legacy tests which read data from GCS, you'll need to [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
99+
To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
100+
101+
Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
101102

102103
```bash
103104
./install_gcloud.sh
@@ -128,6 +129,17 @@ the documentation via GitHub Actions.
128129
The version switcher for the documentation can then be updated by
129130
modifying the `docs/source/_static/switcher.json` file accordingly.
130131

132+
## Citation
133+
134+
If you use the `malariagen_data` package in a publication
135+
or include any of its functions or code in other materials (_e.g._ training resources),
136+
please cite: [doi.org/10.5281/zenodo.11173411](doi.org/10.5281/zenodo.11173411)
137+
138+
Some functions may require additional citations to acknowledge specific contributions. These are indicated in the description for each relevant function.
139+
140+
For any questions, please feel free to contact us at: [support@malariagen.net](mailto:support@malariagen.net)
141+
142+
131143
## Sponsorship
132144

133145
This project is currently supported by the following grants:

docs/source/_static/switcher.json

Lines changed: 46 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,82 @@
11
[
22
{
3-
"name": "13.0.0",
4-
"version": "v13.0.0",
5-
"url": "https:///malariagen.github.io/malariagen-data-python/v13.0.0/",
3+
"name": "15.0.1",
4+
"version": "v15.0.1",
5+
"url": "https://malariagen.github.io/malariagen-data-python/v15.0.1/",
66
"preferred": true
77
},
8+
{
9+
"name": "14.0.0",
10+
"version": "v14.0.0",
11+
"url": "https://malariagen.github.io/malariagen-data-python/v14.0.0/"
12+
},
13+
{
14+
"name": "13.5.0",
15+
"version": "v13.5.0",
16+
"url": "https://malariagen.github.io/malariagen-data-python/v13.5.0/"
17+
},
18+
{
19+
"name": "13.4.0",
20+
"version": "v13.4.0",
21+
"url": "https://malariagen.github.io/malariagen-data-python/v13.4.0/"
22+
},
23+
{
24+
"name": "13.3.0",
25+
"version": "v13.3.0",
26+
"url": "https://malariagen.github.io/malariagen-data-python/v13.3.0/"
27+
},
28+
{
29+
"name": "13.2.1",
30+
"version": "v13.2.1",
31+
"url": "https://malariagen.github.io/malariagen-data-python/v13.2.0/"
32+
},
33+
{
34+
"name": "13.1.0",
35+
"version": "v13.1.0",
36+
"url": "https://malariagen.github.io/malariagen-data-python/v13.1.0/"
37+
},
38+
{
39+
"name": "13.0.4",
40+
"version": "v13.0.4",
41+
"url": "https://malariagen.github.io/malariagen-data-python/v13.0.4/"
42+
},
843
{
944
"name": "12.0.0",
1045
"version": "v12.0.0",
11-
"url": "https:///malariagen.github.io/malariagen-data-python/v12.0.0/"
46+
"url": "https://malariagen.github.io/malariagen-data-python/v12.0.0/"
1247
},
1348
{
1449
"name": "11.0.0",
1550
"version": "v11.0.0",
16-
"url": "https:///malariagen.github.io/malariagen-data-python/v11.0.0/"
51+
"url": "https://malariagen.github.io/malariagen-data-python/v11.0.0/"
1752
},
1853
{
1954
"name": "10.0.0",
2055
"version": "v10.0.0",
21-
"url": "https:///malariagen.github.io/malariagen-data-python/v10.0.0/"
56+
"url": "https://malariagen.github.io/malariagen-data-python/v10.0.0/"
2257
},
2358
{
2459
"name": "9.0.0",
2560
"version": "v9.0.0",
26-
"url": "https:///malariagen.github.io/malariagen-data-python/v9.0.0/"
61+
"url": "https://malariagen.github.io/malariagen-data-python/v9.0.0/"
2762
},
2863
{
2964
"name": "8.0.0",
3065
"version": "v8.0.0",
31-
"url": "https:///malariagen.github.io/malariagen-data-python/v8.0.0/"
66+
"url": "https://malariagen.github.io/malariagen-data-python/v8.0.0/"
3267
},
3368
{
3469
"name": "7.15.0",
3570
"version": "v7.15.0",
36-
"url": "https:///malariagen.github.io/malariagen-data-python/v7.15.0/"
71+
"url": "https://malariagen.github.io/malariagen-data-python/v7.15.0/"
3772
},
3873
{
3974
"name": "7.14.0",
4075
"version": "v7.14.0",
41-
"url": "https:///malariagen.github.io/malariagen-data-python/v7.14.0/"
76+
"url": "https://malariagen.github.io/malariagen-data-python/v7.14.0/"
4277
},
4378
{
4479
"version": "dev",
45-
"url": "https:///malariagen.github.io/malariagen-data-python/latest/"
80+
"url": "https://malariagen.github.io/malariagen-data-python/latest/"
4681
}
4782
]

malariagen_data/af1.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
MAJOR_VERSION_PATH = "v1.0"
1010
CONFIG_PATH = "v1.0-config.json"
1111
GCS_DEFAULT_URL = "gs://vo_afun_release_master_us_central1/"
12+
GCS_DEFAULT_PUBLIC_URL = "gs://vo_anoph_temp_us_central1/vo_afun_release/"
1213
GCS_REGION_URLS = {
1314
"us-central1": "gs://vo_afun_release_master_us_central1",
1415
}
@@ -78,6 +79,7 @@ class Af1(AnophelesDataResource):
7879
def __init__(
7980
self,
8081
url=None,
82+
public_url=GCS_DEFAULT_PUBLIC_URL,
8183
bokeh_output_notebook=True,
8284
results_cache=None,
8385
log=sys.stdout,
@@ -93,6 +95,7 @@ def __init__(
9395
):
9496
super().__init__(
9597
url=url,
98+
public_url=public_url,
9699
config_path=CONFIG_PATH,
97100
cohorts_analysis=cohorts_analysis,
98101
aim_analysis=None,

malariagen_data/ag3.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
MAJOR_VERSION_PATH = "v3"
1616
CONFIG_PATH = "v3-config.json"
1717
GCS_DEFAULT_URL = "gs://vo_agam_release_master_us_central1/"
18+
GCS_DEFAULT_PUBLIC_URL = "gs://vo_anoph_temp_us_central1/vo_agam_release/"
1819
GCS_REGION_URLS = {
1920
"us-central1": "gs://vo_agam_release_master_us_central1",
2021
}
@@ -138,6 +139,7 @@ class Ag3(AnophelesDataResource):
138139
def __init__(
139140
self,
140141
url=None,
142+
public_url=GCS_DEFAULT_PUBLIC_URL,
141143
bokeh_output_notebook=True,
142144
results_cache=None,
143145
log=sys.stdout,
@@ -154,6 +156,7 @@ def __init__(
154156
):
155157
super().__init__(
156158
url=url,
159+
public_url=public_url,
157160
config_path=CONFIG_PATH,
158161
cohorts_analysis=cohorts_analysis,
159162
aim_analysis=aim_analysis,

malariagen_data/anoph/base.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ def __init__(
4747
self,
4848
*,
4949
url: str,
50+
public_url: str,
5051
config_path: str,
5152
pre: bool,
5253
major_version_number: int,
@@ -117,6 +118,8 @@ def __init__(
117118
raise ValueError("A value for the `url` parameter must be provided.")
118119
del url
119120

121+
self._public_url = public_url
122+
120123
# Set up fsspec filesystem. N.B., we use fsspec here to allow for
121124
# accessing different types of storage - fsspec will automatically
122125
# detect which type of storage to use based on the URL provided.

malariagen_data/anoph/dipclust.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,11 @@ def __init__(
4444

4545
@check_types
4646
@doc(
47-
summary="Hierarchically cluster diplotypes in region and produce an interactive plot.",
47+
summary=""""
48+
Hierarchically cluster diplotypes in region and produce an interactive plot.
49+
50+
If you use this function in a publication, please cite both this package and the original manuscript: doi.org/10.1093/molbev/msae140
51+
""",
4852
)
4953
def plot_diplotype_clustering(
5054
self,
@@ -591,7 +595,11 @@ def _insert_dipclust_snp_trace(
591595
return figures, subplot_heights
592596

593597
@doc(
594-
summary="Perform diplotype clustering, annotated with heterozygosity, gene copy number and amino acid variants.",
598+
summary=""""
599+
Perform diplotype clustering, annotated with heterozygosity, gene copy number and amino acid variants.
600+
601+
If you use this function in a publication, please cite both this package and the original manuscript: doi.org/10.1093/molbev/msae140
602+
""",
595603
parameters=dict(
596604
heterozygosity="Plot heterozygosity track.",
597605
snp_transcript="Plot amino acid variants for these transcripts.",

malariagen_data/anoph/frq_base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ def plot_frequencies_heatmap(
235235

236236
# Indexing.
237237
if index is None:
238-
index = list(df.index.names)
238+
index = [str(name) for name in df.index.names]
239239
df = df.reset_index().copy()
240240
if isinstance(index, list):
241241
index_col = (

malariagen_data/anoph/fst.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -539,8 +539,11 @@ def plot_pairwise_average_fst(
539539
if annotation == "standard error":
540540
fig_df.loc[cohort1, cohort2] = se
541541
elif annotation == "Z score":
542-
zs = fst / se
543-
fig_df.loc[cohort1, cohort2] = zs
542+
try:
543+
zs = fst / se
544+
fig_df.loc[cohort1, cohort2] = zs
545+
except ZeroDivisionError:
546+
fig_df.loc[cohort1, cohort2] = np.nan
544547
else:
545548
fig_df.loc[cohort1, cohort2] = fst
546549

malariagen_data/anoph/igv.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,20 +31,21 @@ def _igv_config(
3131
"reference": {
3232
"id": self._genome_ref_id,
3333
"name": self._genome_ref_name,
34-
"fastaURL": f"{self._url}{self._genome_fasta_path}",
35-
"indexURL": f"{self._url}{self._genome_fai_path}",
34+
"fastaURL": f"{self._public_url}{self._genome_fasta_path}",
35+
"indexURL": f"{self._public_url}{self._genome_fai_path}",
3636
"tracks": [
3737
{
3838
"name": "Genes",
3939
"type": "annotation",
4040
"format": "gff3",
41-
"url": f"{self._url}{self._geneset_gff3_path}",
41+
"url": f"{self._public_url}{self._geneset_gff3_path}",
4242
"indexed": False,
4343
}
4444
],
4545
},
4646
"locus": str(region),
4747
}
48+
4849
if tracks:
4950
config["tracks"] = tracks
5051

@@ -58,7 +59,7 @@ def _igv_site_filters_tracks(
5859
):
5960
tracks = []
6061
for site_mask in self.site_mask_ids:
61-
site_filters_vcf_url = f"{self._url}{self._major_version_path}/site_filters/{self._site_filters_analysis}/vcf/{site_mask}/{contig}_sitefilters.vcf.gz" # noqa
62+
site_filters_vcf_url = f"{self._public_url}{self._major_version_path}/site_filters/{self._site_filters_analysis}/vcf/{site_mask}/{contig}_sitefilters.vcf.gz" # f"{self._url}{self._major_version_path}/site_filters/{self._site_filters_analysis}/vcf/{site_mask}/{contig}_sitefilters.vcf.gz" # noqa
6263
track_config = {
6364
"name": f"Filters - {site_mask}",
6465
"url": site_filters_vcf_url,

malariagen_data/anoph/sample_metadata.py

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
11
import io
22
from itertools import cycle
3-
from typing import Any, Callable, Dict, List, Mapping, Optional, Sequence, Tuple, Union
3+
from typing import (
4+
Any,
5+
Callable,
6+
Dict,
7+
List,
8+
Mapping,
9+
Optional,
10+
Sequence,
11+
Tuple,
12+
Union,
13+
Hashable,
14+
cast,
15+
)
416

517
import ipyleaflet # type: ignore
618
import numpy as np
@@ -39,7 +51,7 @@ def __init__(
3951
# data resources, and so column names and dtype need to be
4052
# passed in as parameters.
4153
self._aim_metadata_columns: Optional[List[str]] = None
42-
self._aim_metadata_dtype: Dict[str, Any] = dict()
54+
self._aim_metadata_dtype: Dict[str, Union[str, type, np.dtype]] = dict()
4355
if isinstance(aim_metadata_dtype, Mapping):
4456
self._aim_metadata_columns = list(aim_metadata_dtype.keys())
4557
self._aim_metadata_dtype.update(aim_metadata_dtype)
@@ -140,7 +152,19 @@ def _parse_general_metadata(
140152
"longitude": "float64",
141153
"sex_call": "object",
142154
}
143-
df = pd.read_csv(io.BytesIO(data), dtype=dtype, na_values="")
155+
# Mapping of string dtypes to actual dtypes
156+
dtype_map = {
157+
"object": str,
158+
"int64": np.int64,
159+
"float64": np.float64,
160+
}
161+
162+
# Convert string dtypes to actual dtypes
163+
dtype_fixed: Mapping[Hashable, Union[str, np.dtype, type]] = {
164+
col: dtype_map.get(dtype[col], str) for col in dtype
165+
}
166+
167+
df = pd.read_csv(io.BytesIO(data), dtype=dtype_fixed, na_values="")
144168

145169
# Ensure all column names are lower case.
146170
df.columns = [c.lower() for c in df.columns] # type: ignore
@@ -460,7 +484,12 @@ def _parse_aim_metadata(
460484
if isinstance(data, bytes):
461485
# Parse CSV data.
462486
df = pd.read_csv(
463-
io.BytesIO(data), dtype=self._aim_metadata_dtype, na_values=""
487+
io.BytesIO(data),
488+
dtype=cast(
489+
Mapping[Hashable, Union[str, type, np.dtype]],
490+
self._aim_metadata_dtype,
491+
),
492+
na_values="",
464493
)
465494

466495
# Ensure all column names are lower case.

0 commit comments

Comments
 (0)