Skip to content

Commit 90a69d3

Browse files
authored
Merge branch 'master' into fix-fst-y-range-721
2 parents 8d18b59 + 509ad40 commit 90a69d3

10 files changed

Lines changed: 113 additions & 97 deletions

File tree

AI-POLICY.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# AI use policy and guidelines
2+
3+
The goal of the MalariaGEN data API is to make access, use, and interpretation of the genomic data collected by our partners as easy and intuitive as possible. Maintainers have limited time and attention to focus on reviews, which means that each review request has to be for code that you can be proud of.
4+
5+
Any tool that can help produce better code and understand better the existing codebase, including AI tools, can be used. The only key questions are: “Is this an improvement?” and “Why is the code better now?”.
6+
7+
NEVER submit an AI-generated PR if you are not able to understand and explain the changes and why they matter. Maintainers WILL close PRs without reviewing them if they feel like they are a waste of time.
8+
9+
## Using AI as a coding assistant
10+
11+
1. Understanding and familiarising yourself with the codebase is key. No matter how good the AI code assistant, it will return useless code if you do not provide a smart and accurate enough prompt.
12+
2. Always check that your changes make sense. LLMs are terrible at saying no to a prompt and will lie and make false claims if they can’t do otherwise. It is particularly true if they lack key information.
13+
3. Each commit should be its own piece of coherent change. LLMs like to do everything at once but digestible change is easier to understand and process.
14+
4. Commenting your code is important, but LLMs really like to listen to themselves talking and will be very verbose. A small comment explaining why you made a choice is better than a paragraph explaining how a loop iterates through a list.
15+
16+
## Using AI for communication
17+
18+
As noted above, maintainers have a limited amount of time to spend on malariaGEN data API maintenance and do not want to waste it going through long, sloppy PR descriptions of simple issue. We strongly prefer clear and concise communication, even if it means we have to ask questions when more details are needed.
19+
20+
You are responsible for your own PRs and comments. Even if you use an LLM to write a PR description or comment, you are expected to read through everything and make sure that it accurately and concisely reflects your opinions, ideas and contributions. If reading your own PRs and comments is too much work for you, it is going to be the same for everyone else.
21+
Here are some concrete guidelines for using AI as part of your communication toolbox.
22+
23+
1. In general, the question that needs answering is why not what. Maintainers can see the files and lines of codes that were modified, what they will want to know is the reasoning behind the choices. Sadly, LLMs are not great at explaining their reasoning so you probably will have to chip in.
24+
2. In the same way, if you are responding to a comment or a review, you will need to justify your choice and explain how you made the decision.
25+
3. Make sure that the description of your work is accurate. Errors can happen but it is fairly obvious when an LLM claims more than it delivers.
26+
4. We are aware that English is not everyone’s first language. The grammar of your communications isn’t as important as the quality of your contribution. Feel free to use AI to improve your writing style but make sure that you still understand the message, that its content is conserved and that it doesn’t turn into an epic poem.
27+
5. Maintainers are more interested in your ideas and thoughts than in the standard answer provided by an LLM. We work with genomic data, and contributors are not expected to be experts in computer science, software engineering, genomics, entomology, … You are allowed not to know or not to be sure and it is miles better to say so than it is to regurgitate an answer that you do not understand.

README.md

Lines changed: 3 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -46,88 +46,11 @@ for release notes.
4646

4747
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
4848

49-
Fork and clone this repo:
49+
Detailed instructions can be found in the [Contributors guide](https://github.com/malariagen/malariagen-data-python/blob/master/CONTRIBUTING.md).
5050

51-
```bash
52-
git clone git@github.com:[username]/malariagen-data-python.git
53-
```
54-
55-
Install Python, e.g.:
56-
57-
```bash
58-
sudo add-apt-repository ppa:deadsnakes/ppa
59-
sudo apt install python3.10 python3.10-venv
60-
```
61-
62-
Install pipx, e.g.:
63-
64-
```bash
65-
python3.10 -m pip install --user pipx
66-
python3.10 -m pipx ensurepath
67-
```
68-
69-
Install [poetry](https://python-poetry.org/docs/#installation), e.g.:
70-
71-
```bash
72-
pipx install poetry
73-
```
74-
75-
Create development environment:
76-
77-
```bash
78-
cd malariagen-data-python
79-
poetry use 3.10
80-
poetry install
81-
```
82-
83-
Activate development environment:
84-
85-
```bash
86-
poetry shell
87-
```
88-
89-
Install pre-commit and pre-commit hooks:
90-
91-
```bash
92-
pipx install pre-commit
93-
pre-commit install
94-
```
95-
96-
Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:
97-
98-
```bash
99-
pre-commit run --all-files
100-
```
101-
102-
Run fast unit tests using simulated data:
103-
104-
```bash
105-
poetry run pytest -v tests/anoph
106-
```
107-
108-
To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
109-
110-
Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
111-
112-
```bash
113-
./install_gcloud.sh
114-
```
115-
116-
You'll then need to obtain application-default credentials, e.g.:
117-
118-
```bash
119-
./google-cloud-sdk/bin/gcloud auth application-default login
120-
```
121-
122-
Once this is done, you can run legacy tests:
123-
124-
```bash
125-
poetry run pytest --ignore=tests/anoph -v tests
126-
```
51+
## AI use policy and guidelines
12752

128-
Tests will run slowly the first time, as data required for testing
129-
will be read from GCS. Subsequent runs will be faster as data will be
130-
cached locally in the "gcs_cache" folder.
53+
See [AI use policy and guidelines](https://github.com/malariagen/malariagen-data-python/blob/master/AI-POLICY.md) for more details.
13154

13255
## Release process
13356

malariagen_data/amin1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ def __init__(
105105
site_filters_analysis=site_filters_analysis,
106106
discordant_read_calls_analysis=discordant_read_calls_analysis,
107107
default_site_mask="minimus",
108-
default_phasing_analysis="minimus_noneyet",
108+
default_phasing_analysis=None,
109109
default_coverage_calls_analysis="minimus_noneyet",
110110
bokeh_output_notebook=bokeh_output_notebook,
111111
results_cache=results_cache,

malariagen_data/anoph/frq_base.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -276,7 +276,7 @@ def plot_frequencies_heatmap(
276276
heatmap_df.set_index(index_col, inplace=True)
277277

278278
# Clean column names.
279-
heatmap_df.columns = heatmap_df.columns.str.lstrip("frq_")
279+
heatmap_df.columns = heatmap_df.columns.str.removeprefix("frq_")
280280

281281
# Deal with width and height.
282282
if width is None:
@@ -385,8 +385,8 @@ def plot_frequencies_time_series(
385385

386386
# Build a long-form dataframe from the dataset.
387387
dfs = []
388-
for cohort_index, cohort in enumerate(df_cohorts.itertuples()):
389-
ds_cohort = ds.isel(cohorts=cohort_index)
388+
for cohort in df_cohorts.itertuples():
389+
ds_cohort = ds.isel(cohorts=cohort.Index)
390390
df = pd.DataFrame(
391391
{
392392
"taxon": cohort.taxon,

malariagen_data/anoph/g123.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ def g123_gwss(
153153
self,
154154
contig: base_params.contig,
155155
window_size: g123_params.window_size,
156-
sites: g123_params.sites = base_params.DEFAULT,
156+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
157157
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
158158
sample_sets: Optional[base_params.sample_sets] = None,
159159
sample_query: Optional[base_params.sample_query] = None,
@@ -172,9 +172,6 @@ def g123_gwss(
172172
# invalidate any previously cached data.
173173
name = "g123_gwss_v1"
174174

175-
if sites == base_params.DEFAULT:
176-
assert self._default_phasing_analysis is not None
177-
sites = self._default_phasing_analysis
178175
valid_sites = self.phasing_analysis_ids + ("all", "segregating")
179176
if sites not in valid_sites:
180177
raise ValueError(
@@ -259,7 +256,7 @@ def _g123_calibration(
259256
def g123_calibration(
260257
self,
261258
contig: base_params.contig,
262-
sites: g123_params.sites = base_params.DEFAULT,
259+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
263260
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
264261
sample_query: Optional[base_params.sample_query] = None,
265262
sample_query_options: Optional[base_params.sample_query_options] = None,
@@ -279,6 +276,12 @@ def g123_calibration(
279276
# invalidate any previously cached data.
280277
name = "g123_calibration_v1"
281278

279+
valid_sites = self.phasing_analysis_ids + ("all", "segregating")
280+
if sites not in valid_sites:
281+
raise ValueError(
282+
f"Invalid value for `sites` parameter, must be one of {valid_sites}."
283+
)
284+
282285
params = dict(
283286
contig=contig,
284287
sites=sites,
@@ -314,7 +317,7 @@ def plot_g123_gwss_track(
314317
self,
315318
contig: base_params.contig,
316319
window_size: g123_params.window_size,
317-
sites: g123_params.sites = base_params.DEFAULT,
320+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
318321
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
319322
sample_sets: Optional[base_params.sample_sets] = None,
320323
sample_query: Optional[base_params.sample_query] = None,
@@ -417,7 +420,7 @@ def plot_g123_gwss(
417420
self,
418421
contig: base_params.contig,
419422
window_size: g123_params.window_size,
420-
sites: g123_params.sites = base_params.DEFAULT,
423+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
421424
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
422425
sample_sets: Optional[base_params.sample_sets] = None,
423426
sample_query: Optional[base_params.sample_query] = None,

malariagen_data/anoph/g123_params.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@
3434
""",
3535
]
3636

37+
DEFAULT_SITE_PARAMETER: sites = "segregating"
38+
3739
min_cohort_size_default: base_params.min_cohort_size = 20
3840

3941
max_cohort_size_default: base_params.max_cohort_size = 50

malariagen_data/anoph/genome_features.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -552,5 +552,5 @@ def _transcript_to_parent_name(self, transcript):
552552
return self._gene_name_overrides[parent_id]
553553
except KeyError:
554554
rec_parent = df_genome_features.loc[parent_id]
555-
# Try to access "Name" attribute, fall back to "ID" if not present.
556-
return rec_parent.get("Name", parent_id)
555+
# Try to access gene name attribute, fall back to "ID" if not present.
556+
return rec_parent.get(self._gff_gene_name_attribute, parent_id)

tests/anoph/conftest.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2464,7 +2464,7 @@ def init_config(self):
24642464
"DEFAULT_COHORTS_ANALYSIS": "20250710",
24652465
"DEFAULT_DISCORDANT_READ_CALLS_ANALYSIS": "",
24662466
"SITE_MASK_IDS": ["dirus"],
2467-
"PHASING_ANALYSIS_IDS": ["dirus_noneyet"],
2467+
"PHASING_ANALYSIS_IDS": [],
24682468
}
24692469
config_path = self.bucket_path / "v1.0-config.json"
24702470
with config_path.open(mode="w") as f:
@@ -2786,7 +2786,7 @@ def init_config(self):
27862786
"DEFAULT_COHORTS_ANALYSIS": "20251019",
27872787
"DEFAULT_DISCORDANT_READ_CALLS_ANALYSIS": "",
27882788
"SITE_MASK_IDS": ["minimus"],
2789-
"PHASING_ANALYSIS_IDS": ["minimus_noneyet"],
2789+
"PHASING_ANALYSIS_IDS": [],
27902790
}
27912791
config_path = self.bucket_path / "v1.0-config.json"
27922792
with config_path.open(mode="w") as f:

tests/anoph/test_g123.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
from malariagen_data import af1 as _af1
88
from malariagen_data import ag3 as _ag3
9+
from malariagen_data import adir1 as _adir1
10+
from malariagen_data import amin1 as _amin1
911
from malariagen_data.anoph.g123 import AnophelesG123Analysis
1012

1113

@@ -56,6 +58,44 @@ def af1_sim_api(af1_sim_fixture):
5658
)
5759

5860

61+
@pytest.fixture
62+
def adir1_sim_api(adir1_sim_fixture):
63+
return AnophelesG123Analysis(
64+
url=adir1_sim_fixture.url,
65+
public_url=adir1_sim_fixture.url,
66+
config_path=_adir1.CONFIG_PATH,
67+
major_version_number=_adir1.MAJOR_VERSION_NUMBER,
68+
major_version_path=_adir1.MAJOR_VERSION_PATH,
69+
pre=False,
70+
gff_gene_type="protein_coding_gene",
71+
gff_gene_name_attribute="Note",
72+
gff_default_attributes=("ID", "Parent", "Note", "description"),
73+
default_site_mask="dirus",
74+
results_cache=adir1_sim_fixture.results_cache_path.as_posix(),
75+
taxon_colors=_adir1.TAXON_COLORS,
76+
default_phasing_analysis=None,
77+
)
78+
79+
80+
@pytest.fixture
81+
def amin1_sim_api(amin1_sim_fixture):
82+
return AnophelesG123Analysis(
83+
url=amin1_sim_fixture.url,
84+
public_url=amin1_sim_fixture.url,
85+
config_path=_amin1.CONFIG_PATH,
86+
major_version_number=_amin1.MAJOR_VERSION_NUMBER,
87+
major_version_path=_amin1.MAJOR_VERSION_PATH,
88+
pre=False,
89+
gff_gene_type="protein_coding_gene",
90+
gff_gene_name_attribute="Note",
91+
gff_default_attributes=("ID", "Parent", "Note", "description"),
92+
default_site_mask="minimus",
93+
results_cache=amin1_sim_fixture.results_cache_path.as_posix(),
94+
taxon_colors=_amin1.TAXON_COLORS,
95+
default_phasing_analysis=None,
96+
)
97+
98+
5999
# N.B., here we use pytest_cases to parametrize tests. Each
60100
# function whose name begins with "case_" defines a set of
61101
# inputs to the test functions. See the documentation for
@@ -76,6 +116,14 @@ def case_af1_sim(af1_sim_fixture, af1_sim_api):
76116
return af1_sim_fixture, af1_sim_api
77117

78118

119+
def case_adir1_sim(adir1_sim_fixture, adir1_sim_api):
120+
return adir1_sim_fixture, adir1_sim_api
121+
122+
123+
def case_amin1_sim(amin1_sim_fixture, amin1_sim_api):
124+
return amin1_sim_fixture, amin1_sim_api
125+
126+
79127
def check_g123_gwss(*, api, g123_params):
80128
# Run main gwss function under test.
81129
x, g123 = api.g123_gwss(**g123_params)
@@ -115,6 +163,10 @@ def test_g123_gwss_with_default_sites(fixture, api: AnophelesG123Analysis):
115163

116164
@parametrize_with_cases("fixture,api", cases=".")
117165
def test_g123_gwss_with_phased_sites(fixture, api: AnophelesG123Analysis):
166+
# Skip if this dataset has no phasing analyses (e.g., Adir1, Amin1).
167+
if not api.phasing_analysis_ids:
168+
pytest.skip("No phasing analyses available for this dataset.")
169+
118170
# Set up test parameters.
119171
all_sample_sets = api.sample_sets()["sample_set"].to_list()
120172
g123_params = dict(
@@ -182,6 +234,10 @@ def test_g123_gwss_with_bad_sites(fixture, api: AnophelesG123Analysis):
182234

183235
@parametrize_with_cases("fixture,api", cases=".")
184236
def test_g123_calibration(fixture, api: AnophelesG123Analysis):
237+
# Skip if this dataset has no phasing analyses (e.g., Adir1, Amin1).
238+
if not api.phasing_analysis_ids:
239+
pytest.skip("No phasing analyses available for this dataset.")
240+
185241
# Set up test parameters.
186242
all_sample_sets = api.sample_sets()["sample_set"].to_list()
187243
window_sizes = np.random.randint(100, 500, size=random.randint(2, 5)).tolist()

tests/anoph/test_genome_features.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,9 @@ def test_plot_genes_with_gene_labels(fixture, api: AnophelesGenomeFeaturesData):
169169
# For each contig in the fixture...
170170
for contig in fixture.contigs:
171171
# Get the genes for this contig.
172-
genes_df = api.genome_features(region=contig).query("type == 'gene'")
172+
genes_df = api.genome_features(region=contig).query(
173+
f"type == '{api._gff_gene_type}'"
174+
)
173175

174176
# If there are no genes, we cannot label them.
175177
if not genes_df.empty:
@@ -181,7 +183,10 @@ def test_plot_genes_with_gene_labels(fixture, api: AnophelesGenomeFeaturesData):
181183

182184
# Put the random gene "ID" and its "Name" in a dictionary.
183185
random_gene_labels = dict(
184-
zip(random_sample_genes_df["ID"], random_sample_genes_df["Name"])
186+
zip(
187+
random_sample_genes_df["ID"],
188+
random_sample_genes_df[api._gff_gene_name_attribute],
189+
)
185190
)
186191

187192
# Check that we get a Bokeh figure from plot_genes() with these gene_labels.

0 commit comments

Comments
 (0)