Skip to content

Commit f699ad2

Browse files
Jon BrenasJon Brenas
authored andcommitted
Merge branch 'master' into GH837-adar1
2 parents 3ceb185 + 147329a commit f699ad2

20 files changed

Lines changed: 462 additions & 187 deletions

AI-POLICY.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# AI use policy and guidelines
2+
3+
The goal of the MalariaGEN data API is to make access, use, and interpretation of the genomic data collected by our partners as easy and intuitive as possible. Maintainers have limited time and attention to focus on reviews, which means that each review request has to be for code that you can be proud of.
4+
5+
Any tool that can help produce better code and understand better the existing codebase, including AI tools, can be used. The only key questions are: “Is this an improvement?” and “Why is the code better now?”.
6+
7+
NEVER submit an AI-generated PR if you are not able to understand and explain the changes and why they matter. Maintainers WILL close PRs without reviewing them if they feel like they are a waste of time.
8+
9+
## Using AI as a coding assistant
10+
11+
1. Understanding and familiarising yourself with the codebase is key. No matter how good the AI code assistant, it will return useless code if you do not provide a smart and accurate enough prompt.
12+
2. Always check that your changes make sense. LLMs are terrible at saying no to a prompt and will lie and make false claims if they can’t do otherwise. It is particularly true if they lack key information.
13+
3. Each commit should be its own piece of coherent change. LLMs like to do everything at once but digestible change is easier to understand and process.
14+
4. Commenting your code is important, but LLMs really like to listen to themselves talking and will be very verbose. A small comment explaining why you made a choice is better than a paragraph explaining how a loop iterates through a list.
15+
16+
## Using AI for communication
17+
18+
As noted above, maintainers have a limited amount of time to spend on malariaGEN data API maintenance and do not want to waste it going through long, sloppy PR descriptions of simple issue. We strongly prefer clear and concise communication, even if it means we have to ask questions when more details are needed.
19+
20+
You are responsible for your own PRs and comments. Even if you use an LLM to write a PR description or comment, you are expected to read through everything and make sure that it accurately and concisely reflects your opinions, ideas and contributions. If reading your own PRs and comments is too much work for you, it is going to be the same for everyone else.
21+
Here are some concrete guidelines for using AI as part of your communication toolbox.
22+
23+
1. In general, the question that needs answering is why not what. Maintainers can see the files and lines of codes that were modified, what they will want to know is the reasoning behind the choices. Sadly, LLMs are not great at explaining their reasoning so you probably will have to chip in.
24+
2. In the same way, if you are responding to a comment or a review, you will need to justify your choice and explain how you made the decision.
25+
3. Make sure that the description of your work is accurate. Errors can happen but it is fairly obvious when an LLM claims more than it delivers.
26+
4. We are aware that English is not everyone’s first language. The grammar of your communications isn’t as important as the quality of your contribution. Feel free to use AI to improve your writing style but make sure that you still understand the message, that its content is conserved and that it doesn’t turn into an epic poem.
27+
5. Maintainers are more interested in your ideas and thoughts than in the standard answer provided by an LLM. We work with genomic data, and contributors are not expected to be experts in computer science, software engineering, genomics, entomology, … You are allowed not to know or not to be sure and it is miles better to say so than it is to regurgitate an answer that you do not understand.

README.md

Lines changed: 3 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -46,88 +46,11 @@ for release notes.
4646

4747
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
4848

49-
Fork and clone this repo:
49+
Detailed instructions can be found in the [Contributors guide](https://github.com/malariagen/malariagen-data-python/blob/master/CONTRIBUTING.md).
5050

51-
```bash
52-
git clone git@github.com:[username]/malariagen-data-python.git
53-
```
54-
55-
Install Python, e.g.:
56-
57-
```bash
58-
sudo add-apt-repository ppa:deadsnakes/ppa
59-
sudo apt install python3.10 python3.10-venv
60-
```
61-
62-
Install pipx, e.g.:
63-
64-
```bash
65-
python3.10 -m pip install --user pipx
66-
python3.10 -m pipx ensurepath
67-
```
68-
69-
Install [poetry](https://python-poetry.org/docs/#installation), e.g.:
70-
71-
```bash
72-
pipx install poetry
73-
```
74-
75-
Create development environment:
76-
77-
```bash
78-
cd malariagen-data-python
79-
poetry use 3.10
80-
poetry install
81-
```
82-
83-
Activate development environment:
84-
85-
```bash
86-
poetry shell
87-
```
88-
89-
Install pre-commit and pre-commit hooks:
90-
91-
```bash
92-
pipx install pre-commit
93-
pre-commit install
94-
```
95-
96-
Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:
97-
98-
```bash
99-
pre-commit run --all-files
100-
```
101-
102-
Run fast unit tests using simulated data:
103-
104-
```bash
105-
poetry run pytest -v tests/anoph
106-
```
107-
108-
To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
109-
110-
Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
111-
112-
```bash
113-
./install_gcloud.sh
114-
```
115-
116-
You'll then need to obtain application-default credentials, e.g.:
117-
118-
```bash
119-
./google-cloud-sdk/bin/gcloud auth application-default login
120-
```
121-
122-
Once this is done, you can run legacy tests:
123-
124-
```bash
125-
poetry run pytest --ignore=tests/anoph -v tests
126-
```
51+
## AI use policy and guidelines
12752

128-
Tests will run slowly the first time, as data required for testing
129-
will be read from GCS. Subsequent runs will be faster as data will be
130-
cached locally in the "gcs_cache" folder.
53+
See [AI use policy and guidelines](https://github.com/malariagen/malariagen-data-python/blob/master/AI-POLICY.md) for more details.
13154

13255
## Release process
13356

malariagen_data/amin1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ def __init__(
105105
site_filters_analysis=site_filters_analysis,
106106
discordant_read_calls_analysis=discordant_read_calls_analysis,
107107
default_site_mask="minimus",
108-
default_phasing_analysis="minimus_noneyet",
108+
default_phasing_analysis=None,
109109
default_coverage_calls_analysis="minimus_noneyet",
110110
bokeh_output_notebook=bokeh_output_notebook,
111111
results_cache=results_cache,

malariagen_data/anoph/describe.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
import inspect
2+
from typing import Optional
3+
4+
import pandas as pd
5+
from numpydoc_decorator import doc # type: ignore
6+
7+
from .base import AnophelesBase
8+
9+
10+
class AnophelesDescribe(AnophelesBase):
11+
"""Mixin class providing API introspection and discovery functionality."""
12+
13+
@doc(
14+
summary="""
15+
List all available public API methods with their descriptions.
16+
""",
17+
returns="""
18+
A dataframe with one row per public method, containing the method
19+
name, a short summary description, and its category (data access,
20+
analysis, or plotting).
21+
""",
22+
parameters=dict(
23+
category="""
24+
Optional filter to show only methods of a given category.
25+
Supported values are "data", "analysis", "plot", or None to
26+
show all methods.
27+
""",
28+
),
29+
)
30+
def describe_api(
31+
self,
32+
category: Optional[str] = None,
33+
) -> pd.DataFrame:
34+
methods_info = []
35+
36+
# Walk through all public methods on this instance.
37+
for name in sorted(dir(self)):
38+
# Skip private/dunder methods.
39+
if name.startswith("_"):
40+
continue
41+
42+
attr = getattr(type(self), name, None)
43+
if attr is None:
44+
continue
45+
46+
# Only include callable methods and non-property attributes.
47+
if isinstance(attr, property):
48+
continue
49+
if not callable(attr):
50+
continue
51+
52+
# Extract the docstring summary.
53+
summary = self._extract_summary(attr)
54+
55+
# Determine category.
56+
method_category = self._categorize_method(name)
57+
58+
methods_info.append(
59+
{
60+
"method": name,
61+
"summary": summary,
62+
"category": method_category,
63+
}
64+
)
65+
66+
df = pd.DataFrame(methods_info)
67+
68+
# Apply category filter if specified.
69+
if category is not None:
70+
valid_categories = {"data", "analysis", "plot"}
71+
if category not in valid_categories:
72+
raise ValueError(
73+
f"Invalid category: {category!r}. "
74+
f"Must be one of {valid_categories}."
75+
)
76+
df = df[df["category"] == category].reset_index(drop=True)
77+
78+
return df
79+
80+
@staticmethod
81+
def _extract_summary(method) -> str:
82+
"""Extract the first line of the docstring as a summary."""
83+
docstring = inspect.getdoc(method)
84+
if not docstring:
85+
return ""
86+
# Take the first non-empty line as the summary.
87+
for line in docstring.strip().splitlines():
88+
line = line.strip()
89+
if line:
90+
return line
91+
return ""
92+
93+
@staticmethod
94+
def _categorize_method(name: str) -> str:
95+
"""Categorize a method based on its name."""
96+
if name.startswith("plot_"):
97+
return "plot"
98+
data_prefixes = (
99+
"sample_",
100+
"snp_",
101+
"hap_",
102+
"cnv_",
103+
"genome_",
104+
"open_",
105+
"lookup_",
106+
"read_",
107+
"general_",
108+
"sequence_",
109+
"cohorts_",
110+
"aim_",
111+
"gene_",
112+
)
113+
if name.startswith(data_prefixes):
114+
return "data"
115+
return "analysis"

malariagen_data/anoph/dipclust.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -542,10 +542,10 @@ def _dipclust_concat_subplots(
542542
if isinstance(figure, go.Figure):
543543
# This is a figure, access the traces within it.
544544
for trace in range(len(figure["data"])):
545-
fig.append_trace(figure["data"][trace], row=i + 1, col=1)
545+
fig.add_trace(figure["data"][trace], row=i + 1, col=1)
546546
else:
547547
# Assume this is a trace, add directly.
548-
fig.append_trace(figure, row=i + 1, col=1)
548+
fig.add_trace(figure, row=i + 1, col=1)
549549

550550
fig.update_xaxes(visible=False)
551551
fig.update_layout(

malariagen_data/anoph/frq_base.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -276,7 +276,7 @@ def plot_frequencies_heatmap(
276276
heatmap_df.set_index(index_col, inplace=True)
277277

278278
# Clean column names.
279-
heatmap_df.columns = heatmap_df.columns.str.lstrip("frq_")
279+
heatmap_df.columns = heatmap_df.columns.str.removeprefix("frq_")
280280

281281
# Deal with width and height.
282282
if width is None:
@@ -385,8 +385,8 @@ def plot_frequencies_time_series(
385385

386386
# Build a long-form dataframe from the dataset.
387387
dfs = []
388-
for cohort_index, cohort in enumerate(df_cohorts.itertuples()):
389-
ds_cohort = ds.isel(cohorts=cohort_index)
388+
for cohort in df_cohorts.itertuples():
389+
ds_cohort = ds.isel(cohorts=cohort.Index)
390390
df = pd.DataFrame(
391391
{
392392
"taxon": cohort.taxon,

malariagen_data/anoph/g123.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ def g123_gwss(
153153
self,
154154
contig: base_params.contig,
155155
window_size: g123_params.window_size,
156-
sites: g123_params.sites = base_params.DEFAULT,
156+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
157157
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
158158
sample_sets: Optional[base_params.sample_sets] = None,
159159
sample_query: Optional[base_params.sample_query] = None,
@@ -172,9 +172,6 @@ def g123_gwss(
172172
# invalidate any previously cached data.
173173
name = "g123_gwss_v1"
174174

175-
if sites == base_params.DEFAULT:
176-
assert self._default_phasing_analysis is not None
177-
sites = self._default_phasing_analysis
178175
valid_sites = self.phasing_analysis_ids + ("all", "segregating")
179176
if sites not in valid_sites:
180177
raise ValueError(
@@ -259,7 +256,7 @@ def _g123_calibration(
259256
def g123_calibration(
260257
self,
261258
contig: base_params.contig,
262-
sites: g123_params.sites = base_params.DEFAULT,
259+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
263260
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
264261
sample_query: Optional[base_params.sample_query] = None,
265262
sample_query_options: Optional[base_params.sample_query_options] = None,
@@ -279,6 +276,12 @@ def g123_calibration(
279276
# invalidate any previously cached data.
280277
name = "g123_calibration_v1"
281278

279+
valid_sites = self.phasing_analysis_ids + ("all", "segregating")
280+
if sites not in valid_sites:
281+
raise ValueError(
282+
f"Invalid value for `sites` parameter, must be one of {valid_sites}."
283+
)
284+
282285
params = dict(
283286
contig=contig,
284287
sites=sites,
@@ -314,7 +317,7 @@ def plot_g123_gwss_track(
314317
self,
315318
contig: base_params.contig,
316319
window_size: g123_params.window_size,
317-
sites: g123_params.sites = base_params.DEFAULT,
320+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
318321
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
319322
sample_sets: Optional[base_params.sample_sets] = None,
320323
sample_query: Optional[base_params.sample_query] = None,
@@ -417,7 +420,7 @@ def plot_g123_gwss(
417420
self,
418421
contig: base_params.contig,
419422
window_size: g123_params.window_size,
420-
sites: g123_params.sites = base_params.DEFAULT,
423+
sites: g123_params.sites = g123_params.DEFAULT_SITE_PARAMETER,
421424
site_mask: Optional[base_params.site_mask] = base_params.DEFAULT,
422425
sample_sets: Optional[base_params.sample_sets] = None,
423426
sample_query: Optional[base_params.sample_query] = None,

malariagen_data/anoph/g123_params.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@
3434
""",
3535
]
3636

37+
DEFAULT_SITE_PARAMETER: sites = "segregating"
38+
3739
min_cohort_size_default: base_params.min_cohort_size = 20
3840

3941
max_cohort_size_default: base_params.max_cohort_size = 50

malariagen_data/anoph/genome_features.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -552,5 +552,5 @@ def _transcript_to_parent_name(self, transcript):
552552
return self._gene_name_overrides[parent_id]
553553
except KeyError:
554554
rec_parent = df_genome_features.loc[parent_id]
555-
# Try to access "Name" attribute, fall back to "ID" if not present.
556-
return rec_parent.get("Name", parent_id)
555+
# Try to access gene name attribute, fall back to "ID" if not present.
556+
return rec_parent.get(self._gff_gene_name_attribute, parent_id)

malariagen_data/anoph/sample_metadata.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1569,11 +1569,11 @@ def plot_sample_location_mapbox(
15691569
# Sort by `color` column by default, which can be overridden via category_orders.
15701570
df_locations = df_samples[location_columns].drop_duplicates().sort_values(color)
15711571

1572-
fig = px.scatter_mapbox(
1572+
fig = px.scatter_map(
15731573
df_locations,
15741574
lat="latitude",
15751575
lon="longitude",
1576-
mapbox_style="open-street-map",
1576+
map_style="open-street-map",
15771577
zoom=zoom,
15781578
color=color,
15791579
category_orders=category_orders,

0 commit comments

Comments
 (0)