Skip to content

Commit af2e64c

Browse files
authored
Merge branch 'master' into GH907-n-jack-validation
2 parents 7285f4c + d078091 commit af2e64c

33 files changed

Lines changed: 1247 additions & 333 deletions

.github/workflows/integration_tests.yml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@ on:
33
push:
44
branches:
55
- master
6-
pull_request:
7-
branches:
8-
- master
96
jobs:
107
integration_tests:
118
strategy:

.github/workflows/latest_docs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
run: poetry run sphinx-build -b html docs/source docs/build/html
3131

3232
- name: Deploy HTML to GitHub Pages 🚀
33-
uses: peaceiris/actions-gh-pages@v3.9.3
33+
uses: peaceiris/actions-gh-pages@v4
3434
with:
3535
publish_branch: gh-pages
3636
github_token: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/notebooks.yml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@ on:
33
push:
44
branches:
55
- master
6-
pull_request:
7-
branches:
8-
- master
96
jobs:
107
notebooks:
118
strategy:

AI-POLICY.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# AI use policy and guidelines
2+
3+
The goal of the MalariaGEN data API is to make access, use, and interpretation of the genomic data collected by our partners as easy and intuitive as possible. Maintainers have limited time and attention to focus on reviews, which means that each review request has to be for code that you can be proud of.
4+
5+
Any tool that can help produce better code and understand better the existing codebase, including AI tools, can be used. The only key questions are: “Is this an improvement?” and “Why is the code better now?”.
6+
7+
NEVER submit an AI-generated PR if you are not able to understand and explain the changes and why they matter. Maintainers WILL close PRs without reviewing them if they feel like they are a waste of time.
8+
9+
## Using AI as a coding assistant
10+
11+
1. Understanding and familiarising yourself with the codebase is key. No matter how good the AI code assistant, it will return useless code if you do not provide a smart and accurate enough prompt.
12+
2. Always check that your changes make sense. LLMs are terrible at saying no to a prompt and will lie and make false claims if they can’t do otherwise. It is particularly true if they lack key information.
13+
3. Each commit should be its own piece of coherent change. LLMs like to do everything at once but digestible change is easier to understand and process.
14+
4. Commenting your code is important, but LLMs really like to listen to themselves talking and will be very verbose. A small comment explaining why you made a choice is better than a paragraph explaining how a loop iterates through a list.
15+
16+
## Using AI for communication
17+
18+
As noted above, maintainers have a limited amount of time to spend on malariaGEN data API maintenance and do not want to waste it going through long, sloppy PR descriptions of simple issue. We strongly prefer clear and concise communication, even if it means we have to ask questions when more details are needed.
19+
20+
You are responsible for your own PRs and comments. Even if you use an LLM to write a PR description or comment, you are expected to read through everything and make sure that it accurately and concisely reflects your opinions, ideas and contributions. If reading your own PRs and comments is too much work for you, it is going to be the same for everyone else.
21+
Here are some concrete guidelines for using AI as part of your communication toolbox.
22+
23+
1. In general, the question that needs answering is why not what. Maintainers can see the files and lines of codes that were modified, what they will want to know is the reasoning behind the choices. Sadly, LLMs are not great at explaining their reasoning so you probably will have to chip in.
24+
2. In the same way, if you are responding to a comment or a review, you will need to justify your choice and explain how you made the decision.
25+
3. Make sure that the description of your work is accurate. Errors can happen but it is fairly obvious when an LLM claims more than it delivers.
26+
4. We are aware that English is not everyone’s first language. The grammar of your communications isn’t as important as the quality of your contribution. Feel free to use AI to improve your writing style but make sure that you still understand the message, that its content is conserved and that it doesn’t turn into an epic poem.
27+
5. Maintainers are more interested in your ideas and thoughts than in the standard answer provided by an LLM. We work with genomic data, and contributors are not expected to be experts in computer science, software engineering, genomics, entomology, … You are allowed not to know or not to be sure and it is miles better to say so than it is to regurgitate an answer that you do not understand.

CONTRIBUTING.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Contributing to malariagen-data-python
2+
3+
Thanks for your interest in contributing to this project! This guide will help you get started.
4+
5+
## About the project
6+
7+
This package provides Python tools for accessing and analyzing genomic data from [MalariaGEN](https://www.malariagen.net/), a global research network studying the genomic epidemiology of malaria and its vectors. It provides access to data on _Anopheles_ mosquito species and _Plasmodium_ malaria parasites, with functionality for variant analysis, haplotype clustering, population genetics, and visualization.
8+
9+
## Setting up your development environment
10+
11+
### Prerequisites
12+
13+
You'll need:
14+
15+
- Python 3.10.x (CI-tested version)
16+
- [Poetry](https://python-poetry.org/) for dependency management
17+
- [Git](https://git-scm.com/) for version control
18+
19+
### Initial setup
20+
21+
1. **Fork and clone the repository**
22+
23+
Fork the repository on GitHub, then clone your fork:
24+
25+
```bash
26+
git clone git@github.com:[your-username]/malariagen-data-python.git
27+
cd malariagen-data-python
28+
```
29+
30+
2. **Add the upstream remote**
31+
32+
```bash
33+
git remote add upstream https://github.com/malariagen/malariagen-data-python.git
34+
```
35+
36+
3. **Install Poetry** (if not already installed)
37+
38+
```bash
39+
pipx install poetry
40+
```
41+
42+
4. **Install the project and its dependencies**
43+
44+
```bash
45+
poetry install
46+
```
47+
48+
**Recommended**: Use `poetry run` to run commands inside the virtual environment:
49+
50+
```bash
51+
poetry run pytest
52+
poetry run python script.py
53+
```
54+
55+
**Optional**: If you prefer an interactive shell session, install the shell plugin first:
56+
57+
```bash
58+
poetry self add poetry-plugin-shell
59+
```
60+
61+
Then activate the environment with:
62+
63+
```bash
64+
poetry shell
65+
```
66+
67+
After activation, commands run directly inside the virtual environment:
68+
69+
```bash
70+
pytest
71+
python script.py
72+
```
73+
74+
5. **Install pre-commit hooks**
75+
76+
```bash
77+
pipx install pre-commit
78+
pre-commit install
79+
```
80+
81+
Pre-commit hooks will automatically run `ruff` (linter and formatter) on your changes before each commit.
82+
83+
## Development workflow
84+
85+
### Creating a new feature or fix
86+
87+
1. **Sync with upstream**
88+
89+
```bash
90+
git checkout master
91+
git pull upstream master
92+
```
93+
94+
2. **Create a feature branch**
95+
96+
If an issue does not already exist for your change, [create one](https://github.com/malariagen/malariagen-data-python/issues/new) first. Then create a branch using the convention `GH{issue number}-{short description}`:
97+
98+
```bash
99+
git checkout -b GH123-fix-broken-filter
100+
# or
101+
git checkout -b GH456-add-new-analysis
102+
```
103+
104+
3. **Make your changes**
105+
106+
Write your code, add tests, update documentation as needed.
107+
108+
4. **Run tests locally**
109+
110+
Fast unit tests (no external data access):
111+
112+
```bash
113+
poetry run pytest -v tests/anoph
114+
```
115+
116+
All unit tests (requires setting up credentials for legacy tests):
117+
118+
```bash
119+
poetry run pytest -v tests --ignore tests/integration
120+
```
121+
122+
5. **Check code quality**
123+
124+
The pre-commit hooks will run automatically, but you can also run them manually:
125+
126+
```bash
127+
pre-commit run --all-files
128+
```
129+
130+
### Code style
131+
132+
We use `ruff` for both linting and formatting. The configuration is in `pyproject.toml`. Key points:
133+
134+
- Line length: 88 characters (black default)
135+
- Follow PEP 8 conventions
136+
- Use type hints where appropriate
137+
- Write clear docstrings (we use numpydoc format)
138+
139+
The pre-commit hooks will handle most formatting automatically. If you want to run ruff manually:
140+
141+
```bash
142+
ruff check .
143+
ruff format .
144+
```
145+
146+
### Testing
147+
148+
- **Write tests for new functionality**: Add unit tests in the `tests/` directory
149+
- **Test coverage**: Aim to maintain or improve test coverage
150+
- **Fast tests**: Unit tests should use simulated data when possible (see `tests/anoph/`)
151+
- **Integration tests**: Tests requiring GCS data access are slower and run separately
152+
153+
Run type checking with:
154+
155+
```bash
156+
poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.anoph
157+
```
158+
159+
### Documentation
160+
161+
- Update docstrings if you modify public APIs
162+
- Documentation is built using Sphinx with the pydata theme
163+
- API docs are auto-generated from docstrings
164+
- Follow the [numpydoc](https://numpydoc.readthedocs.io/) style guide
165+
166+
## Submitting your contribution
167+
168+
### Before opening a pull request
169+
170+
- [ ] Tests pass locally
171+
- [ ] Pre-commit hooks pass (or run `pre-commit run --all-files`)
172+
- [ ] Code is well-documented
173+
- [ ] Commit messages are clear and descriptive
174+
175+
### Opening a pull request
176+
177+
1. **Push your branch**
178+
179+
```bash
180+
git push origin your-branch-name
181+
```
182+
183+
2. **Create the pull request**
184+
- Go to the [repository on GitHub](https://github.com/malariagen/malariagen-data-python)
185+
- Click "Pull requests" → "New pull request"
186+
- Select your fork and branch
187+
- Write a clear title and description
188+
189+
3. **Pull request description should include:**
190+
- What problem does this solve?
191+
- How does it solve it?
192+
- Any relevant issue numbers (e.g., "Fixes #123")
193+
- Testing done
194+
- Any breaking changes or migration notes
195+
196+
### Review process
197+
198+
- PRs require approval from a project maintainer
199+
- CI tests must pass (pytest on Python 3.10 with NumPy 1.26.4)
200+
- Address review feedback by pushing new commits to your branch
201+
- Once approved, a maintainer will merge your PR
202+
203+
## Communication
204+
205+
- **Issues**: Use [GitHub Issues](https://github.com/malariagen/malariagen-data-python/issues) for bug reports and feature requests
206+
- **Discussions**: For questions and general discussion, use [GitHub Discussions](https://github.com/malariagen/malariagen-data-python/discussions)
207+
- **Pull requests**: Use PR comments for code review discussions
208+
- **Email**: For data access questions, contact [support@malariagen.net](mailto:support@malariagen.net)
209+
210+
## Finding something to work on
211+
212+
- Look for issues labeled [`good first issue`](https://github.com/malariagen/malariagen-data-python/labels/good%20first%20issue)
213+
- Check for issues labeled [`help wanted`](https://github.com/malariagen/malariagen-data-python/labels/help%20wanted)
214+
- Improve documentation or add examples
215+
- Increase test coverage
216+
217+
## Questions?
218+
219+
If you're unsure about anything, feel free to:
220+
221+
- Open an issue to ask
222+
- Start a discussion on GitHub Discussions
223+
- Ask in your pull request
224+
225+
We appreciate your contributions and will do our best to help you succeed!
226+
227+
## License
228+
229+
By contributing to this project, you agree that your contributions will be licensed under the [MIT License](LICENSE).

README.md

Lines changed: 4 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -46,88 +46,11 @@ for release notes.
4646

4747
To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
4848

49-
Fork and clone this repo:
49+
Detailed instructions can be found in the [Contributors guide](https://github.com/malariagen/malariagen-data-python/blob/master/CONTRIBUTING.md).
5050

51-
```bash
52-
git clone git@github.com:[username]/malariagen-data-python.git
53-
```
54-
55-
Install Python, e.g.:
56-
57-
```bash
58-
sudo add-apt-repository ppa:deadsnakes/ppa
59-
sudo apt install python3.10 python3.10-venv
60-
```
61-
62-
Install pipx, e.g.:
63-
64-
```bash
65-
python3.10 -m pip install --user pipx
66-
python3.10 -m pipx ensurepath
67-
```
68-
69-
Install [poetry](https://python-poetry.org/docs/#installation), e.g.:
70-
71-
```bash
72-
pipx install poetry
73-
```
74-
75-
Create development environment:
76-
77-
```bash
78-
cd malariagen-data-python
79-
poetry use 3.10
80-
poetry install
81-
```
82-
83-
Activate development environment:
84-
85-
```bash
86-
poetry shell
87-
```
88-
89-
Install pre-commit and pre-commit hooks:
90-
91-
```bash
92-
pipx install pre-commit
93-
pre-commit install
94-
```
95-
96-
Run pre-commit checks (isort, black, blackdoc, flake8, ...) manually:
97-
98-
```bash
99-
pre-commit run --all-files
100-
```
101-
102-
Run fast unit tests using simulated data:
103-
104-
```bash
105-
poetry run pytest -v tests/anoph
106-
```
107-
108-
To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
109-
110-
Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:
111-
112-
```bash
113-
./install_gcloud.sh
114-
```
115-
116-
You'll then need to obtain application-default credentials, e.g.:
117-
118-
```bash
119-
./google-cloud-sdk/bin/gcloud auth application-default login
120-
```
121-
122-
Once this is done, you can run legacy tests:
123-
124-
```bash
125-
poetry run pytest --ignore=tests/anoph -v tests
126-
```
51+
## AI use policy and guidelines
12752

128-
Tests will run slowly the first time, as data required for testing
129-
will be read from GCS. Subsequent runs will be faster as data will be
130-
cached locally in the "gcs_cache" folder.
53+
See [AI use policy and guidelines](https://github.com/malariagen/malariagen-data-python/blob/master/AI-POLICY.md) for more details.
13154

13255
## Release process
13356

@@ -142,7 +65,7 @@ modifying the `docs/source/_static/switcher.json` file accordingly.
14265

14366
If you use the `malariagen_data` package in a publication
14467
or include any of its functions or code in other materials (_e.g._ training resources),
145-
please cite: [doi.org/10.5281/zenodo.11173411](doi.org/10.5281/zenodo.11173411)
68+
please cite: [doi.org/10.5281/zenodo.11173411](https://doi.org/10.5281/zenodo.11173411)
14669

14770
Some functions may require additional citations to acknowledge specific contributions. These are indicated in the description for each relevant function.
14871

docs/source/Ag3.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ All the functions below can then be accessed as methods on the ``ag3`` object. E
1515
df_samples = ag3.sample_metadata()
1616

1717
For more information about the data and terms of use, please see the
18-
`MalariaGEN Anopheles gambiae genomic surveillance project <https://www.malariagen.net/anopheles-gambiae-genomic-surveillance-project>`_
18+
`MalariaGEN Anopheles gambiae genomic surveillance project <https://www.malariagen.net/project/anopheles-gambiae-genomic-surveillance-project>`_
1919
home page.
2020

2121
.. currentmodule:: malariagen_data.ag3.Ag3

0 commit comments

Comments
 (0)