ML-Based Taxon Classification for MalariaGEN Data


###    Early Contribution and GSoC Interest: ML-Based Taxon Classification for MalariaGEN Data

---

### Issue  

Hello maintainers,

I have recently submitted a pull request (linked below) and would like to express my interest in contributing further to the project, particularly in the context of Google Summer of Code.

PR: https://github.com/malariagen/malariagen-data-python/pull/1164

---

### Summary of Current Contribution  

The submitted PR introduces utility methods to improve taxon and species discovery within sample metadata:

- `list_taxons()`  
- `list_species_calls()`  

These methods provide a simple, read-only interface for extracting unique taxonomic and species-level information, reducing the need for manual aggregation and improving usability during exploratory analysis.

The implementation includes:

- Robust handling of missing columns and varying data formats  
- Numpydoc-compliant documentation with usage examples  
- Comprehensive test coverage across multiple datasets (`ag3`, `af1`, `adir1`, `amin1`)  
- 20 additional test cases, all passing successfully  

---

### Motivation  

This contribution is intended as an initial step toward improving data discovery workflows within the library. It addresses a common friction point when working with sample metadata and aims to make the API more intuitive and efficient for users.

---

### Interest in GSoC and Proposed Direction  

I am interested in contributing to MalariaGEN through Google Summer of Code and have been developing a project proposal aligned with the ecosystem.

#### Proposed Project:  
**Machine Learning-Based Taxon Classifier for Genomic Data**

The core idea is to build a lightweight classification system that can identify major *Anopheles* taxonomic groups directly from raw FASTQ sequencing reads.

#### Key Objectives:

- Develop a feature extraction pipeline from raw sequencing data (e.g., k-mer based representations)  
- Train robust classification models (e.g., Random Forest, Gradient Boosting, or neural networks)  
- Avoid dependency on variant-calling pipelines to reduce computational overhead  
- Provide a scalable inference system via API and cloud integration  

#### Expected Impact:

- Simplifies taxonomic classification workflows  
- Reduces computational requirements  
- Improves accessibility for users working with raw sequencing data  
- Complements existing tools in the ecosystem by enabling earlier-stage classification  

---

### Request for Feedback  

I would greatly appreciate feedback on:

- The submitted pull request and its design decisions  
- Whether the proposed project direction aligns with current priorities  
- Suggestions for additional areas where I can contribute effectively  

Thank you for your time and consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML-Based Taxon Classification for MalariaGEN Data #1165

Early Contribution and GSoC Interest: ML-Based Taxon Classification for MalariaGEN Data

Issue

Summary of Current Contribution

Motivation

Interest in GSoC and Proposed Direction

Proposed Project:

Key Objectives:

Expected Impact:

Request for Feedback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ML-Based Taxon Classification for MalariaGEN Data #1165

Description

Early Contribution and GSoC Interest: ML-Based Taxon Classification for MalariaGEN Data

Issue

Summary of Current Contribution

Motivation

Interest in GSoC and Proposed Direction

Proposed Project:

Key Objectives:

Expected Impact:

Request for Feedback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions