Early Contribution and GSoC Interest: ML-Based Taxon Classification for MalariaGEN Data
Issue
Hello maintainers,
I have recently submitted a pull request (linked below) and would like to express my interest in contributing further to the project, particularly in the context of Google Summer of Code.
PR: #1164
Summary of Current Contribution
The submitted PR introduces utility methods to improve taxon and species discovery within sample metadata:
list_taxons()
list_species_calls()
These methods provide a simple, read-only interface for extracting unique taxonomic and species-level information, reducing the need for manual aggregation and improving usability during exploratory analysis.
The implementation includes:
- Robust handling of missing columns and varying data formats
- Numpydoc-compliant documentation with usage examples
- Comprehensive test coverage across multiple datasets (
ag3, af1, adir1, amin1)
- 20 additional test cases, all passing successfully
Motivation
This contribution is intended as an initial step toward improving data discovery workflows within the library. It addresses a common friction point when working with sample metadata and aims to make the API more intuitive and efficient for users.
Interest in GSoC and Proposed Direction
I am interested in contributing to MalariaGEN through Google Summer of Code and have been developing a project proposal aligned with the ecosystem.
Proposed Project:
Machine Learning-Based Taxon Classifier for Genomic Data
The core idea is to build a lightweight classification system that can identify major Anopheles taxonomic groups directly from raw FASTQ sequencing reads.
Key Objectives:
- Develop a feature extraction pipeline from raw sequencing data (e.g., k-mer based representations)
- Train robust classification models (e.g., Random Forest, Gradient Boosting, or neural networks)
- Avoid dependency on variant-calling pipelines to reduce computational overhead
- Provide a scalable inference system via API and cloud integration
Expected Impact:
- Simplifies taxonomic classification workflows
- Reduces computational requirements
- Improves accessibility for users working with raw sequencing data
- Complements existing tools in the ecosystem by enabling earlier-stage classification
Request for Feedback
I would greatly appreciate feedback on:
- The submitted pull request and its design decisions
- Whether the proposed project direction aligns with current priorities
- Suggestions for additional areas where I can contribute effectively
Thank you for your time and consideration.
Early Contribution and GSoC Interest: ML-Based Taxon Classification for MalariaGEN Data
Issue
Hello maintainers,
I have recently submitted a pull request (linked below) and would like to express my interest in contributing further to the project, particularly in the context of Google Summer of Code.
PR: #1164
Summary of Current Contribution
The submitted PR introduces utility methods to improve taxon and species discovery within sample metadata:
list_taxons()list_species_calls()These methods provide a simple, read-only interface for extracting unique taxonomic and species-level information, reducing the need for manual aggregation and improving usability during exploratory analysis.
The implementation includes:
ag3,af1,adir1,amin1)Motivation
This contribution is intended as an initial step toward improving data discovery workflows within the library. It addresses a common friction point when working with sample metadata and aims to make the API more intuitive and efficient for users.
Interest in GSoC and Proposed Direction
I am interested in contributing to MalariaGEN through Google Summer of Code and have been developing a project proposal aligned with the ecosystem.
Proposed Project:
Machine Learning-Based Taxon Classifier for Genomic Data
The core idea is to build a lightweight classification system that can identify major Anopheles taxonomic groups directly from raw FASTQ sequencing reads.
Key Objectives:
Expected Impact:
Request for Feedback
I would greatly appreciate feedback on:
Thank you for your time and consideration.