Skip to content

ML-Based Taxon Classification for MalariaGEN Data #1165

@Aadik1ng

Description

@Aadik1ng

Early Contribution and GSoC Interest: ML-Based Taxon Classification for MalariaGEN Data


Issue

Hello maintainers,

I have recently submitted a pull request (linked below) and would like to express my interest in contributing further to the project, particularly in the context of Google Summer of Code.

PR: #1164


Summary of Current Contribution

The submitted PR introduces utility methods to improve taxon and species discovery within sample metadata:

  • list_taxons()
  • list_species_calls()

These methods provide a simple, read-only interface for extracting unique taxonomic and species-level information, reducing the need for manual aggregation and improving usability during exploratory analysis.

The implementation includes:

  • Robust handling of missing columns and varying data formats
  • Numpydoc-compliant documentation with usage examples
  • Comprehensive test coverage across multiple datasets (ag3, af1, adir1, amin1)
  • 20 additional test cases, all passing successfully

Motivation

This contribution is intended as an initial step toward improving data discovery workflows within the library. It addresses a common friction point when working with sample metadata and aims to make the API more intuitive and efficient for users.


Interest in GSoC and Proposed Direction

I am interested in contributing to MalariaGEN through Google Summer of Code and have been developing a project proposal aligned with the ecosystem.

Proposed Project:

Machine Learning-Based Taxon Classifier for Genomic Data

The core idea is to build a lightweight classification system that can identify major Anopheles taxonomic groups directly from raw FASTQ sequencing reads.

Key Objectives:

  • Develop a feature extraction pipeline from raw sequencing data (e.g., k-mer based representations)
  • Train robust classification models (e.g., Random Forest, Gradient Boosting, or neural networks)
  • Avoid dependency on variant-calling pipelines to reduce computational overhead
  • Provide a scalable inference system via API and cloud integration

Expected Impact:

  • Simplifies taxonomic classification workflows
  • Reduces computational requirements
  • Improves accessibility for users working with raw sequencing data
  • Complements existing tools in the ecosystem by enabling earlier-stage classification

Request for Feedback

I would greatly appreciate feedback on:

  • The submitted pull request and its design decisions
  • Whether the proposed project direction aligns with current priorities
  • Suggestions for additional areas where I can contribute effectively

Thank you for your time and consideration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions