Skip to content

The-Obstacle-Is-The-Way/antibody_training_pipeline_ESM

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,274 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Antibody Non-Specificity Prediction Pipeline using ESM

Leeroy Jenkins
"⏰ Times up, let's do this." - Leeroy Jenkins

CI Pipeline Docker CI codecov Python 3.12 License: Apache 2.0 Code style: ruff


This repository provides a machine learning pipeline to predict the non-specificity of antibodies using embeddings from the ESM-1v Protein Language Model(PLM). The project is an implementation of the methods described in the paper "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters" by Sakhnini et al.


Project Description

Non-specific binding of therapeutic antibodies can lead to faster clearance from the body and other unwanted side effects, compromising their effectiveness and safety. Predicting this property, also known as polyreactivity, from an antibody's amino acid sequence is a critical step in drug development.

This project offers a computational pipeline to tackle this challenge. It leverages the power of the ESM-1v, a state-of-the-art PLM, to convert antibody's amino acid sequences meaningful numerical representations (embeddings). These embeddings capture complex biophysical and evolutionary information, which is then used to train a machine learning classifier to predict non-specificity. The pipeline is designed to be modular, allowing for easy adaptation to different datasets and models.


Model Architecture

The model's architecture is a two-stage process designed for both power and interpretability:

  1. Sequence Embedding with ESM-1v: The amino acid sequence of an antibody's Variable Heavy(VH) domain is fed into the pre-trained ESM-1v model. ESM-1v, trained on millions of diverse protein sequences, generates a high-dimensional vector(embedding) for the antibody. This vector represents the learned structural and functional properties of the sequence.
  2. Classification: The generated embedding vector is then used as input for a simpler, classical machine learning model. The original paper found that a Logistic Regression classifier performed best, achieving up to 71% accuracy in 10-fold cross-validation. This second two-stage learns to map the sequence features captured by ESM-1v to a binary outcome: specific or non-specific

This hybrid approach combines the deep contextual understanding of a PLM with the efficiency and interpretability of a linear classifier.


Features

Implemented

  • Data Processing: Scripts to load, clean, and process antibody datasets, including the Boughter et al. (2020) dataset used for training.

  • Sequence Annotation: Annotation of Complementarity-Determining Regions (CDRs) and extraction of the VH domain from full antibody sequences.

  • ESM-1v Embedding: A module to generate embeddings for antibody sequences using the ESM-1v model.

  • Model Training: A complete training pipeline for a Logistic Regression classifier on the generated embeddings.

  • Model Evaluation: Standard evaluation metrics, including k-fold cross-validation, accuracy, sensitivity, and specificity, are implemented to assess model performance.

  • Prediction CLI: Get predictions for new antibody sequences from trained models.

    1. Obtain a pretrained classifier (one of):

      Supported formats:

      • Development: .pkl (Pickle)
      • Production: .npz (NumPy arrays) + _config.json (Metadata)
    # Option A: Using Pickle (Development)
    uv run antibody-predict \
        input_file=data/test.csv \
        output_file=predictions.csv \
        classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl
    
    # Option B: Using NPZ (Production/Secure)
    uv run antibody-predict \
        input_file=data/test.csv \
        output_file=predictions.csv \
        classifier.path=experiments/checkpoints/esm1v/logreg/model.npz \
        classifier.config_path=experiments/checkpoints/esm1v/logreg/model_config.json
  • Web Application Interface: A simple Gradio-based frontend for interactive prediction.

    # Launch the web UI
    uv run antibody-app classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl

    Automatically handles macOS optimization (forces CPU/Single-thread) to prevent crashes.

To-Be Implemented

  • Biophysical Descriptor Module: A feature to calculate and incorporate key biophysical parameters, such as the isoelectric point (pI), which was identified as a major driver of non-specificity.

  • Support for Other PLMs: Integration of other antibody-specific language models like AbLang or AntiBERTy for performance comparison.


Installation & Setup

To get started, clone the repository and set up the Python environment.

  1. Clone the Repository
git clone https://github.com/The-Obstacle-Is-The-Way/antibody_training_pipeline_ESM.git
cd antibody_training_pipeline_ESM
  1. Create the Environment This project uses uv for fast Python package management with virtual environments.

Install uv if you don't have it:

  • For Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | sh
  • For Windows(use pip)
pip install uv
  1. Set up the project

Recommended (all platforms):

# This runs 'uv sync --all-extras' to install ALL dependencies (including dev tools)
make install

Manual setup:

  • On Linux/macOS
uv venv
source .venv/bin/activate
uv sync --all-extras  # Install all dependencies
  • On Windows
uv venv
venv\Scripts\activate
uv sync --all-extras  # Install all dependencies

Important: Always use make install or uv sync --all-extras to ensure dev dependencies (pytest, mypy, etc.) are installed. Plain uv sync will skip them.


Developer Workflow

This project uses modern Python tooling for a streamlined development experience. All common tasks are available through simple make commands.

Quick Start

# Install dependencies
make install

# Run full quality pipeline (format, lint, typecheck, test)
make all

Available Commands

Command Description
make install Install all project dependencies with uv
make format Auto-format code with ruff
make lint Check code quality with ruff linting
make typecheck Run static type checking with mypy
make hooks Run pre-commit hooks on all files
make test Fast suite (unit + integration; skips e2e, slow, gpu)
make test-e2e End-to-end suite (honors env flags like RUN_NOVO_E2E)
make test-all Full pytest suite (env-gated e2e may still skip)
make all Run complete quality pipeline (format β†’ lint β†’ typecheck β†’ fast tests)
make train Run the ML training pipeline
make clean Remove cache directories and temporary files
make help Show all available commands

Training with Hydra

The pipeline uses Hydra for flexible configuration management. Default config is in src/antibody_training_esm/conf/config.yaml:

# Train with default Hydra config (src/antibody_training_esm/conf/config.yaml)
make train
# OR
uv run antibody-train

# Override any parameter from CLI
uv run antibody-train hardware.device=cuda training.batch_size=32
uv run antibody-train classifier.C=0.5 classifier.penalty=l1

# Run hyperparameter sweeps (multi-run mode)
uv run antibody-train --multirun classifier.C=0.1,1.0,10.0
# Creates 3 separate runs with different C values

# Sweep multiple parameters (Cartesian product)
uv run antibody-train --multirun classifier.C=0.1,1.0 classifier.penalty=l1,l2
# Creates 4 runs: (0.1,l1), (0.1,l2), (1.0,l1), (1.0,l2)

Output structure:

experiments/runs/
└── {experiment.name}/
    └── {timestamp}/
        β”œβ”€β”€ .hydra/config.yaml   # Full resolved config
        β”œβ”€β”€ logs/training.log    # Training logs
        └── {model}.pkl          # Trained model artifact

Why Hydra?

  • No file editing: Override any config value from CLI
  • Experiment tracking: Every run saves its full configuration automatically
  • Hyperparameter sweeps: Multi-run mode for systematic parameter exploration
  • Reproducibility: Complete config provenance for every experiment

See Training Guide for comprehensive Hydra usage.


Code Quality Standards

This repository maintains 100% type safety and enforces quality through pre-commit hooks:

  • Ruff: Fast linting and formatting (replaces black, isort, flake8)
  • mypy: Static type checking with strict configuration
  • pytest: Comprehensive test coverage

Pre-commit Hooks

Pre-commit hooks run automatically before each commit to ensure code quality:

# Install hooks (one-time setup)
uv run pre-commit install

# Run hooks manually on all files
make hooks

The hooks will automatically:

  • Format code with ruff
  • Check linting rules
  • Verify type safety with mypy

If any check fails, the commit is blocked until issues are resolved.

Development Best Practices

  1. Before committing: Run make all to ensure everything passes
  2. When adding new code: Include type annotations from the start
  3. If pre-commit blocks: Review the errors and fix them - the hooks ensure quality
  4. For quick checks: Use individual commands like make lint or make typecheck

Security

Pickle Usage

This codebase uses Python's pickle module for:

  • Trained ML models: Saving/loading BinaryClassifier models (.pkl files)
  • Embedding caches: Caching expensive ESM embeddings for performance
  • Preprocessed datasets: Storing locally processed data

Threat Model: All pickle files are generated and consumed locally by trusted code. There is no internet-exposed API and no loading of untrusted pickle files.

For Production Deployment: If deploying this pipeline to a production environment with external access, consider migrating to JSON + NPZ format for artifact serialization. See SECURITY_REMEDIATION_PLAN.md for details.


Docker Support (Frictionless)

This project uses an intelligent Makefile workflow to automatically detect GPU availability and launch the correct Docker configuration.

Works out-of-the-box on:

  • macOS (Apple Silicon/Intel) β†’ Launches in CPU mode
  • Linux/Windows (with NVIDIA GPU) β†’ Launches with CUDA support
  • Linux/Windows (CPU only) β†’ Launches in CPU mode

Commands

1. Development Environment

Includes all dev tools, tests, and hot-reloading source code.

# Auto-detects GPU/CPU and starts the dev shell
make docker-dev

2. Production Environment

Optimized, secure image with pre-cached model weights (~650MB).

# Auto-detects GPU/CPU and runs the training pipeline
make docker-prod

Manual Usage (Optional)

If you prefer using docker compose directly:

  • CPU (macOS/Linux): docker compose up
  • GPU (Linux/Windows): docker compose -f docker-compose.yml -f docker-compose.gpu.yml up

Documentation

πŸ“š Project Documentation

πŸ†• New to the project? Start with the System Overview to understand what this pipeline does and how it works.

For Users

For Developers

For Researchers

Dataset Documentation

See Datasets section below for dataset-specific preprocessing and validation docs.


Datasets

This pipeline uses four antibody datasets for training and evaluation:

Boughter Dataset (Training)

Source: Boughter et al. (2020) Size: 914 VH sequences Assay: ELISA polyreactivity assay Usage: Primary training dataset

Documentation: See docs/datasets/boughter/ for preprocessing steps and data sources.


Jain Dataset (Test - Clinical Antibodies)

Source: Jain et al. (2017) Size: 86 clinical antibodies Assay: Per-antigen ELISA (Adimab dataset) Usage: Primary test dataset, benchmark for Novo comparison (68.60% - EXACT NOVO PARITY)

Documentation: See docs/datasets/jain/ for preprocessing steps and data sources.


Harvey Dataset (Test - Nanobodies)

Source: Harvey et al. (2022) Size: 141,021 nanobody sequences Assay: PSR (polyspecific reagent) assay Usage: Large-scale nanobody test set

Documentation: See docs/datasets/harvey/ for preprocessing steps and data sources.


Shehata Dataset (Test - PSR Cross-Validation)

Source: Shehata et al. (2019) Size: 398 human antibodies Assay: PSR (polyspecific reagent) assay Usage: Cross-assay validation (PSR vs ELISA)

Documentation: See docs/datasets/shehata/ for preprocessing steps and data sources.


Citation

This work implements the methodology from:

Sakhnini et al. (2025) - Novo Nordisk & University of Cambridge

Sakhnini, L.I., Beltrame, L., Fulle, S., Sormanni, P., Henriksen, A., Lorenzen, N., Vendruscolo, M., & Granata, D. (2025). Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters. bioRxiv. https://doi.org/10.1101/2025.04.28.650927

@article{sakhnini2025prediction,
  title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
  author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
  journal={bioRxiv},
  year={2025},
  month={May},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.04.28.650927},
  url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}

Dataset Attributions

This repository uses training and test datasets from multiple published studies:

  • Training: Boughter et al. 2020 (914 VH sequences, ELISA polyreactivity)
  • Test: Jain et al. 2017 (86 clinical antibodies, per-antigen ELISA from Adimab)
  • Test: Harvey et al. 2022 (141k nanobodies, PSR assay)
  • Test: Shehata et al. 2019 (398 antibodies, PSR cross-assay validation)

For complete citations, BibTeX entries, and data attribution details, see CITATIONS.md.

Packages

 
 
 

Contributors

Languages

  • Python 98.7%
  • Shell 1.1%
  • Makefile 0.2%