"β° Times up, let's do this." - Leeroy Jenkins
This repository provides a machine learning pipeline to predict the non-specificity of antibodies using embeddings from the ESM-1v Protein Language Model(PLM). The project is an implementation of the methods described in the paper "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters" by Sakhnini et al.
Non-specific binding of therapeutic antibodies can lead to faster clearance from the body and other unwanted side effects, compromising their effectiveness and safety. Predicting this property, also known as polyreactivity, from an antibody's amino acid sequence is a critical step in drug development.
This project offers a computational pipeline to tackle this challenge. It leverages the power of the ESM-1v, a state-of-the-art PLM, to convert antibody's amino acid sequences meaningful numerical representations (embeddings). These embeddings capture complex biophysical and evolutionary information, which is then used to train a machine learning classifier to predict non-specificity. The pipeline is designed to be modular, allowing for easy adaptation to different datasets and models.
The model's architecture is a two-stage process designed for both power and interpretability:
- Sequence Embedding with ESM-1v: The amino acid sequence of an antibody's Variable Heavy(VH) domain is fed into the pre-trained ESM-1v model. ESM-1v, trained on millions of diverse protein sequences, generates a high-dimensional vector(embedding) for the antibody. This vector represents the learned structural and functional properties of the sequence.
- Classification: The generated embedding vector is then used as input for a simpler, classical machine learning model. The original paper found that a Logistic Regression classifier performed best, achieving up to 71% accuracy in 10-fold cross-validation. This second two-stage learns to map the sequence features captured by ESM-1v to a binary outcome: specific or non-specific
This hybrid approach combines the deep contextual understanding of a PLM with the efficiency and interpretability of a linear classifier.
-
Data Processing: Scripts to load, clean, and process antibody datasets, including the Boughter et al. (2020) dataset used for training.
-
Sequence Annotation: Annotation of Complementarity-Determining Regions (CDRs) and extraction of the VH domain from full antibody sequences.
-
ESM-1v Embedding: A module to generate embeddings for antibody sequences using the ESM-1v model.
-
Model Training: A complete training pipeline for a Logistic Regression classifier on the generated embeddings.
-
Model Evaluation: Standard evaluation metrics, including k-fold cross-validation, accuracy, sensitivity, and specificity, are implemented to assess model performance.
-
Prediction CLI: Get predictions for new antibody sequences from trained models.
-
Obtain a pretrained classifier (one of):
- Run
make train(seedocs/user-guide/training.md). - Download a published checkpoint.
Supported formats:
- Development:
.pkl(Pickle) - Production:
.npz(NumPy arrays) +_config.json(Metadata)
- Run
# Option A: Using Pickle (Development) uv run antibody-predict \ input_file=data/test.csv \ output_file=predictions.csv \ classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl # Option B: Using NPZ (Production/Secure) uv run antibody-predict \ input_file=data/test.csv \ output_file=predictions.csv \ classifier.path=experiments/checkpoints/esm1v/logreg/model.npz \ classifier.config_path=experiments/checkpoints/esm1v/logreg/model_config.json
-
-
Web Application Interface: A simple Gradio-based frontend for interactive prediction.
# Launch the web UI uv run antibody-app classifier.path=experiments/checkpoints/esm1v/logreg/model.pklAutomatically handles macOS optimization (forces CPU/Single-thread) to prevent crashes.
-
Biophysical Descriptor Module: A feature to calculate and incorporate key biophysical parameters, such as the isoelectric point (pI), which was identified as a major driver of non-specificity.
-
Support for Other PLMs: Integration of other antibody-specific language models like AbLang or AntiBERTy for performance comparison.
To get started, clone the repository and set up the Python environment.
- Clone the Repository
git clone https://github.com/The-Obstacle-Is-The-Way/antibody_training_pipeline_ESM.git
cd antibody_training_pipeline_ESM- Create the Environment This project uses uv for fast Python package management with virtual environments.
Install uv if you don't have it:
- For Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | sh- For Windows(use pip)
pip install uv- Set up the project
Recommended (all platforms):
# This runs 'uv sync --all-extras' to install ALL dependencies (including dev tools)
make installManual setup:
- On Linux/macOS
uv venv
source .venv/bin/activate
uv sync --all-extras # Install all dependencies- On Windows
uv venv
venv\Scripts\activate
uv sync --all-extras # Install all dependenciesImportant: Always use make install or uv sync --all-extras to ensure dev dependencies (pytest, mypy, etc.) are installed. Plain uv sync will skip them.
This project uses modern Python tooling for a streamlined development experience. All common tasks are available through simple make commands.
# Install dependencies
make install
# Run full quality pipeline (format, lint, typecheck, test)
make all| Command | Description |
|---|---|
make install |
Install all project dependencies with uv |
make format |
Auto-format code with ruff |
make lint |
Check code quality with ruff linting |
make typecheck |
Run static type checking with mypy |
make hooks |
Run pre-commit hooks on all files |
make test |
Fast suite (unit + integration; skips e2e, slow, gpu) |
make test-e2e |
End-to-end suite (honors env flags like RUN_NOVO_E2E) |
make test-all |
Full pytest suite (env-gated e2e may still skip) |
make all |
Run complete quality pipeline (format β lint β typecheck β fast tests) |
make train |
Run the ML training pipeline |
make clean |
Remove cache directories and temporary files |
make help |
Show all available commands |
The pipeline uses Hydra for flexible configuration management. Default config is in src/antibody_training_esm/conf/config.yaml:
# Train with default Hydra config (src/antibody_training_esm/conf/config.yaml)
make train
# OR
uv run antibody-train
# Override any parameter from CLI
uv run antibody-train hardware.device=cuda training.batch_size=32
uv run antibody-train classifier.C=0.5 classifier.penalty=l1
# Run hyperparameter sweeps (multi-run mode)
uv run antibody-train --multirun classifier.C=0.1,1.0,10.0
# Creates 3 separate runs with different C values
# Sweep multiple parameters (Cartesian product)
uv run antibody-train --multirun classifier.C=0.1,1.0 classifier.penalty=l1,l2
# Creates 4 runs: (0.1,l1), (0.1,l2), (1.0,l1), (1.0,l2)Output structure:
experiments/runs/
βββ {experiment.name}/
βββ {timestamp}/
βββ .hydra/config.yaml # Full resolved config
βββ logs/training.log # Training logs
βββ {model}.pkl # Trained model artifact
Why Hydra?
- No file editing: Override any config value from CLI
- Experiment tracking: Every run saves its full configuration automatically
- Hyperparameter sweeps: Multi-run mode for systematic parameter exploration
- Reproducibility: Complete config provenance for every experiment
See Training Guide for comprehensive Hydra usage.
This repository maintains 100% type safety and enforces quality through pre-commit hooks:
- Ruff: Fast linting and formatting (replaces black, isort, flake8)
- mypy: Static type checking with strict configuration
- pytest: Comprehensive test coverage
Pre-commit hooks run automatically before each commit to ensure code quality:
# Install hooks (one-time setup)
uv run pre-commit install
# Run hooks manually on all files
make hooksThe hooks will automatically:
- Format code with ruff
- Check linting rules
- Verify type safety with mypy
If any check fails, the commit is blocked until issues are resolved.
- Before committing: Run
make allto ensure everything passes - When adding new code: Include type annotations from the start
- If pre-commit blocks: Review the errors and fix them - the hooks ensure quality
- For quick checks: Use individual commands like
make lintormake typecheck
This codebase uses Python's pickle module for:
- Trained ML models: Saving/loading BinaryClassifier models (
.pklfiles) - Embedding caches: Caching expensive ESM embeddings for performance
- Preprocessed datasets: Storing locally processed data
Threat Model: All pickle files are generated and consumed locally by trusted code. There is no internet-exposed API and no loading of untrusted pickle files.
For Production Deployment: If deploying this pipeline to a production environment with external access, consider migrating to JSON + NPZ format for artifact serialization. See SECURITY_REMEDIATION_PLAN.md for details.
This project uses an intelligent Makefile workflow to automatically detect GPU availability and launch the correct Docker configuration.
Works out-of-the-box on:
- macOS (Apple Silicon/Intel) β Launches in CPU mode
- Linux/Windows (with NVIDIA GPU) β Launches with CUDA support
- Linux/Windows (CPU only) β Launches in CPU mode
Includes all dev tools, tests, and hot-reloading source code.
# Auto-detects GPU/CPU and starts the dev shell
make docker-devOptimized, secure image with pre-cached model weights (~650MB).
# Auto-detects GPU/CPU and runs the training pipeline
make docker-prodIf you prefer using docker compose directly:
- CPU (macOS/Linux):
docker compose up - GPU (Linux/Windows):
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up
π New to the project? Start with the System Overview to understand what this pipeline does and how it works.
- Quick Start - Inference:
INFERENCE_GUIDE.md(root - fast reference) - Comprehensive Inference Guide: docs/user-guide/inference.md
- Installation & Setup: See Installation above
- Training Models: docs/user-guide/training.md
- Testing Models: docs/user-guide/testing.md
- All User Guides:
docs/user-guide/(installation, getting-started, preprocessing, troubleshooting)
- Architecture: docs/developer-guide/architecture.md
- Development Workflow: docs/developer-guide/development-workflow.md
- Testing Strategy: docs/developer-guide/testing-strategy.md
- CI/CD: docs/developer-guide/ci-cd.md
- Type Checking: docs/developer-guide/type-checking.md
- Security: docs/developer-guide/security.md
- Preprocessing Internals: docs/developer-guide/preprocessing-internals.md
- Docker: docs/developer-guide/docker.md
- Novo Parity Analysis: docs/research/novo-parity.md
- Methodology & Divergences: docs/research/methodology.md
- Assay Thresholds: docs/research/assay-thresholds.md
- Benchmark Results: docs/research/benchmark-results.md
- Model Zoo Roadmap: docs/research/model-zoo-roadmap.md
See Datasets section below for dataset-specific preprocessing and validation docs.
This pipeline uses four antibody datasets for training and evaluation:
Source: Boughter et al. (2020) Size: 914 VH sequences Assay: ELISA polyreactivity assay Usage: Primary training dataset
Documentation: See docs/datasets/boughter/ for preprocessing steps and data sources.
Source: Jain et al. (2017) Size: 86 clinical antibodies Assay: Per-antigen ELISA (Adimab dataset) Usage: Primary test dataset, benchmark for Novo comparison (68.60% - EXACT NOVO PARITY)
Documentation: See docs/datasets/jain/ for preprocessing steps and data sources.
Source: Harvey et al. (2022) Size: 141,021 nanobody sequences Assay: PSR (polyspecific reagent) assay Usage: Large-scale nanobody test set
Documentation: See docs/datasets/harvey/ for preprocessing steps and data sources.
Source: Shehata et al. (2019) Size: 398 human antibodies Assay: PSR (polyspecific reagent) assay Usage: Cross-assay validation (PSR vs ELISA)
Documentation: See docs/datasets/shehata/ for preprocessing steps and data sources.
This work implements the methodology from:
Sakhnini et al. (2025) - Novo Nordisk & University of Cambridge
Sakhnini, L.I., Beltrame, L., Fulle, S., Sormanni, P., Henriksen, A., Lorenzen, N., Vendruscolo, M., & Granata, D. (2025). Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters. bioRxiv. https://doi.org/10.1101/2025.04.28.650927
@article{sakhnini2025prediction,
title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
journal={bioRxiv},
year={2025},
month={May},
publisher={Cold Spring Harbor Laboratory},
doi={10.1101/2025.04.28.650927},
url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}This repository uses training and test datasets from multiple published studies:
- Training: Boughter et al. 2020 (914 VH sequences, ELISA polyreactivity)
- Test: Jain et al. 2017 (86 clinical antibodies, per-antigen ELISA from Adimab)
- Test: Harvey et al. 2022 (141k nanobodies, PSR assay)
- Test: Shehata et al. 2019 (398 antibodies, PSR cross-assay validation)
For complete citations, BibTeX entries, and data attribution details, see CITATIONS.md.