Antibody Non-Specificity Prediction Pipeline using ESM

"⏰ Times up, let's do this." - Leeroy Jenkins

This repository provides a machine learning pipeline to predict the non-specificity of antibodies using embeddings from the ESM-1v Protein Language Model(PLM). The project is an implementation of the methods described in the paper "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters" by Sakhnini et al.

Project Description

Non-specific binding of therapeutic antibodies can lead to faster clearance from the body and other unwanted side effects, compromising their effectiveness and safety. Predicting this property, also known as polyreactivity, from an antibody's amino acid sequence is a critical step in drug development.

This project offers a computational pipeline to tackle this challenge. It leverages the power of the ESM-1v, a state-of-the-art PLM, to convert antibody's amino acid sequences meaningful numerical representations (embeddings). These embeddings capture complex biophysical and evolutionary information, which is then used to train a machine learning classifier to predict non-specificity. The pipeline is designed to be modular, allowing for easy adaptation to different datasets and models.

Model Architecture

The model's architecture is a two-stage process designed for both power and interpretability:

Sequence Embedding with ESM-1v: The amino acid sequence of an antibody's Variable Heavy(VH) domain is fed into the pre-trained ESM-1v model. ESM-1v, trained on millions of diverse protein sequences, generates a high-dimensional vector(embedding) for the antibody. This vector represents the learned structural and functional properties of the sequence.
Classification: The generated embedding vector is then used as input for a simpler, classical machine learning model. The original paper found that a Logistic Regression classifier performed best, achieving up to 71% accuracy in 10-fold cross-validation. This second two-stage learns to map the sequence features captured by ESM-1v to a binary outcome: specific or non-specific

This hybrid approach combines the deep contextual understanding of a PLM with the efficiency and interpretability of a linear classifier.

Features

Implemented

Data Processing: Scripts to load, clean, and process antibody datasets, including the Boughter et al. (2020) dataset used for training.
Sequence Annotation: Annotation of Complementarity-Determining Regions (CDRs) and extraction of the VH domain from full antibody sequences.
ESM-1v Embedding: A module to generate embeddings for antibody sequences using the ESM-1v model.
Model Training: A complete training pipeline for a Logistic Regression classifier on the generated embeddings.
Model Evaluation: Standard evaluation metrics, including k-fold cross-validation, accuracy, sensitivity, and specificity, are implemented to assess model performance.

Prediction CLI: Get predictions for new antibody sequences from trained models.

Obtain a pretrained classifier (one of):
- Run make train (see docs/user-guide/training.md).
- Download a published checkpoint.
Supported formats:
- Development: .pkl (Pickle)
- Production: .npz (NumPy arrays) + _config.json (Metadata)

# Option A: Using Pickle (Development)
uv run antibody-predict \
    input_file=data/test.csv \
    output_file=predictions.csv \
    classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl

# Option B: Using NPZ (Production/Secure)
uv run antibody-predict \
    input_file=data/test.csv \
    output_file=predictions.csv \
    classifier.path=experiments/checkpoints/esm1v/logreg/model.npz \
    classifier.config_path=experiments/checkpoints/esm1v/logreg/model_config.json

Web Application Interface: A simple Gradio-based frontend for interactive prediction.
```
# Launch the web UI
uv run antibody-app classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl
```
Automatically handles macOS optimization (forces CPU/Single-thread) to prevent crashes.

To-Be Implemented

Biophysical Descriptor Module: A feature to calculate and incorporate key biophysical parameters, such as the isoelectric point (pI), which was identified as a major driver of non-specificity.
Support for Other PLMs: Integration of other antibody-specific language models like AbLang or AntiBERTy for performance comparison.

Installation & Setup

To get started, clone the repository and set up the Python environment.

Clone the Repository

git clone https://github.com/The-Obstacle-Is-The-Way/antibody_training_pipeline_ESM.git
cd antibody_training_pipeline_ESM

Create the Environment This project uses uv for fast Python package management with virtual environments.

Install uv if you don't have it:

For Linux/macOS

curl -LsSf https://astral.sh/uv/install.sh | sh

For Windows(use pip)

pip install uv

Set up the project

Recommended (all platforms):

# This runs 'uv sync --all-extras' to install ALL dependencies (including dev tools)
make install

Manual setup:

On Linux/macOS

uv venv
source .venv/bin/activate
uv sync --all-extras  # Install all dependencies

On Windows

uv venv
venv\Scripts\activate
uv sync --all-extras  # Install all dependencies

Important: Always use make install or uv sync --all-extras to ensure dev dependencies (pytest, mypy, etc.) are installed. Plain uv sync will skip them.

Developer Workflow

This project uses modern Python tooling for a streamlined development experience. All common tasks are available through simple make commands.

Quick Start

# Install dependencies
make install

# Run full quality pipeline (format, lint, typecheck, test)
make all

Available Commands

Command	Description
`make install`	Install all project dependencies with uv
`make format`	Auto-format code with ruff
`make lint`	Check code quality with ruff linting
`make typecheck`	Run static type checking with mypy
`make hooks`	Run pre-commit hooks on all files
`make test`	Fast suite (unit + integration; skips `e2e`, `slow`, `gpu`)
`make test-e2e`	End-to-end suite (honors env flags like `RUN_NOVO_E2E`)
`make test-all`	Full pytest suite (env-gated e2e may still skip)
`make all`	Run complete quality pipeline (format → lint → typecheck → fast tests)
`make train`	Run the ML training pipeline
`make clean`	Remove cache directories and temporary files
`make help`	Show all available commands

Training with Hydra

The pipeline uses Hydra for flexible configuration management. Default config is in src/antibody_training_esm/conf/config.yaml:

# Train with default Hydra config (src/antibody_training_esm/conf/config.yaml)
make train
# OR
uv run antibody-train

# Override any parameter from CLI
uv run antibody-train hardware.device=cuda training.batch_size=32
uv run antibody-train classifier.C=0.5 classifier.penalty=l1

# Run hyperparameter sweeps (multi-run mode)
uv run antibody-train --multirun classifier.C=0.1,1.0,10.0
# Creates 3 separate runs with different C values

# Sweep multiple parameters (Cartesian product)
uv run antibody-train --multirun classifier.C=0.1,1.0 classifier.penalty=l1,l2
# Creates 4 runs: (0.1,l1), (0.1,l2), (1.0,l1), (1.0,l2)

Output structure:

experiments/runs/
└── {experiment.name}/
    └── {timestamp}/
        ├── .hydra/config.yaml   # Full resolved config
        ├── logs/training.log    # Training logs
        └── {model}.pkl          # Trained model artifact

Why Hydra?

No file editing: Override any config value from CLI
Experiment tracking: Every run saves its full configuration automatically
Hyperparameter sweeps: Multi-run mode for systematic parameter exploration
Reproducibility: Complete config provenance for every experiment

See Training Guide for comprehensive Hydra usage.

Code Quality Standards

This repository maintains 100% type safety and enforces quality through pre-commit hooks:

Ruff: Fast linting and formatting (replaces black, isort, flake8)
mypy: Static type checking with strict configuration
pytest: Comprehensive test coverage

Pre-commit Hooks

Pre-commit hooks run automatically before each commit to ensure code quality:

# Install hooks (one-time setup)
uv run pre-commit install

# Run hooks manually on all files
make hooks

The hooks will automatically:

Format code with ruff
Check linting rules
Verify type safety with mypy

If any check fails, the commit is blocked until issues are resolved.

Development Best Practices

Before committing: Run make all to ensure everything passes
When adding new code: Include type annotations from the start
If pre-commit blocks: Review the errors and fix them - the hooks ensure quality
For quick checks: Use individual commands like make lint or make typecheck

Security

Pickle Usage

This codebase uses Python's pickle module for:

Trained ML models: Saving/loading BinaryClassifier models (.pkl files)
Embedding caches: Caching expensive ESM embeddings for performance
Preprocessed datasets: Storing locally processed data

Threat Model: All pickle files are generated and consumed locally by trusted code. There is no internet-exposed API and no loading of untrusted pickle files.

For Production Deployment: If deploying this pipeline to a production environment with external access, consider migrating to JSON + NPZ format for artifact serialization. See SECURITY_REMEDIATION_PLAN.md for details.

Docker Support (Frictionless)

This project uses an intelligent Makefile workflow to automatically detect GPU availability and launch the correct Docker configuration.

Works out-of-the-box on:

macOS (Apple Silicon/Intel) → Launches in CPU mode
Linux/Windows (with NVIDIA GPU) → Launches with CUDA support
Linux/Windows (CPU only) → Launches in CPU mode

Commands

1. Development Environment

Includes all dev tools, tests, and hot-reloading source code.

# Auto-detects GPU/CPU and starts the dev shell
make docker-dev

2. Production Environment

Optimized, secure image with pre-cached model weights (~650MB).

# Auto-detects GPU/CPU and runs the training pipeline
make docker-prod

Manual Usage (Optional)

If you prefer using docker compose directly:

CPU (macOS/Linux): docker compose up
GPU (Linux/Windows): docker compose -f docker-compose.yml -f docker-compose.gpu.yml up

Documentation

📚 Project Documentation

🆕 New to the project? Start with the System Overview to understand what this pipeline does and how it works.

For Users

Quick Start - Inference: INFERENCE_GUIDE.md (root - fast reference)
Comprehensive Inference Guide: docs/user-guide/inference.md
Installation & Setup: See Installation above
Training Models: docs/user-guide/training.md
Testing Models: docs/user-guide/testing.md
All User Guides: docs/user-guide/ (installation, getting-started, preprocessing, troubleshooting)

For Developers

Architecture: docs/developer-guide/architecture.md
Development Workflow: docs/developer-guide/development-workflow.md
Testing Strategy: docs/developer-guide/testing-strategy.md
CI/CD: docs/developer-guide/ci-cd.md
Type Checking: docs/developer-guide/type-checking.md
Security: docs/developer-guide/security.md
Preprocessing Internals: docs/developer-guide/preprocessing-internals.md
Docker: docs/developer-guide/docker.md

For Researchers

Novo Parity Analysis: docs/research/novo-parity.md
Methodology & Divergences: docs/research/methodology.md
Assay Thresholds: docs/research/assay-thresholds.md
Benchmark Results: docs/research/benchmark-results.md
Model Zoo Roadmap: docs/research/model-zoo-roadmap.md

Dataset Documentation

See Datasets section below for dataset-specific preprocessing and validation docs.

Datasets

This pipeline uses four antibody datasets for training and evaluation:

Boughter Dataset (Training)

Source: Boughter et al. (2020) Size: 914 VH sequences Assay: ELISA polyreactivity assay Usage: Primary training dataset

Documentation: See docs/datasets/boughter/ for preprocessing steps and data sources.

Jain Dataset (Test - Clinical Antibodies)

Source: Jain et al. (2017) Size: 86 clinical antibodies Assay: Per-antigen ELISA (Adimab dataset) Usage: Primary test dataset, benchmark for Novo comparison (68.60% - EXACT NOVO PARITY)

Documentation: See docs/datasets/jain/ for preprocessing steps and data sources.

Harvey Dataset (Test - Nanobodies)

Source: Harvey et al. (2022) Size: 141,021 nanobody sequences Assay: PSR (polyspecific reagent) assay Usage: Large-scale nanobody test set

Documentation: See docs/datasets/harvey/ for preprocessing steps and data sources.

Shehata Dataset (Test - PSR Cross-Validation)

Source: Shehata et al. (2019) Size: 398 human antibodies Assay: PSR (polyspecific reagent) assay Usage: Cross-assay validation (PSR vs ELISA)

Documentation: See docs/datasets/shehata/ for preprocessing steps and data sources.

Citation

This work implements the methodology from:

Sakhnini et al. (2025) - Novo Nordisk & University of Cambridge

Sakhnini, L.I., Beltrame, L., Fulle, S., Sormanni, P., Henriksen, A., Lorenzen, N., Vendruscolo, M., & Granata, D. (2025). Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters. bioRxiv. https://doi.org/10.1101/2025.04.28.650927

@article{sakhnini2025prediction,
  title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
  author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
  journal={bioRxiv},
  year={2025},
  month={May},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.04.28.650927},
  url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}

Dataset Attributions

This repository uses training and test datasets from multiple published studies:

Training: Boughter et al. 2020 (914 VH sequences, ELISA polyreactivity)
Test: Jain et al. 2017 (86 clinical antibodies, per-antigen ELISA from Adimab)
Test: Harvey et al. 2022 (141k nanobodies, PSR assay)
Test: Shehata et al. 2019 (398 antibodies, PSR cross-assay validation)

For complete citations, BibTeX entries, and data attribution details, see CITATIONS.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,274 Commits
.github		.github
data		data
docs		docs
experiments		experiments
preprocessing		preprocessing
results		results
scripts		scripts
spaces		spaces
src/antibody_training_esm		src/antibody_training_esm
tests		tests
validation		validation
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
AUDIT_REPORT.md		AUDIT_REPORT.md
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.dev		Dockerfile.dev
Dockerfile.prod		Dockerfile.prod
GEMINI.md		GEMINI.md
GRADIO.md		GRADIO.md
INFERENCE_GUIDE.md		INFERENCE_GUIDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
USAGE.md		USAGE.md
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Antibody Non-Specificity Prediction Pipeline using ESM

Project Description

Model Architecture

Features

Implemented

To-Be Implemented

Installation & Setup

Developer Workflow

Quick Start

Available Commands

Training with Hydra

Code Quality Standards

Pre-commit Hooks

Development Best Practices

Security

Pickle Usage

Docker Support (Frictionless)

Commands

1. Development Environment

2. Production Environment

Manual Usage (Optional)

Documentation

📚 Project Documentation

For Users

For Developers

For Researchers

Dataset Documentation

Datasets

Boughter Dataset (Training)

Jain Dataset (Test - Clinical Antibodies)

Harvey Dataset (Test - Nanobodies)

Shehata Dataset (Test - PSR Cross-Validation)

Citation

Dataset Attributions

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages