Skip to content

Add Megatron-Bridge LoRA support for GRPO actor training#1865

Open
taivu1998 wants to merge 1 commit intoTHUDM:mainfrom
taivu1998:tdv/issue-1202-lora-grpo
Open

Add Megatron-Bridge LoRA support for GRPO actor training#1865
taivu1998 wants to merge 1 commit intoTHUDM:mainfrom
taivu1998:tdv/issue-1202-lora-grpo

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Addresses #1202.

This PR adds a first supported Megatron-Bridge LoRA path for dense GRPO actor training in slime. It introduces LoRA CLI flags, validates the initially supported configuration at startup, applies Megatron-Bridge PEFT LoRA only to the actor model, and exports effective actor weights to SGLang by temporarily merging adapters into the live model during bridge-based HF weight conversion.

Why

Issue #1202 asks for LoRA support for GRPO training and examples. The discussion also calls out known Megatron-Bridge LoRA risk around MoE and checkpointing paths, so this implementation intentionally starts with a narrow, guarded dense-model path rather than silently enabling unsupported combinations.

Changes

  • Added --enable-lora, --lora-target-modules, --lora-rank, --lora-alpha, and --lora-dropout.
  • Added validation for the first supported LoRA slice:
    • Megatron backend.
    • Megatron-Bridge HF export mode.
    • GRPO actor training.
    • Colocated rollout outside debug train-only runs.
    • Dense models only.
    • Default weight backuper enabled.
  • Added a Megatron LoRA helper module that:
    • lazily imports Megatron-Bridge PEFT LoRA.
    • builds the LoRA config from slime args.
    • applies LoRA only to the actor provider path.
    • logs local trainable and total parameter counts.
    • restores TensorBackuper-style model weights around temporary LoRA merge export.
  • Updated bridge weight export so LoRA runs export base + adapter effective actor weights to SGLang, then restore unmerged training weights.
  • Added focused unit tests for validation, config mapping, actor-only application, merge traversal, and backup restore safety.
  • Added an English advanced usage page with a concrete GRPO LoRA flag example and linked it from the docs index.

Guardrails

The PR rejects combinations that need separate parity work before support:

  • MoE models.
  • PPO or critic-based training.
  • Decoupled rollout mode outside --debug-train-only.
  • Custom model providers.
  • --only-train-params-name-list and --freeze-params-name-list.
  • On-policy distillation.
  • Reference model update intervals.
  • --disable-weights-backuper.

Validation

  • env UV_CACHE_DIR=/tmp/uv-cache PYTHONPATH=. uv run pytest tests/test_lora_support.py -> 22 passed
  • python3 -m py_compile slime/backends/megatron_utils/peft.py slime/backends/megatron_utils/model_provider.py slime/backends/megatron_utils/model.py slime/backends/megatron_utils/update_weight/hf_weight_iterator_bridge.py slime/utils/arguments.py tests/test_lora_support.py
  • git diff --check

uv run ruff check ... was attempted locally but could not run because this worktree environment does not have a ruff executable installed.

@taivu1998 taivu1998 marked this pull request as ready for review April 26, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant