fix: guard DP-imbalance empty micro-batches under dynamic batching by leofan-lab · Pull Request #1860 · THUDM/slime

leofan-lab · 2026-04-24T16:29:44Z

Summary

With --use-dynamic-batch-size and variable-length rollouts (common in multi-turn RL with tool calls), DP ranks can need different numbers of micro-batches. The all_reduce(MAX, num_microbatches) in get_data_iterator forces ranks with fewer real samples to iterate empty micro-batches through the pipeline scheduler, previously producing two crash signatures deep inside post-forward aggregation:

torch.cat(): expected a non-empty list of Tensors (from get_batch on empty tokens)
AttributeError: 'NoneType' object has no attribute 'device' (from compute_advantages_and_returns on trailing Nones in the reorder output)

Both trace back to the same root cause.

When this fires

Hit on async-retool training runs with --use-dynamic-batch-size + DP > 1 once sample-length variance got high enough that some ranks finished their real micro-batches before the collective cap. Does not fire with fixed batch size or DP=1.

Fix

Two layered guards for the same class of bug:

get_batch: fabricate a 1-token placeholder when tokens=[], with a declared placeholder_for_key schema and a post-fill invariant assert so a future schema addition fails loudly at the placeholder site instead of 400 lines downstream. response_length=0 makes the placeholder a no-op under CP chunking and sum_of_sample_mean (0-size tensors, split_sizes=[0]).
forward_only: size the reordered output to len(origin_indices) (not len(values)), dropping trailing placeholder outputs. Add local asserts for the first-fit bin-packer invariant (real samples at [0, N)) and output-count >= real-sample-count, so a future change to the partitioner or forward path fails locally rather than at some downstream .device access.

Tests

tests/test_dp_imbalance_reorder.py — 5 tests on the forward_only reorder, including a test_fix_vs_pre_fix_behavior_divergence case that asserts the exact pre-fix trailing-None shape.
tests/test_dp_imbalance_placeholder.py — 4 tests on the get_batch placeholder schema: known-key fill, per-sample-count invariant, unknown-key default, invariant-assert firing on schema drift. Imports and exercises the real _fill_empty_microbatch_placeholder helper (mutation-tested: removing the invariant-assert block breaks test_placeholder_invariant_fires_on_unfilled_key).

9/9 pass on a CPU-only dev pod.

With --use-dynamic-batch-size and variable-length rollouts (common in multi-turn RL), DP ranks can need different micro-batch counts. The all_reduce(MAX, num_microbatches) forces lagging ranks to iterate empty micro-batches, which previously died inside torch.cat / torch.split with no useful context. Two layered fixes for the same root cause: 1. get_batch: fabricate a 1-token placeholder when `tokens=[]`, with a declared placeholder-for-key schema and a post-fill invariant assert so a future schema addition fails loudly here instead of 400 lines downstream. response_length=0 makes the placeholder a no-op under CP chunking and in sum_of_sample_mean. 2. forward_only: size the reordered output to len(origin_indices), not len(values), so trailing placeholder outputs are dropped. Add invariant asserts for the first-fit bin-packer assumption (real samples at positions [0, N)) and for output-count >= origin count, so future changes that interleave empties or drop real outputs fail locally rather than at some downstream .device access. Tests: tests/test_dp_imbalance_reorder.py (forward_only reorder), tests/test_dp_imbalance_placeholder.py (get_batch placeholder schema). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard DP-imbalance empty micro-batches under dynamic batching#1860

fix: guard DP-imbalance empty micro-batches under dynamic batching#1860
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
leofan-lab:fix/dp-empty-microbatch

leofan-lab commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leofan-lab commented Apr 24, 2026

Summary

When this fires

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant