Skip to content

fix: guard DP-imbalance empty micro-batches under dynamic batching#1860

Open
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
leofan-lab:fix/dp-empty-microbatch
Open

fix: guard DP-imbalance empty micro-batches under dynamic batching#1860
leofan-lab wants to merge 1 commit intoTHUDM:mainfrom
leofan-lab:fix/dp-empty-microbatch

Conversation

@leofan-lab
Copy link
Copy Markdown
Contributor

Summary

With --use-dynamic-batch-size and variable-length rollouts (common in multi-turn RL with tool calls), DP ranks can need different numbers of micro-batches. The all_reduce(MAX, num_microbatches) in get_data_iterator forces ranks with fewer real samples to iterate empty micro-batches through the pipeline scheduler, previously producing two crash signatures deep inside post-forward aggregation:

  • torch.cat(): expected a non-empty list of Tensors (from get_batch on empty tokens)
  • AttributeError: 'NoneType' object has no attribute 'device' (from compute_advantages_and_returns on trailing Nones in the reorder output)

Both trace back to the same root cause.

When this fires

Hit on async-retool training runs with --use-dynamic-batch-size + DP > 1 once sample-length variance got high enough that some ranks finished their real micro-batches before the collective cap. Does not fire with fixed batch size or DP=1.

Fix

Two layered guards for the same class of bug:

  1. get_batch: fabricate a 1-token placeholder when tokens=[], with a declared placeholder_for_key schema and a post-fill invariant assert so a future schema addition fails loudly at the placeholder site instead of 400 lines downstream. response_length=0 makes the placeholder a no-op under CP chunking and sum_of_sample_mean (0-size tensors, split_sizes=[0]).
  2. forward_only: size the reordered output to len(origin_indices) (not len(values)), dropping trailing placeholder outputs. Add local asserts for the first-fit bin-packer invariant (real samples at [0, N)) and output-count >= real-sample-count, so a future change to the partitioner or forward path fails locally rather than at some downstream .device access.

Tests

  • tests/test_dp_imbalance_reorder.py — 5 tests on the forward_only reorder, including a test_fix_vs_pre_fix_behavior_divergence case that asserts the exact pre-fix trailing-None shape.
  • tests/test_dp_imbalance_placeholder.py — 4 tests on the get_batch placeholder schema: known-key fill, per-sample-count invariant, unknown-key default, invariant-assert firing on schema drift. Imports and exercises the real _fill_empty_microbatch_placeholder helper (mutation-tested: removing the invariant-assert block breaks test_placeholder_invariant_fires_on_unfilled_key).

9/9 pass on a CPU-only dev pod.

With --use-dynamic-batch-size and variable-length rollouts (common in
multi-turn RL), DP ranks can need different micro-batch counts. The
all_reduce(MAX, num_microbatches) forces lagging ranks to iterate empty
micro-batches, which previously died inside torch.cat / torch.split with
no useful context.

Two layered fixes for the same root cause:

1. get_batch: fabricate a 1-token placeholder when `tokens=[]`, with a
   declared placeholder-for-key schema and a post-fill invariant assert
   so a future schema addition fails loudly here instead of 400 lines
   downstream. response_length=0 makes the placeholder a no-op under CP
   chunking and in sum_of_sample_mean.

2. forward_only: size the reordered output to len(origin_indices), not
   len(values), so trailing placeholder outputs are dropped. Add
   invariant asserts for the first-fit bin-packer assumption (real
   samples at positions [0, N)) and for output-count >= origin count,
   so future changes that interleave empties or drop real outputs fail
   locally rather than at some downstream .device access.

Tests: tests/test_dp_imbalance_reorder.py (forward_only reorder),
tests/test_dp_imbalance_placeholder.py (get_batch placeholder schema).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant