Skip to content

Bug fixes and performance optimizations for FLUX training#187

Merged
coolkp merged 2 commits intoAI-Hypercomputer:mainfrom
hx89:haixinl/sharding_fix
Jul 16, 2025
Merged

Bug fixes and performance optimizations for FLUX training#187
coolkp merged 2 commits intoAI-Hypercomputer:mainfrom
hx89:haixinl/sharding_fix

Conversation

@hx89
Copy link
Copy Markdown
Contributor

@hx89 hx89 commented Jun 18, 2025

Bug fixes:

  1. Fix the text sequence length configuration issue
  2. Fix the race condition when clean up the encoder cache in multi-process run

Performance optimizations:

  1. Fix the FSDP sharding in fc2
  2. Remove sharding of the bias to avoid all-gather
  3. Remove the sharding of rope embedding to avoid all-gather
  4. Fix the sharding of time step embedding to avoid all-to-all

Also added nsys profiler support

@coolkp coolkp self-requested a review June 18, 2025 17:58
@coolkp
Copy link
Copy Markdown
Collaborator

coolkp commented Jul 15, 2025

Can you git rebase instead of merge to preserve the commit history?

Comment thread src/maxdiffusion/max_utils.py Outdated

from google.cloud import storage

libcudart = cdll.LoadLibrary("libcudart.so")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking the pytest, the pytest workflow right now only supports TPU. The dependencies are not installed. https://github.com/AI-Hypercomputer/maxdiffusion/blob/eaef265428ef200f8260e4edb017f0b59e839f1b/.github/workflows/UnitTests.yml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed nsys and rebased to main.

@hx89 hx89 force-pushed the haixinl/sharding_fix branch from 3516f7b to 282f741 Compare July 15, 2025 23:09
@hx89 hx89 force-pushed the haixinl/sharding_fix branch from 282f741 to 105b8fe Compare July 15, 2025 23:22
Copy link
Copy Markdown
Collaborator

@coolkp coolkp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@coolkp coolkp merged commit c69c95e into AI-Hypercomputer:main Jul 16, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants