Training added on top of flux_impl by ksikiric · Pull Request #147 · AI-Hypercomputer/maxdiffusion

ksikiric · 2025-02-12T13:19:17Z

Linked to #146

I've added the training code on top of https://github.com/AI-Hypercomputer/maxdiffusion/tree/flux_impl. This PR is meant to be merged after #146.

With the training code, I have also added a pipeline for flux, which can be used for inference as well.

google-cla · 2025-02-12T13:19:24Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

ksikiric · 2025-02-13T12:14:28Z

Background in #148

@entrpn, I've rebased on flux_lora now and aligned the pipeline with the changes you made to generate_flux.py. Inference is working as expected, but I am a bit suspicious about the training. Please try it out and lets discuss on how to move forward with this.

In the meantime, I will prepare another PR where I will add FA for GPUs, similar to how it is done in maxtext.

ksikiric · 2025-02-19T07:34:17Z

Hi @entrpn, have you had a chance to test this PR? I think we can try to merge this soon if you think it looks alright

entrpn · 2025-02-19T16:12:04Z

Hi @entrpn, have you had a chance to test this PR? I think we can try to merge this soon if you think it looks alright

I started to take a look at it. The pipeline fails for me during the data pipeline due to memory restraints in my environment. During the text encoding, I get OOM. The code will need to be refactored in a way that this can run on on CPU or at least 32 GB of accelerator memory (preferably 16) since the t5 encoder cannot be sharded atm. I remember doing something similar before by batching the captions in the data pipeline. I can take a look at it next week and try to get that part working.

ksikiric · 2025-03-05T12:14:51Z

@entrpn I rebased on main and cleaned up the commit logs. Will keep this PR up to date with main to make the merge easier when we are ready.

ksikiric · 2025-03-31T15:37:17Z

@entrpn I have merged your branch into mine, added the orbax checkpointing, and a new file for inference that utilizes the flux pipeline. It was easier getting the orbax loading working that way, plus I think it is a bit cleaner. Tell me what you think.

I did a small training run of 100 steps, and the resulting images does look ok. Please verify on your side and let me know if and when we can merge this. Thanks.

entrpn · 2025-04-03T19:03:34Z

@ksikiric apologies for the late response and thank you for adding this functionality. Let me take a look at this the week after next as I will be traveling until then. Will get back to you right after I'm back.

entrpn · 2025-04-15T00:14:50Z

@ksikiric can you share the commands you use to run the training job and save the checkpoint and then the command you use to run inference on the saved checkpoint?

ksikiric · 2025-04-15T13:58:08Z

python src/maxdiffusion/train_flux.py src/maxdiffusion/configs/base_flux_dev.yml save_final_checkpoint=True run_name=flux
followed by
python src/maxdiffusion/generate_flux_pipeline.py src/maxdiffusion/configs/base_flux_dev.yml run_name=flux

should do the trick :) @entrpn

entrpn · 2025-04-15T22:40:04Z

@ksikiric please take a look at the comments. Afterwards can you rebase with main and run the linter. If all passes, we can merge it.

…te_flux.py

…eline.

ksikiric · 2025-04-16T09:21:47Z

@entrpn @coolkp all comments have now been addressed as well as ruff + code_tyle.sh have been applied.

ksikiric closed this Feb 12, 2025

ksikiric reopened this Feb 12, 2025

This was referenced Feb 12, 2025

Flux inference implementation #146

Merged

WIP flax.linen flux.1 ported from nnx jflux #141

Closed

entrpn mentioned this pull request Feb 13, 2025

Flux lora #148

Merged

ksikiric force-pushed the kris/flux-impl-training branch from ba2d028 to f56234e Compare February 13, 2025 12:11

ksikiric mentioned this pull request Feb 13, 2025

Flash attention for GPUs like in maxtext #149

Merged

ksikiric force-pushed the kris/flux-impl-training branch from f56234e to bffc7dc Compare March 5, 2025 12:10

coolkp reviewed Apr 14, 2025

View reviewed changes

Comment thread src/maxdiffusion/schedulers/scheduling_euler_discrete_flax.py Outdated

coolkp reviewed Apr 14, 2025

View reviewed changes

Comment thread src/maxdiffusion/configs/base_flux_dev.yml Outdated

entrpn requested changes Apr 15, 2025

View reviewed changes

Comment thread src/maxdiffusion/configs/base_flux_dev.yml Outdated

Comment thread src/maxdiffusion/configs/base_flux_schnell.yml Outdated

Comment thread src/maxdiffusion/configs/base_flux_schnell.yml Outdated

entrpn reviewed Apr 15, 2025

View reviewed changes

Comment thread src/maxdiffusion/checkpointing/flux_checkpointer.py Outdated

Comment thread src/maxdiffusion/checkpointing/flux_checkpointer.py Outdated

Comment thread src/maxdiffusion/pipelines/flux/flux_pipeline.py Outdated

ksikiric and others added 8 commits April 16, 2025 07:35

Added training code, loss and results are stable

abc6f6b

Rebased on flux_lora and aligned flux_pipeline with changes in genera…

4195195

…te_flux.py

batch text encoding.

26da5fd

comment out post_training_steps

ee7d422

refactor some code for similarity to sd trainers.

d05161d

Added orbax saving and a new file for inference that utilizes the pip…

bfec2c8

…eline.

Update generate_flux_pipeline.py

4600a72

Fixed comments and rebased on main

5453b3c

ksikiric force-pushed the kris/flux-impl-training branch from 58b1bdc to 5453b3c Compare April 16, 2025 08:58

ruff + code_style

7e130fc

ksikiric marked this pull request as ready for review April 16, 2025 10:29

entrpn approved these changes Apr 16, 2025

View reviewed changes

entrpn merged commit b951454 into AI-Hypercomputer:main Apr 16, 2025
2 checks passed

Conversation

ksikiric commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-cla Bot commented Feb 12, 2025

Uh oh!

ksikiric commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksikiric commented Feb 19, 2025

Uh oh!

entrpn commented Feb 19, 2025

Uh oh!

ksikiric commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksikiric commented Mar 31, 2025

Uh oh!

entrpn commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

entrpn commented Apr 15, 2025

Uh oh!

ksikiric commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

entrpn commented Apr 15, 2025

Uh oh!

ksikiric commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ksikiric commented Feb 12, 2025 •

edited

Loading

ksikiric commented Feb 13, 2025 •

edited

Loading

ksikiric commented Mar 5, 2025 •

edited

Loading

ksikiric commented Apr 16, 2025 •

edited

Loading