update readme to include training on GPUs. Revert max_utils jitting of state.

entrpn · entrpn · commit 39b38cf5ee36 · 2025-03-13T18:16:38.000Z
diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ MaxDiffusion supports
   - [Dreambooth](#dreambooth)
   - [Inference](#inference)
   - [Flux](#flux)
-    - [Flash Attention for GPU:](#flash-attention-for-gpu)
+    - [Fused Attention for GPU:](#fused-attention-for-gpu)
   - [Hyper SDXL LoRA](#hyper-sdxl-lora)
   - [Load Multiple LoRA](#load-multiple-lora)
   - [SDXL Lightning](#sdxl-lightning)
@@ -83,6 +83,14 @@ After installation completes, run the training script.
   python -m src.maxdiffusion.train_sdxl src/maxdiffusion/configs/base_xl.yml run_name="my_xl_run" output_dir="gs://your-bucket/" per_device_batch_size=1
   ```
 
+  On GPUS with Fused Attention:
+
+  First install Transformer Engine by following the [instructions here](#fused-attention-for-gpu).
+
+  ```bash
+  NVTE_FUSED_ATTN=1 python -m src.maxdiffusion.train_sdxl src/maxdiffusion/configs/base_xl.yml hardware=gpu run_name='test-sdxl-train' output_dir=/tmp/ train_text_encoder=false cache_latents_text_encoder_outputs=true max_train_steps=200 weights_dtype=float16 activations_dtype=float16 per_device_batch_size=1 attention="cudnn_flash_te"
+  ```
+
   To generate images with a trained checkpoint, run:
 
   ```bash
@@ -176,8 +184,8 @@ To generate images, run the following command:
   python src/maxdiffusion/generate_flux.py src/maxdiffusion/configs/base_flux_schnell.yml jax_cache_dir=/tmp/cache_dir run_name=flux_test output_dir=/tmp/ prompt="photograph of an electronics chip in the shape of a race car with trillium written on its side" per_device_batch_size=1 ici_data_parallelism=1 ici_fsdp_parallelism=-1 offload_encoders=False
   ```
 
-    ## Flash Attention for GPU:
-    Flash Attention for GPU is supported via TransformerEngine. Installation instructions:
+    ## Fused Attention for GPU:
+    Fused Attention for GPU is supported via TransformerEngine. Installation instructions:
 
     ```bash
     cd maxdiffusion
diff --git a/src/maxdiffusion/max_utils.py b/src/maxdiffusion/max_utils.py
@@ -404,18 +404,20 @@ def setup_initial_state(
         state = state[checkpoint_item]
     if not state:
       max_logging.log(f"Could not find the item in orbax, creating state...")
-      state = init_train_state(
-        model=model,
-        tx=tx,
-        weights_init_fn=weights_init_fn,
-        params=model_params,
-        training=training,
-        eval_only=False
+      init_train_state_partial = functools.partial(
+          init_train_state,
+          model=model,
+          tx=tx,
+          weights_init_fn=weights_init_fn,
+          params=model_params,
+          training=training,
+          eval_only=False,
       )
-      if model_params:
-        state = state.replace(params=model_params)
-
-      state = jax.device_put(state, state_mesh_shardings)
+      state = jax.jit(
+          init_train_state_partial,
+          in_shardings=None,
+          out_shardings=state_mesh_shardings,
+      )()
 
   state = unbox_logicallypartioned_trainstate(state)