Skip Tokamax RaggedDotGroupSizes for FP8

BirdsOfAFthr · web-flow · commit d5805b2ac4df · 2026-03-19T10:33:33.000-07:00
# Description FP8 path is still using tokamax internal backend APIs. The new `RaggedDotGroupSizes` was introduced ([pull3330](#3330)) for Tokamax public APIs in bf16 path, which broke FP8. # Tests Benchmarks were run internally. # Checklist Before submitting this PR, please make sure (put X in square brackets): - [X] I have performed a self-review of my code. For an optional AI review, add the `gemini-review` label. - [X] I have necessary comments in my code, particularly in hard-to-understand areas. - [X] I have run end-to-end tests tests and provided workload links above if applicable. - [X] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in [our documentation](https://maxtext.readthedocs.io/en/latest/development.html#adding-new-documentation-files).
diff --git a/src/maxtext/layers/moe.py b/src/maxtext/layers/moe.py
@@ -897,7 +897,7 @@ def gmm(
         inputs, kernel, tiling, group_sizes, expert_assignments, weight_gather_axes, input_buffer_count, combine_scopes
     ):
       # TODO (b/491979205) pipeline fsdp ag per repeat fails tokamax gmm
-      if self.config.using_pipeline_parallelism and self.config.pipeline_fsdp_ag_per_repeat:
+      if self.config.use_qwix_quantization or (self.config.using_pipeline_parallelism and self.config.pipeline_fsdp_ag_per_repeat):
         tokamax_group_sizes = group_sizes
       else:
         tokamax_group_sizes = tokamax.RaggedDotGroupSizes(