You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add debug statements
Conversion script ran without failing
test verify orbax hf tensors
Add unscanned conversion script for qwen3 next
Move gating op to after sharding optimizations
added zero centered rmsnorm
Add layer by layer comparision script
Remove debug files
Remove zero centered rms norm logic
Remove changes from forward pass logit checker
Remove sow debug line
Fix qkvz split in gated delta net and fix normalization after decoder layers
Run linter and modify ckpt conversion config
remove scanned script since it is not working yet
move qwen3 next unscanned conversion script to utils folder
Remove rms norm after decoder block for qwen3 next
Add scanned conversion script for qwen3 next
Added qwen3 next conversion test script
Resolved gemini review comments
Ran pyink for indentation errors
Added readme for qwen3 next
typo in qwen3 next readme
Reformatted unscanned script
Formatted scripts again
Undo changes in decoders.py
Formatted function with long line length
fix linter issues
Revise gemini-review comment
Add back change to pyconfig after rebase
Resolved pr comments
Added moe strategies section to qwen3 next readme
resolved comments in scripts
Dynamically get batch_size and seq_len
Add logic to decouple touple when using scanned
Resolve pr comments
Add train compile test for qwen3-next
Update train_compile test for qwen3-next
Moved checks to types.py from pyconfig_deprecated.py
Resolved comment for qwen3 next readme
Ran pyink formatter
Remove sparse_matmul test
Qwen3-Next is Alibaba 80B Mixture-of-Experts (MoE) model (activating only 3B parameters) that features a novel **hybrid attention** architecture combining Gated DeltaNet (linear attention) and Gated Attention (full attention) for massive context scaling. This documentation covers the integration of **Qwen3-Next-80B-A3B** into MaxText:
5
+
6
+
For more details on the architecture, see the [Qwen3 Technical Blog](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list).
7
+
8
+
* * * * *
9
+
10
+
Checkpoint Conversion
11
+
---------------------
12
+
13
+
To get started, you first need a MaxText-compatible checkpoint.
14
+
15
+
1.**Download the Model**: Download the official model from Hugging Face. You can use a tool like `hf_transfer` for a fast download.
2. **Convert the Checkpoint**: Run the `convert_qwen3_next_scanned.py` script to convert the downloaded Hugging Face weights into the Orbax format required by MaxText.
After converting the checkpoint, you can use it for fine-tuning or start a pre-training run from scratch. The command below is an example for fine-tuning on a v5p-512 slice. To pre-train, simply remove the `load_parameters_path` argument.
To verify that the MaxText implementation is numerically equivalent to the original Hugging Face model, you can run the end-to-end test scripts. These scripts automate the logit comparison test for each model.
60
+
61
+
Before running, you must set the `MAXTEXT_CHECKPOINT_PATH` environment variable. You can also optionally set `HF_MODEL_PATH` to point to a local copy of the Hugging Face model.
62
+
63
+
### Qwen3-Next-80B-A3B
64
+
65
+
Bash
66
+
67
+
```
68
+
# Set the required path to your converted MaxText checkpoint
This model implementation supports both **Token Dropping** and **Dropless** strategies for Mixture of Experts routing. Take a look at the MaxText [documentation](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/reference/core_concepts/moe_configuration.md) on MoE configs and flags to set based on desired strategy.
0 commit comments