Skip to content

Commit 4a20e4a

Browse files
Merge pull request #3348 from AI-Hypercomputer:shuningjin-fix-link
PiperOrigin-RevId: 881094139
2 parents b67b88b + 58e1cc6 commit 4a20e4a

4 files changed

Lines changed: 24 additions & 23 deletions

File tree

docs/guides/checkpointing_solutions/convert_checkpoint.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ The following models are supported:
2222

2323
- Hugging Face requires Pytorch.
2424
- Hugging Face model checkpoints require local disk space.
25-
- The model files are always downloaded to a disk cache first before being loaded into memory (for more info, please consult Hugging Face [docs](https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference)). The default local storage path for Hugging Face models is \$HOME/.cache/huggingface/hub
25+
- The model files are always downloaded to a disk cache first before being loaded into memory (for more info, please consult Hugging Face [docs](https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference)). The default local storage path for Hugging Face models is `$HOME/.cache/huggingface/hub`
2626

2727
## Hugging Face to MaxText
2828

2929
Use the `to_maxtext.py` script to convert a Hugging Face model into a MaxText checkpoint. The script will automatically download the specified model from the Hugging Face Hub, perform conversion, and save converted checkpoints to given output directory.
3030

31-
\*\**For a complete example, see the test script at [`end_to_end/tpu/qwen3/4b/test_qwen3_to_mt.sh`](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/qwen3/4b/test_qwen3_to_mt.sh) and [`end_to_end/tpu/gemma3/4b/test_gemma3_to_mt.sh`](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/gemma3/4b/test_gemma3_to_mt.sh).*
31+
\*\**For a complete example, see the test script at [`tests/end_to_end/tpu/qwen3/4b/test_qwen3_to_mt.sh`](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/end_to_end/tpu/qwen3/4b/test_qwen3_to_mt.sh) and [`tests/end_to_end/tpu/gemma3/4b/test_gemma3_to_mt.sh`](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/end_to_end/tpu/gemma3/4b/test_gemma3_to_mt.sh).*
3232

3333
### Usage
3434

@@ -41,7 +41,7 @@ uv venv --python 3.12 --seed ${VENV_NAME?}
4141
source ${VENV_NAME?}/bin/activate
4242
```
4343

44-
Second, ensure you have the necessary dependencies installed (PyTorch for the conversion script).
44+
Second, ensure you have the necessary dependencies installed (e.g., install PyTorch for checkpoint conversion and logit check).
4545

4646
```bash
4747
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
@@ -88,17 +88,17 @@ python3 -m maxtext.checkpoint_conversion.to_maxtext maxtext/configs/base.yml \
8888
- `hf_access_token`: Your Hugging Face token.
8989
- `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Googld Cloud Storage (GCS) or local. If not set, the default output directory is `Maxtext/tmp`.
9090
- `hardware=cpu`: run the conversion script on a CPU machine.
91-
- `checkpoint_storage_use_zarr3`: # Set to True to use zarr3 format (recommended for McJAX); set to False for Pathways.
92-
- `checkpoint_storage_use_ocdbt`: # Set to True to use OCDBT format (recommended for McJAX); set to False for Pathways.
93-
- `--lazy_load_tensors` (optional): If `true`, loads Hugging Face weights on-demand to minimize RAM usage. For large models, it is recommended to use the `--lazy_load_tensors=true` flag to reduce memory usage during conversion. For example, converting a Llama3.1-70B model with `--lazy_load_tensors=true` uses around 200GB of RAM and completes in ~10 minutes.
94-
- `--hf_model_path` (optional): Specifies a local or remote directory containing the model weights. If unspecified, we use the [default Hugging Face repository ID](https://github.com/AI-Hypercomputer/maxtext/blob/0d909c44391539db4e8cc2a33de9d77a891beb31/src/MaxText/checkpoint_conversion/utils/utils.py#L58-L85) (e.g., openai/gpt-oss-20b). This is necessary for locally dequantized models like GPT-OSS or DeepSeek.
91+
- `checkpoint_storage_use_zarr3`: Set to True to use zarr3 format (recommended for McJAX); set to False for Pathways.
92+
- `checkpoint_storage_use_ocdbt`: Set to True to use OCDBT format (recommended for McJAX); set to False for Pathways.
93+
- `--lazy_load_tensors` (optional): If `true`, loads Hugging Face weights on-demand to minimize RAM usage. When memory is constrained, it is recommended to use the `--lazy_load_tensors=true` flag to reduce memory usage during conversion. For example, converting a Llama3.1-70B model with `--lazy_load_tensors=true` uses around 200GB of RAM and completes in ~10 minutes.
94+
- `--hf_model_path` (optional): Specifies a local or remote directory containing the model weights. If unspecified, we use the [default Hugging Face repository ID](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/utils.py#L59-L91) (e.g., openai/gpt-oss-20b). This is necessary for locally dequantized models like GPT-OSS or DeepSeek.
9595

96-
Above command will download the Hugging Face model to local machine, convert it to the MaxText format and save it to `${MODEL_CHECKPOINT_DIRECTORY}/0/items`.
96+
Above command will download the Hugging Face model to local machine if `hf_model_path` is unspecified, or reuse the checkpoint in `hf_model_path`. It will convert the checkpoint to the MaxText format and save it to `${MODEL_CHECKPOINT_DIRECTORY}/0/items`.
9797

9898
## MaxText to Hugging Face
9999

100100
Use the `to_huggingface.py` script to convert a MaxText checkpoint into the Hugging Face format. This is useful for sharing your models or integrating them with the Hugging Face ecosystem.
101-
\*\**For a complete example, see the test script at [`end_to_end/tpu/qwen3/4b/test_qwen3_to_hf.sh`](https://github.com/AI-Hypercomputer/maxtext/blob/main/end_to_end/tpu/qwen3/4b/test_qwen3_to_hf.sh).*
101+
\*\**For a complete example, see the test script at [`tests/end_to_end/tpu/qwen3/4b/test_qwen3_to_hf.sh`](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/end_to_end/tpu/qwen3/4b/test_qwen3_to_hf.sh).*
102102

103103
### Usage
104104

@@ -127,13 +127,13 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface src/maxtext/configs/base
127127

128128
## Verifying conversion correctness
129129

130-
To ensure the conversion was successful, you can use the `tests/utils/forward_pass_logit_checker.py` script. It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
130+
To ensure the conversion was successful, you can use the [`tests/utils/forward_pass_logit_checker.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/utils/forward_pass_logit_checker.py) script. It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
131131

132132
### Usage
133133

134134
```bash
135135
python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml \
136-
tokenizer_path=assets/<tokenizer> \
136+
tokenizer_path=<tokenizer> \
137137
load_parameters_path=<path-to-maxtext-checkpoint> \
138138
model_name=<MODEL_NAME> \
139139
scan_layers=false \
@@ -151,8 +151,9 @@ python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml \
151151
- `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
152152
- `scan_layers`: Indicates if the output checkpoint is scanned (scan_layers=true) or unscanned (scan_layers=false).
153153
- `use_multimodal`: Indicates if multimodality is used.
154-
- `--run_hf_model`: Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
155-
- `--hf_model_path`: The path to the Hugging Face checkpoint.
154+
- `--run_hf_model` (optional): Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
155+
- `--hf_model_path` (optional): The path to the Hugging Face checkpoint (if `--run_hf_model=True`)
156+
- `--golden_logits_path` (optional): The pre-saved golden logits. (if `--run_hf_model` is not set)
156157
- `--max_kl_div`: Max KL divergence tolerance during comparisons.
157158

158159
**Example successful conversion verification:**
@@ -218,7 +219,7 @@ To extend conversion support to a new model architecture, you must define its sp
218219

219220
2. **Add Hugging Face weights Shape**: In [`utils/hf_shape.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/hf_shape.py), define the tensor shape of Hugging Face format (`def {MODEL}_HF_WEIGHTS_TO_SHAPE`). This is used to ensure the tensor shape is matched after to_huggingface conversion.
220221
3. **Register model key**: In [`utils/utils.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/utils.py), add the new model key in `HF_IDS`.
221-
4. **Add transformer config**: In [`utils/hf_model_configs.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/hf_model_configs.py), add the `transformers.Config` object, describing the Hugging Face model configuration (defined in ['src/maxtext/configs/models'](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs/models)). **Note**: This configuration must precisely match the MaxText model's architecture.
222+
4. **Add transformer config**: In [`utils/hf_model_configs.py`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/checkpoint_conversion/utils/hf_model_configs.py), add the `transformers.Config` object, describing the Hugging Face model configuration (defined in [`src/maxtext/configs/models`](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs/models)). **Note**: This configuration must precisely match the MaxText model's architecture.
222223

223224
Here is an example [PR to add support for gemma3 multi-modal model](https://github.com/AI-Hypercomputer/maxtext/pull/1983)
224225

docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,15 +35,15 @@ The first section of the log details the configuration of your run. This is cruc
3535

3636
MaxText builds its configuration in layers.
3737

38-
- It starts with the **default configuration** from a YAML file. In our example, the file is [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/28e5097ac467ed8b1d17676d68aa5acc50f9d60d/src/maxtext/configs/base.yml).
38+
- It starts with the **default configuration** from a YAML file. In our example, the file is [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml).
3939

4040
- Then, it overwrites any of these values with the arguments you provide in the **command line**.
4141

4242
```none
4343
Updating keys from env and command line: ['run_name', 'model_name', 'enable_checkpointing', 'base_output_directory', 'per_device_batch_size', 'dataset_type', 'steps', 'max_target_length']
4444
```
4545

46-
- It updates keys based on the **model-specific configuration** file. When you specify a model, like `deepseek2-16b`, MaxText reads the corresponding parameters from the [deepseek2-16b.yml](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/maxtext/configs/models/deepseek2-16b.yml) file.
46+
- It updates keys based on the **model-specific configuration** file. When you specify a model, like `deepseek2-16b`, MaxText reads the corresponding parameters from the [deepseek2-16b.yml](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/models/deepseek2-16b.yml) file.
4747

4848
```none
4949
Running Model: deepseek2-16b
@@ -212,7 +212,7 @@ In this example, given `model=deepseek2-16b`, `per_device_batch_size=24`, `max_t
212212
- 94.54% of the TFLOPs are attributed to learnable weight and 5.46% are attributed to attention.
213213
- As you will see next, this number is important for calculating performance metrics, such as TFLOP/s/device and Model FLOPs Utilization (MFU).
214214

215-
You can find more information about model FLOPs and MFU in the [Performance Metrics](performance-metrics) topic.
215+
You can find more information about model FLOPs and MFU in the [Performance Metrics](../../reference/performance_metrics.md#performance-metrics) topic.
216216

217217
## 4. Training metrics
218218

@@ -283,4 +283,4 @@ $$\text{number of tokens per device} = \text{per device batch size} \times \text
283283
completed step: 8, seconds: 5.670, TFLOP/s/device: 134.856, Tokens/s/device: 8668.393, total_weights: 163259, loss: 9.596
284284
completed step: 9, seconds: 5.669, TFLOP/s/device: 134.884, Tokens/s/device: 8670.184, total_weights: 155934, loss: 9.580
285285
```
286-
- For better convergence, we want to have large total weights. Towards this end, MaxText supports [packing](https://github.com/AI-Hypercomputer/maxtext/blob/28e5097ac467ed8b1d17676d68aa5acc50f9d60d/src/MaxText/sequence_packing.py#L37) multiple short sequences into one. This is enabled by default with `packing=True` in [base.yml](https://github.com/AI-Hypercomputer/maxtext/blob/28e5097ac467ed8b1d17676d68aa5acc50f9d60d/src/maxtext/configs/base.yml#L465).
286+
- For better convergence, we want to have large total weights. Towards this end, MaxText supports [packing](https://github.com/AI-Hypercomputer/maxtext/blob/28e5097ac467ed8b1d17676d68aa5acc50f9d60d/src/MaxText/sequence_packing.py#L37) multiple short sequences into one. This is enabled by default with `packing=True` in [base.yml](https://github.com/AI-Hypercomputer/maxtext/blob/ccd91f48454ed887c1ba2fe27d5c6214cff2817c/src/maxtext/configs/base.yml#L598).

docs/tutorials/pretraining.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
# Pre-training
2020

21-
In this tutorial, we introduce how to run pretraining with real datasets. While synthetic data is commonly used for benchmarking, we rely on real datasets to obtain meaningful weights. Currently, MaxText supports three dataset input pipelines: HuggingFace, Grain, and TensorFlow Datasets (TFDS). We will walk you through: setting up dataset, modifying the [dataset configs](https://github.com/AI-Hypercomputer/maxtext/blob/08d9f20329ab55b9b928543fedd28ad173e1cd97/src/maxtext/configs/base.yml#L486-L514) and [tokenizer configs](https://github.com/AI-Hypercomputer/maxtext/blob/08d9f20329ab55b9b928543fedd28ad173e1cd97/src/maxtext/configs/base.yml#L452-L455) for training, and optionally enabling evaluation.
21+
In this tutorial, we introduce how to run pretraining with real datasets. While synthetic data is commonly used for benchmarking, we rely on real datasets to obtain meaningful weights. Currently, MaxText supports three dataset input pipelines: HuggingFace, Grain, and TensorFlow Datasets (TFDS). We will walk you through: setting up dataset, modifying the [dataset configs](https://github.com/AI-Hypercomputer/maxtext/blob/f11f5507c987fdb57272c090ebd2cbdbbadbd36c/src/maxtext/configs/base.yml#L631-L675) and [tokenizer configs](https://github.com/AI-Hypercomputer/maxtext/blob/f11f5507c987fdb57272c090ebd2cbdbbadbd36c/src/maxtext/configs/base.yml#L566) for training, and optionally enabling evaluation.
2222

2323
To start with, we focus on HuggingFace datasets for convenience.
2424

@@ -57,7 +57,7 @@ completed step: 1, seconds: 0.287, TFLOP/s/device: 110.951, Tokens/s/device: 713
5757
completed step: 9, seconds: 1.010, TFLOP/s/device: 31.541, Tokens/s/device: 2027.424, total_weights: 7979, loss: 9.436
5858
```
5959

60-
The total weights is the number of real tokens processed in each step. More explanation can be found in [Understand Logs and Metrics](understand-logs-and-metrics) page.
60+
The total weights is the number of real tokens processed in each step. More explanation can be found in [Understand Logs and Metrics](../guides/monitoring_and_debugging/understand_logs_and_metrics.md#understand-logs-and-metrics) page.
6161

6262
**Evaluation config (optional)**:
6363

@@ -87,7 +87,7 @@ eval metrics after step: 9, loss=9.420, total_weights=75264.0
8787

8888
Grain is a library for reading data for training and evaluating JAX models. It is the recommended input pipeline for determinism and resilience! It supports data formats like ArrayRecord and Parquet. You can check [Grain pipeline](../guides/data_input_pipeline/data_input_grain.md) for more details.
8989

90-
**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/0baff00ac27bb7996c62057f235cc1d2f43d734e/setup_gcsfuse.sh#L18).
90+
**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/tools/setup/setup_gcsfuse.sh).
9191

9292
- For example, we can mount the bucket `gs://maxtext-dataset` on the local path `/tmp/gcsfuse` before training
9393
```bash
@@ -133,7 +133,7 @@ The TensorFlow Datasets (TFDS) pipeline uses dataset in the TFRecord format. You
133133

134134
**Data preparation**: You need to download data to a [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets), and the pipeline streams data from the bucket.
135135

136-
- To download the AllenAI C4 dataset to your bucket, you can use [download_dataset.sh](https://github.com/AI-Hypercomputer/maxtext/blob/08d9f20329ab55b9b928543fedd28ad173e1cd97/download_dataset.sh#L19): `bash download_dataset.sh <GCS_PROJECT> <GCS_BUCKET_FOR_DATASET>`
136+
- To download the AllenAI C4 dataset to your bucket, you can use [download_dataset.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/tools/data_generation/download_dataset.sh): `bash download_dataset.sh <GCS_PROJECT> <GCS_BUCKET_FOR_DATASET>`
137137

138138
This **command** shows pretraining with TFDS pipeline, along with evaluation:
139139

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# Checkpoint conversion utilities
22

3-
This guide provides instructions for using the scripts that convert model checkpoints bidirectionally between Hugging Face and MaxText formats. For more information, please see the [convert_checkpoint](../../../../docs/guides/checkpointing_solutions/convert_checkpoint.md) document.
3+
This guide provides instructions for using the scripts that convert model checkpoints bidirectionally between Hugging Face and MaxText formats. For more information, please see the [convert_checkpoint](../../../docs/guides/checkpointing_solutions/convert_checkpoint.md) document.

0 commit comments

Comments
 (0)