Merge pull request #2814 from AI-Hypercomputer:hengtaoguo-links

Google-ML-Automation · Google-ML-Automation · commit d5ea75192073 · 2025-12-14T19:24:33.000-08:00
PiperOrigin-RevId: 844535808
diff --git a/docs/guides/optimization/benchmark_and_performance.md b/docs/guides/optimization/benchmark_and_performance.md
@@ -22,7 +22,7 @@ Arithmetic intensity is calculated as the ratio of floating-point operations (FL
 
 This metric helps determine whether a computation is MXU-bound (high arithmetic intensity) or memory-bound/communication-bound (low arithmetic intensity).
 
-[This sharding document](sharding_on_TPUs) illustrates various sharding strategies and their roofline analysis, through AI analysis.
+[This sharding document](sharding.md) illustrates various sharding strategies and their roofline analysis, through AI analysis.
 
 ## Metrics for benchmark analysis
 
@@ -74,8 +74,7 @@ See [](quantization).
 ### Choose sharding strategy
 
 Sharding is crucial for optimizing model performance. MaxText offers various sharding strategies and hybrid options, including FSDP, TP, EP, CP, and PP, which can be configured through your MaxText settings.
-[This document](sharding_on_TPUs) illustrates in detail how sharding works in maxtext and chooses the right sharding config for your workload.
-[This document](sharding_on_TPUs) illustrates in detail how sharding works in maxtext and chooses the write sharding config for your workload.
+[This document](sharding.md) illustrates in detail how sharding works in maxtext and chooses the right sharding config for your workload.
 
 ### Performance tuning on custom Pallas call
 
diff --git a/docs/reference/architecture/architecture_overview.md b/docs/reference/architecture/architecture_overview.md
@@ -91,7 +91,7 @@ The modularity of this design is clearly demonstrated by third-party extensions.
 
 ### Data ingestion (`input_pipeline.py`)
 
-[The data ingestion pipeline](data-input-pipeline) is a critical component for performance at scale. In MaxText, the main training loop interfaces with the data pipeline through the create\_data\_iterator function, which is called from train.py. This function acts as a facade, abstracting the specific data loading implementation from the rest of the training logic.
+[The data ingestion pipeline](../../guides/data_input_pipeline.md) is a critical component for performance at scale. In MaxText, the main training loop interfaces with the data pipeline through the create\_data\_iterator function, which is called from train.py. This function acts as a facade, abstracting the specific data loading implementation from the rest of the training logic.
 
 MaxText supports three primary data loading backends:
 
@@ -153,7 +153,7 @@ This logical mesh abstraction enables the implementation of the standard paralle
 
 In MaxText, these strategies are implemented by annotating the model's PyTrees (the nested Python structures of arrays that hold the parameters and state) with sharding specifications. This is done using Flax's partitioning utilities, such as nn\_partitioning. These annotations provide requirements and hints to the compiler, telling it how each tensor should be distributed across the axes of the device mesh. The compiler then generates the appropriate collective communication operations (e.g., all-reduce, all-gather) needed to execute the parallel computation correctly and efficiently.
 
-For more information on sharding see [our sharding documentation](sharding_on_TPUs).
+For more information on sharding see [our sharding documentation](../../guides/optimization/sharding.md).
 
 ### Hardware abstraction and performance via XLA
 
diff --git a/docs/reference/models/supported_models_and_architectures.md b/docs/reference/models/supported_models_and_architectures.md
@@ -70,7 +70,7 @@ MaxText supports a wide range of parallelism strategies for scaling training and
 
 The following summarizes observed runtime efficiency and scaling behaviors of MaxText across different hardware and model types, based on published benchmarks and large-scale runs.
 
-* **High MFU**: MaxText targets high Model FLOPs Utilization across scales; exact numbers vary by model, hardware and config. See [**Performance Metrics → MFU**](performance-metrics) for the definition and how we calculate it.
+* **High MFU**: MaxText targets high Model FLOPs Utilization across scales; exact numbers vary by model, hardware and config. See [**Performance Metrics → MFU**](../performance_metrics.md#performance-metrics) for the definition and how we calculate it.
 * **Quantization**: MaxText supports quantization via both the AQT and Qwix libraries. Qwix is the recommended approach, providing a non-intrusive way to apply various quantization techniques, including Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ).
  * **MoE**: The Mixture-of-Experts implementation features dropless routing with Megablox and `jax.lax.ragged_dot` kernels for enhanced performance.
 * **Multi-Token Prediction (MTP)**: This feature improves training efficiency on DeepSeek-style models by adding an auxiliary loss based on predicting multiple future tokens.
@@ -91,6 +91,6 @@ The following summarizes observed runtime efficiency and scaling behaviors of Ma
 
 
 * **Technical Explanations:**
-    * [Parallelism & Sharding](sharding_on_TPUs)
-    * [Quantization Documentation](quantization)
-    * [AOT Compilation Instructions](aot-compilation)
+    * [Parallelism & Sharding](../../guides/optimization/sharding.md)
+    * [Quantization Documentation](../core_concepts/quantization.md)
+    * [AOT Compilation Instructions](../../guides/monitoring_and_debugging/features_and_diagnostics.md#ahead-of-time-compilation-aot)
diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md
@@ -48,7 +48,7 @@ python3 -m MaxText.train src/MaxText/configs/base.yml \
   dataset_type=synthetic \
   steps=10
 ```
-Optional: If you want to try training on a Hugging Face dataset, see [Data Input Pipeline](data-input-pipeline) for data input options.
+Optional: If you want to try training on a Hugging Face dataset, see [Data Input Pipeline](../guides/data_input_pipeline.md) for data input options.
 
 5. To demonstrate model output, run the following command:
 ```sh
@@ -92,7 +92,7 @@ Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.a
 
 ## Multihost development
 
-Google Kubernetes Engine (GKE) is the recommended way to run MaxText on multiple hosts. It provides a managed environment for deploying and scaling containerized applications, including those that require TPUs or GPUs. See [Running Maxtext with XPK](run-xpk) for details.
+Google Kubernetes Engine (GKE) is the recommended way to run MaxText on multiple hosts. It provides a managed environment for deploying and scaling containerized applications, including those that require TPUs or GPUs. See [Running Maxtext with XPK](../run_maxtext/run_maxtext_via_xpk.md) for details.
 
 ## Next steps: preflight optimizations
 
diff --git a/docs/tutorials/posttraining/full_finetuning.md b/docs/tutorials/posttraining/full_finetuning.md
@@ -76,7 +76,7 @@ These scripts can provide a reference point for various scripts.
 
 ### MaxText checkpoint to Hugging Face
 
-Post finetuning or pre-training, MaxText also provides scripts to convert MaxText format weights back to [Hugging Face](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/llama_mistral_mixtral_orbax_to_hf.py).
+Post finetuning or pre-training, MaxText also provides scripts to convert MaxText format weights back to [Hugging Face](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_scripts/llama_mistral_mixtral_orbax_to_hf.py).
 
 #### Dataset
 
diff --git a/end_to_end/tpu/gemma/Run_Gemma.md b/end_to_end/tpu/gemma/Run_Gemma.md
@@ -19,7 +19,7 @@
 
 Following the instructions at [kaggle](https://www.kaggle.com/models/google/gemma/frameworks/maxText) will let you download Gemma model weights. You will have to consent to license for Gemma using your kaggle account's [API credentials](https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials).
 
-After downloading the weights run [convert_gemma_chkpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/convert_gemma_chkpt.py), which converts the checkpoint to be compatible with MaxText and uploads them to a GCS bucket. You can run decode and finetuning using instructions mentioned in the test scripts at [end_to_end/tpu/gemma](https://github.com/AI-Hypercomputer/maxtext/tree/main/end_to_end/tpu/gemma).
+After downloading the weights run [convert_gemma_chkpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_scripts/convert_gemma_chkpt.py), which converts the checkpoint to be compatible with MaxText and uploads them to a GCS bucket. You can run decode and finetuning using instructions mentioned in the test scripts at [end_to_end/tpu/gemma](https://github.com/AI-Hypercomputer/maxtext/tree/main/end_to_end/tpu/gemma).
 
 ## MaxText supports pretraining and finetuning with high performance
 
diff --git a/end_to_end/tpu/gemma3/Run_Gemma3.md b/end_to_end/tpu/gemma3/Run_Gemma3.md
@@ -29,7 +29,7 @@ python3 -m MaxText.train src/MaxText/configs/base.yml model_name=gemma3-4b base_
 ```
 
 ## Checkpoint Conversion
-To obtain the Gemma3 model weights, follow the instructions provided on [Kaggle](https://www.kaggle.com/models/google/gemma-3/flax/). You will need to accept the Gemma3 license through your Kaggle account and utilize your Kaggle [API credentials](https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials) for authentication. Once the weights are downloaded to your GCS bucket, use the [convert_gemma3_chkpt.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/convert_gemma3_chkpt.py) script to transform the checkpoint into a format compatible with MaxText. This script will also upload the converted checkpoints to a Google Cloud Storage (GCS) bucket.
+To obtain the Gemma3 model weights, follow the instructions provided on [Kaggle](https://www.kaggle.com/models/google/gemma-3/flax/). You will need to accept the Gemma3 license through your Kaggle account and utilize your Kaggle [API credentials](https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials) for authentication. Once the weights are downloaded to your GCS bucket, use the [checkpoint conversion utils](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/MaxText/utils/ckpt_conversion#usage) to transform the checkpoint into a format compatible with MaxText. This script will also upload the converted checkpoints to a Google Cloud Storage (GCS) bucket.
 
 ## Fine-tuning
 After the conversion, you will have a MaxText compatible checkpoint which allows you to fine-tune it with different datasets. One example command to fine-tune a Gemma3-4B model is as follows:
diff --git a/src/MaxText/checkpointing.py b/src/MaxText/checkpointing.py
@@ -56,7 +56,7 @@ def save(
       self,
       directory: epath.Path,
       # `item` is for backwards compatibility with older Orbax API, see
-      # https://orbax.readthedocs.io/en/latest/api_refactor.html.
+      # https://orbax.readthedocs.io/en/latest/guides/checkpoint/api_refactor.html.
       item: Optional[Any] = None,
       args: Any = None,
   ):
diff --git a/src/MaxText/configs/base.yml b/src/MaxText/configs/base.yml
@@ -537,7 +537,7 @@ per_device_batch_size: 12.0
 # When expansion_factor_real_data is set to > 1, total_hosts//expansion_factor_real_data will load data.
 # Each data-loading host will load per_device_batch_size * expansion_factor_real_data.
 # When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job.
-# Details in https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/data_input_grain.md#using-grain
+# Details in https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/data_input_pipeline/data_input_grain.md#using-grain
 expansion_factor_real_data: -1.0
 eval_per_device_batch_size: 0.0
 max_corpus_chars: 10_000_000
@@ -578,7 +578,7 @@ use_sft: False
 sft_train_on_completion_only: False
 
 # dataset_type must be synthetic, hf, grain, tfds
-# details in: https://github.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md
+# details in: https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/data_input_pipeline.md
 dataset_type: tfds
 # for TFDS input pipeline (dataset_type=tfds)
 dataset_path: "" # your path given as argument in download_dataset.sh, e.g. "gs://my-maxtext-dataset/"