Adding cards/introductions to index pages and replacing ToC trees and resolving cl/843765069

jacoguzo · jacoguzo · commit 8bca12ea76a8 · 2025-12-16T15:47:28.000-08:00
diff --git a/docs/guides.md b/docs/guides.md
@@ -16,7 +16,49 @@
 
 # How-to guides
 
+Explore our how-to guides for optimizing, debugging, and managing your MaxText workloads.
+
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} ⚡ Optimization
+:link: guides/optimization
+:link-type: doc
+
+Techniques for maximizing performance, including sharding strategies, Pallas kernels, and benchmarking.
+:::
+
+:::{grid-item-card} 💾 Data Pipelines
+:link: guides/data_input_pipeline
+:link-type: doc
+
+Configure input pipelines using **Grain** (recommended for determinism), **HuggingFace**, or **TFDS**.
+:::
+
+:::{grid-item-card} 🔄 Checkpointing
+:link: guides/checkpointing_solutions
+:link-type: doc
+
+Manage GCS checkpoints, handle preemption with emergency checkpointing, and configure multi-tier storage.
+:::
+
+:::{grid-item-card} 🔍 Monitoring & Debugging
+:link: guides/monitoring_and_debugging
+:link-type: doc
+
+Tools for observability: goodput monitoring, hung job debugging, and Vertex AI TensorBoard integration.
+:::
+
+:::{grid-item-card} 🐍 Python Notebooks
+:link: guides/run_python_notebook
+:link-type: doc
+
+Interactive development guides for running MaxText on Google Colab or local JupyterLab environments.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
 
 guides/optimization.md
diff --git a/docs/guides/checkpointing_solutions.md b/docs/guides/checkpointing_solutions.md
@@ -1,7 +1,33 @@
 (checkpointing_solutions)=
 # Checkpointing
 
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 💾 GCS Checkpointing
+:link: checkpointing_solutions/gcs_checkpointing
+:link-type: doc
+
+Standard checkpointing to Google Cloud Storage.
+:::
+
+:::{grid-item-card} 🚑 Emergency Checkpointing
+:link: checkpointing_solutions/emergency_checkpointing
+:link-type: doc
+
+Handle preemption and recover training progress.
+:::
+
+:::{grid-item-card} 🗄️ Multi-tier checkpointing
+:link: checkpointing_solutions/multi_tier_checkpointing
+:link-type: doc
+
+Optimize storage costs and performance with multi-tier usage.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
 
 checkpointing_solutions/gcs_checkpointing.md
diff --git a/docs/guides/data_input_pipeline.md b/docs/guides/data_input_pipeline.md
@@ -19,11 +19,40 @@
 
 Currently MaxText has three data input pipelines:
 
-| Pipeline | Dataset formats | Features | Limitations |
-| -------- | --------------- | -------- | ----------- |
-| **[Grain](data_input_pipeline/data_input_grain.md)** (recommended)| [ArrayRecord](https://github.com/google/array_record) (random access, available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview), or [conversion](https://github.com/google/array_record/tree/main/beam))<br>[Parquet](https://arrow.apache.org/docs/python/parquet.html) (sequential access) | With arrayrecord: fully deterministic, resilient to preemption; global shuffle <br>With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle |  |
-| **[Hugging Face](data_input_pipeline/data_input_hf.md)** | datasets in [Hugging Face Hub](https://huggingface.co/datasets)<br>local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access) | no download needed, convenience; <br>multiple formats | limit scalability using the Hugging Face Hub (no limit using Cloud Storage); <br>non-deterministic with preemption<br>(deterministic without preemption)<br> |
-| **[TFDS](data_input_pipeline/data_input_hf.md)** | TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview) | performant | only supports TFRecords; <br>non-deterministic with preemption<br>(deterministic without preemption) |
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 🌾 Grain (Recommended)
+:link: data_input_pipeline/data_input_grain
+:link-type: doc
+
+**Features**: With arrayrecord: fully deterministic, resilient to preemption; global shuffle. With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle.
+
+**Formats**: ArrayRecord, Parquet.
+:::
+
+:::{grid-item-card} 🤗 Hugging Face
+:link: data_input_pipeline/data_input_hf
+:link-type: doc
+
+**Formats**: Hugging Face Hub, local/GCS (json, parquet, arrow, csv, txt).
+:::
+
+:::{grid-item-card} 💾 TFDS
+:link: data_input_pipeline/data_input_hf
+:link-type: doc
+
+**Formats**: TFRecord (via Tensorflow Datasets).
+:::
+
+:::{grid-item-card} ⚡ Optimizing Performance
+:link: data_input_pipeline/data_pipeline_perf
+:link-type: doc
+
+Guide to setting and verifying performance goals to maximize accelerator utilization.
+:::
+
+::::
 
 ## Multihost dataloading best practice
 Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:
diff --git a/docs/guides/monitoring_and_debugging.md b/docs/guides/monitoring_and_debugging.md
@@ -17,8 +17,63 @@
 
 # Monitoring and debugging
 
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 🕵️ Features & Diagnostics
+:link: monitoring_and_debugging/features_and_diagnostics
+:link-type: doc
+
+Diagnostic tools and features for monitoring MaxText.
+:::
+
+:::{grid-item-card} ☁️ GCP Observability
+:link: monitoring_and_debugging/gcp_workload_observability
+:link-type: doc
+
+Observability for workloads running on Google Cloud Platform.
+:::
+
+:::{grid-item-card} 🚫 Hang Playbook
+:link: monitoring_and_debugging/megascale_hang_playbook
+:link-type: doc
+
+Troubleshooting guide for training hangs at megascale.
+:::
+
+:::{grid-item-card} 📈 Goodput
+:link: monitoring_and_debugging/monitor_goodput
+:link-type: doc
+
+Monitoring efficient training time (Goodput).
+:::
+
+:::{grid-item-card} 📊 Logs & Metrics
+:link: monitoring_and_debugging/understand_logs_and_metrics
+:link-type: doc
+
+Understanding MaxText logs and performance metrics.
+:::
+
+:::{grid-item-card} 📉 TensorBoard
+:link: monitoring_and_debugging/use_vertex_ai_tensorboard
+:link-type: doc
+
+Using Vertex AI TensorBoard for visualization.
+:::
+
+:::{grid-item-card} ⏱️ XProf
+:link: monitoring_and_debugging/xprof_user_guide
+:link-type: doc
+
+Profiling performance with XProf.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
+
 monitoring_and_debugging/features_and_diagnostics.md
 monitoring_and_debugging/gcp_workload_observability.md
 monitoring_and_debugging/megascale_hang_playbook.md
diff --git a/docs/guides/optimization.md b/docs/guides/optimization.md
@@ -16,8 +16,44 @@
 
 # Optimization
 
+Explore techniques for maximizing performance, including model customization, sharding strategies, Pallas kernels, and benchmarking.
+
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 🛠️ Customizing Model Configs
+:link: optimization/custom_model
+:link-type: doc
+
+Optimize and customize your LLM model configurations for higher performance (MFU) on TPUs.
+:::
+
+:::{grid-item-card} 🥞 Sharding Strategies
+:link: optimization/sharding
+:link-type: doc
+
+Choose efficient sharding strategies (FSDP, TP, EP, PP) using Roofline Analysis and understand arithmetic intensity.
+:::
+
+:::{grid-item-card} ⚡ Pallas Kernels
+:link: optimization/pallas_kernels_performance
+:link-type: doc
+
+Optimize with Pallas kernels for fine-grained control. 
+:::
+
+:::{grid-item-card} 📈 Benchmarking & Tuning
+:link: optimization/benchmark_and_performance
+:link-type: doc
+
+Guide to setting up benchmarks, performing performance tuning, and analyzing metrics.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
+
 optimization/custom_model.md
 optimization/sharding.md
 optimization/pallas_kernels_performance.md
diff --git a/docs/guides/run_python_notebook.md b/docs/guides/run_python_notebook.md
@@ -6,8 +6,7 @@ This guide provides clear, step-by-step instructions for getting started with py
 
 - [Prerequisites](#prerequisites)
 - [Method 1: Google Colab with TPU](#method-1-google-colab-with-tpu)
-- [Method 2: Local Jupyter Lab with TPU](#method-2-local-jupyter-lab-with-tpu)
-- [Available Examples](#available-examples)
+- [Method 2: Local Jupyter Lab with TPU (Recommended)](#method-2-local-jupyter-lab-with-tpu-recommended)
 - [Common Pitfalls & Debugging](#common-pitfalls--debugging)
 - [Support & Resources](#support-and-resources)
 - [Contributing](#contributing)
@@ -32,6 +31,8 @@ This is the fastest way to run MaxText python notebooks without managing infrast
 **⚠️ IMPORTANT NOTE ⚠️**
 The free tier of Google Colab provides access to `v5e-1 TPU`, but this access is not guaranteed and is subject to availability and usage limits.
 
+Currently, this method only supports the **`sft_qwen3_demo.ipynb`** notebook, which demonstrates Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k). If you want to run other notebooks, please use the local Jupyter Lab setup method.
+
 Before proceeding, please verify that the specific notebook you are running works reliably on the free-tier TPU resources. If you encounter frequent disconnections or resource limitations, you may need to:
 
 * Upgrade to a Colab Pro or Pro+ subscription for more stable and powerful TPU access.
@@ -63,9 +64,9 @@ Before proceeding, please verify that the specific notebook you are running work
 ### Step 4: Run the Notebook
 Follow the instructions within the notebook cells to install dependencies and run the training/inference.
 
-## Method 2: Local Jupyter Lab with TPU
+## Method 2: Local Jupyter Lab with TPU (Recommended)
 
-You can run Python notebooks on a local JupyterLab environment, giving you full control over your computing resources.
+You can run all of our Python notebooks on a local JupyterLab environment, giving you full control over your computing resources.
 
 ### Step 1: Set Up TPU VM
 
diff --git a/docs/reference.md b/docs/reference.md
@@ -16,7 +16,42 @@
 
 # Reference documentation
 
+Deep dive into MaxText architecture, models, and core concepts.
+
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 📊 Performance Metrics
+:link: reference/performance_metrics
+:link-type: doc
+
+Understanding Model Flops Utilization (MFU), calculation methods, and why it matters for performance optimization.
+:::
+
+:::{grid-item-card} 🤖 Models
+:link: reference/models
+:link-type: doc
+
+Supported models and architectures, including Llama, Qwen, and Mixtral. Details on tiering and new additions.
+:::
+
+:::{grid-item-card} 🏗️ Architecture
+:link: reference/architecture
+:link-type: doc
+
+High-level overview of MaxText design, JAX/XLA choices, and how components interact.
+:::
+
+:::{grid-item-card} 💡 Core Concepts
+:link: reference/core_concepts
+:link-type: doc
+
+Key concepts including checkpointing strategies, quantization, tiling, and Mixture of Experts (MoE) configuration.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
 
 reference/performance_metrics.md
diff --git a/docs/reference/architecture.md b/docs/reference/architecture.md
@@ -1,6 +1,25 @@
 # Architecture
 
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 🗺️ Overview
+:link: architecture/architecture_overview
+:link-type: doc
+
+High-level overview of MaxText design and components.
+:::
+
+:::{grid-item-card} 📚 JAX/AI Libraries
+:link: architecture/jax_ai_libraries_chosen
+:link-type: doc
+
+Deep dive into the JAX and AI libraries used in MaxText.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
 
 architecture/architecture_overview.md
diff --git a/docs/reference/core_concepts.md b/docs/reference/core_concepts.md
@@ -16,7 +16,54 @@
 
 # Core concepts
 
+::::{grid} 1 2 2 2
+:gutter: 2
+
+:::{grid-item-card} 💾 Checkpoints
+:link: core_concepts/checkpoints
+:link-type: doc
+
+Understanding checkpoint formats and strategies.
+:::
+
+:::{grid-item-card} ⚖️ Alternatives
+:link: core_concepts/alternatives
+:link-type: doc
+
+Comparison with other frameworks like Megatron-LM.
+:::
+
+:::{grid-item-card} 📉 Quantization
+:link: core_concepts/quantization
+:link-type: doc
+
+Techniques for reducing model size and improving performance.
+:::
+
+:::{grid-item-card} 🧱 Tiling
+:link: core_concepts/tiling
+:link-type: doc
+
+Understanding tiling strategies for partitioning logic.
+:::
+
+:::{grid-item-card} ⚡ JAX/XLA/Pallas
+:link: core_concepts/jax_xla_and_pallas
+:link-type: doc
+
+How MaxText leverages JAX, XLA, and Pallas for efficiency.
+:::
+
+:::{grid-item-card} 🧠 MoE Configuration
+:link: core_concepts/moe_configuration
+:link-type: doc
+
+Configuring Mixture of Experts (MoE) models.
+:::
+::::
+
 ```{toctree}
+:hidden:
 :maxdepth: 1
 
 core_concepts/checkpoints.md
diff --git a/docs/reference/models.md b/docs/reference/models.md
diff --git a/docs/run_maxtext.md b/docs/run_maxtext.md
diff --git a/docs/tutorials.md b/docs/tutorials.md