Skip to content

Commit 8bca12e

Browse files
committed
Adding cards/introductions to index pages and replacing ToC trees and resolving cl/843765069
1 parent cb462f5 commit 8bca12e

12 files changed

Lines changed: 388 additions & 10 deletions

docs/guides.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,49 @@
1616

1717
# How-to guides
1818

19+
Explore our how-to guides for optimizing, debugging, and managing your MaxText workloads.
20+
21+
::::{grid} 1 2 2 2
22+
:gutter: 2
23+
24+
:::{grid-item-card} ⚡ Optimization
25+
:link: guides/optimization
26+
:link-type: doc
27+
28+
Techniques for maximizing performance, including sharding strategies, Pallas kernels, and benchmarking.
29+
:::
30+
31+
:::{grid-item-card} 💾 Data Pipelines
32+
:link: guides/data_input_pipeline
33+
:link-type: doc
34+
35+
Configure input pipelines using **Grain** (recommended for determinism), **HuggingFace**, or **TFDS**.
36+
:::
37+
38+
:::{grid-item-card} 🔄 Checkpointing
39+
:link: guides/checkpointing_solutions
40+
:link-type: doc
41+
42+
Manage GCS checkpoints, handle preemption with emergency checkpointing, and configure multi-tier storage.
43+
:::
44+
45+
:::{grid-item-card} 🔍 Monitoring & Debugging
46+
:link: guides/monitoring_and_debugging
47+
:link-type: doc
48+
49+
Tools for observability: goodput monitoring, hung job debugging, and Vertex AI TensorBoard integration.
50+
:::
51+
52+
:::{grid-item-card} 🐍 Python Notebooks
53+
:link: guides/run_python_notebook
54+
:link-type: doc
55+
56+
Interactive development guides for running MaxText on Google Colab or local JupyterLab environments.
57+
:::
58+
::::
59+
1960
```{toctree}
61+
:hidden:
2062
:maxdepth: 1
2163
2264
guides/optimization.md

docs/guides/checkpointing_solutions.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,33 @@
11
(checkpointing_solutions)=
22
# Checkpointing
33

4+
::::{grid} 1 2 2 2
5+
:gutter: 2
6+
7+
:::{grid-item-card} 💾 GCS Checkpointing
8+
:link: checkpointing_solutions/gcs_checkpointing
9+
:link-type: doc
10+
11+
Standard checkpointing to Google Cloud Storage.
12+
:::
13+
14+
:::{grid-item-card} 🚑 Emergency Checkpointing
15+
:link: checkpointing_solutions/emergency_checkpointing
16+
:link-type: doc
17+
18+
Handle preemption and recover training progress.
19+
:::
20+
21+
:::{grid-item-card} 🗄️ Multi-tier checkpointing
22+
:link: checkpointing_solutions/multi_tier_checkpointing
23+
:link-type: doc
24+
25+
Optimize storage costs and performance with multi-tier usage.
26+
:::
27+
::::
28+
429
```{toctree}
30+
:hidden:
531
:maxdepth: 1
632
733
checkpointing_solutions/gcs_checkpointing.md

docs/guides/data_input_pipeline.md

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,40 @@
1919

2020
Currently MaxText has three data input pipelines:
2121

22-
| Pipeline | Dataset formats | Features | Limitations |
23-
| -------- | --------------- | -------- | ----------- |
24-
| **[Grain](data_input_pipeline/data_input_grain.md)** (recommended)| [ArrayRecord](https://github.com/google/array_record) (random access, available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview), or [conversion](https://github.com/google/array_record/tree/main/beam))<br>[Parquet](https://arrow.apache.org/docs/python/parquet.html) (sequential access) | With arrayrecord: fully deterministic, resilient to preemption; global shuffle <br>With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle | |
25-
| **[Hugging Face](data_input_pipeline/data_input_hf.md)** | datasets in [Hugging Face Hub](https://huggingface.co/datasets)<br>local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access) | no download needed, convenience; <br>multiple formats | limit scalability using the Hugging Face Hub (no limit using Cloud Storage); <br>non-deterministic with preemption<br>(deterministic without preemption)<br> |
26-
| **[TFDS](data_input_pipeline/data_input_hf.md)** | TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview) | performant | only supports TFRecords; <br>non-deterministic with preemption<br>(deterministic without preemption) |
22+
::::{grid} 1 2 2 2
23+
:gutter: 2
24+
25+
:::{grid-item-card} 🌾 Grain (Recommended)
26+
:link: data_input_pipeline/data_input_grain
27+
:link-type: doc
28+
29+
**Features**: With arrayrecord: fully deterministic, resilient to preemption; global shuffle. With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle.
30+
31+
**Formats**: ArrayRecord, Parquet.
32+
:::
33+
34+
:::{grid-item-card} 🤗 Hugging Face
35+
:link: data_input_pipeline/data_input_hf
36+
:link-type: doc
37+
38+
**Formats**: Hugging Face Hub, local/GCS (json, parquet, arrow, csv, txt).
39+
:::
40+
41+
:::{grid-item-card} 💾 TFDS
42+
:link: data_input_pipeline/data_input_hf
43+
:link-type: doc
44+
45+
**Formats**: TFRecord (via Tensorflow Datasets).
46+
:::
47+
48+
:::{grid-item-card} ⚡ Optimizing Performance
49+
:link: data_input_pipeline/data_pipeline_perf
50+
:link-type: doc
51+
52+
Guide to setting and verifying performance goals to maximize accelerator utilization.
53+
:::
54+
55+
::::
2756

2857
## Multihost dataloading best practice
2958
Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:

docs/guides/monitoring_and_debugging.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,63 @@
1717

1818
# Monitoring and debugging
1919

20+
::::{grid} 1 2 2 2
21+
:gutter: 2
22+
23+
:::{grid-item-card} 🕵️ Features & Diagnostics
24+
:link: monitoring_and_debugging/features_and_diagnostics
25+
:link-type: doc
26+
27+
Diagnostic tools and features for monitoring MaxText.
28+
:::
29+
30+
:::{grid-item-card} ☁️ GCP Observability
31+
:link: monitoring_and_debugging/gcp_workload_observability
32+
:link-type: doc
33+
34+
Observability for workloads running on Google Cloud Platform.
35+
:::
36+
37+
:::{grid-item-card} 🚫 Hang Playbook
38+
:link: monitoring_and_debugging/megascale_hang_playbook
39+
:link-type: doc
40+
41+
Troubleshooting guide for training hangs at megascale.
42+
:::
43+
44+
:::{grid-item-card} 📈 Goodput
45+
:link: monitoring_and_debugging/monitor_goodput
46+
:link-type: doc
47+
48+
Monitoring efficient training time (Goodput).
49+
:::
50+
51+
:::{grid-item-card} 📊 Logs & Metrics
52+
:link: monitoring_and_debugging/understand_logs_and_metrics
53+
:link-type: doc
54+
55+
Understanding MaxText logs and performance metrics.
56+
:::
57+
58+
:::{grid-item-card} 📉 TensorBoard
59+
:link: monitoring_and_debugging/use_vertex_ai_tensorboard
60+
:link-type: doc
61+
62+
Using Vertex AI TensorBoard for visualization.
63+
:::
64+
65+
:::{grid-item-card} ⏱️ XProf
66+
:link: monitoring_and_debugging/xprof_user_guide
67+
:link-type: doc
68+
69+
Profiling performance with XProf.
70+
:::
71+
::::
72+
2073
```{toctree}
74+
:hidden:
2175
:maxdepth: 1
76+
2277
monitoring_and_debugging/features_and_diagnostics.md
2378
monitoring_and_debugging/gcp_workload_observability.md
2479
monitoring_and_debugging/megascale_hang_playbook.md

docs/guides/optimization.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,44 @@
1616

1717
# Optimization
1818

19+
Explore techniques for maximizing performance, including model customization, sharding strategies, Pallas kernels, and benchmarking.
20+
21+
::::{grid} 1 2 2 2
22+
:gutter: 2
23+
24+
:::{grid-item-card} 🛠️ Customizing Model Configs
25+
:link: optimization/custom_model
26+
:link-type: doc
27+
28+
Optimize and customize your LLM model configurations for higher performance (MFU) on TPUs.
29+
:::
30+
31+
:::{grid-item-card} 🥞 Sharding Strategies
32+
:link: optimization/sharding
33+
:link-type: doc
34+
35+
Choose efficient sharding strategies (FSDP, TP, EP, PP) using Roofline Analysis and understand arithmetic intensity.
36+
:::
37+
38+
:::{grid-item-card} ⚡ Pallas Kernels
39+
:link: optimization/pallas_kernels_performance
40+
:link-type: doc
41+
42+
Optimize with Pallas kernels for fine-grained control.
43+
:::
44+
45+
:::{grid-item-card} 📈 Benchmarking & Tuning
46+
:link: optimization/benchmark_and_performance
47+
:link-type: doc
48+
49+
Guide to setting up benchmarks, performing performance tuning, and analyzing metrics.
50+
:::
51+
::::
52+
1953
```{toctree}
54+
:hidden:
2055
:maxdepth: 1
56+
2157
optimization/custom_model.md
2258
optimization/sharding.md
2359
optimization/pallas_kernels_performance.md

docs/guides/run_python_notebook.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,7 @@ This guide provides clear, step-by-step instructions for getting started with py
66

77
- [Prerequisites](#prerequisites)
88
- [Method 1: Google Colab with TPU](#method-1-google-colab-with-tpu)
9-
- [Method 2: Local Jupyter Lab with TPU](#method-2-local-jupyter-lab-with-tpu)
10-
- [Available Examples](#available-examples)
9+
- [Method 2: Local Jupyter Lab with TPU (Recommended)](#method-2-local-jupyter-lab-with-tpu-recommended)
1110
- [Common Pitfalls & Debugging](#common-pitfalls--debugging)
1211
- [Support & Resources](#support-and-resources)
1312
- [Contributing](#contributing)
@@ -32,6 +31,8 @@ This is the fastest way to run MaxText python notebooks without managing infrast
3231
**⚠️ IMPORTANT NOTE ⚠️**
3332
The free tier of Google Colab provides access to `v5e-1 TPU`, but this access is not guaranteed and is subject to availability and usage limits.
3433

34+
Currently, this method only supports the **`sft_qwen3_demo.ipynb`** notebook, which demonstrates Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k). If you want to run other notebooks, please use the local Jupyter Lab setup method.
35+
3536
Before proceeding, please verify that the specific notebook you are running works reliably on the free-tier TPU resources. If you encounter frequent disconnections or resource limitations, you may need to:
3637

3738
* Upgrade to a Colab Pro or Pro+ subscription for more stable and powerful TPU access.
@@ -63,9 +64,9 @@ Before proceeding, please verify that the specific notebook you are running work
6364
### Step 4: Run the Notebook
6465
Follow the instructions within the notebook cells to install dependencies and run the training/inference.
6566

66-
## Method 2: Local Jupyter Lab with TPU
67+
## Method 2: Local Jupyter Lab with TPU (Recommended)
6768

68-
You can run Python notebooks on a local JupyterLab environment, giving you full control over your computing resources.
69+
You can run all of our Python notebooks on a local JupyterLab environment, giving you full control over your computing resources.
6970

7071
### Step 1: Set Up TPU VM
7172

docs/reference.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,42 @@
1616

1717
# Reference documentation
1818

19+
Deep dive into MaxText architecture, models, and core concepts.
20+
21+
::::{grid} 1 2 2 2
22+
:gutter: 2
23+
24+
:::{grid-item-card} 📊 Performance Metrics
25+
:link: reference/performance_metrics
26+
:link-type: doc
27+
28+
Understanding Model Flops Utilization (MFU), calculation methods, and why it matters for performance optimization.
29+
:::
30+
31+
:::{grid-item-card} 🤖 Models
32+
:link: reference/models
33+
:link-type: doc
34+
35+
Supported models and architectures, including Llama, Qwen, and Mixtral. Details on tiering and new additions.
36+
:::
37+
38+
:::{grid-item-card} 🏗️ Architecture
39+
:link: reference/architecture
40+
:link-type: doc
41+
42+
High-level overview of MaxText design, JAX/XLA choices, and how components interact.
43+
:::
44+
45+
:::{grid-item-card} 💡 Core Concepts
46+
:link: reference/core_concepts
47+
:link-type: doc
48+
49+
Key concepts including checkpointing strategies, quantization, tiling, and Mixture of Experts (MoE) configuration.
50+
:::
51+
::::
52+
1953
```{toctree}
54+
:hidden:
2055
:maxdepth: 1
2156
2257
reference/performance_metrics.md

docs/reference/architecture.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,25 @@
11
# Architecture
22

3+
::::{grid} 1 2 2 2
4+
:gutter: 2
5+
6+
:::{grid-item-card} 🗺️ Overview
7+
:link: architecture/architecture_overview
8+
:link-type: doc
9+
10+
High-level overview of MaxText design and components.
11+
:::
12+
13+
:::{grid-item-card} 📚 JAX/AI Libraries
14+
:link: architecture/jax_ai_libraries_chosen
15+
:link-type: doc
16+
17+
Deep dive into the JAX and AI libraries used in MaxText.
18+
:::
19+
::::
20+
321
```{toctree}
22+
:hidden:
423
:maxdepth: 1
524
625
architecture/architecture_overview.md

docs/reference/core_concepts.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,54 @@
1616

1717
# Core concepts
1818

19+
::::{grid} 1 2 2 2
20+
:gutter: 2
21+
22+
:::{grid-item-card} 💾 Checkpoints
23+
:link: core_concepts/checkpoints
24+
:link-type: doc
25+
26+
Understanding checkpoint formats and strategies.
27+
:::
28+
29+
:::{grid-item-card} ⚖️ Alternatives
30+
:link: core_concepts/alternatives
31+
:link-type: doc
32+
33+
Comparison with other frameworks like Megatron-LM.
34+
:::
35+
36+
:::{grid-item-card} 📉 Quantization
37+
:link: core_concepts/quantization
38+
:link-type: doc
39+
40+
Techniques for reducing model size and improving performance.
41+
:::
42+
43+
:::{grid-item-card} 🧱 Tiling
44+
:link: core_concepts/tiling
45+
:link-type: doc
46+
47+
Understanding tiling strategies for partitioning logic.
48+
:::
49+
50+
:::{grid-item-card} ⚡ JAX/XLA/Pallas
51+
:link: core_concepts/jax_xla_and_pallas
52+
:link-type: doc
53+
54+
How MaxText leverages JAX, XLA, and Pallas for efficiency.
55+
:::
56+
57+
:::{grid-item-card} 🧠 MoE Configuration
58+
:link: core_concepts/moe_configuration
59+
:link-type: doc
60+
61+
Configuring Mixture of Experts (MoE) models.
62+
:::
63+
::::
64+
1965
```{toctree}
66+
:hidden:
2067
:maxdepth: 1
2168
2269
core_concepts/checkpoints.md

0 commit comments

Comments
 (0)