You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorials/posttraining/sft.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,7 @@
15
15
-->
16
16
17
17
# SFT on single-host TPUs
18
+
18
19
Supervised fine-tuning (SFT) is a process where a pre-trained large language model is fine-tuned on a labeled dataset to adapt the model to perform better on specific tasks.
19
20
20
21
This tutorial demonstrates step-by-step instructions for setting up the environment and then training the model on a Hugging Face dataset using SFT.
This section explains how to prepare your model checkpoint for use with MaxText. You have two options: using an existing MaxText checkpoint or converting a Hugging Face checkpoint.
68
70
69
71
### Option 1: Using an existing MaxText checkpoint
72
+
70
73
If you already have a MaxText-compatible model checkpoint, simply set the following environment variable and move on to the next section.
71
74
72
75
```sh
73
76
export PRE_TRAINED_MODEL_CKPT_PATH=<gcs path for MaxText checkpoint># e.g., gs://my-bucket/my-model-checkpoint/0/items
74
77
```
75
78
76
79
### Option 2: Converting a Hugging Face checkpoint
80
+
77
81
If your model checkpoint is from Hugging Face, you need to run a conversion script to make it MaxText-compatible.
78
82
79
83
1.**Set the Output Path:** First, define where the converted MaxText checkpoint will be saved. For example:
Copy file name to clipboardExpand all lines: docs/tutorials/posttraining/sft_on_multi_host.md
+27-5Lines changed: 27 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,7 @@
15
15
-->
16
16
17
17
# SFT on multi-host TPUs
18
+
18
19
Supervised fine-tuning (SFT) is a process where a pre-trained large language model is fine-tuned on a labeled dataset to adapt the model to perform better on specific tasks.
19
20
20
21
This tutorial demonstrates step-by-step instructions for setting up the multi-host TPU environment and then training the model on the Hugging Face dataset using SFT. In this tutorial we use a multi-host TPU such as `v6e-256`.
@@ -24,16 +25,20 @@ We use [Tunix](https://github.com/google/tunix), a JAX-based library designed fo
24
25
Let's get started!
25
26
26
27
## 1. Build and upload MaxText Docker image
28
+
27
29
This section guides you through cloning the MaxText repository, building MaxText Docker image with dependencies, and uploading the docker image to your project's Artifact Registry.
28
30
29
31
### 1.1. Clone the MaxText repository
32
+
30
33
```bash
31
34
git clone https://github.com/google/maxtext.git
32
35
cd maxtext
33
36
```
34
37
35
38
### 1.2. Build MaxText Docker image
39
+
36
40
Before building the Docker image, authenticate to [Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) for permission to push your images and other access.
41
+
37
42
```bash
38
43
# Authenticate your user account for gcloud CLI access
### 1.3. Upload the Docker image to Artifact Registry
59
+
52
60
> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".
The `docker_upload_runner.sh` script uploads your Docker image to Artifact Registry.
58
68
59
69
## 2. Install XPK
60
-
Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md).
70
+
71
+
Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md).
61
72
62
73
## 3. Create GKE cluster
74
+
63
75
Use a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster).
This section explains how to prepare your model checkpoint for use with MaxText. You have two options: using an existing MaxText checkpoint or converting a Hugging Face checkpoint.
95
109
96
110
### Option 1: Using an existing MaxText checkpoint
111
+
97
112
If you already have a MaxText-compatible model checkpoint, simply set the following environment variable and move on to the next section.
98
113
99
114
```bash
100
115
export MODEL_CHECKPOINT_PATH=<gcs path for MaxText checkpoint># e.g., gs://my-bucket/my-model-checkpoint/0/items
101
116
```
117
+
102
118
**Note:** Make sure that `MODEL_CHECKPOINT_PATH` has the checkpoints created using the correct storage flags:
103
-
***For SFT with McJAX:**`checkpoint_storage_use_zarr3=True` and `checkpoint_storage_use_ocdbt=True`.
104
-
***For SFT with Pathways:**`checkpoint_storage_use_zarr3=False` and `checkpoint_storage_use_ocdbt=False`.
119
+
120
+
-**For SFT with McJAX:**`checkpoint_storage_use_zarr3=True` and `checkpoint_storage_use_ocdbt=True`.
121
+
-**For SFT with Pathways:**`checkpoint_storage_use_zarr3=False` and `checkpoint_storage_use_ocdbt=False`.
105
122
106
123
### Option 2: Converting a Hugging Face checkpoint
124
+
107
125
If your model checkpoint is from Hugging Face, you need to run a conversion script to make it MaxText-compatible.
108
126
109
127
1.**Set the Output Path:** First, define where the converted MaxText checkpoint will be saved. For example:
0 commit comments