AI-Hypercomputer
diff --git a/‎codecov.yml‎
Lines changed: 2 additions & 1 deletion b/‎codecov.yml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/tutorials/posttraining/sft.md‎
Lines changed: 7 additions & 1 deletion b/‎docs/tutorials/posttraining/sft.md‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎docs/tutorials/posttraining/sft_on_multi_host.md‎
Lines changed: 27 additions & 5 deletions b/‎docs/tutorials/posttraining/sft_on_multi_host.md‎
Lines changed: 27 additions & 5 deletions
diff --git a/‎end_to_end/tpu/llama3.1/8b/run_sft.sh‎
Lines changed: 1 addition & 1 deletion b/‎end_to_end/tpu/llama3.1/8b/run_sft.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/MaxText/examples/sft_llama3_demo.ipynb‎
Lines changed: 3 additions & 2 deletions b/‎src/MaxText/examples/sft_llama3_demo.ipynb‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/MaxText/examples/sft_qwen3_demo.ipynb‎
Lines changed: 3 additions & 3 deletions b/‎src/MaxText/examples/sft_qwen3_demo.ipynb‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎src/MaxText/examples/sft_train_and_evaluate.py‎
Lines changed: 3 additions & 3 deletions b/‎src/MaxText/examples/sft_train_and_evaluate.py‎
Lines changed: 3 additions & 3 deletions
@@ -39,7 +39,8 @@ ignore:
   - "src/MaxText/inference"
   - "src/MaxText/inference_mlperf"
   - "src/MaxText/scratch_code"
-  - "src/MaxText/distillation" # code moved to src/MaxText/trainers/post_train/distillation
+  - "src/MaxText/distillation" # code moved to src/maxtext/trainers/post_train/distillation
+  - "src/MaxText/sft" # code moved to src/maxtext/trainers/post_train/sft
 
 
 flags:
 
@@ -15,6 +15,7 @@
  -->
 
 # SFT on single-host TPUs
+
 Supervised fine-tuning (SFT) is a process where a pre-trained large language model is fine-tuned on a labeled dataset to adapt the model to perform better on specific tasks.
 
 This tutorial demonstrates step-by-step instructions for setting up the environment and then training the model on a Hugging Face dataset using SFT.
@@ -64,16 +65,19 @@ export TRAIN_DATA_COLUMNS=<data columns to train on> # e.g., ['messages']
 ```
 
 ## Get your model checkpoint
+
 This section explains how to prepare your model checkpoint for use with MaxText. You have two options: using an existing MaxText checkpoint or converting a Hugging Face checkpoint.
 
 ### Option 1: Using an existing MaxText checkpoint
+
 If you already have a MaxText-compatible model checkpoint, simply set the following environment variable and move on to the next section.
 
 ```sh
 export PRE_TRAINED_MODEL_CKPT_PATH=<gcs path for MaxText checkpoint> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
 ### Option 2: Converting a Hugging Face checkpoint
+
 If your model checkpoint is from Hugging Face, you need to run a conversion script to make it MaxText-compatible.
 
 1. **Set the Output Path:** First, define where the converted MaxText checkpoint will be saved. For example:
@@ -101,10 +105,11 @@ export PRE_TRAINED_MODEL_CKPT_PATH=${PRE_TRAINED_MODEL_CKPT_DIRECTORY}/0/items
 ```
 
 ## Run SFT on Hugging Face Dataset
+
 Now you are ready to run SFT using the following command:
 
 ```sh
-python3 -m MaxText.sft.sft_trainer src/MaxText/configs/sft.yml \
+python3 -m maxtext.trainers.post_train.sft.train_sft src/MaxText/configs/sft.yml \
     run_name=${RUN_NAME} \
     base_output_directory=${BASE_OUTPUT_DIRECTORY} \
     model_name=${PRE_TRAINED_MODEL} \
@@ -118,4 +123,5 @@ python3 -m MaxText.sft.sft_trainer src/MaxText/configs/sft.yml \
     train_data_columns=${TRAIN_DATA_COLUMNS} \
     profiler=xplane
 ```
+
 Your fine-tuned model checkpoints will be saved here: `$BASE_OUTPUT_DIRECTORY/$RUN_NAME/checkpoints`.
@@ -15,6 +15,7 @@
  -->
 
 # SFT on multi-host TPUs
+
 Supervised fine-tuning (SFT) is a process where a pre-trained large language model is fine-tuned on a labeled dataset to adapt the model to perform better on specific tasks.
 
 This tutorial demonstrates step-by-step instructions for setting up the multi-host TPU environment and then training the model on the Hugging Face dataset using SFT. In this tutorial we use a multi-host TPU such as `v6e-256`.
@@ -24,16 +25,20 @@ We use [Tunix](https://github.com/google/tunix), a JAX-based library designed fo
 Let's get started!
 
 ## 1. Build and upload MaxText Docker image
+
 This section guides you through cloning the MaxText repository, building MaxText Docker image with dependencies, and uploading the docker image to your project's Artifact Registry.
 
 ### 1.1. Clone the MaxText repository
+
 ```bash
 git clone https://github.com/google/maxtext.git
 cd maxtext
 ```
 
 ### 1.2. Build MaxText Docker image
+
 Before building the Docker image, authenticate to [Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) for permission to push your images and other access.
+
 ```bash
 # Authenticate your user account for gcloud CLI access
 gcloud auth login
@@ -43,26 +48,34 @@ gcloud auth application-default login
 gcloud auth configure-docker
 docker run hello-world
 ```
+
 Then run the following command to create a local Docker image named `maxtext_base_image`. This build process takes approximately 10 to 15 minutes.
+
 ```bash
 bash dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training
 ```
 
 ### 1.3. Upload the Docker image to Artifact Registry
+
 > **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".
+
 ```bash
 export DOCKER_IMAGE_NAME=<Docker Image Name>
 bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$DOCKER_IMAGE_NAME
 ```
+
 The `docker_upload_runner.sh` script uploads your Docker image to Artifact Registry.
 
 ## 2. Install XPK
-Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md). 
+
+Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md).
 
 ## 3. Create GKE cluster
+
 Use a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster).
 
 ## 4. Environment configuration
+
 ```bash
 # -- Google Cloud Configuration --
 export PROJECT=<Google Cloud Project ID>
@@ -91,19 +104,24 @@ export TRAIN_DATA_COLUMNS=<Data Columns to Train on> # e.g., ['messages']
 ```
 
 ## 5. Get MaxText model checkpoint
+
 This section explains how to prepare your model checkpoint for use with MaxText. You have two options: using an existing MaxText checkpoint or converting a Hugging Face checkpoint.
 
 ### Option 1: Using an existing MaxText checkpoint
+
 If you already have a MaxText-compatible model checkpoint, simply set the following environment variable and move on to the next section.
 
 ```bash
 export MODEL_CHECKPOINT_PATH=<gcs path for MaxText checkpoint> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
+
 **Note:** Make sure that `MODEL_CHECKPOINT_PATH` has the checkpoints created using the correct storage flags:
-* **For SFT with McJAX:** `checkpoint_storage_use_zarr3=True` and `checkpoint_storage_use_ocdbt=True`.
-* **For SFT with Pathways:** `checkpoint_storage_use_zarr3=False` and `checkpoint_storage_use_ocdbt=False`.
+
+- **For SFT with McJAX:** `checkpoint_storage_use_zarr3=True` and `checkpoint_storage_use_ocdbt=True`.
+- **For SFT with Pathways:** `checkpoint_storage_use_zarr3=False` and `checkpoint_storage_use_ocdbt=False`.
 
 ### Option 2: Converting a Hugging Face checkpoint
+
 If your model checkpoint is from Hugging Face, you need to run a conversion script to make it MaxText-compatible.
 
 1. **Set the Output Path:** First, define where the converted MaxText checkpoint will be saved. For example:
@@ -137,9 +155,11 @@ export MODEL_CHECKPOINT_PATH=${MODEL_CHECKPOINT_DIRECTORY}/0/items
 ```
 
 ## 6. Submit workload on GKE cluster
+
 This section provides the command to run SFT on a GKE cluster.
 
 ### 6.1. SFT with Multi-Controller JAX (McJAX)
+
 ```bash
 xpk workload create \
 --cluster=${CLUSTER_NAME} \
@@ -149,11 +169,13 @@ xpk workload create \
 --workload=${WORKLOAD_NAME} \
 --tpu-type=${TPU_TYPE} \
 --num-slices=${TPU_SLICE} \
---command "python3 -m MaxText.sft.sft_trainer src/MaxText/configs/sft.yml run_name=$WORKLOAD_NAME base_output_directory=$OUTPUT_PATH model_name=$MODEL_NAME load_parameters_path=$MODEL_CHECKPOINT_PATH hf_access_token=$HF_TOKEN tokenizer_path=$TOKENIZER_PATH per_device_batch_size=1 steps=$STEPS profiler=xplane hf_path=$DATASET_NAME train_split=$TRAIN_SPLIT train_data_columns=$TRAIN_DATA_COLUMNS"
+--command "python3 -m maxtext.trainers.post_train.sft.train_sft src/MaxText/configs/sft.yml run_name=$WORKLOAD_NAME base_output_directory=$OUTPUT_PATH model_name=$MODEL_NAME load_parameters_path=$MODEL_CHECKPOINT_PATH hf_access_token=$HF_TOKEN tokenizer_path=$TOKENIZER_PATH per_device_batch_size=1 steps=$STEPS profiler=xplane hf_path=$DATASET_NAME train_split=$TRAIN_SPLIT train_data_columns=$TRAIN_DATA_COLUMNS"
 ```
+
 Once the fine-tuning is completed, you can access your model checkpoints at `$OUTPUT_PATH/$WORKLOAD_NAME/checkpoints`.
 
 ### 6.2. SFT with Pathways
+
 ```bash
 xpk workload create-pathways \
 --cluster=${CLUSTER_NAME} \
@@ -163,7 +185,7 @@ xpk workload create-pathways \
 --workload=${WORKLOAD_NAME} \
 --tpu-type=${TPU_TYPE} \
 --num-slices=${TPU_SLICE} \
---command="JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE=1 python3 -m MaxText.sft.sft_trainer src/MaxText/configs/sft.yml run_name=$WORKLOAD_NAME base_output_directory=$OUTPUT_PATH model_name=$MODEL_NAME load_parameters_path=$MODEL_CHECKPOINT_PATH hf_access_token=$HF_TOKEN tokenizer_path=$TOKENIZER_PATH per_device_batch_size=1 steps=$STEPS profiler=xplane checkpoint_storage_use_zarr3=False checkpoint_storage_use_ocdbt=False enable_single_controller=True"
+--command="JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE=1 python3 -m maxtext.trainers.post_train.sft.train_sft src/MaxText/configs/sft.yml run_name=$WORKLOAD_NAME base_output_directory=$OUTPUT_PATH model_name=$MODEL_NAME load_parameters_path=$MODEL_CHECKPOINT_PATH hf_access_token=$HF_TOKEN tokenizer_path=$TOKENIZER_PATH per_device_batch_size=1 steps=$STEPS profiler=xplane checkpoint_storage_use_zarr3=False checkpoint_storage_use_ocdbt=False enable_single_controller=True"
 ```
 
 Once the fine-tuning is completed, you can access your model checkpoints at `$OUTPUT_PATH/$WORKLOAD_NAME/checkpoints`.
@@ -57,7 +57,7 @@ fi
 echo "Running fine-tuning on checkpoint: ${PRE_TRAINED_MODEL_CKPT_PATH}"
 
 # Run Supervised Fine-Tuning on MaxText checkpoint using HuggingFaceH4/ultrachat_200k dataset
-python3 -m MaxText.sft.sft_trainer "${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText}"/configs/sft.yml \
+python3 -m maxtext.trainers.post_train.sft.train_sft "${MAXTEXT_PKG_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/MaxText}"/configs/sft.yml \
     run_name=${RUN_NAME} base_output_directory=${BASE_OUTPUT_DIRECTORY}/${PRE_TRAINED_MODEL} \
     model_name=${PRE_TRAINED_MODEL} load_parameters_path=${PRE_TRAINED_MODEL_CKPT_PATH} \
     hf_access_token=$HF_TOKEN tokenizer_path=${PRE_TRAINED_MODEL_TOKENIZER} \
 
@@ -38,7 +38,7 @@ Repository = "https://github.com/AI-Hypercomputer/maxtext.git"
 allow-direct-references = true
 
 [tool.hatch.build.targets.wheel]
-packages = ["src/MaxText", "src/install_maxtext_extra_deps"]
+packages = ["src/MaxText", "src/maxtext", "src/install_maxtext_extra_deps"]
 
 [tool.hatch.build.targets.wheel.hooks.custom]
 path = "build_hooks.py"
 
@@ -149,7 +149,7 @@
     "import sys\n",
     "import MaxText\n",
     "from MaxText import pyconfig\n",
-    "from MaxText.sft.sft_trainer import train as sft_train\n",
+    "from maxtext.trainers.post_train.sft import train_sft\n",
     "import jax\n",
     "from huggingface_hub import login\n",
     "\n",
@@ -173,6 +173,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
     "if IN_COLAB:\n",
     "    HF_TOKEN = userdata.get(\"HF_TOKEN\")\n",
@@ -312,7 +313,7 @@
     "print(\"=\" * 60)\n",
     "\n",
     "try:\n",
-    "    trainer, mesh = sft_train(config)\n",
+    "    trainer, mesh = train_sft.train(config)\n",
     "\n",
     "    print(\"\\n\" + \"=\" * 60)\n",
     "    print(\"✅ Training Completed Successfully!\")\n",
 
@@ -201,7 +201,7 @@
     "from MaxText import pyconfig\n",
     "from MaxText.examples.sft_train_and_evaluate import evaluate_model, get_test_dataset\n",
     "from MaxText.integration.tunix.tunix_adapter import TunixMaxTextAdapter\n",
-    "from MaxText.sft import sft_trainer\n",
+    "from maxtext.trainers.post_train.sft import train_sft\n",
     "\n",
     "# Suppress vLLM logging with a severity level below ERROR\n",
     "os.environ[\"VLLM_LOGGING_LEVEL\"] = \"ERROR\"\n",
@@ -451,7 +451,7 @@
    },
    "outputs": [],
    "source": [
-    "trainer, mesh = sft_trainer.setup_trainer_state(config)"
+    "trainer, mesh = train_sft.setup_trainer_state(config)"
    ]
   },
   {
@@ -545,7 +545,7 @@
    "outputs": [],
    "source": [
     "print(\"Starting SFT Training...\")\n",
-    "trainer = sft_trainer.train_model(config, trainer, mesh)\n",
+    "trainer = train_sft.train_model(config, trainer, mesh)\n",
     "print(\"SFT Training Complete!\")"
    ]
   },
 
@@ -92,7 +92,7 @@
 from MaxText import pyconfig
 from MaxText.input_pipeline import instruction_data_processing
 from MaxText.integration.tunix.tunix_adapter import TunixMaxTextAdapter
-from MaxText.sft import sft_trainer
+from maxtext.trainers.post_train.sft import train_sft
 
 # Suppress vLLM logging with a severity level below ERROR
 os.environ["VLLM_LOGGING_LEVEL"] = "ERROR"
@@ -330,7 +330,7 @@ def train_and_evaluate(config):
   test_dataset = get_test_dataset(config, tokenizer)
   test_dataset = test_dataset[:NUM_TEST_SAMPLES]
   test_dataset = test_dataset.to_iter_dataset().batch(BATCH_SIZE, drop_remainder=True)
-  trainer, mesh = sft_trainer.setup_trainer_state(config)
+  trainer, mesh = train_sft.setup_trainer_state(config)
   vllm_rollout = create_vllm_rollout(config, trainer.model, mesh, tokenizer)
 
   # 1. Pre-SFT Evaluation
@@ -340,7 +340,7 @@ def train_and_evaluate(config):
 
   # 2. SFT Training
   max_logging.log("Starting SFT training...")
-  trainer = sft_trainer.train_model(config, trainer, mesh)
+  trainer = train_sft.train_model(config, trainer, mesh)
 
   # 3. Post-SFT Evaluation
   max_logging.log("Running Post-SFT evaluation...")