AI-Hypercomputer
diff --git a/‎docs/tutorials/posttraining/knowledge_distillation.md‎
Lines changed: 128 additions & 97 deletions b/‎docs/tutorials/posttraining/knowledge_distillation.md‎
Lines changed: 128 additions & 97 deletions
@@ -17,160 +17,186 @@
 # Knowledge distillation
 
 ## Overview
+
 Knowledge Distillation is a compression technique that transfers knowledge from a larger (teacher) model to a smaller (student) model. This allows the smaller model to achieve performance levels closer to the larger one, but with significantly fewer parameters and computational resources.
 
-This guide focuses on **response-based knowledge distillation**, a technique where the student model is trained to replicate the outputs and behaviors of the teacher model. Within response-based knowledge distillation, two primary methods are often employed:
+This tutorial focuses on **response-based knowledge distillation**, a technique where the student model is trained to replicate the outputs and behaviors of the teacher model. Within response-based knowledge distillation, two primary methods are often employed:
+
+1. **Offline Distillation (Dataset Generation):**
 
-1.  **Offline Distillation (Dataset Generation):**
-    *   The pre-trained teacher model first generates a new dataset of input-output pairs.
-    *   The student model is then trained on this teacher-generated dataset using standard fine-tuning techniques.
+   - The pre-trained teacher model (running in vLLM) generates a new dataset of input-output pairs.
+   - The student model is then trained on this teacher-generated dataset using standard fine-tuning techniques in MaxText.
 
-2.  **Online Distillation (Logit Matching):**
-    *   During the training process, both the teacher model (which is typically frozen) and the student model process the same input data simultaneously.
-    *   The student model is trained by minimizing a loss function that encourages its output logits to match the logits produced by the teacher model for the same inputs.
+1. **Online Distillation (Logit Matching):**
+
+   - During the training process, both the teacher model (which is typically frozen) and the student model process the same input data simultaneously.
+   - The student model is trained by minimizing a loss function that encourages its output logits to match the logits produced by the teacher model for the same inputs.
 
 ## Running Offline Distillation with MaxText
 
-The following recipe demonstrates the process of offline distillation using **Deepseek2-16b** as the teacher model and **Llama2-7b** as the student model. Since this recipe fine-tunes the student model using Supervised Fine-Tuning (SFT), it's crucial to use the conversational variant for both the teacher and student models. Here’s a step-by-step guide:
+The following recipe demonstrates the process of offline distillation using **Qwen/Qwen3-32B** as the teacher model and **Llama-3.1-8B** as the student model. Since this recipe fine-tunes the student model using Supervised Fine-Tuning (SFT), it's crucial to use the conversational variant for both the teacher and student models. Here's a step-by-step tutorial:
 
 ### Prerequisites
 
 #### a. Setup environment variables
 
 ```bash
-export HF_TOKEN = <Hugging Face access token>
-export BASE_DIRECTORY = <Directory to store distillation results>
-export HF_REPO_NAME = <Hugging Face repository name to store teacher-generated dataset>
-export USERNAME_OR_ORG = <Owner of Hugging Face repository>
-export RUN_NAME = <unique name for the run>
+export HF_TOKEN=<your-hf-token> # e.g., hf_BA6...
+export RUN_NAME=<your-run-name> # e.g., distill-20260115
 ```
 
 #### b. Install dependencies
 
-```sh
-git clone https://github.com/AI-Hypercomputer/maxtext.git
-python3 -m venv ~/venv-maxtext
-source ~/venv-maxtext/bin/activate
-python3 -m pip install uv
-cd maxtext
-uv pip install -r dependencies/requirements/requirements.txt
-```
+To install MaxText and its dependencies for post-training (including vLLM for the teacher), run the following:
 
-### 1. Obtain and prepare the teacher model
+1. Follow the [MaxText installation instructions](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#install-maxtext).
 
-#### a. Download model from Hugging Face
+1. Install the additional dependencies for post-training:
 
 ```bash
-huggingface-cli login  # Provide your Hugging Face token
-huggingface-cli download deepseek-ai/DeepSeek-V2-Lite-Chat --repo-type model --local-dir ~/deepseek2-16b-chat
+bash tools/setup/setup_post_training_requirements.sh
 ```
 
-#### b. Convert checkpoint to MaxText format
-MaxText requires checkpoints to be in a specific format. You'll need to convert the downloaded Hugging Face checkpoints to a MaxText-compatible checkpoint.
+#### c. Setup storage with Hyperdisk
+
+To store large models and datasets, attach a Hyperdisk to your TPU VM. Refer to the [Google Cloud Hyperdisk documentation](https://cloud.google.com/compute/docs/disks/add-hyperdisk) and [TPU VM management](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm) for detailed instructions.
+
+First, create a Hyperdisk:
 
 ```bash
-# Get unscanned checkpoint for efficient decoding
-JAX_PLATFORMS=cpu \
-python3 -m MaxText.utils.ckpt_scripts.convert_deepseek_family_unscanned_ckpt \
-  --base_model_path ~/deepseek2-16b-chat \
-  --maxtext_model_path ${BASE_DIRECTORY}/deepseek2-16-chat/unscanned \
-  --model_size deepseek2-16b
+export ZONE=<your-tpu-zone>  # e.g., us-central1-a
+export TPU_VM_NAME=<your-tpu-vm-name>
+export DISK_NAME=<your-disk-name>  # e.g., my-hyperdisk
+export DISK_SIZE=<disk-size>  # e.g., 500GB
+
+gcloud compute disks create ${DISK_NAME} \
+  --size=${DISK_SIZE} \
+  --type=hyperdisk-balanced \
+  --zone=${ZONE}
 ```
 
-### 2. Obtain and prepare the student model
+Then, attach the disk to your TPU VM:
+
+```bash
+gcloud compute instances attach-disk ${TPU_VM_NAME} \
+  --disk=${DISK_NAME} \
+  --zone=${ZONE}
+```
 
-#### a. Download model from Hugging Face
+Inside the TPU VM, format and mount the disk (if not already mounted):
 
 ```bash
-huggingface-cli download meta-llama/Llama-2-7b-chat-hf --repo-type model --local-dir ~/llama2-7b-chat
+# Assuming the disk is /dev/sdb, check with lsblk
+sudo mkfs.ext4 /dev/sdb
+sudo mkdir -p /mnt/hyperdisk
+sudo mount /dev/sdb /mnt/hyperdisk
 ```
 
-#### b. Convert checkpoint to MaxText format
-MaxText requires checkpoints to be in a specific format. You'll need to convert the downloaded Hugging Face checkpoints to a MaxText-compatible checkpoint.
+Update the BASE_DIRECTORY to point to the mounted disk and create the directory:
 
 ```bash
-# Get scanned checkpoint for fine-tuning
-JAX_PLATFORMS=cpu \
-python3 -m MaxText.utils.ckpt_scripts.llama_or_mistral_ckpt \
-  --base-model-path ~/llama2-7b-chat \
-  --maxtext-model-path ${BASE_DIRECTORY}/llama2-7b-chat/scanned \
-  --model-size llama2-7b
+export BASE_NAME=<your-base-directory>  # e.g., knowledge-distillation
+export BASE_DIRECTORY=/mnt/hyperdisk/${BASE_NAME}
+mkdir -p ${BASE_DIRECTORY}
 ```
 
-### 3. Generate dataset using the teacher model
-Once the teacher model's checkpoint is in the MaxText format, you can run inference to generate the dataset that will be used to fine-tune the student model.
+> **Note:** This tutorial uses a mounted Hyperdisk for performance and reproducibility, because writing large model files and many small I/O operations directly to `gs://` can be significantly slower.
 
-### 3.a. Run the JetStream server
+### Obtain and prepare the teacher model
 
-Example command to run JetStream server on `v4-8`:
+For the teacher model, we will use **vLLM** to run inference. vLLM can load Hugging Face checkpoints directly, so **no conversion to MaxText format is needed** for the teacher. Ensure the teacher model is supported on TPU vLLM (refer to the [vLLM TPU recommended models](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/#text-only-models) for the latest list).
+
+You can simply download the model from Hugging Face to your local directory:
 
 ```bash
-python3 -m MaxText.maxengine_server src/MaxText/configs/base.yml \
-  tokenizer_path=deepseek-ai/DeepSeek-V2-Lite-chat tokenizer_type=huggingface \
-  load_parameters_path=${BASE_DIRECTORY}/deepseek2-16-chat/unscanned/0/items \
-  model_name=deepseek2-16b \
-  per_device_batch_size=10 ici_tensor_parallelism=4 \
-  max_target_length=2048 max_prefill_predict_length=64 \
-  hf_access_token=$HF_TOKEN \
-  scan_layers=False \
-  multi_sampling=True decode_sampling_strategy=weighted
+huggingface-cli login --token $HF_TOKEN
+huggingface-cli download Qwen/Qwen3-32B --repo-type model --local-dir ${BASE_DIRECTORY}/qwen3-32b
 ```
 
-Set `multi_sampling` to `True` to generate multiple independent completions per prompt.
+### Obtain and prepare the student model
 
+The student model will be trained in MaxText, which uses the Orbax checkpointing format. You will download the Hugging Face weights to your mounted bucket and convert them for training.
 
-### 3.b. Generate dataset using JetStream server
-In a new tab in your terminal, run the following command to generate dataset from teacher model. Note that this is an example command to run on `v4-8`:
+#### Convert checkpoint to MaxText format
+
+The following command downloads the Hugging Face weights and converts them to the MaxText format.
+
+**Note:** This conversion script requires PyTorch.
 
 ```bash
-python3 -m MaxText.generate_distillation_data \
-  --tokenizer-path deepseek-ai/DeepSeek-V2-Lite-chat \
-  --dataset-path HuggingFaceH4/ultrachat_200k --data-split train_sft \
-  --data-columns messages \
-  --max-prefill-length 64 --max-target-length 2048 \
-  --hf-access-token $HF_TOKEN \
-  --use-chat-template --remove-local-dataset-files \
-  --num-generations 2 --batch-size 1024 --num-batches 200 \
-  upload-to-hf --hf-repo-id ${HF_REPO_NAME}
+python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
+```
+
+```bash
+# Set the checkpoint directory
+export PRE_TRAINED_MODEL_CKPT_DIRECTORY=${BASE_DIRECTORY}/llama3.1-8b-ckpt
+
+# Convert to MaxText format
+python3 -m MaxText.utils.ckpt_conversion.to_maxtext src/MaxText/configs/base.yml \
+    model_name=llama3.1-8b \
+    hf_access_token=${HF_TOKEN} \
+    base_output_directory=${PRE_TRAINED_MODEL_CKPT_DIRECTORY} \
+    scan_layers=True skip_jax_distributed_system=True
 ```
 
-When `multi_sampling=True` (Step 3.a), the `--num-generations` parameter specifies the number of distinct completions to generate per prompt. The `--batch-size` parameter controls how many prompts are processed per batch, and `--num-batches` defines how many such batches to run. The total number of prompt-completion pairs generated is approximately `num_batches * batch_size * num_generations`.
+### Generate dataset using vLLM (Teacher Step)
+
+Use the provided script `generate_distillation_data_vllm.py` to generate the dataset from the teacher model. This script writes a Parquet dataset compatible with MaxText SFT.
+
+Run the generation script:
+
+```bash
+export OUTPUT_DATASET=${BASE_DIRECTORY}/datasets/distillation_data.parquet
 
-For example, with `--batch-size 1024`, `--num-generations 2`, and `--num-batches 200`, this would yield `200 * 1024 * 2 = 409,600` prompt-completion pairs.
+python3 -m tools.data_generation.generate_distillation_data_vllm \
+  --dataset-path HuggingFaceH4/ultrachat_200k \
+  --data-split train_sft \
+  --data-columns messages \
+  --hf-access-token $HF_TOKEN \
+  --teacher-model ${BASE_DIRECTORY}/qwen3-32b \
+  --use-chat-template \
+  --num-prompts 5120 \
+  --num-generations 2 \
+  --output-file ${OUTPUT_DATASET}
 
-It's important to note that some prompts may be filtered out by pre-processing logic before inference. If the prompt sequences are longer than `max-prefill-length`, then those prompts will be filtered out in pre-processing stage.
+```
 
-Additionally, the generated dataset can be uploaded to either Hugging Face or Google Cloud Storage (GCS). To upload to Hugging Face, use the `upload-to-hf --hf-repo-id <hf_repo_name>` flags. To upload to GCS, use the `upload-to-gcs --gcs-bucket <gcs bucket name> --gcs-data-path <path in gcs bucket>` flags.
+### Fine-tune the student model using Supervised Fine Tuning (SFT)
 
-### 4. Fine-tune the student model using Supervised Fine Tuning (SFT)
 You can now fine-tune your smaller student model using supervised fine-tuning technique in MaxText.
 
-### 4.a. Fine-tune the student model using dataset generated in Step 3
+#### Fine-tune the student model using the generated dataset
 
-Example command to run fine-tuning on v4-8:
+Example command to run fine-tuning on a TPU v6e-8:
 
 ```bash
-python3 -m MaxText.sft_trainer src/MaxText/configs/sft.yml \
+python3 -m MaxText.sft.sft_trainer src/MaxText/configs/sft.yml \
   run_name=${RUN_NAME} \
-  base_output_directory=${BASE_DIRECTORY}/distillation/deepseek2-16b-distill-llama2-7b \
-  tokenizer_path=meta-llama/Llama-2-7b-chat-hf tokenizer_type=huggingface \
-  hf_path=${USERNAME_OR_ORG}/${HF_REPO_NAME} \
-  train_split='train' train_data_columns=['prompt','completion'] \
-  load_parameters_path=${BASE_DIRECTORY}/llama2-7b-chat/scanned/0/items \
-  model_name=llama2-7b \
-  per_device_batch_size=2 ici_expert_parallelism=-1 ici_fsdp_parallelism=4 \
+  base_output_directory=${BASE_DIRECTORY}/distillation/qwen3-32b-distill-llama3.1-8b \
+  tokenizer_path=meta-llama/Llama-3.1-8B-Instruct tokenizer_type=huggingface \
+  dataset_type=hf \
+  hf_path=parquet \
+  hf_train_files=${OUTPUT_DATASET} \
+  train_split='train' \
+  train_data_columns=['messages'] \
+  load_parameters_path=${PRE_TRAINED_MODEL_CKPT_DIRECTORY}/0/items \
+  model_name=llama3.1-8b \
+  per_device_batch_size=2 \
+  steps=200 \
+  ici_expert_parallelism=-1 ici_fsdp_parallelism=4 \
   max_target_length=2048 \
-  hf_access_token=$HF_TOKEN
+  hf_access_token=$HF_TOKEN \
+  profiler=xplane
 ```
 
-### 4.b. **[OPTIONAL]** Fine-tune the student model using the original dataset
+#### **[OPTIONAL]** Fine-tune the student model using the original dataset
 
 The checkpoint from the student model's fine-tuning (on the teacher-generated dataset) can be used for a subsequent fine-tuning stage. In this step, the student model is fine-tuned on the original dataset that was initially provided to the teacher model for generating the dataset.
 
 ```bash
 # Get the latest checkpoint for fine-tuned student model
-CHECKPOINTS_PATH=${BASE_DIRECTORY}/distillation/deepseek2-16b-distill-llama2-7b/${RUN_NAME}/checkpoints
-checkpoints=$(gcloud storage ls $CHECKPOINTS_PATH)
+CHECKPOINTS_PATH=${BASE_DIRECTORY}/distillation/qwen3-32b-distill-llama3.1-8b/${RUN_NAME}/checkpoints
+checkpoints=$(ls $CHECKPOINTS_PATH)
 integer_dirs=()
 for dir in $checkpoints; do
   dir_name=$(basename "$dir")
@@ -180,18 +206,23 @@ for dir in $checkpoints; do
 done
 sorted_dirs=($(printf '%s\n' "${integer_dirs[@]}" | sort -n))
 largest_dir="${sorted_dirs[-1]}"
-FINE_TUNED_MODEL_CKPT_PATH=${CHECKPOINTS_PATH}/${largest_dir}/items
+FINE_TUNED_MODEL_CKPT_PATH=${CHECKPOINTS_PATH}/${largest_dir}/model_params
 
 # Fine-tune student model on original dataset
-python3 -m MaxText.sft_trainer src/MaxText/configs/sft.yml \
-  run_name=${RUN_NAME} \
-  base_output_directory=${BASE_DIRECTORY}/distillation/deepseek2-16b-distill-llama2-7b \
-  tokenizer_path=meta-llama/Llama-2-7b-chat-hf tokenizer_type=huggingface \
+python3 -m MaxText.sft.sft_trainer src/MaxText/configs/sft.yml \
+  run_name=${RUN_NAME}_stage2 \
+  base_output_directory=${BASE_DIRECTORY}/distillation/qwen3-32b-distill-llama3.1-8b \
+  tokenizer_path=meta-llama/Llama-3.1-8B-Instruct tokenizer_type=huggingface \
+  dataset_type=hf \
   hf_path='HuggingFaceH4/ultrachat_200k' \
-  train_split='train_sft' train_data_columns=['messages'] \
+  train_split='train_sft' \
+  train_data_columns=['messages'] \
   load_parameters_path=${FINE_TUNED_MODEL_CKPT_PATH} \
-  model_name=llama2-7b \
-  per_device_batch_size=2 ici_expert_parallelism=-1 ici_fsdp_parallelism=4 \
+  model_name=llama3.1-8b \
+  per_device_batch_size=2 \
+  steps=200 \
+  ici_expert_parallelism=-1 ici_fsdp_parallelism=4 \
   max_target_length=2048 \
-  hf_access_token=$HF_TOKEN
+  hf_access_token=$HF_TOKEN \
+  profiler=xplane
 ```