AI-Hypercomputer
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 1 deletion b/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/run_pathways_tests_internal.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/run_pathways_tests_internal.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/run_tests_internal.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/run_tests_internal.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 4 additions & 3 deletions b/‎README.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎benchmarks/maxtest/getting_started.md‎
Lines changed: 15 additions & 0 deletions b/‎benchmarks/maxtest/getting_started.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎benchmarks/maxtest/maxtest.sh‎
Lines changed: 5 additions & 3 deletions b/‎benchmarks/maxtest/maxtest.sh‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎benchmarks/maxtest/maxtest.yaml.template‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/maxtest/maxtest.yaml.template‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎dependencies/dockerfiles/maxtext_jax_ai_image.Dockerfile‎
Lines changed: 6 additions & 0 deletions b/‎dependencies/dockerfiles/maxtext_jax_ai_image.Dockerfile‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎dependencies/dockerfiles/maxtext_post_training_dependencies.Dockerfile‎
Lines changed: 2 additions & 0 deletions b/‎dependencies/dockerfiles/maxtext_post_training_dependencies.Dockerfile‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎dependencies/dockerfiles/maxtext_post_training_local_dependencies.Dockerfile‎
Lines changed: 4 additions & 13 deletions b/‎dependencies/dockerfiles/maxtext_post_training_local_dependencies.Dockerfile‎
Lines changed: 4 additions & 13 deletions
@@ -33,4 +33,4 @@ src/MaxText/inference_mlperf @vipannalla @mitalisi @gpolovets1 @mailvijayasingh
 .github/workflows @gobbleturk @khatwanimohit @shralex @parambole @bvandermoon @richjames0
 
 # Benchmarking/Recipes
-benchmarks @SujeethJinesh @bvandermoon @richjames0 @shralex @vipannalla @mitalisi @RissyRan @shauryagup @NuojCheng @gobbleturk @khatwanimohit @Obliviour
+benchmarks @SujeethJinesh @bvandermoon @richjames0 @shralex @vipannalla @mitalisi @RissyRan @shauryagup @NuojCheng @gobbleturk @khatwanimohit @Obliviour @notabee
@@ -75,7 +75,7 @@ jobs:
           python3 -m pip install -e . --no-dependencies &&
           python3 -m pip uninstall -y libtpu &&
           # TODO(b/454659463): Enable test_default_hlo_match after volume mount is supported.
-          python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" -k "not AotHloIdenticalTest" --durations=0
+          python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" -k "not AotHloIdenticalTest and not CompileThenLoad" --durations=0
     
     services:
       resource_manager:
 
@@ -81,4 +81,4 @@ jobs:
           python3 -m pip install -e . --no-dependencies
           [ "${{ inputs.total_workers }}" -gt 1 ] && python3 -m pip install --quiet pytest-split && SPLIT_ARGS="--splits ${{ inputs.total_workers }} --group ${{ inputs.worker_group }}" || SPLIT_ARGS=""
           export LIBTPU_INIT_ARGS='--xla_tpu_scoped_vmem_limit_kib=65536'
-          python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" --durations=0 $SPLIT_ARGS
+          python3 -m pytest ${{ inputs.pytest_addopts }} -v -m "${FINAL_PYTEST_MARKER}" --durations=0 --deselect "tests/aot_hlo_identical_test.py::AotHloIdenticalTest::test_default_hlo_match" $SPLIT_ARGS
@@ -22,7 +22,7 @@
 
 MaxText is a high performance, highly scalable, open-source LLM library and reference implementation written in pure Python/[JAX](https://docs.jax.dev/en/latest/jax-101.html) and targeting Google Cloud TPUs and GPUs for training.
 
-MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning).
+MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning) and Group Sequence Policy Optimization (GSPO, a type of Reinforcement Learning).
 
 MaxText achieves high Model FLOPs Utilization (MFU) and tokens/second from single host to very large clusters while staying simple and largely "optimization-free" thanks to the power of JAX and the XLA compiler.
 
@@ -74,9 +74,10 @@ Check out these getting started guides:
 * Supervised Fine Tuning (SFT)
   * [SFT on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/sft.html)
   * [SFT on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/sft_on_multi_host.html)
-* Group Relative Policy Optimization (GRPO)
+* Group Relative & Group Sequence Policy Optimization (GRPO & GSPO)
   * [GRPO on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html)
-  * [GRPO on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/grpo_with_pathways.html)
+  * [GRPO on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/grpo_with_pathways.html) 
+  * [GSPO](https://maxtext.readthedocs.io/en/latest/tutorials/grpo.html#run-gspo) (pass `loss_algo=gspo-token` to run GSPO)
 
 ### Model library
 
 
@@ -44,6 +44,21 @@ EXIT_CODE=0
 
 -   maxtest.sh will generate a YAML file in the directory that is passed to kubectl. This file can be modified and reused by running `kubectl apply -f maxtest.yaml`
 
+### Passing custom libtpu or XLA flags ###
+
+If we want to pass custom flags this is also possible by specifying
+`--libtpu_args`.
+
+
+#### Setting flags for SDC checking ####
+
+Useful checking for the existence of SDC on TPU hardware.
+
+```
+bash maxtest.sh --project $TPU_PROJECT --cluster $CLUSTER --region $REGION --nodepool $NODEPOOL_NAME --num_workers $NUM_WORKERS --libtpu_args '--xla_tpu_enable_sdc_checker'
+```
+
+
 ### Debugging common job errors ###
 
 If the job does not exit with `EXIT_CODE=0`, there is a failure among one of
 
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!bin/bash
 
 function usage() {
   echo "error: $1"
@@ -15,6 +15,7 @@ while [[ "$#" > 0 ]]; do case $1 in
   -r|--region)  GKE_REGION="$2";shift;shift;;
   --nodepool)   NODEPOOL="$2";shift;shift;;
   --num_workers)   NUM_WORKERS="$2";shift;shift;;
+  --libtpu_args)   LIBTPU_ARGS="$2";shift;shift;;
   *) usage "Unknown parameter passed: $1"; shift; shift;;
 esac; done
 
@@ -32,19 +33,20 @@ if [ -z "$TPU_ACCELERATOR" ]; then exit; fi;
 
 UUID=$(uuidgen)
 export JOB_NAME="${UUID:0:5}-maxtest"
-export DOCKER_IMAGE="gcr.io/cloud-tpu-images-public/tpu/healthscan"
+export DOCKER_IMAGE="us-docker.pkg.dev/cloud-tpu-images-public/tpu/healthscan:latest"
 export NODEPOOL
 export TPU_TOPOLOGY
 export TPU_ACCELERATOR
 export GKE_PROJECT
 export GKE_REGION
 export GKE_CLUSTER
+export LIBTPU_ARGS
 
 export MEMORY_PER_HOST="407Gi"
 export TPU_CHIPS_PER_HOST=4
 export COMPLETIONS=$NUM_WORKERS # Number of VMs in the nodepool (v6e -> 2 VMs for v6e-8, v5p -> 1 VM for a v5p-8)
 
-YAML_VARS='$JOB_NAME $DOCKER_IMAGE $NODEPOOL $TPU_TOPOLOGY $TPU_ACCELERATOR $COMPLETIONS $MEMORY_PER_HOST $TPU_CHIPS_PER_HOST $GKE_PROJECT $GKE_REGION $GKE_CLUSTER'
+YAML_VARS='$JOB_NAME $DOCKER_IMAGE $NODEPOOL $TPU_TOPOLOGY $TPU_ACCELERATOR $COMPLETIONS $MEMORY_PER_HOST $TPU_CHIPS_PER_HOST $GKE_PROJECT $GKE_REGION $GKE_CLUSTER $LIBTPU_ARGS'
 
 envsubst "${YAML_VARS}" < maxtest.yaml.template > maxtest.yaml
 
 
@@ -42,7 +42,7 @@ spec:
           _sigterm() (kill -SIGTERM $! 2>/dev/null;);
           trap _sigterm SIGTERM;
 
-          (export TPU_STDERR_LOG_LEVEL=0 && export TPU_MIN_LOG_LEVEL=0 && export TF_CPP_MIN_LOG_LEVEL=0 && python3 -m benchmarks.benchmark_runner healthscan --device_type=$TPU_ACCELERATOR_TYPE --base_output_directory=gke-healthscan-output --num_steps=5) & PID=$1;
+          (export TPU_STDERR_LOG_LEVEL=0 && export TPU_MIN_LOG_LEVEL=0 && export TF_CPP_MIN_LOG_LEVEL=0 && echo LIBTPU_INIT_ARGS='$LIBTPU_ARGS' && export LIBTPU_INIT_ARGS='$LIBTPU_ARGS' && python3 -m benchmarks.benchmark_runner healthscan --device_type=$TPU_ACCELERATOR_TYPE --base_output_directory=gke-healthscan-output --num_steps=5) & PID=$1;
 
           while kill -0 $PID 2>/dev/null;
               do sleep 5;
 
@@ -52,6 +52,12 @@ RUN if [ "$DEVICE" = "tpu" ]; then \
         python3 -m pip install 'google-tunix>=0.1.2'; \
   fi
 
+# Temporarily downgrade to JAX=0.7.2 for GPU images
+RUN if [ "$DEVICE" = "gpu" ]; then \
+      python3 -m pip install -U "jax[cuda12]==0.8.1"; \
+      python3 -m pip install -U "transformer-engine-cu12" "transformer-engine-jax" "transformer-engine"; \
+    fi
+
 # Now copy the remaining code (source files that may change frequently)
 COPY . .
 
 
@@ -30,6 +30,8 @@ RUN pip install numba==0.61.2
 # Install vLLM for Jax and TPUs
 RUN pip install vllm-tpu
 
+RUN pip install --no-deps qwix==0.1.4
+
 RUN if [ "$MODE" = "post-training-experimental" ]; then \
     pip uninstall -y jax jaxlib libtpu && \
     pip install --pre -U jax jaxlib -i https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ && \
 
@@ -28,27 +28,18 @@ RUN pip install keyring keyrings.google-artifactregistry-auth
 RUN pip install numba==0.61.2
 
 COPY tunix /tunix
+RUN pip uninstall -y google-tunix
 RUN pip install -e /tunix --no-cache-dir
 
 
 COPY vllm /vllm
-RUN VLLM_TARGET_DEVICE="tpu" pip install -e /vllm --no-cache-dir --pre \
-    --extra-index-url https://pypi.org/simple/ \
-    --extra-index-url https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ \
-    --extra-index-url https://download.pytorch.org/whl/nightly/cpu \
-    --find-links https://storage.googleapis.com/jax-releases/libtpu_releases.html \
-    --find-links https://storage.googleapis.com/libtpu-wheels/index.html \
-    --find-links https://storage.googleapis.com/libtpu-releases/index.html \
-    --find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html \
-    --find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html 
+RUN VLLM_TARGET_DEVICE="tpu" pip install -e /vllm --no-cache-dir
 
 
 COPY tpu-inference /tpu-inference
-RUN pip install -e /tpu-inference --no-cache-dir --pre \
-    --extra-index-url https://pypi.org/simple/ \
-    --extra-index-url https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/ \
-    --find-links https://storage.googleapis.com/jax-releases/libtpu_releases.html
+RUN pip install -e /tpu-inference --no-cache-dir
 
+RUN pip install --no-deps qwix==0.1.4
 
 RUN if [ "$MODE" = "post-training-experimental" ]; then \
     echo "MODE=post-training-experimental: Re-installing JAX/libtpu"; \