Skip to content

Commit 1ee782f

Browse files
adriangbclaude
andauthored
Migrate Python usage to uv workspace (#20414)
I was having trouble getting benchmarks to gen data. ## Summary - Replace three independent `requirements.txt` files with a uv workspace (`benchmarks`, `dev`, `docs` projects) - Single `uv.lock` lockfile for reproducible dependency resolution - Simplify `bench.sh` by removing all ad-hoc venv/pip logic in favor of `uv run` ## Test plan - [ ] `uv sync` resolves all deps from repo root - [ ] `uv run --project benchmarks python3 benchmarks/compare.py` works - [ ] `uv run --project docs sphinx-build docs/source docs/build` builds docs - [ ] Run a benchmark from `bench.sh` that uses Python (e.g., h2o data gen or compare flow) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ace9cd4 commit 1ee782f

18 files changed

Lines changed: 1199 additions & 224 deletions

.github/workflows/docs.yaml

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -40,17 +40,11 @@ jobs:
4040
ref: asf-site
4141
path: asf-site
4242

43-
- name: Setup Python
44-
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
45-
with:
46-
python-version: "3.12"
43+
- name: Setup uv
44+
uses: astral-sh/setup-uv@f0ec1fc3b38f5e7cd731bb6ce540c5af426746bb # v6.1.0
4745

4846
- name: Install dependencies
49-
run: |
50-
set -x
51-
python3 -m venv venv
52-
source venv/bin/activate
53-
pip install -r docs/requirements.txt
47+
run: uv sync --package datafusion-docs
5448
- name: Install dependency graph tooling
5549
run: |
5650
set -x
@@ -61,9 +55,8 @@ jobs:
6155
- name: Build docs
6256
run: |
6357
set -x
64-
source venv/bin/activate
6558
cd docs
66-
./build.sh
59+
uv run --package datafusion-docs ./build.sh
6760
6861
- name: Copy & push the generated HTML
6962
run: |

.github/workflows/docs_pr.yaml

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -44,16 +44,10 @@ jobs:
4444
with:
4545
submodules: true
4646
fetch-depth: 1
47-
- name: Setup Python
48-
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
49-
with:
50-
python-version: "3.12"
47+
- name: Setup uv
48+
uses: astral-sh/setup-uv@f0ec1fc3b38f5e7cd731bb6ce540c5af426746bb # v6.1.0
5149
- name: Install doc dependencies
52-
run: |
53-
set -x
54-
python3 -m venv venv
55-
source venv/bin/activate
56-
pip install -r docs/requirements.txt
50+
run: uv sync --package datafusion-docs
5751
- name: Install dependency graph tooling
5852
run: |
5953
set -x
@@ -63,6 +57,5 @@ jobs:
6357
- name: Build docs html and check for warnings
6458
run: |
6559
set -x
66-
source venv/bin/activate
6760
cd docs
68-
./build.sh # fails on errors
61+
uv run --package datafusion-docs ./build.sh # fails on errors

benchmarks/bench.sh

Lines changed: 4 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ DATAFUSION_DIR=${DATAFUSION_DIR:-$SCRIPT_DIR/..}
4242
DATA_DIR=${DATA_DIR:-$SCRIPT_DIR/data}
4343
CARGO_COMMAND=${CARGO_COMMAND:-"cargo run --release"}
4444
PREFER_HASH_JOIN=${PREFER_HASH_JOIN:-true}
45-
VIRTUAL_ENV=${VIRTUAL_ENV:-$SCRIPT_DIR/venv}
4645

4746
usage() {
4847
echo "
@@ -53,7 +52,6 @@ $0 data [benchmark]
5352
$0 run [benchmark] [query]
5453
$0 compare <branch1> <branch2>
5554
$0 compare_detail <branch1> <branch2>
56-
$0 venv
5755
5856
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
5957
Examples:
@@ -71,7 +69,6 @@ data: Generates or downloads data needed for benchmarking
7169
run: Runs the named benchmark
7270
compare: Compares fastest results from benchmark runs
7371
compare_detail: Compares minimum, average (±stddev), and maximum results from benchmark runs
74-
venv: Creates new venv (unless already exists) and installs compare's requirements into it
7572
7673
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
7774
Benchmarks
@@ -144,7 +141,6 @@ CARGO_COMMAND command that runs the benchmark binary
144141
DATAFUSION_DIR directory to use (default $DATAFUSION_DIR)
145142
RESULTS_NAME folder where the benchmark files are stored
146143
PREFER_HASH_JOIN Prefer hash join algorithm (default true)
147-
VENV_PATH Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate)
148144
DATAFUSION_* Set the given datafusion configuration
149145
"
150146
exit 1
@@ -542,9 +538,6 @@ main() {
542538
compare_detail)
543539
compare_benchmarks "$ARG2" "$ARG3" "--detailed"
544540
;;
545-
venv)
546-
setup_venv
547-
;;
548541
"")
549542
usage
550543
;;
@@ -708,7 +701,7 @@ run_compile_profile() {
708701
local data_path="${DATA_DIR}/tpch_sf1"
709702

710703
echo "Running compile profile benchmark..."
711-
local cmd=(python3 "${runner}" --data "${data_path}")
704+
local cmd=(uv run python3 "${runner}" --data "${data_path}")
712705
if [ ${#profiles[@]} -gt 0 ]; then
713706
cmd+=(--profiles "${profiles[@]}")
714707
fi
@@ -923,151 +916,27 @@ data_h2o() {
923916
SIZE=${1:-"SMALL"}
924917
DATA_FORMAT=${2:-"CSV"}
925918

926-
# Function to compare Python versions
927-
version_ge() {
928-
[ "$(printf '%s\n' "$1" "$2" | sort -V | head -n1)" = "$2" ]
929-
}
930-
931-
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
932-
933-
# Find the highest available Python version (3.10 or higher)
934-
REQUIRED_VERSION="3.10"
935-
PYTHON_CMD=$(command -v python3 || true)
936-
937-
if [ -n "$PYTHON_CMD" ]; then
938-
PYTHON_VERSION=$($PYTHON_CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
939-
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
940-
echo "Found Python version $PYTHON_VERSION, which is suitable."
941-
else
942-
echo "Python version $PYTHON_VERSION found, but version $REQUIRED_VERSION or higher is required."
943-
PYTHON_CMD=""
944-
fi
945-
fi
946-
947-
# Search for suitable Python versions if the default is unsuitable
948-
if [ -z "$PYTHON_CMD" ]; then
949-
# Loop through all available Python3 commands on the system
950-
for CMD in $(compgen -c | grep -E '^python3(\.[0-9]+)?$'); do
951-
if command -v "$CMD" &> /dev/null; then
952-
PYTHON_VERSION=$($CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
953-
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
954-
PYTHON_CMD="$CMD"
955-
echo "Found suitable Python version: $PYTHON_VERSION ($CMD)"
956-
break
957-
fi
958-
fi
959-
done
960-
fi
961-
962-
# If no suitable Python version found, exit with an error
963-
if [ -z "$PYTHON_CMD" ]; then
964-
echo "Python 3.10 or higher is required. Please install it."
965-
return 1
966-
fi
967-
968-
echo "Using Python command: $PYTHON_CMD"
969-
970-
# Install falsa and other dependencies
971-
echo "Installing falsa..."
972-
973-
# Set virtual environment directory
974-
VIRTUAL_ENV="${PWD}/venv"
975-
976-
# Create a virtual environment using the detected Python command
977-
$PYTHON_CMD -m venv "$VIRTUAL_ENV"
978-
979-
# Activate the virtual environment and install dependencies
980-
source "$VIRTUAL_ENV/bin/activate"
981-
982-
# Ensure 'falsa' is installed (avoid unnecessary reinstall)
983-
pip install --quiet --upgrade falsa
984-
985919
# Create directory if it doesn't exist
986920
H2O_DIR="${DATA_DIR}/h2o"
987921
mkdir -p "${H2O_DIR}"
988922

989923
# Generate h2o test data
990924
echo "Generating h2o test data in ${H2O_DIR} with size=${SIZE} and format=${DATA_FORMAT}"
991-
falsa groupby --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"
992-
993-
# Deactivate virtual environment after completion
994-
deactivate
925+
uv run falsa groupby --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"
995926
}
996927

997928
data_h2o_join() {
998929
# Default values for size and data format
999930
SIZE=${1:-"SMALL"}
1000931
DATA_FORMAT=${2:-"CSV"}
1001932

1002-
# Function to compare Python versions
1003-
version_ge() {
1004-
[ "$(printf '%s\n' "$1" "$2" | sort -V | head -n1)" = "$2" ]
1005-
}
1006-
1007-
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
1008-
1009-
# Find the highest available Python version (3.10 or higher)
1010-
REQUIRED_VERSION="3.10"
1011-
PYTHON_CMD=$(command -v python3 || true)
1012-
1013-
if [ -n "$PYTHON_CMD" ]; then
1014-
PYTHON_VERSION=$($PYTHON_CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
1015-
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
1016-
echo "Found Python version $PYTHON_VERSION, which is suitable."
1017-
else
1018-
echo "Python version $PYTHON_VERSION found, but version $REQUIRED_VERSION or higher is required."
1019-
PYTHON_CMD=""
1020-
fi
1021-
fi
1022-
1023-
# Search for suitable Python versions if the default is unsuitable
1024-
if [ -z "$PYTHON_CMD" ]; then
1025-
# Loop through all available Python3 commands on the system
1026-
for CMD in $(compgen -c | grep -E '^python3(\.[0-9]+)?$'); do
1027-
if command -v "$CMD" &> /dev/null; then
1028-
PYTHON_VERSION=$($CMD -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")
1029-
if version_ge "$PYTHON_VERSION" "$REQUIRED_VERSION"; then
1030-
PYTHON_CMD="$CMD"
1031-
echo "Found suitable Python version: $PYTHON_VERSION ($CMD)"
1032-
break
1033-
fi
1034-
fi
1035-
done
1036-
fi
1037-
1038-
# If no suitable Python version found, exit with an error
1039-
if [ -z "$PYTHON_CMD" ]; then
1040-
echo "Python 3.10 or higher is required. Please install it."
1041-
return 1
1042-
fi
1043-
1044-
echo "Using Python command: $PYTHON_CMD"
1045-
1046-
# Install falsa and other dependencies
1047-
echo "Installing falsa..."
1048-
1049-
# Set virtual environment directory
1050-
VIRTUAL_ENV="${PWD}/venv"
1051-
1052-
# Create a virtual environment using the detected Python command
1053-
$PYTHON_CMD -m venv "$VIRTUAL_ENV"
1054-
1055-
# Activate the virtual environment and install dependencies
1056-
source "$VIRTUAL_ENV/bin/activate"
1057-
1058-
# Ensure 'falsa' is installed (avoid unnecessary reinstall)
1059-
pip install --quiet --upgrade falsa
1060-
1061933
# Create directory if it doesn't exist
1062934
H2O_DIR="${DATA_DIR}/h2o"
1063935
mkdir -p "${H2O_DIR}"
1064936

1065937
# Generate h2o test data
1066938
echo "Generating h2o test data in ${H2O_DIR} with size=${SIZE} and format=${DATA_FORMAT}"
1067-
falsa join --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"
1068-
1069-
# Deactivate virtual environment after completion
1070-
deactivate
939+
uv run falsa join --path-prefix="${H2O_DIR}" --size "${SIZE}" --data-format "${DATA_FORMAT}"
1071940
}
1072941

1073942
# Runner for h2o groupby benchmark
@@ -1269,7 +1138,7 @@ compare_benchmarks() {
12691138
echo "--------------------"
12701139
echo "Benchmark ${BENCH}"
12711140
echo "--------------------"
1272-
PATH=$VIRTUAL_ENV/bin:$PATH python3 "${SCRIPT_DIR}"/compare.py $OPTS "${RESULTS_FILE1}" "${RESULTS_FILE2}"
1141+
uv run python3 "${SCRIPT_DIR}"/compare.py $OPTS "${RESULTS_FILE1}" "${RESULTS_FILE2}"
12731142
else
12741143
echo "Note: Skipping ${RESULTS_FILE1} as ${RESULTS_FILE2} does not exist"
12751144
fi
@@ -1384,10 +1253,6 @@ run_clickbench_sorted() {
13841253
${QUERY_ARG}
13851254
}
13861255

1387-
setup_venv() {
1388-
python3 -m venv "$VIRTUAL_ENV"
1389-
PATH=$VIRTUAL_ENV/bin:$PATH python3 -m pip install -r requirements.txt
1390-
}
13911256

13921257
# And start the process up
13931258
main

benchmarks/pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[project]
2+
name = "datafusion-benchmarks"
3+
version = "0.1.0"
4+
requires-python = ">=3.11"
5+
# typing_extensions is an undeclared dependency of falsa
6+
dependencies = ["rich", "falsa", "typing_extensions"]

benchmarks/requirements.txt

Lines changed: 0 additions & 18 deletions
This file was deleted.

dev/pyproject.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
[project]
2+
name = "datafusion-dev"
3+
version = "0.1.0"
4+
requires-python = ">=3.11"
5+
dependencies = ["tomlkit", "PyGithub", "requests"]

dev/release/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -178,10 +178,10 @@ We maintain a [changelog] so our users know what has been changed between releas
178178

179179
The changelog is generated using a Python script.
180180

181-
To run the script, you will need a GitHub Personal Access Token (described in the prerequisites section) and the `PyGitHub` library. First install the `PyGitHub` dependency via `pip`:
181+
To run the script, you will need a GitHub Personal Access Token (described in the prerequisites section) and the `PyGitHub` library. First install the dev dependencies via `uv`:
182182

183183
```shell
184-
pip3 install PyGitHub
184+
uv sync
185185
```
186186

187187
To generate the changelog, set the `GITHUB_TOKEN` environment variable and then run `./dev/release/generate-changelog.py`
@@ -199,7 +199,7 @@ to generate a change log of all changes between the `50.3.0` tag and `branch-51`
199199

200200
```shell
201201
export GITHUB_TOKEN=<your-token-here>
202-
./dev/release/generate-changelog.py 50.3.0 branch-51 51.0.0 > dev/changelog/51.0.0.md
202+
uv run ./dev/release/generate-changelog.py 50.3.0 branch-51 51.0.0 > dev/changelog/51.0.0.md
203203
```
204204

205205
This script creates a changelog from GitHub PRs based on the labels associated with them as well as looking for

dev/requirements.txt

Lines changed: 0 additions & 2 deletions
This file was deleted.

dev/update_arrow_deps.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
# Script that updates the arrow dependencies in datafusion locally
2020
#
2121
# installation:
22-
# pip install tomlkit requests
22+
# uv sync
2323
#
2424
# pin all arrow crates deps to a specific version:
2525
#

dev/update_datafusion_versions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
# Script that updates versions for datafusion crates, locally
2020
#
2121
# dependencies:
22-
# pip install tomlkit
22+
# uv sync
2323

2424
import re
2525
import argparse

0 commit comments

Comments
 (0)