fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions by quillaid · Pull Request #1772 · nteract/desktop

quillaid · 2026-04-14T21:52:42Z

Summary

Adds a 45-second hard wall-clock deadline around the shell-connect + kernel_info_reply exchange in JupyterKernel::launch(), preventing indefinite hangs when a kernel is alive but unresponsive due to partial ZMQ bind failure.
Adds launch_kernel_with_retry() in the runtime agent with a 60-second per-attempt timeout and up to 3 retry attempts, each with fresh port allocation. Both LaunchKernel and RestartKernel handlers use this wrapper.
Derives Clone on KernelLaunchConfig and KernelSharedRefs to support retry cloning.

Root cause

Discovered by the gremlin suite during v2.2.0 testing: 19 concurrent pixi notebook launches caused one kernel to hang for 20+ minutes, blocking 10/19 gremlins.

Two interacting bugs:

TOCTOU port race -- peek_ports_with_listeners() reserves 5 ports via TCP listeners, but drop(listeners) releases them before the kernel process binds its ZMQ sockets. Under high concurrency, the OS can reassign a port to another process during this window.
Missing timeout recovery -- When a kernel partially binds (e.g., 4/5 ZMQ ports succeed), it stays alive but broken. The existing select! racing shell.read() (30s timeout) vs died_rx (process death) never fires because neither condition is met: the kernel is alive but unresponsive on the shell channel due to the stolen port.

The specific incident: port 33425 (stdin_port) was stolen during concurrent launch. The kernel started (4/5 ports bound) but the runtime agent hung in the kernel_info_reply check for 20+ minutes until the gremlin suite timed out.

What changed

File	Change
`crates/runtimed/src/jupyter_kernel.rs`	Wrap shell-connect + kernel_info_reply in `tokio::time::timeout(45s)`. Inner `select!` still handles fast path (kernel dies or responds within 30s); outer timeout catches alive-but-broken kernels.
`crates/runtimed/src/runtime_agent.rs`	New `launch_kernel_with_retry()`: 60s per-attempt timeout, 3 attempts max, 500ms backoff between retries. Fresh `KernelLaunchConfig` clone per attempt so port allocation restarts.
`crates/runtimed/src/kernel_connection.rs`	`#[derive(Clone)]` on `KernelLaunchConfig` and `KernelSharedRefs`.

Test plan

cargo clippy -p runtimed passes with zero warnings
Stress test: 10 concurrent pixi notebooks during pool warmup (4/10 pixi envs ready) -- 10/10 passed, 2.3s each
Stress test: 15 concurrent pixi notebooks (exceeds pool of 10) -- 15/15 passed
Stress test: 10 concurrent pixi notebooks after daemon restart during warmup -- 10/10 passed
Zero ZMQ port collisions or retry messages in daemon log across all runs (normal path unaffected)
Full gremlin suite pass (pending)

…rt collisions The gremlin suite discovered a kernel startup hang during v2.2.0 testing: 19 concurrent pixi notebook launches caused one kernel to hang for 20+ minutes, blocking 10/19 gremlins. Root cause: two interacting bugs in kernel launch: 1. TOCTOU port race — peek_ports_with_listeners() reserves 5 ports via TCP listeners, but drop(listeners) releases them before the kernel process binds its ZMQ sockets. Under high concurrency another process can steal a port during this window. 2. Missing timeout recovery — when a kernel partially binds (e.g., 4/5 ZMQ ports succeed), it stays alive but broken. The existing select! racing shell.read() (30s timeout) vs died_rx (process death) never fires because neither condition is met: the kernel is alive but unresponsive on the shell channel due to the stolen port. Fix (two layers): Inner (jupyter_kernel.rs): wrap the entire shell-connect + kernel_info_reply exchange in a 45-second hard wall-clock deadline via tokio::time::timeout. The inner select! still handles the fast path, but the outer timeout guarantees we never hang indefinitely when the kernel is alive-but-broken. Outer (runtime_agent.rs): add launch_kernel_with_retry() with a 60-second per-attempt timeout and up to 3 retry attempts. Each retry clones a fresh KernelLaunchConfig so port allocation starts from scratch. Both LaunchKernel and RestartKernel handlers now use this wrapper. Supporting (kernel_connection.rs): derive Clone on KernelLaunchConfig and KernelSharedRefs to support retry cloning. Stress-tested with 10, 15, and 10 concurrent pixi notebook launches against the nightly daemon — all passed with zero port collisions and zero retries, confirming the normal path is unaffected while the safety net is in place.

…unch Code review finding: launch_kernel_with_retry() could leave orphan kernel processes and detached tokio tasks when a launch attempt failed or timed out. When JupyterKernel::launch() errors after spawning the child process and background tasks, only process_watcher_task was aborted — iopub_task and stderr_task were leaked. Worse, when the outer 60s timeout drops the launch() future entirely, none of the spawned tasks or the child process are cleaned up because tokio::spawn tasks are detached from the future that spawned them. Fix: introduce LaunchGuard, an RAII guard created immediately after the child process is spawned. It holds the kernel PID (for direct SIGKILL on Unix) and AbortHandles for every background task spawned during launch(). - On error return: the guard is dropped, killing the process and aborting all tracked tasks. - On future cancellation (outer timeout): same — Drop fires automatically. - On success: launch_guard.disarm() is called just before constructing the JupyterKernel struct, which takes ownership via its own Drop impl. The guard uses SIGKILL via PID rather than relying on task abort + Drop propagation because tokio task abort is asynchronous and the Child handle's kill_on_drop may not fire synchronously when the owning task is aborted. Tracked tasks: stderr reader, process watcher, iopub listener, shell reader, heartbeat monitor, comm coalesce writer.

… retry Only retry on transient errors (timeouts, ZMQ connection failures, port collisions). Fail fast on terminal errors that will not succeed on retry: missing environments, broken env configs, unsupported kernel types, and early process crashes. Without this, a kernel that fails because the binary is missing or the env is broken would retry 3 times over up to 180s before surfacing the error — a UX regression for common failures. Terminal error patterns (fail fast, no retry): - "requires a prepared environment" (uv/conda env not created) - "has no python binary" (conda env broken) - "could not resolve conda environment prefix" (env_yml broken) - "Unsupported kernel type" (config error) - "Kernel process exited immediately" (binary/env broken) - "Kernel process died before responding" (kernel crashed) Everything else defaults to retryable, including the port-collision and timeout scenarios the retry loop was designed for. Also downgrades LaunchGuard::Drop log messages from warn! to debug! since they fire on every expected error bubble-up path and were too loud.

rgbkrk · 2026-04-15T04:25:02Z

Looked at this from a different angle. The 45s deadline + 60s retry + zombie SIGKILL machinery all exist because we're doing port reservation the wrong way: bind 5 TCP listeners, read the port numbers, drop the listeners, then hope the kernel binds before anything else grabs those ports. That's an inherent TOCTOU race — we can make it rarer with retries and tighter timeouts, but we can't close it while we're pre-reserving ports from a different process than the one that ultimately binds.

The deeper fix is to stop using TCP for daemon↔kernel transport. Jupyter's connection file format natively supports "transport": "ipc" with filesystem socket paths. We'd:

mkdir -p ~/.cache/runt/kernels/{kernel_id}/ with 0700
Fill ConnectionInfo with transport: IPC, ip: <that dir>, and stdin_port=1, control_port=2, … as socket-path suffixes
Spawn the kernel — it binds to filesystem sockets we own, no OS port allocation happens at all
rm -rf the dir on teardown

No race window, no retries, no deadlines, no zombie sweep. peek_ports_with_listeners goes away. The launch_kernel_with_retry wrapper, LaunchGuard, and outer 45s timeout from this PR become unnecessary. We'd delete more code than we'd add.

Side wins: permission-isolated via fs perms, no firewall surface (not listening on 127.0.0.1), slightly faster than TCP over loopback, and the daemon↔kernel link is always colocated so we lose nothing by dropping cross-machine support.

Support caveat: PyZMQ has supported IPC everywhere for years on Unix; Windows Named Pipes support landed in PyZMQ 22.x (2021). If our bundled Windows build has any issue, the fallback is option 2: write the connection file with port=0 and let the kernel rewrite it after binding (standard Jupyter idiom, adds a "wait for file rewrite" step but still TOCTOU-free).

Closing this PR in favor of pursuing the IPC switch. The kernel-hang bug it addresses is real; the fix framing just isn't deep enough.

quillaid added 3 commits April 14, 2026 21:52

rgbkrk closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions#1772

fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions#1772
quillaid wants to merge 3 commits intonteract:mainfrom
quillaid:fix/kernel-launch-retry

quillaid commented Apr 14, 2026

Uh oh!

rgbkrk commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

quillaid commented Apr 14, 2026

Summary

Root cause

What changed

Test plan

Uh oh!

rgbkrk commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants