fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions#1772
fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions#1772quillaid wants to merge 3 commits intonteract:mainfrom
Conversation
…rt collisions The gremlin suite discovered a kernel startup hang during v2.2.0 testing: 19 concurrent pixi notebook launches caused one kernel to hang for 20+ minutes, blocking 10/19 gremlins. Root cause: two interacting bugs in kernel launch: 1. TOCTOU port race — peek_ports_with_listeners() reserves 5 ports via TCP listeners, but drop(listeners) releases them before the kernel process binds its ZMQ sockets. Under high concurrency another process can steal a port during this window. 2. Missing timeout recovery — when a kernel partially binds (e.g., 4/5 ZMQ ports succeed), it stays alive but broken. The existing select! racing shell.read() (30s timeout) vs died_rx (process death) never fires because neither condition is met: the kernel is alive but unresponsive on the shell channel due to the stolen port. Fix (two layers): Inner (jupyter_kernel.rs): wrap the entire shell-connect + kernel_info_reply exchange in a 45-second hard wall-clock deadline via tokio::time::timeout. The inner select! still handles the fast path, but the outer timeout guarantees we never hang indefinitely when the kernel is alive-but-broken. Outer (runtime_agent.rs): add launch_kernel_with_retry() with a 60-second per-attempt timeout and up to 3 retry attempts. Each retry clones a fresh KernelLaunchConfig so port allocation starts from scratch. Both LaunchKernel and RestartKernel handlers now use this wrapper. Supporting (kernel_connection.rs): derive Clone on KernelLaunchConfig and KernelSharedRefs to support retry cloning. Stress-tested with 10, 15, and 10 concurrent pixi notebook launches against the nightly daemon — all passed with zero port collisions and zero retries, confirming the normal path is unaffected while the safety net is in place.
…unch Code review finding: launch_kernel_with_retry() could leave orphan kernel processes and detached tokio tasks when a launch attempt failed or timed out. When JupyterKernel::launch() errors after spawning the child process and background tasks, only process_watcher_task was aborted — iopub_task and stderr_task were leaked. Worse, when the outer 60s timeout drops the launch() future entirely, none of the spawned tasks or the child process are cleaned up because tokio::spawn tasks are detached from the future that spawned them. Fix: introduce LaunchGuard, an RAII guard created immediately after the child process is spawned. It holds the kernel PID (for direct SIGKILL on Unix) and AbortHandles for every background task spawned during launch(). - On error return: the guard is dropped, killing the process and aborting all tracked tasks. - On future cancellation (outer timeout): same — Drop fires automatically. - On success: launch_guard.disarm() is called just before constructing the JupyterKernel struct, which takes ownership via its own Drop impl. The guard uses SIGKILL via PID rather than relying on task abort + Drop propagation because tokio task abort is asynchronous and the Child handle's kill_on_drop may not fire synchronously when the owning task is aborted. Tracked tasks: stderr reader, process watcher, iopub listener, shell reader, heartbeat monitor, comm coalesce writer.
… retry Only retry on transient errors (timeouts, ZMQ connection failures, port collisions). Fail fast on terminal errors that will not succeed on retry: missing environments, broken env configs, unsupported kernel types, and early process crashes. Without this, a kernel that fails because the binary is missing or the env is broken would retry 3 times over up to 180s before surfacing the error — a UX regression for common failures. Terminal error patterns (fail fast, no retry): - "requires a prepared environment" (uv/conda env not created) - "has no python binary" (conda env broken) - "could not resolve conda environment prefix" (env_yml broken) - "Unsupported kernel type" (config error) - "Kernel process exited immediately" (binary/env broken) - "Kernel process died before responding" (kernel crashed) Everything else defaults to retryable, including the port-collision and timeout scenarios the retry loop was designed for. Also downgrades LaunchGuard::Drop log messages from warn! to debug! since they fire on every expected error bubble-up path and were too loud.
|
Looked at this from a different angle. The 45s deadline + 60s retry + zombie SIGKILL machinery all exist because we're doing port reservation the wrong way: bind 5 TCP listeners, read the port numbers, drop the listeners, then hope the kernel binds before anything else grabs those ports. That's an inherent TOCTOU race — we can make it rarer with retries and tighter timeouts, but we can't close it while we're pre-reserving ports from a different process than the one that ultimately binds. The deeper fix is to stop using TCP for daemon↔kernel transport. Jupyter's connection file format natively supports
No race window, no retries, no deadlines, no zombie sweep. Side wins: permission-isolated via fs perms, no firewall surface (not listening on 127.0.0.1), slightly faster than TCP over loopback, and the daemon↔kernel link is always colocated so we lose nothing by dropping cross-machine support. Support caveat: PyZMQ has supported IPC everywhere for years on Unix; Windows Named Pipes support landed in PyZMQ 22.x (2021). If our bundled Windows build has any issue, the fallback is option 2: write the connection file with Closing this PR in favor of pursuing the IPC switch. The kernel-hang bug it addresses is real; the fix framing just isn't deep enough. |
Summary
kernel_info_replyexchange inJupyterKernel::launch(), preventing indefinite hangs when a kernel is alive but unresponsive due to partial ZMQ bind failure.launch_kernel_with_retry()in the runtime agent with a 60-second per-attempt timeout and up to 3 retry attempts, each with fresh port allocation. BothLaunchKernelandRestartKernelhandlers use this wrapper.CloneonKernelLaunchConfigandKernelSharedRefsto support retry cloning.Root cause
Discovered by the gremlin suite during v2.2.0 testing: 19 concurrent pixi notebook launches caused one kernel to hang for 20+ minutes, blocking 10/19 gremlins.
Two interacting bugs:
TOCTOU port race --
peek_ports_with_listeners()reserves 5 ports via TCP listeners, butdrop(listeners)releases them before the kernel process binds its ZMQ sockets. Under high concurrency, the OS can reassign a port to another process during this window.Missing timeout recovery -- When a kernel partially binds (e.g., 4/5 ZMQ ports succeed), it stays alive but broken. The existing
select!racingshell.read()(30s timeout) vsdied_rx(process death) never fires because neither condition is met: the kernel is alive but unresponsive on the shell channel due to the stolen port.The specific incident: port 33425 (
stdin_port) was stolen during concurrent launch. The kernel started (4/5 ports bound) but the runtime agent hung in thekernel_info_replycheck for 20+ minutes until the gremlin suite timed out.What changed
crates/runtimed/src/jupyter_kernel.rstokio::time::timeout(45s). Innerselect!still handles fast path (kernel dies or responds within 30s); outer timeout catches alive-but-broken kernels.crates/runtimed/src/runtime_agent.rslaunch_kernel_with_retry(): 60s per-attempt timeout, 3 attempts max, 500ms backoff between retries. FreshKernelLaunchConfigclone per attempt so port allocation restarts.crates/runtimed/src/kernel_connection.rs#[derive(Clone)]onKernelLaunchConfigandKernelSharedRefs.Test plan
cargo clippy -p runtimedpasses with zero warnings