Skip to content

fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions#1772

Closed
quillaid wants to merge 3 commits intonteract:mainfrom
quillaid:fix/kernel-launch-retry
Closed

fix(runtimed): add wall-clock launch deadline and retry for kernel port collisions#1772
quillaid wants to merge 3 commits intonteract:mainfrom
quillaid:fix/kernel-launch-retry

Conversation

@quillaid
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a 45-second hard wall-clock deadline around the shell-connect + kernel_info_reply exchange in JupyterKernel::launch(), preventing indefinite hangs when a kernel is alive but unresponsive due to partial ZMQ bind failure.
  • Adds launch_kernel_with_retry() in the runtime agent with a 60-second per-attempt timeout and up to 3 retry attempts, each with fresh port allocation. Both LaunchKernel and RestartKernel handlers use this wrapper.
  • Derives Clone on KernelLaunchConfig and KernelSharedRefs to support retry cloning.

Root cause

Discovered by the gremlin suite during v2.2.0 testing: 19 concurrent pixi notebook launches caused one kernel to hang for 20+ minutes, blocking 10/19 gremlins.

Two interacting bugs:

  1. TOCTOU port race -- peek_ports_with_listeners() reserves 5 ports via TCP listeners, but drop(listeners) releases them before the kernel process binds its ZMQ sockets. Under high concurrency, the OS can reassign a port to another process during this window.

  2. Missing timeout recovery -- When a kernel partially binds (e.g., 4/5 ZMQ ports succeed), it stays alive but broken. The existing select! racing shell.read() (30s timeout) vs died_rx (process death) never fires because neither condition is met: the kernel is alive but unresponsive on the shell channel due to the stolen port.

The specific incident: port 33425 (stdin_port) was stolen during concurrent launch. The kernel started (4/5 ports bound) but the runtime agent hung in the kernel_info_reply check for 20+ minutes until the gremlin suite timed out.

What changed

File Change
crates/runtimed/src/jupyter_kernel.rs Wrap shell-connect + kernel_info_reply in tokio::time::timeout(45s). Inner select! still handles fast path (kernel dies or responds within 30s); outer timeout catches alive-but-broken kernels.
crates/runtimed/src/runtime_agent.rs New launch_kernel_with_retry(): 60s per-attempt timeout, 3 attempts max, 500ms backoff between retries. Fresh KernelLaunchConfig clone per attempt so port allocation restarts.
crates/runtimed/src/kernel_connection.rs #[derive(Clone)] on KernelLaunchConfig and KernelSharedRefs.

Test plan

  • cargo clippy -p runtimed passes with zero warnings
  • Stress test: 10 concurrent pixi notebooks during pool warmup (4/10 pixi envs ready) -- 10/10 passed, 2.3s each
  • Stress test: 15 concurrent pixi notebooks (exceeds pool of 10) -- 15/15 passed
  • Stress test: 10 concurrent pixi notebooks after daemon restart during warmup -- 10/10 passed
  • Zero ZMQ port collisions or retry messages in daemon log across all runs (normal path unaffected)
  • Full gremlin suite pass (pending)

…rt collisions

The gremlin suite discovered a kernel startup hang during v2.2.0 testing:
19 concurrent pixi notebook launches caused one kernel to hang for 20+
minutes, blocking 10/19 gremlins.

Root cause: two interacting bugs in kernel launch:

1. TOCTOU port race — peek_ports_with_listeners() reserves 5 ports via TCP
   listeners, but drop(listeners) releases them before the kernel process
   binds its ZMQ sockets. Under high concurrency another process can steal
   a port during this window.

2. Missing timeout recovery — when a kernel partially binds (e.g., 4/5 ZMQ
   ports succeed), it stays alive but broken. The existing select! racing
   shell.read() (30s timeout) vs died_rx (process death) never fires
   because neither condition is met: the kernel is alive but unresponsive
   on the shell channel due to the stolen port.

Fix (two layers):

Inner (jupyter_kernel.rs): wrap the entire shell-connect + kernel_info_reply
exchange in a 45-second hard wall-clock deadline via tokio::time::timeout.
The inner select! still handles the fast path, but the outer timeout
guarantees we never hang indefinitely when the kernel is alive-but-broken.

Outer (runtime_agent.rs): add launch_kernel_with_retry() with a 60-second
per-attempt timeout and up to 3 retry attempts. Each retry clones a fresh
KernelLaunchConfig so port allocation starts from scratch. Both LaunchKernel
and RestartKernel handlers now use this wrapper.

Supporting (kernel_connection.rs): derive Clone on KernelLaunchConfig and
KernelSharedRefs to support retry cloning.

Stress-tested with 10, 15, and 10 concurrent pixi notebook launches against
the nightly daemon — all passed with zero port collisions and zero retries,
confirming the normal path is unaffected while the safety net is in place.
…unch

Code review finding: launch_kernel_with_retry() could leave orphan kernel
processes and detached tokio tasks when a launch attempt failed or timed
out.  When JupyterKernel::launch() errors after spawning the child process
and background tasks, only process_watcher_task was aborted — iopub_task
and stderr_task were leaked.  Worse, when the outer 60s timeout drops the
launch() future entirely, none of the spawned tasks or the child process
are cleaned up because tokio::spawn tasks are detached from the future
that spawned them.

Fix: introduce LaunchGuard, an RAII guard created immediately after the
child process is spawned.  It holds the kernel PID (for direct SIGKILL on
Unix) and AbortHandles for every background task spawned during launch().

- On error return: the guard is dropped, killing the process and aborting
  all tracked tasks.
- On future cancellation (outer timeout): same — Drop fires automatically.
- On success: launch_guard.disarm() is called just before constructing the
  JupyterKernel struct, which takes ownership via its own Drop impl.

The guard uses SIGKILL via PID rather than relying on task abort + Drop
propagation because tokio task abort is asynchronous and the Child
handle's kill_on_drop may not fire synchronously when the owning task is
aborted.

Tracked tasks: stderr reader, process watcher, iopub listener, shell
reader, heartbeat monitor, comm coalesce writer.
… retry

Only retry on transient errors (timeouts, ZMQ connection failures, port
collisions).  Fail fast on terminal errors that will not succeed on retry:
missing environments, broken env configs, unsupported kernel types, and
early process crashes.

Without this, a kernel that fails because the binary is missing or the env
is broken would retry 3 times over up to 180s before surfacing the error
— a UX regression for common failures.

Terminal error patterns (fail fast, no retry):
- "requires a prepared environment" (uv/conda env not created)
- "has no python binary" (conda env broken)
- "could not resolve conda environment prefix" (env_yml broken)
- "Unsupported kernel type" (config error)
- "Kernel process exited immediately" (binary/env broken)
- "Kernel process died before responding" (kernel crashed)

Everything else defaults to retryable, including the port-collision and
timeout scenarios the retry loop was designed for.

Also downgrades LaunchGuard::Drop log messages from warn! to debug! since
they fire on every expected error bubble-up path and were too loud.
@rgbkrk
Copy link
Copy Markdown
Member

rgbkrk commented Apr 15, 2026

Looked at this from a different angle. The 45s deadline + 60s retry + zombie SIGKILL machinery all exist because we're doing port reservation the wrong way: bind 5 TCP listeners, read the port numbers, drop the listeners, then hope the kernel binds before anything else grabs those ports. That's an inherent TOCTOU race — we can make it rarer with retries and tighter timeouts, but we can't close it while we're pre-reserving ports from a different process than the one that ultimately binds.

The deeper fix is to stop using TCP for daemon↔kernel transport. Jupyter's connection file format natively supports "transport": "ipc" with filesystem socket paths. We'd:

  • mkdir -p ~/.cache/runt/kernels/{kernel_id}/ with 0700
  • Fill ConnectionInfo with transport: IPC, ip: <that dir>, and stdin_port=1, control_port=2, … as socket-path suffixes
  • Spawn the kernel — it binds to filesystem sockets we own, no OS port allocation happens at all
  • rm -rf the dir on teardown

No race window, no retries, no deadlines, no zombie sweep. peek_ports_with_listeners goes away. The launch_kernel_with_retry wrapper, LaunchGuard, and outer 45s timeout from this PR become unnecessary. We'd delete more code than we'd add.

Side wins: permission-isolated via fs perms, no firewall surface (not listening on 127.0.0.1), slightly faster than TCP over loopback, and the daemon↔kernel link is always colocated so we lose nothing by dropping cross-machine support.

Support caveat: PyZMQ has supported IPC everywhere for years on Unix; Windows Named Pipes support landed in PyZMQ 22.x (2021). If our bundled Windows build has any issue, the fallback is option 2: write the connection file with port=0 and let the kernel rewrite it after binding (standard Jupyter idiom, adds a "wait for file rewrite" step but still TOCTOU-free).

Closing this PR in favor of pursuing the IPC switch. The kernel-hang bug it addresses is real; the fix framing just isn't deep enough.

@rgbkrk rgbkrk closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants