Skip to content
This repository was archived by the owner on Dec 15, 2025. It is now read-only.

Commit cbce123

Browse files
committed
...
1 parent d9961d5 commit cbce123

1 file changed

Lines changed: 366 additions & 0 deletions

File tree

docs/shutdown_improvement_plan.md

Lines changed: 366 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,366 @@
1+
# Zinit Shutdown Functionality Improvement Plan
2+
3+
## Current Issues
4+
5+
1. **Incomplete Child Process Termination**: When services are stopped, child processes may remain running.
6+
2. **Lack of Verification**: There's no verification that all processes are actually terminated.
7+
3. **Improper Graceful Shutdown**: Zinit doesn't wait for all processes to terminate before exiting.
8+
9+
## Solution Overview
10+
11+
We'll implement a robust shutdown mechanism that:
12+
1. Uses our stats functionality to detect all child processes
13+
2. Properly manages process groups
14+
3. Verifies all processes are terminated before Zinit exits
15+
16+
## Implementation Plan
17+
18+
```mermaid
19+
flowchart TD
20+
A[Enhance stop method] --> B[Improve kill_process_tree]
21+
B --> C[Add process verification]
22+
C --> D[Implement graceful shutdown]
23+
24+
A1[Use stats to detect child processes] --> A
25+
A2[Send signals to all processes] --> A
26+
A3[Implement cascading termination] --> A
27+
28+
B1[Ensure proper process group handling] --> B
29+
B2[Add timeout and escalation logic] --> B
30+
31+
C1[Create verification mechanism] --> C
32+
C2[Add polling for process existence] --> C
33+
34+
D1[Wait for all processes to terminate] --> D
35+
D2[Add cleanup of resources] --> D
36+
D3[Implement clean exit] --> D
37+
```
38+
39+
## Detailed Implementation Steps
40+
41+
### 1. Enhance the `stop` Method in `LifecycleManager`
42+
43+
```rust
44+
pub async fn stop<S: AsRef<str>>(&self, name: S) -> Result<()> {
45+
// Get service information
46+
let table = self.services.read().await;
47+
let service = table.get(name.as_ref())
48+
.ok_or_else(|| ZInitError::unknown_service(name.as_ref()))?;
49+
50+
let mut service = service.write().await;
51+
service.set_target(Target::Down);
52+
53+
// Get the main process PID
54+
let pid = service.pid;
55+
if pid.as_raw() == 0 {
56+
return Ok(());
57+
}
58+
59+
// Get the signal to use
60+
let signal = signal::Signal::from_str(&service.service.signal.stop.to_uppercase())
61+
.map_err(|err| anyhow::anyhow!("unknown stop signal: {}", err))?;
62+
63+
// Release the lock before potentially long-running operations
64+
drop(service);
65+
drop(table);
66+
67+
// Get all child processes using our stats functionality
68+
let children = self.get_child_process_stats(pid.as_raw()).await?;
69+
70+
// First try to stop the process group
71+
let _ = self.pm.signal(pid, signal);
72+
73+
// Wait a short time for processes to terminate gracefully
74+
sleep(std::time::Duration::from_millis(500)).await;
75+
76+
// Check if processes are still running and use SIGKILL if needed
77+
self.ensure_processes_terminated(pid.as_raw(), &children).await?;
78+
79+
Ok(())
80+
}
81+
```
82+
83+
### 2. Add a New `ensure_processes_terminated` Method
84+
85+
```rust
86+
async fn ensure_processes_terminated(&self, parent_pid: i32, children: &[ProcessStats]) -> Result<()> {
87+
// Check if parent is still running
88+
let parent_running = self.is_process_running(parent_pid).await?;
89+
90+
// If parent is still running, send SIGKILL
91+
if parent_running {
92+
debug!("Process {} still running after SIGTERM, sending SIGKILL", parent_pid);
93+
let _ = self.pm.signal(Pid::from_raw(parent_pid), signal::Signal::SIGKILL);
94+
}
95+
96+
// Check and kill any remaining child processes
97+
for child in children {
98+
if self.is_process_running(child.pid).await? {
99+
debug!("Child process {} still running, sending SIGKILL", child.pid);
100+
let _ = signal::kill(Pid::from_raw(child.pid), signal::Signal::SIGKILL);
101+
}
102+
}
103+
104+
// Verify all processes are gone
105+
let mut retries = 5;
106+
while retries > 0 {
107+
let mut all_terminated = true;
108+
109+
// Check parent
110+
if self.is_process_running(parent_pid).await? {
111+
all_terminated = false;
112+
}
113+
114+
// Check children
115+
for child in children {
116+
if self.is_process_running(child.pid).await? {
117+
all_terminated = false;
118+
break;
119+
}
120+
}
121+
122+
if all_terminated {
123+
return Ok(());
124+
}
125+
126+
// Wait before retrying
127+
sleep(std::time::Duration::from_millis(100)).await;
128+
retries -= 1;
129+
}
130+
131+
// If we get here, some processes might still be running
132+
warn!("Some processes may still be running after shutdown attempts");
133+
Ok(())
134+
}
135+
```
136+
137+
### 3. Add a Helper Method to Check if a Process is Running
138+
139+
```rust
140+
async fn is_process_running(&self, pid: i32) -> Result<bool> {
141+
// Use sysinfo to check if process exists
142+
let mut system = System::new();
143+
let sys_pid = sysinfo::Pid::from(pid as usize);
144+
system.refresh_process(sys_pid);
145+
146+
Ok(system.process(sys_pid).is_some())
147+
}
148+
```
149+
150+
### 4. Improve the `kill_process_tree` Method
151+
152+
```rust
153+
#[cfg(target_os = "linux")]
154+
async fn kill_process_tree(
155+
&self,
156+
mut dag: ProcessDAG,
157+
mut state_channels: HashMap<String, Watcher<State>>,
158+
mut shutdown_timeouts: HashMap<String, u64>,
159+
) -> Result<()> {
160+
let (tx, mut rx) = mpsc::unbounded_channel();
161+
tx.send(DUMMY_ROOT.into())?;
162+
163+
let mut count = dag.count;
164+
while let Some(name) = rx.recv().await {
165+
debug!("{} has been killed (or was inactive) adding its children", name);
166+
167+
for child in dag.adj.get(&name).unwrap_or(&Vec::new()) {
168+
let child_indegree: &mut u32 = dag.indegree.entry(child.clone()).or_default();
169+
*child_indegree -= 1;
170+
171+
debug!("decrementing child {} indegree to {}", child, child_indegree);
172+
173+
if *child_indegree == 0 {
174+
let watcher = state_channels.remove(child);
175+
if watcher.is_none() {
176+
// not an active service
177+
tx.send(child.to_string())?;
178+
continue;
179+
}
180+
181+
let shutdown_timeout = shutdown_timeouts.remove(child);
182+
let lifecycle = self.clone_lifecycle();
183+
184+
// Spawn a task to kill the service and wait for it to terminate
185+
let kill_task = tokio::spawn(Self::kill_wait_enhanced(
186+
lifecycle,
187+
child.to_string(),
188+
tx.clone(),
189+
watcher.unwrap(),
190+
shutdown_timeout.unwrap_or(config::DEFAULT_SHUTDOWN_TIMEOUT),
191+
));
192+
193+
// Add a timeout to ensure we don't wait forever
194+
let _ = tokio::time::timeout(
195+
std::time::Duration::from_secs(shutdown_timeout.unwrap_or(config::DEFAULT_SHUTDOWN_TIMEOUT) + 2),
196+
kill_task
197+
).await;
198+
}
199+
}
200+
201+
count -= 1;
202+
if count == 0 {
203+
break;
204+
}
205+
}
206+
207+
// Final verification that all processes are gone
208+
self.verify_all_processes_terminated().await?;
209+
210+
Ok(())
211+
}
212+
```
213+
214+
### 5. Add an Enhanced `kill_wait` Method
215+
216+
```rust
217+
#[cfg(target_os = "linux")]
218+
async fn kill_wait_enhanced(
219+
self,
220+
name: String,
221+
ch: mpsc::UnboundedSender<String>,
222+
mut rx: Watcher<State>,
223+
shutdown_timeout: u64,
224+
) {
225+
debug!("kill_wait {}", name);
226+
227+
// Try to stop the service gracefully
228+
let stop_result = self.stop(name.clone()).await;
229+
230+
// Wait for the service to become inactive or timeout
231+
let fut = timeout(
232+
std::time::Duration::from_secs(shutdown_timeout),
233+
async move {
234+
while let Some(state) = rx.next().await {
235+
if !state.is_active() {
236+
return;
237+
}
238+
}
239+
},
240+
);
241+
242+
match stop_result {
243+
Ok(_) => {
244+
let _ = fut.await;
245+
}
246+
Err(e) => error!("couldn't stop service {}: {}", name.clone(), e),
247+
}
248+
249+
// Verify the service is actually stopped
250+
if let Ok(status) = self.status(&name).await {
251+
if status.pid != 0 {
252+
// Service is still running, try to kill it
253+
let _ = self.kill(&name, signal::Signal::SIGKILL).await;
254+
}
255+
}
256+
257+
debug!("sending to the death channel {}", name.clone());
258+
if let Err(e) = ch.send(name.clone()) {
259+
error!(
260+
"error: couldn't send the service {} to the shutdown loop: {}",
261+
name, e
262+
);
263+
}
264+
}
265+
```
266+
267+
### 6. Add a Method to Verify All Processes are Terminated
268+
269+
```rust
270+
async fn verify_all_processes_terminated(&self) -> Result<()> {
271+
// Get all services
272+
let table = self.services.read().await;
273+
274+
// Check each service
275+
for (name, service) in table.iter() {
276+
let service = service.read().await;
277+
let pid = service.pid.as_raw();
278+
279+
// Skip services with no PID
280+
if pid == 0 {
281+
continue;
282+
}
283+
284+
// Check if the main process is still running
285+
if self.is_process_running(pid).await? {
286+
warn!("Service {} (PID {}) is still running after shutdown", name, pid);
287+
288+
// Try to kill it with SIGKILL
289+
let _ = signal::kill(Pid::from_raw(pid), signal::Signal::SIGKILL);
290+
}
291+
292+
// Check for child processes
293+
if let Ok(children) = self.get_child_process_stats(pid).await {
294+
for child in children {
295+
if self.is_process_running(child.pid).await? {
296+
warn!("Child process {} of service {} is still running after shutdown",
297+
child.pid, name);
298+
299+
// Try to kill it with SIGKILL
300+
let _ = signal::kill(Pid::from_raw(child.pid), signal::Signal::SIGKILL);
301+
}
302+
}
303+
}
304+
}
305+
306+
Ok(())
307+
}
308+
```
309+
310+
### 7. Update the `shutdown` and `reboot` Methods
311+
312+
```rust
313+
pub async fn shutdown(&self) -> Result<()> {
314+
info!("shutting down");
315+
316+
// Set the shutdown flag
317+
*self.shutdown.write().await = true;
318+
319+
#[cfg(target_os = "linux")]
320+
{
321+
// Power off using our enhanced method
322+
let result = self.power(RebootMode::RB_POWER_OFF).await;
323+
324+
// Final verification before exit
325+
self.verify_all_processes_terminated().await?;
326+
327+
return result;
328+
}
329+
330+
#[cfg(not(target_os = "linux"))]
331+
{
332+
// Stop all services
333+
let services = self.list().await?;
334+
for service in services {
335+
let _ = self.stop(&service).await;
336+
}
337+
338+
// Verify all processes are terminated
339+
self.verify_all_processes_terminated().await?;
340+
341+
if self.container {
342+
std::process::exit(0);
343+
} else {
344+
info!("System shutdown not supported on this platform");
345+
std::process::exit(0);
346+
}
347+
}
348+
}
349+
```
350+
351+
## Testing Plan
352+
353+
1. **Basic Service Termination**: Test that a simple service is properly terminated
354+
2. **Child Process Termination**: Test that a service with child processes has all processes terminated
355+
3. **Graceful Shutdown**: Test that Zinit exits cleanly after all services are stopped
356+
4. **Edge Cases**:
357+
- Test with services that spawn many child processes
358+
- Test with services that spawn child processes that change their process group
359+
- Test with services that ignore SIGTERM
360+
361+
## Implementation Timeline
362+
363+
1. **Phase 1**: Enhance the `stop` method and add the helper methods (1-2 hours)
364+
2. **Phase 2**: Improve the `kill_process_tree` and `kill_wait` methods (1-2 hours)
365+
3. **Phase 3**: Update the `shutdown` and `reboot` methods (1 hour)
366+
4. **Phase 4**: Testing and debugging (2-3 hours)

0 commit comments

Comments
 (0)