tests/zedagent: add zedagent integration test suite by eriknordmark · Pull Request #1152 · lf-edge/eden

eriknordmark · 2026-05-04T22:15:38Z

Summary

Adds an Eden testscript suite (tests/zedagent/) that exercises the
zedagent microservice end-to-end against a live EVE instance.

Six test scenarios:

device_info_completeness – verifies ZInfoDevice contains
hardware inventory, network adapters, EVE version, and
data-security-at-rest info; also verifies that a config-item change
appears in the next publish.
config_items_and_status – exercises the config-item round-trip
through parseConfigItems/handleGlobalConfigImpl; deploys and
deletes an app to cover parseAppInstanceConfig.
maintenance_mode – sets maintenance.mode=enabled, confirms
ZInfoDevice.state transitions to ZDEVICE_STATE_MAINTENANCE_MODE,
then restores normal operation.
app_metrics_detail – deploys an app with a persistent volume,
verifies per-app metrics (ZMetricMsg.am) and per-device disk
metrics (MetricContent.dm.disk), and confirms ZiApp.state:RUNNING
is published.
network_instance_info_metrics – creates a local NI, deploys two
apps on it, verifies NI info (ZiNetworkInstance.networkID) and NI
metrics (ZMetricMsg.nm.networkID).
attest_flow – verifies the remote attestation FSM reaches
ATTEST_STATE_COMPLETE, PCR status is published, and the integrity
token is persisted. Requires eve.tpm=true; skipped otherwise.

zedagent_test.go provides TestInfo, TestMetric, and TestFlowLog
helpers that the testscripts invoke via the test command.

Every test guards against unexpected device reboots using the standard
watchdog pattern:

! test eden.reboot.test -test.v -timewait=<N>m -reboot=0 -count=1 &

with timewait sized to exceed the worst-case foreground duration of
each test. A pkill cancels the watchdog as soon as the real work is
done so that wait returns promptly. Hard crashes that trigger a reboot
(kernel panic, OOM reboot, watchdog timeout) are caught by the reboot
guard. Soft crashes (process restart without reboot) are caught
indirectly: foreground TestInfo/TestApp steps time out if zedagent
misses a publish after restarting.

The suite was validated against a QEMU-based coverage-instrumented EVE
instance. The Eden e2e run achieves 50.6% statement coverage on
cmd/zedagent, versus 10.4% from the existing unit tests.

Test plan

Run against a local QEMU EVE instance:
```
cd tests/zedagent && make test
```
Confirm all six TestEdenScripts/* subtests pass (attest_flow
requires a TPM-enabled device; it self-skips on plain QEMU).

🤖 Generated with Claude Code

Adds an Eden testscript suite (tests/zedagent/) that exercises the zedagent microservice end-to-end against a live EVE instance. Six test scenarios: - device_info_completeness: verifies ZInfoDevice contains hardware inventory, network adapters, EVE version, and data-security-at-rest info, and that config-item changes appear in the next publish. - config_items_and_status: exercises the config-item round-trip through parseConfigItems/handleGlobalConfigImpl; deploys and deletes an app to cover parseAppInstanceConfig. - maintenance_mode: sets maintenance.mode=enabled, confirms ZInfoDevice.state transitions to ZDEVICE_STATE_MAINTENANCE_MODE, then restores normal operation. - app_metrics_detail: deploys an app with a persistent volume, verifies per-app metrics (ZMetricMsg.am) and per-device disk metrics (MetricContent.dm.disk), and confirms ZiApp.state:RUNNING is published. - network_instance_info_metrics: creates a local NI, deploys two apps on it, verifies NI info (ZiNetworkInstance.networkID) and NI metrics (ZMetricMsg.nm.networkID). - attest_flow: verifies the remote attestation FSM reaches ATTEST_STATE_COMPLETE, PCR status is published, and the integrity token is persisted (requires eve.tpm=true; skipped otherwise). zedagent_test.go provides TestInfo, TestMetric, and TestFlowLog helpers that the testscripts invoke via the test command. Every test guards against unexpected device reboots using the standard watchdog pattern: ! test eden.reboot.test -test.v -timewait=<N>m -reboot=0 -count=1 & with timewait sized to exceed the worst-case foreground duration of each test. A pkill cancels the watchdog immediately after the real work is done so that wait returns promptly rather than blocking until the ceiling. Hard crashes that trigger a reboot (kernel panic, OOM reboot, watchdog timeout) are caught by the reboot guard. Soft crashes (process restart without reboot) are caught indirectly: foreground TestInfo/TestApp steps will time out if zedagent misses a publish after restarting. The suite was validated against a QEMU-based coverage-instrumented EVE instance. The Eden e2e run achieves 50.6% statement coverage on cmd/zedagent, versus 10.4% from the existing unit tests. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace inline `pkill -f` with an embedded kill_watchdog.sh script in all six test files. The old `exec sh -c 'pkill -f ...'` pattern caused pkill to match and kill its own parent sh process before `|| true` could suppress the error, resulting in [signal: terminated]. The embedded script avoids the self-match: the `sh kill_watchdog.sh` process has no `eden.reboot.test` in its cmdline, and pgrep excludes itself. Also fix the attest_flow skip message: testscript `skip` accepts a single bare-word argument, not a quoted string with spaces. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three fixes for the zedagent integration tests: device_info_completeness: merge the swList.shortVersion check into the same TestInfo call as machineArch/systemAdapter/dataSecAtRestInfo. The original split into two sequential TestInfo calls had a race where the epoch-bump ZInfoDevice arrived before the second subscriber started, causing a 5m timeout. maintenance_mode: increase the exit-maintenance-mode TestInfo timeout from 5m to 10m. EVE can take longer than 5m to re-populate systemAdapter fields after clearing maintenance mode. config_items_and_status: add pre-test cleanup (pre_cleanup.sh) to remove any zagent-t1/zagent-n1 resources left by a previous failed run, wait for them to be fully absent, then proceed. Also increase the AppInfo TestInfo timeout from 5m to 10m for the same epoch-race reason. Remove the brittle `stdout 'deploy network ... request sent'` check that breaks when the network already exists. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Testscript does not support shell-style backslash line continuations. Remove them from device_info_completeness and attest_flow, putting each command on a single line. For the maintenance_mode exit check, replace the systemAdapter.status.ports.ifname filter with machineArch. After exiting maintenance mode EVE consistently re-populates machineArch (a static field present in every ZInfoDevice) before it re-populates systemAdapter, so this filter reliably catches the first ZInfoDevice published on exit. Also extend the timeout from 10m to 15m. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Pod deletion in EVE can take more than 3m (container teardown, volume cleanup). Increase the pre-cleanup wait for zagent-t1 absence from 3m to 10m so that a stale pod from a previous failed run is fully removed before this test re-creates it. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

device_info_completeness: TestInfo with multiple -out flags only outputs the last field's value. Use a single -out for swList.shortVersion (the meaningful stdout check) and verify machineArch via the filter predicate only. Increase timeout to 15m since this consistently takes ~10m. maintenance_mode: a previous failed run can leave EVE stuck in maintenance mode. Add a pre-cleanup step that resets maintenance.mode to none and waits for a confirming ZInfoDevice before starting the actual test, ensuring we always begin from a known ONLINE state. Bump watchdog to 35m to cover the extra pre-cleanup window. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eriknordmark requested a review from uncleDecart as a code owner May 4, 2026 22:15

eriknordmark force-pushed the zedagent-integration-tests branch from 645a898 to 63c40a8 Compare May 5, 2026 10:41

eriknordmark force-pushed the zedagent-integration-tests branch from 32a6659 to 57f210d Compare May 5, 2026 15:49

eriknordmark and others added 5 commits May 5, 2026 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/zedagent: add zedagent integration test suite#1152

tests/zedagent: add zedagent integration test suite#1152
eriknordmark wants to merge 6 commits intolf-edge:masterfrom
eriknordmark:zedagent-integration-tests

eriknordmark commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eriknordmark commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eriknordmark commented May 4, 2026 •

edited

Loading