Skip to content

tests/zedagent: add zedagent integration test suite#1152

Open
eriknordmark wants to merge 6 commits intolf-edge:masterfrom
eriknordmark:zedagent-integration-tests
Open

tests/zedagent: add zedagent integration test suite#1152
eriknordmark wants to merge 6 commits intolf-edge:masterfrom
eriknordmark:zedagent-integration-tests

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

@eriknordmark eriknordmark commented May 4, 2026

Summary

Adds an Eden testscript suite (tests/zedagent/) that exercises the
zedagent microservice end-to-end against a live EVE instance.

Six test scenarios:

  • device_info_completeness – verifies ZInfoDevice contains
    hardware inventory, network adapters, EVE version, and
    data-security-at-rest info; also verifies that a config-item change
    appears in the next publish.
  • config_items_and_status – exercises the config-item round-trip
    through parseConfigItems/handleGlobalConfigImpl; deploys and
    deletes an app to cover parseAppInstanceConfig.
  • maintenance_mode – sets maintenance.mode=enabled, confirms
    ZInfoDevice.state transitions to ZDEVICE_STATE_MAINTENANCE_MODE,
    then restores normal operation.
  • app_metrics_detail – deploys an app with a persistent volume,
    verifies per-app metrics (ZMetricMsg.am) and per-device disk
    metrics (MetricContent.dm.disk), and confirms ZiApp.state:RUNNING
    is published.
  • network_instance_info_metrics – creates a local NI, deploys two
    apps on it, verifies NI info (ZiNetworkInstance.networkID) and NI
    metrics (ZMetricMsg.nm.networkID).
  • attest_flow – verifies the remote attestation FSM reaches
    ATTEST_STATE_COMPLETE, PCR status is published, and the integrity
    token is persisted. Requires eve.tpm=true; skipped otherwise.

zedagent_test.go provides TestInfo, TestMetric, and TestFlowLog
helpers that the testscripts invoke via the test command.

Every test guards against unexpected device reboots using the standard
watchdog pattern:

! test eden.reboot.test -test.v -timewait=<N>m -reboot=0 -count=1 &

with timewait sized to exceed the worst-case foreground duration of
each test. A pkill cancels the watchdog as soon as the real work is
done so that wait returns promptly. Hard crashes that trigger a reboot
(kernel panic, OOM reboot, watchdog timeout) are caught by the reboot
guard. Soft crashes (process restart without reboot) are caught
indirectly: foreground TestInfo/TestApp steps time out if zedagent
misses a publish after restarting.

The suite was validated against a QEMU-based coverage-instrumented EVE
instance. The Eden e2e run achieves 50.6% statement coverage on
cmd/zedagent, versus 10.4% from the existing unit tests.

Test plan

  • Run against a local QEMU EVE instance:
    cd tests/zedagent && make test
    
  • Confirm all six TestEdenScripts/* subtests pass (attest_flow
    requires a TPM-enabled device; it self-skips on plain QEMU).

🤖 Generated with Claude Code

@eriknordmark eriknordmark requested a review from uncleDecart as a code owner May 4, 2026 22:15
@eriknordmark eriknordmark force-pushed the zedagent-integration-tests branch from 645a898 to 63c40a8 Compare May 5, 2026 10:41
Adds an Eden testscript suite (tests/zedagent/) that exercises the
zedagent microservice end-to-end against a live EVE instance.

Six test scenarios:

- device_info_completeness: verifies ZInfoDevice contains hardware
  inventory, network adapters, EVE version, and data-security-at-rest
  info, and that config-item changes appear in the next publish.
- config_items_and_status: exercises the config-item round-trip through
  parseConfigItems/handleGlobalConfigImpl; deploys and deletes an app
  to cover parseAppInstanceConfig.
- maintenance_mode: sets maintenance.mode=enabled, confirms
  ZInfoDevice.state transitions to ZDEVICE_STATE_MAINTENANCE_MODE,
  then restores normal operation.
- app_metrics_detail: deploys an app with a persistent volume, verifies
  per-app metrics (ZMetricMsg.am) and per-device disk metrics
  (MetricContent.dm.disk), and confirms ZiApp.state:RUNNING is
  published.
- network_instance_info_metrics: creates a local NI, deploys two apps
  on it, verifies NI info (ZiNetworkInstance.networkID) and NI metrics
  (ZMetricMsg.nm.networkID).
- attest_flow: verifies the remote attestation FSM reaches
  ATTEST_STATE_COMPLETE, PCR status is published, and the integrity
  token is persisted (requires eve.tpm=true; skipped otherwise).

zedagent_test.go provides TestInfo, TestMetric, and TestFlowLog helpers
that the testscripts invoke via the test command.

Every test guards against unexpected device reboots using the standard
watchdog pattern:

  ! test eden.reboot.test -test.v -timewait=<N>m -reboot=0 -count=1 &

with timewait sized to exceed the worst-case foreground duration of each
test. A pkill cancels the watchdog immediately after the real work is
done so that wait returns promptly rather than blocking until the ceiling.

Hard crashes that trigger a reboot (kernel panic, OOM reboot, watchdog
timeout) are caught by the reboot guard. Soft crashes (process restart
without reboot) are caught indirectly: foreground TestInfo/TestApp steps
will time out if zedagent misses a publish after restarting.

The suite was validated against a QEMU-based coverage-instrumented EVE
instance. The Eden e2e run achieves 50.6% statement coverage on
cmd/zedagent, versus 10.4% from the existing unit tests.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eriknordmark eriknordmark force-pushed the zedagent-integration-tests branch from 32a6659 to 57f210d Compare May 5, 2026 15:49
eriknordmark and others added 5 commits May 5, 2026 22:23
Replace inline `pkill -f` with an embedded kill_watchdog.sh script
in all six test files. The old `exec sh -c 'pkill -f ...'` pattern
caused pkill to match and kill its own parent sh process before
`|| true` could suppress the error, resulting in [signal: terminated].
The embedded script avoids the self-match: the `sh kill_watchdog.sh`
process has no `eden.reboot.test` in its cmdline, and pgrep excludes
itself.

Also fix the attest_flow skip message: testscript `skip` accepts a
single bare-word argument, not a quoted string with spaces.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes for the zedagent integration tests:

device_info_completeness: merge the swList.shortVersion check into
the same TestInfo call as machineArch/systemAdapter/dataSecAtRestInfo.
The original split into two sequential TestInfo calls had a race where
the epoch-bump ZInfoDevice arrived before the second subscriber started,
causing a 5m timeout.

maintenance_mode: increase the exit-maintenance-mode TestInfo timeout
from 5m to 10m. EVE can take longer than 5m to re-populate systemAdapter
fields after clearing maintenance mode.

config_items_and_status: add pre-test cleanup (pre_cleanup.sh) to
remove any zagent-t1/zagent-n1 resources left by a previous failed run,
wait for them to be fully absent, then proceed. Also increase the AppInfo
TestInfo timeout from 5m to 10m for the same epoch-race reason. Remove
the brittle `stdout 'deploy network ... request sent'` check that breaks
when the network already exists.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Testscript does not support shell-style backslash line continuations.
Remove them from device_info_completeness and attest_flow, putting each
command on a single line.

For the maintenance_mode exit check, replace the
systemAdapter.status.ports.ifname filter with machineArch. After exiting
maintenance mode EVE consistently re-populates machineArch (a static
field present in every ZInfoDevice) before it re-populates systemAdapter,
so this filter reliably catches the first ZInfoDevice published on exit.
Also extend the timeout from 10m to 15m.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pod deletion in EVE can take more than 3m (container teardown, volume
cleanup). Increase the pre-cleanup wait for zagent-t1 absence from 3m
to 10m so that a stale pod from a previous failed run is fully removed
before this test re-creates it.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
device_info_completeness: TestInfo with multiple -out flags only outputs
the last field's value. Use a single -out for swList.shortVersion (the
meaningful stdout check) and verify machineArch via the filter predicate
only. Increase timeout to 15m since this consistently takes ~10m.

maintenance_mode: a previous failed run can leave EVE stuck in
maintenance mode. Add a pre-cleanup step that resets maintenance.mode
to none and waits for a confirming ZInfoDevice before starting the
actual test, ensuring we always begin from a known ONLINE state.
Bump watchdog to 35m to cover the extra pre-cleanup window.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant