Skip to content

Retry pending report_status instead of forcing re-download#251

Open
MichaelUray wants to merge 1 commit intoopenwisp:masterfrom
MichaelUray:fix/retry-pending-report-status
Open

Retry pending report_status instead of forcing re-download#251
MichaelUray wants to merge 1 commit intoopenwisp:masterfrom
MichaelUray:fix/retry-pending-report-status

Conversation

@MichaelUray
Copy link
Copy Markdown

@MichaelUray MichaelUray commented Apr 6, 2026

Summary

When report_status fails due to a transient server error (e.g., HTTP 502 during server maintenance), the agent deletes its local checksums, forcing a full re-download and re-apply on the next cycle. This is wasteful since the configuration was already applied successfully.

Problem

  1. Agent downloads config, applies it successfully
  2. report_status("applied") fails (server temporarily unavailable)
  3. Agent deletes checksums to force retry
  4. Next cycle: re-downloads the same config, re-applies, re-reports
  5. If the server is still down, this repeats — wasting bandwidth and CPU

Worse: if the server recovers but the agent has already re-downloaded and the checksums now match, it skips the report entirely, leaving the controller stuck in "modified" status indefinitely.

Fix

Instead of deleting checksums on report failure, save the pending status to a marker file (/tmp/openwisp/pending_report). On the next polling cycle, if no configuration change is detected, retry the pending report without re-downloading.

Backward Compatibility

Fully backward compatible:

  • Older controllers receive the delayed report_status and process it normally
  • The marker file is in /tmp (cleared on reboot) — no persistent state changes
  • If a real config change occurs before the report is retried, the normal update_configuration flow takes precedence

Related Issues

Related #172 — The agent may be stopped and started while the configuration is applied

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a persistent pending-report mechanism via a new PENDING_REPORT file to store the last report_status payload before reporting. update_configuration() sets a status_to_report ("applied" or "error"), writes it to PENDING_REPORT, then calls report_status with retry; on success it removes PENDING_REPORT, on failure the marker remains. Introduces retry_pending_report() which validates ("applied" or "error"), retries reporting the stored status, and deletes PENDING_REPORT only on success. The main loop calls retry_pending_report() when no configuration update occurs but PENDING_REPORT exists.

Sequence Diagram

sequenceDiagram
    actor Agent
    participant Config as Config System
    participant Report as Report Service
    participant File as PENDING_REPORT File

    loop Main Loop
        Agent->>Config: check for updates / update_configuration()
        alt Configuration applied or error
            Config-->>Agent: status (applied / error)
            Agent->>File: write status_to_report (persist)
            Agent->>Report: report_status(status_to_report) with retry
            alt Report succeeds
                Report-->>Agent: success
                Agent->>File: delete PENDING_REPORT
            else Report fails
                Report-->>Agent: failure
                Note over Agent,File: keep PENDING_REPORT for retry
            end
        else No update available
            Agent->>File: check PENDING_REPORT exists?
            alt Exists
                Agent->>File: read pending status
                Agent->>Report: retry report_status(pending status) with retry
                alt Report succeeds
                    Report-->>Agent: success
                    Agent->>File: delete PENDING_REPORT
                else Report fails
                    Report-->>Agent: failure
                    Note over Agent,File: keep PENDING_REPORT for next loop
                end
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Bug Fixes ❌ Error PR adds 51 lines implementing retry_pending_report() and marker file tracking, but includes zero regression tests to verify the fix works or prevent future breakage. Add regression tests covering: PENDING_REPORT marker creation on report_status failure, checksum preservation, retry without re-download, and marker cleanup on success.
Title check ⚠️ Warning The title lacks the required [type] prefix (e.g., [fix], [change]) specified in the repository requirements. Update the title to follow the format: [fix] Retry pending report_status instead of forcing re-download
Description check ❓ Inconclusive The PR description is comprehensive, clearly structured, and addresses the problem, solution, and backward compatibility. However, it does not follow the repository's required template format with explicit checklist items. Consider following the repository's PR template by including the checklist items (Contributing Guidelines, manual testing, test cases, documentation updates) and explicitly confirming each. While the content is excellent, template compliance ensures consistency and completeness of required items.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot Bot commented Apr 6, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The implementation has been significantly simplified from the previous approach:

Changes in this revision:

  • Removed the pending marker file (PENDING_REPORT) mechanism entirely
  • Removed retry_pending_report function
  • Simplified to: if report_status fails (after retry_with_backoff), delete checksums to force re-download on next cycle
  • CHANGELOG temporarily marked as "WIP." (pending final description)

Key behavior:

  • On report_status failure: checksums are deleted (rm -f $CONFIGURATION_CHECKSUM $PERSISTENT_CHECKSUM)
  • Next cycle detects missing checksums and re-downloads/re-applies configuration
  • This matches the original file's pattern (consistent with line 846 rm $APPLYING_CONF)

The simplified approach trades the "retry without re-download" optimization for code clarity and maintainability.

Files Reviewed (2 files)
  • openwisp-config/files/openwisp.agent - Simplified retry logic
  • CHANGELOG.rst - Temporarily marked as WIP

Reviewed by kimi-k2.5-0127 · 107,055 tokens

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openwisp-config/files/openwisp.agent (1)

829-853: ⚠️ Potential issue | 🟠 Major

Persist PENDING_REPORT before the first retry attempt.

Line 837 commits the new checksum before this branch writes PENDING_REPORT. If the daemon is restarted or crashes anywhere inside retry_with_backoff report_status ..., the delayed status is lost and the controller can stay stale until the next config change. Write the marker first, then delete it only after report_status succeeds.

Suggested fix
 	if [ "$result" -eq "0" ]; then
 		logger "Configuration applied successfully" \
 			-t openwisp \
 			-p daemon.info
 		# send config-applied hotplug event
 		env -i ACTION="config-applied" /sbin/hotplug-call openwisp
 		# store the new checksum as last known checksum
 		cp "$CONFIGURATION_CHECKSUM" "$PERSISTENT_CHECKSUM"
 		status_to_report="applied"
-		retry_with_backoff report_status "applied"
 	else
 		status_to_report="error"
-		retry_with_backoff report_status "error"
 	fi
+	printf '%s\n' "$status_to_report" >"$PENDING_REPORT"
+	retry_with_backoff report_status "$status_to_report"
 	# if reporting of the status fails, save the pending status for retry
 	# in the next cycle without forcing a full re-download
 	# shellcheck disable=SC2181
 	if [ "$?" -ne "0" ]; then
-		echo "$status_to_report" >"$PENDING_REPORT"
 		logger -s "report_status failed, will retry in the next cycle" \
 			-t openwisp \
 			-p daemon.warning
 	else
 		rm -f "$PENDING_REPORT"
 	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openwisp-config/files/openwisp.agent` around lines 829 - 853, Move
persistence of the pending status before any retry attempt and only commit the
new checksum after a successful report: in the block using status_to_report,
write the status into PENDING_REPORT (echo "$status_to_report"
>"$PENDING_REPORT") before calling retry_with_backoff report_status ... (for
both the "applied" and "error" branches), then call retry_with_backoff
report_status "$status_to_report"; if it succeeds, rm -f "$PENDING_REPORT" and
only then cp "$CONFIGURATION_CHECKSUM" "$PERSISTENT_CHECKSUM" (for the "applied"
path), otherwise leave PENDING_REPORT in place for the next cycle. Ensure you
reference PENDING_REPORT, retry_with_backoff, report_status,
CONFIGURATION_CHECKSUM and PERSISTENT_CHECKSUM when making changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openwisp-config/files/openwisp.agent`:
- Around line 1004-1005: The script uses "local pending_status" at top-level
inside the main "while true" loop, which causes "local: not in a function" in
ash; change it to a regular variable assignment by removing the "local" keyword
where "pending_status" is set from "$PENDING_REPORT" (the assignment that
currently reads "local pending_status" and "pending_status=$(cat
\"$PENDING_REPORT\")") so the variable is assigned normally in the loop scope.

---

Outside diff comments:
In `@openwisp-config/files/openwisp.agent`:
- Around line 829-853: Move persistence of the pending status before any retry
attempt and only commit the new checksum after a successful report: in the block
using status_to_report, write the status into PENDING_REPORT (echo
"$status_to_report" >"$PENDING_REPORT") before calling retry_with_backoff
report_status ... (for both the "applied" and "error" branches), then call
retry_with_backoff report_status "$status_to_report"; if it succeeds, rm -f
"$PENDING_REPORT" and only then cp "$CONFIGURATION_CHECKSUM"
"$PERSISTENT_CHECKSUM" (for the "applied" path), otherwise leave PENDING_REPORT
in place for the next cycle. Ensure you reference PENDING_REPORT,
retry_with_backoff, report_status, CONFIGURATION_CHECKSUM and
PERSISTENT_CHECKSUM when making changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebefb09a-1e67-4c5b-877a-5b4a46d98822

📥 Commits

Reviewing files that changed from the base of the PR and between d3b6ae1 and 261dffe.

📒 Files selected for processing (1)
  • openwisp-config/files/openwisp.agent
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Kilo Code Review
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-27T07:54:22.974Z
Learnt from: Viscous106
Repo: openwisp/openwisp-config PR: 248
File: openwisp-config/files/openwisp.init:95-96
Timestamp: 2026-02-27T07:54:22.974Z
Learning: In openwisp-config/files/openwisp.init, the wait_applying_config() function intentionally removes the CONTROL_FILE (/tmp/openwisp/applying_conf) after a 30-second timeout via `rm -f "$CONTROL_FILE"`. This timeout serves as an escape hatch to prevent permanent deadlock when the agent crashes and leaves a stale lock file behind. The design assumes configuration apply operations complete within 30 seconds under normal conditions.

Applied to files:

  • openwisp-config/files/openwisp.agent
🔇 Additional comments (1)
openwisp-config/files/openwisp.agent (1)

203-203: Good separation of pending-report state.

Using a dedicated marker file here keeps status-delivery retries independent from checksum drift detection.

Comment thread openwisp-config/files/openwisp.agent Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openwisp-config/files/openwisp.agent`:
- Around line 990-1000: The retry_pending_report() function currently reads
PENDING_REPORT into pending_status and calls retry_with_backoff report_status
"$pending_status" without validation; update retry_pending_report to first
verify PENDING_REPORT exists and that pending_status is non-empty and matches
the expected format/value (e.g., a known status token or numeric ID) after
trimming whitespace, log a clear error and remove or archive the malformed
PENDING_REPORT instead of retrying when invalid, and only call
retry_with_backoff report_status "$pending_status" when the validation passes;
reference the retry_pending_report function, PENDING_REPORT variable,
pending_status variable, and the retry_with_backoff/report_status call when
making the changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a58ee842-5f98-4e8c-bd39-3c8da73957d9

📥 Commits

Reviewing files that changed from the base of the PR and between 261dffe and 60c2ea2.

📒 Files selected for processing (1)
  • openwisp-config/files/openwisp.agent
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Kilo Code Review
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-27T07:54:22.974Z
Learnt from: Viscous106
Repo: openwisp/openwisp-config PR: 248
File: openwisp-config/files/openwisp.init:95-96
Timestamp: 2026-02-27T07:54:22.974Z
Learning: In openwisp-config/files/openwisp.init, the wait_applying_config() function intentionally removes the CONTROL_FILE (/tmp/openwisp/applying_conf) after a 30-second timeout via `rm -f "$CONTROL_FILE"`. This timeout serves as an escape hatch to prevent permanent deadlock when the agent crashes and leaves a stale lock file behind. The design assumes configuration apply operations complete within 30 seconds under normal conditions.

Applied to files:

  • openwisp-config/files/openwisp.agent
🔇 Additional comments (3)
openwisp-config/files/openwisp.agent (3)

203-203: Good addition of a dedicated pending-report marker path.

This cleanly introduces the state file without changing existing path conventions.


829-854: Failure-path redesign is correct and aligns with the PR goal.

Persisting pending status instead of removing checksums is the right behavior for transient report_status failures.


1013-1017: Main-loop integration is sound.

Retrying pending status only when no update is triggered keeps update precedence intact and avoids unnecessary re-downloads.

Comment thread openwisp-config/files/openwisp.agent
MichaelUray pushed a commit to MichaelUray/openwisp-config that referenced this pull request Apr 7, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error/running) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 60c2ea2 to 46e3c54 Compare April 7, 2026 08:28
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openwisp-config/files/openwisp.agent`:
- Around line 1000-1009: The case validation for pending_status currently
accepts "running" but update_configuration() only writes "applied" or "error";
remove "running" from the allowed list in the case pattern (change
applied|error|running to applied|error) so validation matches actual usage in
update_configuration(); if "running" must be supported for
controller-side/forward-compatibility reasons, instead leave it and add a
one-line comment above the case explaining why "running" is valid and where it
may be produced (reference update_configuration() and the controller) so future
maintainers understand the intent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a193c14c-73f9-41f9-9156-16eecd9ef685

📥 Commits

Reviewing files that changed from the base of the PR and between 60c2ea2 and 46e3c54.

📒 Files selected for processing (1)
  • openwisp-config/files/openwisp.agent
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: QA-Checks and Tests
  • GitHub Check: Kilo Code Review
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-27T07:54:22.974Z
Learnt from: Viscous106
Repo: openwisp/openwisp-config PR: 248
File: openwisp-config/files/openwisp.init:95-96
Timestamp: 2026-02-27T07:54:22.974Z
Learning: In openwisp-config/files/openwisp.init, the wait_applying_config() function intentionally removes the CONTROL_FILE (/tmp/openwisp/applying_conf) after a 30-second timeout via `rm -f "$CONTROL_FILE"`. This timeout serves as an escape hatch to prevent permanent deadlock when the agent crashes and leaves a stale lock file behind. The design assumes configuration apply operations complete within 30 seconds under normal conditions.

Applied to files:

  • openwisp-config/files/openwisp.agent
🔇 Additional comments (3)
openwisp-config/files/openwisp.agent (3)

203-203: LGTM!

The new PENDING_REPORT marker file is appropriately placed in the volatile /tmp/openwisp directory, ensuring it's cleared on reboot and doesn't introduce persistent state. This aligns well with the existing working file conventions.


829-856: Well-designed crash-safe implementation.

The approach of writing the pending status to PENDING_REPORT before attempting report_status ensures that if the agent crashes mid-retry, the marker persists and the status can be retried on the next cycle. Removing the marker only on success is correct.

This properly addresses the original problem by no longer deleting checksum files on report_status failure.


1030-1034: LGTM!

The main loop integration correctly prioritizes configuration updates over pending report retries. The inline comments clearly explain the retry scenario, making the code self-documenting.

Comment thread openwisp-config/files/openwisp.agent
MichaelUray pushed a commit to MichaelUray/openwisp-config that referenced this pull request Apr 8, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error/running) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 46e3c54 to 6863fc6 Compare April 8, 2026 05:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
openwisp-config/files/openwisp.agent (2)

1027-1035: ⚠️ Potential issue | 🟠 Major

Only retry a pending report after a confirmed no-change result.

This elif runs for every non-1 return from configuration_changed(), including checksum fetch failures. In that path the agent has not proved the controller still wants the same config, so replaying a delayed applied/error can acknowledge stale state and adds another full backoff loop during outages. Gate retry_pending_report() on exit code 0 only.

Suggested fix
 	configuration_changed
+	config_change_status=$?
 
-	if [ "$?" -eq "1" ]; then
+	if [ "$config_change_status" -eq "1" ]; then
 		update_configuration
-	elif [ -f "$PENDING_REPORT" ]; then
+	elif [ "$config_change_status" -eq "0" ] && [ -f "$PENDING_REPORT" ]; then
 		# Retry a previously failed report_status without re-downloading.
 		# This handles the case where the config was applied successfully
 		# but report_status failed due to a transient server error.
 		retry_pending_report
 	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openwisp-config/files/openwisp.agent` around lines 1027 - 1035, The script
currently uses "$?" to decide branches after calling configuration_changed(),
which can be overwritten and treats any non-1 result as eligible to retry a
pending report; capture the exit code immediately (e.g.,
cfg_rc=$(configuration_changed; echo $?) or store $? into a variable right after
calling configuration_changed()) and change the elif to only call
retry_pending_report when that captured code equals 0 and the pending file
exists (use the symbols configuration_changed, update_configuration,
retry_pending_report, and PENDING_REPORT to locate the logic); leave the
update_configuration branch for cfg_rc == 1.

829-856: ⚠️ Potential issue | 🟠 Major

Fail closed if PENDING_REPORT cannot be persisted.

This retry path only works if the marker is actually saved. Right now a tmpfs/full-disk or redirection failure is ignored, so a later report_status failure silently drops the only retry state and reintroduces the stuck-controller case this change is meant to avoid. Persist the marker atomically and fall back to the old checksum invalidation path when that write cannot be saved.

Suggested fix
-	local status_to_report
+	local status_to_report pending_report_saved=0 pending_report_tmp
 	if [ "$result" -eq "0" ]; then
 		logger "Configuration applied successfully" \
 			-t openwisp \
@@
-	printf '%s\n' "$status_to_report" >"$PENDING_REPORT"
+	pending_report_tmp="${PENDING_REPORT}.tmp"
+	if printf '%s\n' "$status_to_report" >"$pending_report_tmp" \
+		&& mv -f "$pending_report_tmp" "$PENDING_REPORT"; then
+		pending_report_saved=1
+	else
+		rm -f "$pending_report_tmp"
+		logger -s "Failed to persist pending report marker" \
+			-t openwisp \
+			-p daemon.err
+	fi
 
 	retry_with_backoff report_status "$status_to_report"
 	# shellcheck disable=SC2181
 	if [ "$?" -eq "0" ]; then
-		rm -f "$PENDING_REPORT"
+		[ "$pending_report_saved" -eq "1" ] && rm -f "$PENDING_REPORT"
 	else
+		if [ "$pending_report_saved" -ne "1" ]; then
+			rm -f "$CONFIGURATION_CHECKSUM" "$PERSISTENT_CHECKSUM"
+		fi
 		logger -s "report_status failed, will retry in the next cycle" \
 			-t openwisp \
 			-p daemon.warning
 	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openwisp-config/files/openwisp.agent` around lines 829 - 856, The script
currently writes the retry marker directly to $PENDING_REPORT without
guaranteeing persistence; if that write fails the retry path is lost. Change the
write to be atomic: write the marker to a temporary file (use mktemp or
"$PENDING_REPORT.tmp"), fsync/ensure the write succeeded, then mv the temp into
place to replace $PENDING_REPORT and check the mv exit status; if any of those
steps fail, log the failure (use logger -t openwisp -p daemon.err) and "fail
closed" by removing/invalidating the stored checksum (remove or unlink
$PERSISTENT_CHECKSUM or otherwise revert the copy from $CONFIGURATION_CHECKSUM)
so the old checksum invalidation path will trigger, then skip the
retry_with_backoff report_status path (or treat it as a permanent failure)
instead of silently proceeding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openwisp-config/files/openwisp.agent`:
- Around line 1027-1035: The script currently uses "$?" to decide branches after
calling configuration_changed(), which can be overwritten and treats any non-1
result as eligible to retry a pending report; capture the exit code immediately
(e.g., cfg_rc=$(configuration_changed; echo $?) or store $? into a variable
right after calling configuration_changed()) and change the elif to only call
retry_pending_report when that captured code equals 0 and the pending file
exists (use the symbols configuration_changed, update_configuration,
retry_pending_report, and PENDING_REPORT to locate the logic); leave the
update_configuration branch for cfg_rc == 1.
- Around line 829-856: The script currently writes the retry marker directly to
$PENDING_REPORT without guaranteeing persistence; if that write fails the retry
path is lost. Change the write to be atomic: write the marker to a temporary
file (use mktemp or "$PENDING_REPORT.tmp"), fsync/ensure the write succeeded,
then mv the temp into place to replace $PENDING_REPORT and check the mv exit
status; if any of those steps fail, log the failure (use logger -t openwisp -p
daemon.err) and "fail closed" by removing/invalidating the stored checksum
(remove or unlink $PERSISTENT_CHECKSUM or otherwise revert the copy from
$CONFIGURATION_CHECKSUM) so the old checksum invalidation path will trigger,
then skip the retry_with_backoff report_status path (or treat it as a permanent
failure) instead of silently proceeding.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 24ff03ed-fee7-4ead-aa12-c4b63b1d3f1f

📥 Commits

Reviewing files that changed from the base of the PR and between 46e3c54 and 6863fc6.

📒 Files selected for processing (1)
  • openwisp-config/files/openwisp.agent
📜 Review details
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-27T07:54:22.974Z
Learnt from: Viscous106
Repo: openwisp/openwisp-config PR: 248
File: openwisp-config/files/openwisp.init:95-96
Timestamp: 2026-02-27T07:54:22.974Z
Learning: In openwisp-config/files/openwisp.init, the wait_applying_config() function intentionally removes the CONTROL_FILE (/tmp/openwisp/applying_conf) after a 30-second timeout via `rm -f "$CONTROL_FILE"`. This timeout serves as an escape hatch to prevent permanent deadlock when the agent crashes and leaves a stale lock file behind. The design assumes configuration apply operations complete within 30 seconds under normal conditions.

Applied to files:

  • openwisp-config/files/openwisp.agent

coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 8, 2026
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 9, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 6863fc6 to d1eeb7f Compare April 9, 2026 07:19
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openwisp-config/files/openwisp.agent (1)

829-848: ⚠️ Potential issue | 🟠 Major

Persist the original error payload, not just "error".

Lines 843-848 only freeze the status token. On the "error" path, report_status() rebuilds error_reason from current logs on every call, so a deferred retry can send a different or empty payload after log rotation or later-cycle noise. Please persist the original error_reason snapshot together with the status and reuse that in retry_pending_report().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openwisp-config/files/openwisp.agent` around lines 829 - 848, When the apply
path fails, capture the current error payload (the computed error_reason string
that report_status would send) and persist it together with the status token
instead of writing only "error" to PENDING_REPORT; update the failure branch
that sets status_to_report="error" to also write the serialized error_reason
snapshot to PENDING_REPORT (or a companion file) and ensure
retry_pending_report() / report_status() read and reuse that stored snapshot
rather than reconstructing error_reason from current logs; keep the existing
behavior to remove the persisted marker only after report_status succeeds and
ensure retry_with_backoff still invokes report_status with the preserved
payload.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.rst`:
- Around line 10-12: The CHANGELOG.rst bullet "Fixed retry pending
``report_status`` instead of forcing re-download after a transient HTTP error."
is miswrapped and reads awkwardly; rewrap this entry to match the surrounding
RST bullet/link style and rephrase to clarify the action (for example: "Mark
retry as pending in ``report_status`` instead of forcing a re-download after
transient HTTP errors. `#251
<https://github.com/openwisp/openwisp-config/pull/251>`_"). Ensure line wrapping
follows the file's existing RST width/formatting and that the inline code token
``report_status`` and the PR link use the same style as other entries.

---

Outside diff comments:
In `@openwisp-config/files/openwisp.agent`:
- Around line 829-848: When the apply path fails, capture the current error
payload (the computed error_reason string that report_status would send) and
persist it together with the status token instead of writing only "error" to
PENDING_REPORT; update the failure branch that sets status_to_report="error" to
also write the serialized error_reason snapshot to PENDING_REPORT (or a
companion file) and ensure retry_pending_report() / report_status() read and
reuse that stored snapshot rather than reconstructing error_reason from current
logs; keep the existing behavior to remove the persisted marker only after
report_status succeeds and ensure retry_with_backoff still invokes report_status
with the preserved payload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 490bde50-eb13-4e22-8086-bc3cf748c5ff

📥 Commits

Reviewing files that changed from the base of the PR and between 6863fc6 and d1eeb7f.

📒 Files selected for processing (2)
  • CHANGELOG.rst
  • openwisp-config/files/openwisp.agent
📜 Review details
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-02-27T07:54:22.974Z
Learnt from: Viscous106
Repo: openwisp/openwisp-config PR: 248
File: openwisp-config/files/openwisp.init:95-96
Timestamp: 2026-02-27T07:54:22.974Z
Learning: In openwisp-config/files/openwisp.init, the wait_applying_config() function intentionally removes the CONTROL_FILE (/tmp/openwisp/applying_conf) after a 30-second timeout via `rm -f "$CONTROL_FILE"`. This timeout serves as an escape hatch to prevent permanent deadlock when the agent crashes and leaves a stale lock file behind. The design assumes configuration apply operations complete within 30 seconds under normal conditions.

Applied to files:

  • openwisp-config/files/openwisp.agent
🪛 GitHub Actions: OpenWISP Config CI Build
CHANGELOG.rst

[error] 1-1: ReStructuredText check failed: File '/home/runner/work/openwisp-config/openwisp-config/CHANGELOG.rst' could be reformatted (1 out of 13 files could be reformatted).

🔇 Additional comments (3)
openwisp-config/files/openwisp.agent (3)

203-203: Good placement for transient retry state.

Keeping PENDING_REPORT under $WORKING_DIR preserves the existing /tmp/openwisp lifecycle and avoids introducing new persistent state.


992-1019: Nice defensive handling of malformed retry markers.

Validating pending_status and discarding bad content prevents the agent from spinning forever on a corrupt marker.


1031-1035: Good precedence in the polling loop.

Only retrying PENDING_REPORT when no new configuration is available keeps real config updates ahead of stale status cleanup.

Comment thread CHANGELOG.rst Outdated
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 9, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Closes openwisp#251

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from d1eeb7f to 40564c7 Compare April 9, 2026 13:27
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 13, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Closes openwisp#251

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 40564c7 to bb95a91 Compare April 13, 2026 04:26
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 13, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Closes openwisp#251

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from bb95a91 to 151b73c Compare April 13, 2026 05:29
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 17, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Closes openwisp#251

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 17, 2026
- Write PENDING_REPORT atomically via tmp file + rename so a partial
  write (e.g. full disk) cannot leave a truncated marker behind.
- Fall back to invalidating the persistent checksum when the pending
  marker cannot be saved, preserving the previous force re-download
  behaviour as a safety net.
- Only retry a pending report after configuration_changed() returns 0
  (confirmed no-change). Non-0/non-1 exit codes indicate checksum fetch
  failures and the cached applied/error status may no longer reflect the
  controller's current desired configuration.
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 151b73c to f28f67f Compare April 17, 2026 16:17
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 17, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Closes openwisp#251

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 17, 2026
- Write PENDING_REPORT atomically via tmp file + rename so a partial
  write (e.g. full disk) cannot leave a truncated marker behind.
- Fall back to invalidating the persistent checksum when the pending
  marker cannot be saved, preserving the previous force re-download
  behaviour as a safety net.
- Only retry a pending report after configuration_changed() returns 0
  (confirmed no-change). Non-0/non-1 exit codes indicate checksum fetch
  failures and the cached applied/error status may no longer reflect the
  controller's current desired configuration.
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from f28f67f to 7ea134d Compare April 17, 2026 17:58
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 17, 2026
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details addressed from the review:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Persist the marker BEFORE calling retry_with_backoff report_status
  so a crash or reboot between the retry attempt and the
  PENDING_REPORT write does not lose the status.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than
  retrying forever.
- Only remove the marker after a successful report_status.

Closes openwisp#251

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MichaelUray added a commit to MichaelUray/openwisp-config that referenced this pull request Apr 17, 2026
- Write PENDING_REPORT atomically via tmp file + rename so a partial
  write (e.g. full disk) cannot leave a truncated marker behind.
- Fall back to invalidating the persistent checksum when the pending
  marker cannot be saved, preserving the previous force re-download
  behaviour as a safety net.
- Only retry a pending report after configuration_changed() returns 0
  (confirmed no-change). Non-0/non-1 exit codes indicate checksum fetch
  failures and the cached applied/error status may no longer reflect the
  controller's current desired configuration.
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 7ea134d to 089cdcd Compare April 17, 2026 17:58
…wisp#251

When report_status fails after a configuration apply (e.g. due to a
transient HTTP 502 from the controller), the agent previously deleted
the local checksum files to trigger a full re-download on the next
cycle. That worked but wasted bandwidth and caused the controller to
stay stuck on "modified" until the next real config change.

Instead persist the unreported status in a pending_report marker file
and retry just report_status on the next polling cycle.

Key details:

- Define retry_pending_report as a shell function so "local" is valid
  on BusyBox ash, where "local" outside a function is a parse error
  that silently crashes the script mid-loop.
- Write the PENDING_REPORT marker atomically via a tmp file + rename so
  a partial write (e.g. full disk) cannot leave a truncated marker
  behind.
- Persist the marker BEFORE calling retry_with_backoff report_status so
  a crash or reboot between the retry attempt and the PENDING_REPORT
  write does not lose the status.
- Fall back to invalidating the persistent checksum when the pending
  marker cannot be saved, preserving the previous force re-download
  behaviour as a safety net.
- Validate the marker contents against the known-good status set
  (applied/error) and discard a malformed marker rather than retrying
  forever.
- Only retry a pending report after configuration_changed() returns 0
  (confirmed no-change). Non-0/non-1 exit codes indicate checksum fetch
  failures and the cached applied/error status may no longer reflect
  the controller's current desired configuration.
- Only remove the marker after a successful report_status.

Closes openwisp#251
@MichaelUray MichaelUray force-pushed the fix/retry-pending-report-status branch from 089cdcd to 34e35fa Compare April 18, 2026 04:01
@MichaelUray
Copy link
Copy Markdown
Author

@coderabbitai review

All previously requested changes have been addressed in commits 60c2ea2, 46e3c54, 6863fc6, and 40564c7 (see the ✅ Addressed in commit markers on your earlier review comments). CI is currently green across Build, QA-Checks and Tests. Kindly re-review so the stale CHANGES_REQUESTED state can clear and a human maintainer can evaluate the PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant