Skip to content

Make broadcast notification workflow crash-resumable #568

@chrispaskvan

Description

@chrispaskvan

Summary

The broadcast notification workflow in NotificationController.create() is not crash-resumable. If the process dies mid-broadcast, some subscribers are notified and others are silently lost with no mechanism to detect or recover from the gap.

Problem

The broadcast flow in notifications/notification.controller.js performs these steps:

  1. Create a ClaimCheck (Redis hash)
  2. Fetch all subscribed users from Cosmos DB via UserService.getSubscribedUsers()
  3. Throttle-enqueue each user as a BullMQ job via Publisher.sendNotification() + add phone to claim check

Critical issues:

  • The throttle() call on line ~113 is fire-and-forget — the create() method returns the claimCheckNumber before all users are enqueued
  • If the process crashes mid-throttle, users not yet enqueued are permanently lost
  • There is no record of which users were targeted vs. which were successfully enqueued
  • The claim check only tracks users whose phone numbers were added via claimCheck.addPhoneNumber(), so it cannot be used to detect missing users
  • On restart, there is no mechanism to discover or resume incomplete broadcasts
  • For Xur notifications (the primary broadcast use case), this means some subscribers get the weekly notification and others do not

Proposed Solution

Option A: Checkpoint-Based Recovery (Simpler)

  1. Persist broadcast intent in Cosmos DB before starting:
    7A11-F0A1
    {
    id: claimCheckNumber,
    type: "broadcast",
    notificationType: subscription,
    status: "in_progress",
    targetPhoneNumbers: [...all subscribed user phone numbers...],
    enqueuedPhoneNumbers: [],
    createdAt: timestamp,
    completedAt: null
    }
  2. Track enqueue progress: After each successful publisher.sendNotification() + claimCheck.addPhoneNumber(), add the phone number to enqueuedPhoneNumbers in the broadcast document
  3. Await the throttle: Change create() to await throttle(...) so the method returns only after all users are enqueued (or the caller knows about partial failure)
  4. Mark completion: Set status: "completed" and completedAt when all users are enqueued
  5. Recovery on startup: Query Cosmos DB for broadcasts with status: "in_progress". For each, compute targetPhoneNumbers - enqueuedPhoneNumbers and re-enqueue the difference
  6. Admin endpoint: POST /notifications/broadcasts/:id/resume to manually trigger recovery

Option B: BullMQ FlowProducer (More Elegant)

  1. Use BullMQ FlowProducer to create a parent "broadcast" job with child jobs for each user
  2. BullMQ natively tracks parent-child relationships and can report overall broadcast status
  3. If the process crashes, BullMQ retains the queued child jobs and the parent tracks completion
  4. This requires restructuring the publisher to use FlowProducer.add() instead of individual Queue.add() calls

Recommendation: Option A for lower risk. Option B if adopting BullMQ Flows is desirable for other workflows too.

Files to Modify

  • notifications/notification.controller.jscreate() method: persist intent, await throttle, track progress
  • helpers/claim-check.js — Add broadcast document persistence (or create a new BroadcastRecord helper)
  • helpers/publisher.js — Optionally restructure for FlowProducer
  • helpers/throttle.js — Return results so caller knows which tasks succeeded/failed
  • loaders/index.js — Add startup recovery scan for incomplete broadcasts
  • notifications/notification.routes.js — Add resume endpoint (optional)

Acceptance Criteria

  • Broadcast intent (target user list, notification type, claim check number) is persisted before enqueuing begins
  • Per-user enqueue progress is tracked durably
  • create() awaits completion of all enqueues (or returns with a clear partial-failure indication)
  • Incomplete broadcasts are detected and resumed on service restart
  • Integration test: start a broadcast with N users, interrupt after N/2, restart, verify all N users receive notifications
  • Claim check accurately reflects final delivery state for all targeted users

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions