Summary
The broadcast notification workflow in NotificationController.create() is not crash-resumable. If the process dies mid-broadcast, some subscribers are notified and others are silently lost with no mechanism to detect or recover from the gap.
Problem
The broadcast flow in notifications/notification.controller.js performs these steps:
- Create a
ClaimCheck (Redis hash)
- Fetch all subscribed users from Cosmos DB via
UserService.getSubscribedUsers()
- Throttle-enqueue each user as a BullMQ job via
Publisher.sendNotification() + add phone to claim check
Critical issues:
- The
throttle() call on line ~113 is fire-and-forget — the create() method returns the claimCheckNumber before all users are enqueued
- If the process crashes mid-throttle, users not yet enqueued are permanently lost
- There is no record of which users were targeted vs. which were successfully enqueued
- The claim check only tracks users whose phone numbers were added via
claimCheck.addPhoneNumber(), so it cannot be used to detect missing users
- On restart, there is no mechanism to discover or resume incomplete broadcasts
- For Xur notifications (the primary broadcast use case), this means some subscribers get the weekly notification and others do not
Proposed Solution
Option A: Checkpoint-Based Recovery (Simpler)
- Persist broadcast intent in Cosmos DB before starting:
7A11-F0A1
{
id: claimCheckNumber,
type: "broadcast",
notificationType: subscription,
status: "in_progress",
targetPhoneNumbers: [...all subscribed user phone numbers...],
enqueuedPhoneNumbers: [],
createdAt: timestamp,
completedAt: null
}
- Track enqueue progress: After each successful
publisher.sendNotification() + claimCheck.addPhoneNumber(), add the phone number to enqueuedPhoneNumbers in the broadcast document
- Await the throttle: Change
create() to await throttle(...) so the method returns only after all users are enqueued (or the caller knows about partial failure)
- Mark completion: Set
status: "completed" and completedAt when all users are enqueued
- Recovery on startup: Query Cosmos DB for broadcasts with
status: "in_progress". For each, compute targetPhoneNumbers - enqueuedPhoneNumbers and re-enqueue the difference
- Admin endpoint:
POST /notifications/broadcasts/:id/resume to manually trigger recovery
Option B: BullMQ FlowProducer (More Elegant)
- Use BullMQ
FlowProducer to create a parent "broadcast" job with child jobs for each user
- BullMQ natively tracks parent-child relationships and can report overall broadcast status
- If the process crashes, BullMQ retains the queued child jobs and the parent tracks completion
- This requires restructuring the publisher to use
FlowProducer.add() instead of individual Queue.add() calls
Recommendation: Option A for lower risk. Option B if adopting BullMQ Flows is desirable for other workflows too.
Files to Modify
notifications/notification.controller.js — create() method: persist intent, await throttle, track progress
helpers/claim-check.js — Add broadcast document persistence (or create a new BroadcastRecord helper)
helpers/publisher.js — Optionally restructure for FlowProducer
helpers/throttle.js — Return results so caller knows which tasks succeeded/failed
loaders/index.js — Add startup recovery scan for incomplete broadcasts
notifications/notification.routes.js — Add resume endpoint (optional)
Acceptance Criteria
Summary
The broadcast notification workflow in
NotificationController.create()is not crash-resumable. If the process dies mid-broadcast, some subscribers are notified and others are silently lost with no mechanism to detect or recover from the gap.Problem
The broadcast flow in
notifications/notification.controller.jsperforms these steps:ClaimCheck(Redis hash)UserService.getSubscribedUsers()Publisher.sendNotification()+ add phone to claim checkCritical issues:
throttle()call on line ~113 is fire-and-forget — thecreate()method returns theclaimCheckNumberbefore all users are enqueuedclaimCheck.addPhoneNumber(), so it cannot be used to detect missing usersProposed Solution
Option A: Checkpoint-Based Recovery (Simpler)
7A11-F0A1
{
id: claimCheckNumber,
type: "broadcast",
notificationType: subscription,
status: "in_progress",
targetPhoneNumbers: [...all subscribed user phone numbers...],
enqueuedPhoneNumbers: [],
createdAt: timestamp,
completedAt: null
}
publisher.sendNotification()+claimCheck.addPhoneNumber(), add the phone number toenqueuedPhoneNumbersin the broadcast documentcreate()toawait throttle(...)so the method returns only after all users are enqueued (or the caller knows about partial failure)status: "completed"andcompletedAtwhen all users are enqueuedstatus: "in_progress". For each, computetargetPhoneNumbers - enqueuedPhoneNumbersand re-enqueue the differencePOST /notifications/broadcasts/:id/resumeto manually trigger recoveryOption B: BullMQ FlowProducer (More Elegant)
FlowProducerto create a parent "broadcast" job with child jobs for each userFlowProducer.add()instead of individualQueue.add()callsRecommendation: Option A for lower risk. Option B if adopting BullMQ Flows is desirable for other workflows too.
Files to Modify
notifications/notification.controller.js—create()method: persist intent, await throttle, track progresshelpers/claim-check.js— Add broadcast document persistence (or create a newBroadcastRecordhelper)helpers/publisher.js— Optionally restructure for FlowProducerhelpers/throttle.js— Return results so caller knows which tasks succeeded/failedloaders/index.js— Add startup recovery scan for incomplete broadcastsnotifications/notification.routes.js— Add resume endpoint (optional)Acceptance Criteria
create()awaits completion of all enqueues (or returns with a clear partial-failure indication)