Skip to content

fix(plugin-multi-tenant): forward req in afterTenantDelete cascade#16505

Open
philipdaveby wants to merge 1 commit intopayloadcms:mainfrom
philipdaveby:fix/multi-tenant-cascade-forward-req
Open

fix(plugin-multi-tenant): forward req in afterTenantDelete cascade#16505
philipdaveby wants to merge 1 commit intopayloadcms:mainfrom
philipdaveby:fix/multi-tenant-cascade-forward-req

Conversation

@philipdaveby
Copy link
Copy Markdown

What?

@payloadcms/plugin-multi-tenant's afterTenantDelete hook (packages/plugin-multi-tenant/src/hooks/afterTenantDelete.ts) receives req from the caller but does not forward it into the three internal Payload calls inside the hook:

  • req.payload.delete({ collection: slug, where: {...} }) — per tenant-scoped collection (cascade)
  • req.payload.find({ collection: usersSlug, ... }) — users-with-tenant lookup
  • req.payload.update({ collection: usersSlug, ... }) — per affected user

Each call therefore grabs a fresh pool connection and runs outside the caller's transaction.

Why?

When the caller (e.g. a Kafka consumer that opens a transaction to delete a tenant) holds row locks via its open tx, the cascade's sibling tx tries to acquire the same locks and blocks. The parent awaits the cascade's promise; the cascade waits on the parent's locks; Postgres can't detect the cycle because the parent connection is at ClientRead (waiting on Node, not on a Lock) — producing multi-minute idle in transaction leaks visible in pg_stat_activity.

The leak shape we observed:

```
pid X (idle in tx, ClientRead) — last query:
DELETE FROM payload_preferences WHERE key IN ($1)
pid Y (active, blocked on transactionid Lock) — query:
DELETE FROM static_assets WHERE id = $1
```

pid X is the parent (its last visible statement is Payload's deleteUserPreferences cleanup at the end of the deleteByID flow, right before the multi-tenant afterDelete hook fires). pid Y is one of the cascade's sibling txs. The cycle persists until something external (idle_in_transaction_session_timeout, NAT eviction, or a process-level watchdog) terminates a connection.

Beyond fixing the deadlock, sharing the tx also restores transactional consistency: a half-failed cascade no longer commits "tenant deleted, child rows orphaned" or "child rows deleted, tenant still present".

How?

Pass the caller's req into all three Payload calls so the cascade and the users-cleanup share the parent's transactionID. Three single-line additions:

```diff
cleanupPromises.push(
req.payload.delete({
collection: slug,

  •  req,
     where: { [tenantFieldName]: { in: [id] } },
    
    }),
    )
    ```

```diff
const usersWithTenant = (await req.payload.find({
collection: usersSlug,
depth: 0,
limit: 0,

  • req,
    where: { ... },
    })) as PaginatedDocs
    ```

```diff
cleanupPromises.push(
req.payload.update({
id: user.id,
collection: usersSlug,
data: { ... },

  •    req,
     }),
    
    )
    ```

Reproduction notes

  • The bug has been present in every shipped 3.x version. Verified by extracting dist/hooks/afterTenantDelete.js from npm tarballs for 3.30.0, 3.50.0, 3.64.0 (current popular pin), and 3.84.1 (latest at time of writing) — all identical.
  • Triggers when:
    1. The tenant collection is deleted from a code path that has its own open transaction (Kafka consumer, custom endpoint, etc.).
    2. The tenant has child records in any tenant-scoped collection.
    3. The cascade's payload.delete touches a row that the parent's tx has already touched (or vice versa).
  • Effectively never observed via the admin UI delete flow because that path doesn't open a custom outer tx, so the cascade's sibling tx doesn't race itself. We hit it through a Kafka consumer that calls payload.delete(tenantsSlug, id, req) after payload.db.beginTransaction().

Test plan

  • Existing plugin-multi-tenant tests continue to pass.
  • Suggest adding a regression test: invoke addTenantCleanup with a req carrying a transactionID, assert the inner payload.delete/find/update calls receive the same transactionID (e.g. by spying on the payload mock).

I'm happy to add the regression test if helpful — let me know which test file you'd prefer it in (packages/plugin-multi-tenant/test/ looks like the right place, but I haven't audited the existing test layout).

Risk

Minimal. The change only thread the existing req through. Behaviour is unchanged when the caller doesn't open a transaction (req.transactionID is undefined and Payload's adapter treats it as a plain pool acquisition).

The afterTenantDelete hook receives `req` from the caller (which carries
the parent transaction's `transactionID`) but does not forward it into
its three internal Payload calls — the cascade `payload.delete` per
tenant-scoped collection, the users-find, and the users-update.

Each cascade call therefore grabs a fresh pool connection and runs
OUTSIDE the caller's transaction. When the caller (e.g. a Kafka
consumer that opens a transaction to delete a tenant) holds row locks
via its open tx, the cascade's sibling tx tries to acquire the same
locks and blocks. The parent awaits the cascade's promise; the cascade
waits on the parent's locks; PostgreSQL can't detect the cycle because
the parent connection is at `ClientRead` (waiting on Node, not on a
Lock) — producing multi-minute idle-in-transaction leaks visible in
`pg_stat_activity`.

We hit this in our staging/dev environment when deleting a tenant
(QPortal instance) via a Kafka consumer that opens its own tx and calls
`payload.delete(tenantsSlug, id, req)`. The leak shape is:

  pid X (idle in tx, ClientRead) — last query
    DELETE FROM payload_preferences WHERE key IN ($1)
  pid Y (active, blocked on transactionid Lock) — query
    DELETE FROM static_assets WHERE id = $1

pid X is the parent (its last visible statement is Payload's
`deleteUserPreferences` cleanup at the end of the deleteByID flow,
right before the multi-tenant `afterDelete` hook fires). pid Y is one
of the cascade's sibling txs. Postgres sees the lock cycle but reports
the parent as `ClientRead` rather than `Lock`, so the cycle isn't
broken automatically — the deadlock persists until something external
(idle_in_transaction_session_timeout, NAT eviction, or our own
process-level watchdog) terminates a connection.

Fix: pass the caller's `req` into all three Payload calls in
afterTenantDelete so the cascade and the users-cleanup share the
parent's transactionID. Three single-line additions.

The bug has been present in every shipped 3.x version (verified by
extracting `dist/hooks/afterTenantDelete.js` from npm tarballs for
3.30.0, 3.50.0, 3.64.0, 3.84.1).

Beyond fixing the deadlock, sharing the tx also restores
transactional consistency: a half-failed cascade no longer commits
"tenant deleted, child rows orphaned" or "child rows deleted, tenant
still present".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant