feat: enforce retrieval relevance thresholds#2
Conversation
Add explicit score semantics and pre-packaging relevance filtering for memory search, including deterministic regression coverage for noisy direct-fact retrieval with integration-heavy memory sets.
|
Thanks for the substantial work on this — the threshold/relevance separation is the right shape and the recall-preservation bypasses are well-motivated. A few issues from a careful read of the diff that I think need to land before merge, plus some smaller items. Blockers1.
|
- preserve explicit zero relevance thresholds - split legacy score from gated ranking_score across retrieval paths - add regression coverage for SQL score semantics and recall bypass policy
|
Thanks for the careful read. Fixed these in What was missed:
What changed:
Regression coverage added/updated:
Validation:
|
The previous flow ran TLL chain augmentation inside executeSearchStep and tagged hydrated rows with `similarity: 0.5` so they could pass through applySearchRelevanceFilter. That magic constant either filtered chain rows out (when the threshold was higher) or polluted ranking with a meaningless score. Move TLL augmentation AFTER postProcessResults / applySearchRelevanceFilter. Hydrated rows now carry `similarity: null` and `retrieval_signal: 'tll-chain'`, so they ride around the similarity gate entirely — chain-membership is a structurally different retrieval signal than semantic similarity. The trace adds a `tll-augmentation` stage with the ids that were appended. Replace the `slice(0, 10)` magic number with a named constant (TLL_SEED_CANDIDATE_COUNT).
Squashed across four review rounds on PR #18, all of which surfaced after the initial wave of fixes (#1–#11). Each item below maps to a finding in the PR's review threads. Search path - hydrateChainMemories now returns fully-shaped SearchResults via SELECT * + normalizeMemoryRow. The previous projection set similarity: null and omitted source_site / score / summary / observed_at, crashing the buildInjection formatter (`memory.similarity.toFixed(2)`) the first time TLL augmentation actually fired against a populated chain. The `as unknown as SearchResult` cast was hiding it from tsc. (review v2 #1) - Hydration query now uses `unnest($2::uuid[]) WITH ORDINALITY ... ORDER BY req.ord` so the chronological order chainsFor returns through the augmentation pipeline survives. (review v4 #2) - Workspace isolation: hydrateChainMemories filters `m.workspace_id IS NULL` to match the gate behavior performSearch's postProcessResults already applies. Without it, a workspace memory chained from a global memory's entity could surface in a global response. (review v4 #1) - Defensive `relevance: 1.0` on hydrated rows locks in the chain-membership bypass invariant against future filter drift. The augmented rows are appended after applySearchRelevanceFilter today, but `similarity: 0` + `score: 0` would make them load-bearing on `relevance` if any future filter past appendTllAugmentation checked `memory.relevance >= threshold`. Regression test drives performSearch with a high retrievalOptions.relevanceThreshold and confirms the augmented row survives. (review v5 #2) Repository - TLL chain reads (chain, chainEventsForEntities) now derive chronological position and predecessor via `ROW_NUMBER()` and `LAG()` window functions ordered by observation_date ASC (with stored position_in_chain as a deterministic tiebreaker for events sharing an observation_date). The stored predecessor_memory_id and position_in_chain columns become insertion-order audit metadata; the API surface returns chronological ordering. Backfilled out-of-order events surface in their true position with chronologically-correct predecessors. (review v3 #1) - chainEventsForEntities adds `m.workspace_id IS NULL` for the same reason as the search-path fix above — the global event-chains HTTP endpoint must not surface workspace memories. (review v4 #1) Schema - FirstMentionsExtractBodySchema validates memory_ids_by_turn_id values as UUIDs so a non-UUID returns 400 (schema layer) instead of leaking a Postgres "invalid input syntax for type uuid" as 500 from the route. (review v3 #2) - New SearchResult.retrieval_signal optional field tags chain-augmented rows so observability and any future ranker can distinguish them from similarity-ranked candidates. (review v2 #1, plumbed through v4) Refactor - Extracted maybeExpandViaTLL, hydrateChainMemories, appendTllAugmentation out of memory-search.ts into a sibling tll-augmentation.ts module. The search file dropped from 551 → 385 LOC, back under the 400-LOC project cap. Shared internal types (PostProcessedSearch, RelevanceFilterSummary) pulled into memory-search-types.ts so the two consumers don't duplicate. (review v5 #4) Test coverage - New integration test (services/__tests__/tll-augmentation-integration.test.ts) drives performSearch end-to-end through appendTllAugmentation, with cases for: rendering augmented rows through buildInjection without crashing, the SQL contract (unnest ORDINALITY + ORDER BY req.ord), workspace-leak prevention, the relevance-1.0 bypass invariant, and no-augmentation for non-TLL queries. - New repository tests for backfill chronological ordering of chain() and chainEventsForEntities() and for chainEventsForEntities workspace isolation. - New route test asserts 400 (not 500) on non-UUID memory_ids_by_turn_id. Verification: `npx tsc --noEmit` clean · 1286/1286 vitest pass against the test DB · `npx fallow audit --no-cache` exit 0. Deferrals (parked durably in the research repo's tech-debt log at Atomicmemory-research/docs/core-repo/tech-debt.md): - predecessor_memory_id ON DELETE CASCADE vs SET NULL — design call, contested between two reviewers; current default kept. - process.env.ALLOWED_ORIGINS direct read in src/routes/memories.ts — pre-existing CLAUDE.md violation, out of scope for this PR. - shouldUseTLL non-adjacent "after did" — low operational risk; tightening the regex was the explicit goal of review #9 and re-broadening risks reintroducing false positives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
scorebackward-compatibleMigration / Compatibility
thresholdrequest field is opt-in and always takes precedence when supplied.similarityThresholdas the default normalized relevance floor.rankingMinSimilarity; simple/medium direct-query candidates below that semantic floor are filtered before injection candidate selection, and importance/recency ranking boosts only apply above the same floor. SQL and in-memory scorers clamp this value to[0,1].similarityThresholdis not a universal post-retrieval floor; temporal/current-state phrasing intentionally keeps recency/current-state recall intact.ranking_scoreis a composite ranking/debug value and is not normalized;relevanceis normalized to[0,1]for filtering.Threshold Semantics
threshold/ servicerelevanceThreshold: per-request normalized injection relevance floor. It wins over config defaults and applies unless omitted.similarityThreshold: fallback normalized relevance floor for unscoped simple/medium non-temporal searches when no request threshold is supplied. It is bypassed for recall-preserving query/source/as-of cases.rankingMinSimilarity: retrieval-profile semantic floor for ranking eligibility. It prevents importance/recency boosts below the floor in SQL/in-memory scoring and filters sub-floor simple/medium direct-query candidates before final selection.Validation
npx tsc --noEmitnpm run buildnpm run generate:openapigit diff --checknpm testnpm test -- dual-write-representationsnpm test -- retrieval-policy retrieval-relevance-regression pgvector-smoke dual-write-representationsnpm test -- memory-search-runtime-config retrieval-policy retrieval-relevance-regression pgvector-smoke dual-write-representationsnpm test -- retrieval-policy retrieval-relevance-regression current-state-ranking search-pipeline-runtime-confignpx fallow audit --health-baseline=.fallow/health-baseline.json --dupes-baseline=.fallow/dupes-baseline.json --base=origin/mainnpm test -- retrieval-relevance-regression retrieval-trace memory-search-runtime-config scoped-dispatch search-pipeline-runtime-config namespace-retrieval response-schema-coverage memory-route-config-overrideLinear Scope
Delivered or ready for review in this repo:
precisionAtK >= 0.8) plus exact fixture exclusion checkssemantic_similarity,ranking_score, andrelevancePaired GTM-1113 SDK PRs:
Not closed by this PR: