Skip to content

Commit 96e731b

Browse files
reakaleekclaude
andauthored
ContentDateEnrichment: Filter _update_by_query to only unresolved documents (#3118)
* ContentDateEnrichment: Filter _update_by_query to only unresolved documents The _update_by_query in ResolveContentDatesAsync was re-indexing every document in both the lexical and semantic indices. On the semantic index, this triggered ML inference for all 6 semantic_text fields on every document — causing the deploy workflow to hang for 3+ hours. After HashedBulkUpdate, unchanged documents (noop) retain their resolved content_last_updated from the previous run. Only new/changed documents have the field at the default DateTimeOffset.MinValue (0001-01-01). The filter restricts _update_by_query to only these unresolved documents, reducing the typical deploy from hundreds of thousands of documents to just the changed ones. Also enhances integration tests to use real HashedBulkUpdate-style scripted upserts with full DocumentationDocument serialization, and adds tests proving the filter behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Simplify: Clean up review findings - Extract query JsonObject to static readonly string (avoid re-allocation per call) - Remove debug output.WriteLine from discovery test - Fix double JsonNode.Parse — reuse parsed node for params.doc - Use const for hash field name - Fix misleading comment on doc1 in FilteredResolve test - Remove unnecessary null-forgiving operators Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Tests: Assert bulk response has no item-level errors The existing check only verified HTTP status code, which can be 200 even when individual bulk items fail. Parse the response body and assert "errors": false. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Tests: Fail explicitly on missing or malformed bulk errors field Assert bulkResult and its "errors" property exist before checking the boolean value, rather than defaulting to false on missing/malformed JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5c018e0 commit 96e731b

2 files changed

Lines changed: 271 additions & 65 deletions

File tree

src/Elastic.Markdown/Exporters/Elasticsearch/ContentDateEnrichment.cs

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,11 +77,35 @@ public async Task SyncLookupIndexAsync(string lexicalAlias, Cancel ct)
7777
/// the pipeline compares each document's content_hash against the lookup from
7878
/// the previous run and either preserves the old date or stamps a new one.
7979
/// </summary>
80+
// Only process documents that don't already have a valid content_last_updated.
81+
// After HashedBulkUpdate: unchanged docs (noop) retain their resolved date from the
82+
// previous run; new/changed docs have the field missing or at DateTimeOffset.MinValue
83+
// (0001-01-01). This filter avoids re-indexing the entire index, which on the semantic
84+
// index would trigger expensive semantic_text inference for every document.
85+
private static readonly string UnresolvedContentDatesQuery = new JsonObject
86+
{
87+
["query"] = new JsonObject
88+
{
89+
["bool"] = new JsonObject
90+
{
91+
["must_not"] = new JsonArray(
92+
new JsonObject
93+
{
94+
["range"] = new JsonObject
95+
{
96+
["content_last_updated"] = new JsonObject { ["gt"] = "1970-01-01T00:00:00Z" }
97+
}
98+
}
99+
)
100+
}
101+
}
102+
}.ToJsonString();
103+
80104
public async Task ResolveContentDatesAsync(string indexAlias, Cancel ct)
81105
{
82106
logger.LogInformation("Resolving content dates in {Index} via pipeline {Pipeline}", indexAlias, PipelineName);
83107

84-
await operations.UpdateByQueryAsync(indexAlias, PostData.Empty, PipelineName, ct);
108+
await operations.UpdateByQueryAsync(indexAlias, PostData.String(UnresolvedContentDatesQuery), PipelineName, ct);
85109

86110
logger.LogInformation("Content date resolution complete for {Index}", indexAlias);
87111
}

0 commit comments

Comments
 (0)