Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.
Changes:
- Introduces a
MarkdownChunkmodel and updates chunking/output/tests to useHeading+ChunkText. - Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
- Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| EssentialCSharp.Chat/Program.cs | Updates chunk stats/output to use MarkdownChunk fields. |
| EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs | Updates assertions for the new chunk model (ChunkText). |
| EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs | Preserves blank lines for paragraph-aware splitting and emits MarkdownChunk instances. |
| EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs | Adds MarkdownChunk record and changes FileChunkingResult.Chunks type accordingly. |
| EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs | Implements batch embedding + staging-then-swap upload strategy using Npgsql. |
| EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs | Updates conversion to BookContentChunk using MarkdownChunk and adds deterministic IDs + ChunkIndex. |
| EssentialCSharp.Chat.Shared/Services/AISearchService.cs | Adds heading-based deduplication of vector search results. |
| EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs | Adds ChunkIndex as stored metadata for chunks. |
| // ── Step 2: Batch-embed all chunks in a single API call ─────────────────────── | ||
| // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>. | ||
| // The single-string overload used previously is a convenience extension method | ||
| // that wraps one item and calls this same method. | ||
| var chunkList = bookContents.ToList(); | ||
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | ||
|
|
||
| GeneratedEmbeddings<Embedding<float>> embeddings = | ||
| await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken); | ||
|
|
||
| if (embeddings.Count != chunkList.Count) | ||
| throw new InvalidOperationException( | ||
| $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}."); | ||
|
|
||
| for (int i = 0; i < chunkList.Count; i++) | ||
| { | ||
| chunkList[i].TextEmbedding = embeddings[i].Vector; |
There was a problem hiding this comment.
The embedding generator call batches all chunks into a single request, but Azure OpenAI embeddings have an input-count limit (noted in the comment as 2048). If the book produces more than that many chunks, this will fail at runtime. Consider chunking texts into batches (<= provider limit) and merging the returned vectors back into chunkList (or throw a clear exception when the limit is exceeded).
| // ── Step 2: Batch-embed all chunks in a single API call ─────────────────────── | |
| // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>. | |
| // The single-string overload used previously is a convenience extension method | |
| // that wraps one item and calls this same method. | |
| var chunkList = bookContents.ToList(); | |
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | |
| GeneratedEmbeddings<Embedding<float>> embeddings = | |
| await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken); | |
| if (embeddings.Count != chunkList.Count) | |
| throw new InvalidOperationException( | |
| $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}."); | |
| for (int i = 0; i < chunkList.Count; i++) | |
| { | |
| chunkList[i].TextEmbedding = embeddings[i].Vector; | |
| // ── Step 2: Batch-embed all chunks in provider-safe API calls ───────────────── | |
| // Azure OpenAI embeddings impose an input-count limit per request. | |
| // Process the texts in batches and merge the returned vectors back into the | |
| // original chunk list to preserve ordering. | |
| var chunkList = bookContents.ToList(); | |
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | |
| const int maxEmbeddingBatchSize = 2048; | |
| for (int batchStart = 0; batchStart < texts.Count; batchStart += maxEmbeddingBatchSize) | |
| { | |
| int batchSize = Math.Min(maxEmbeddingBatchSize, texts.Count - batchStart); | |
| List<string> batchTexts = texts.GetRange(batchStart, batchSize); | |
| GeneratedEmbeddings<Embedding<float>> embeddings = | |
| await embeddingGenerator.GenerateAsync(batchTexts, cancellationToken: cancellationToken); | |
| if (embeddings.Count != batchSize) | |
| throw new InvalidOperationException( | |
| $"Embedding count mismatch for batch starting at index {batchStart}: expected {batchSize}, got {embeddings.Count}."); | |
| for (int i = 0; i < batchSize; i++) | |
| { | |
| chunkList[batchStart + i].TextEmbedding = embeddings[i].Vector; | |
| } |
| // Drop any leftover backup from a previous run | ||
| cmd.CommandText = $"DROP TABLE IF EXISTS \"{oldName}\""; | ||
| await cmd.ExecuteNonQueryAsync(cancellationToken); | ||
|
|
||
| // Rename live → old. IF EXISTS is a no-op on first run when no live table exists. | ||
| // Using ALTER TABLE IF EXISTS avoids PL/pgSQL string interpolation entirely. | ||
| cmd.CommandText = $"ALTER TABLE IF EXISTS \"{collectionName}\" RENAME TO \"{oldName}\""; | ||
| await cmd.ExecuteNonQueryAsync(cancellationToken); | ||
|
|
||
| // Rename staging → live | ||
| cmd.CommandText = $"ALTER TABLE \"{stagingName}\" RENAME TO \"{collectionName}\""; | ||
| await cmd.ExecuteNonQueryAsync(cancellationToken); |
There was a problem hiding this comment.
These SQL statements interpolate collectionName/derived names directly into identifier-quoted SQL. If collectionName can be influenced outside trusted code, this becomes identifier-injection (quotes can be escaped/broken). Consider restricting collectionName to a safe identifier regex (e.g., letters/digits/underscore) before composing SQL, and use Npgsql's identifier-quoting helper to build the final identifiers consistently.
| /// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction. | ||
| /// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit | ||
| /// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state. | ||
| /// 5. Drop the old live backup table. |
There was a problem hiding this comment.
The method documentation says the swap uses "three SQL RENAMEs", but the implementation performs two RENAME operations (live→old, staging→live) plus DROP TABLE statements. Update the comments so they accurately describe the actual DDL being executed.
| /// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction. | |
| /// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit | |
| /// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state. | |
| /// 5. Drop the old live backup table. | |
| /// 4. Atomically swap tables in a single transaction using two SQL RENAME operations | |
| /// (live → old, staging → live). PostgreSQL ALTER TABLE acquires | |
| /// AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The | |
| /// transaction ensures no reader sees an intermediate state. | |
| /// 5. Drop the old live backup table with DROP TABLE. |
| public async Task GenerateBookContentEmbeddingsAndUploadToVectorStore( | ||
| IEnumerable<BookContentChunk> bookContents, | ||
| CancellationToken cancellationToken, | ||
| string? collectionName = null) | ||
| { | ||
| collectionName ??= CollectionName; | ||
| string stagingName = $"{collectionName}_staging"; | ||
| string oldName = $"{collectionName}_old"; | ||
|
|
||
| var collection = vectorStore.GetCollection<string, BookContentChunk>(collectionName); | ||
| await collection.EnsureCollectionDeletedAsync(cancellationToken); | ||
| await collection.EnsureCollectionExistsAsync(cancellationToken); | ||
| if (dataSource is null) | ||
| throw new InvalidOperationException("NpgsqlDataSource is required for the staging swap. Ensure it is registered in DI."); | ||
|
|
||
| int uploadedCount = 0; | ||
| // ── Step 1: Prepare staging collection ──────────────────────────────────────── | ||
| var staging = vectorStore.GetCollection<string, BookContentChunk>(stagingName); | ||
| await staging.EnsureCollectionDeletedAsync(cancellationToken); | ||
| await staging.EnsureCollectionExistsAsync(cancellationToken); | ||
|
|
||
| foreach (var chunk in bookContents) | ||
| // ── Step 2: Batch-embed all chunks in a single API call ─────────────────────── | ||
| // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>. | ||
| // The single-string overload used previously is a convenience extension method | ||
| // that wraps one item and calls this same method. | ||
| var chunkList = bookContents.ToList(); | ||
| var texts = chunkList.Select(c => c.ChunkText).ToList(); | ||
|
|
||
| GeneratedEmbeddings<Embedding<float>> embeddings = | ||
| await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken); | ||
|
|
||
| if (embeddings.Count != chunkList.Count) | ||
| throw new InvalidOperationException( | ||
| $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}."); | ||
|
|
||
| for (int i = 0; i < chunkList.Count; i++) | ||
| { | ||
| chunkList[i].TextEmbedding = embeddings[i].Vector; | ||
| } | ||
|
|
||
| // ── Step 3: Batch-upsert all chunks into staging ────────────────────────────── | ||
| await staging.UpsertAsync(chunkList, cancellationToken); | ||
| Console.WriteLine($"Uploaded {chunkList.Count} chunks to staging collection '{stagingName}'."); |
There was a problem hiding this comment.
GenerateBookContentEmbeddingsAndUploadToVectorStore now contains significant new behavior (batch embedding + staging/swap workflow) but there are no unit/integration tests covering it in the Chat test suite. Adding tests that at least verify batching semantics (e.g., embedding generator called with N inputs and staging collection UpsertAsync invoked once with embeddings assigned) would help prevent regressions; the swap step can be isolated behind an interface to make it testable without a live database.
| var candidates_list = new List<VectorSearchResult<BookContentChunk>>(); | ||
| await foreach (var result in collection.SearchAsync(searchVector, options: vectorSearchOptions, top: candidates, cancellationToken: cancellationToken)) | ||
| { | ||
| results.Add(result); | ||
| candidates_list.Add(result); | ||
| } |
There was a problem hiding this comment.
Local variable name candidates_list uses snake_case, which is inconsistent with the surrounding camelCase naming in this method (e.g., vectorSearchOptions, searchVector). Rename to candidatesList (or similar) for consistency and readability.
Description
Describe your changes here.
Fixes #Issue_Number (if available)
Ensure that your pull request has followed all the steps below: