feat: Update embedding logic to bulk by BenjaminMichaelis · Pull Request #1037 · IntelliTect/EssentialCSharp.Web

BenjaminMichaelis · 2026-04-26T15:21:37Z

Description

Describe your changes here.

Fixes #Issue_Number (if available)

Ensure that your pull request has followed all the steps below:

Code compilation
Created tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

+        foreach (var raw in fileContent)
+        {
+            var line = raw.Trim();
+            var isBlank = string.IsNullOrWhiteSpace(line);
+            if (!isBlank || !lastWasBlank)
+                normalizedLines.Add(line);
+            lastWasBlank = isBlank;
+        }


Copilot

Pull request overview

This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.

Changes:

Introduces a MarkdownChunk model and updates chunking/output/tests to use Heading + ChunkText.
Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
EssentialCSharp.Chat/Program.cs	Updates chunk stats/output to use `MarkdownChunk` fields.
EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs	Updates assertions for the new chunk model (`ChunkText`).
EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs	Preserves blank lines for paragraph-aware splitting and emits `MarkdownChunk` instances.
EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs	Adds `MarkdownChunk` record and changes `FileChunkingResult.Chunks` type accordingly.
EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs	Implements batch embedding + staging-then-swap upload strategy using Npgsql.
EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs	Updates conversion to `BookContentChunk` using `MarkdownChunk` and adds deterministic IDs + `ChunkIndex`.
EssentialCSharp.Chat.Shared/Services/AISearchService.cs	Adds heading-based deduplication of vector search results.
EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs	Adds `ChunkIndex` as stored metadata for chunks.

Copilot · 2026-04-26T15:25:42Z

+        // ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
+        // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
+        // The single-string overload used previously is a convenience extension method
+        // that wraps one item and calls this same method.
+        var chunkList = bookContents.ToList();
+        var texts = chunkList.Select(c => c.ChunkText).ToList();
+
+        GeneratedEmbeddings<Embedding<float>> embeddings =
+            await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);
+
+        if (embeddings.Count != chunkList.Count)
+            throw new InvalidOperationException(
+                $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");
+
+        for (int i = 0; i < chunkList.Count; i++)
+        {
+            chunkList[i].TextEmbedding = embeddings[i].Vector;


The embedding generator call batches all chunks into a single request, but Azure OpenAI embeddings have an input-count limit (noted in the comment as 2048). If the book produces more than that many chunks, this will fail at runtime. Consider chunking texts into batches (<= provider limit) and merging the returned vectors back into chunkList (or throw a clear exception when the limit is exceeded).

Suggested change

// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────

// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.

// The single-string overload used previously is a convenience extension method

// that wraps one item and calls this same method.

var chunkList = bookContents.ToList();

var texts = chunkList.Select(c => c.ChunkText).ToList();

GeneratedEmbeddings<Embedding<float>> embeddings =

await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);

if (embeddings.Count != chunkList.Count)

throw new InvalidOperationException(

$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");

for (int i = 0; i < chunkList.Count; i++)

{

chunkList[i].TextEmbedding = embeddings[i].Vector;

// ── Step 2: Batch-embed all chunks in provider-safe API calls ─────────────────

// Azure OpenAI embeddings impose an input-count limit per request.

// Process the texts in batches and merge the returned vectors back into the

// original chunk list to preserve ordering.

var chunkList = bookContents.ToList();

var texts = chunkList.Select(c => c.ChunkText).ToList();

const int maxEmbeddingBatchSize = 2048;

for (int batchStart = 0; batchStart < texts.Count; batchStart += maxEmbeddingBatchSize)

{

int batchSize = Math.Min(maxEmbeddingBatchSize, texts.Count - batchStart);

List<string> batchTexts = texts.GetRange(batchStart, batchSize);

GeneratedEmbeddings<Embedding<float>> embeddings =

await embeddingGenerator.GenerateAsync(batchTexts, cancellationToken: cancellationToken);

if (embeddings.Count != batchSize)

throw new InvalidOperationException(

$"Embedding count mismatch for batch starting at index {batchStart}: expected {batchSize}, got {embeddings.Count}.");

for (int i = 0; i < batchSize; i++)

{

chunkList[batchStart + i].TextEmbedding = embeddings[i].Vector;

}

Copilot · 2026-04-26T15:25:42Z

+            // Drop any leftover backup from a previous run
+            cmd.CommandText = $"DROP TABLE IF EXISTS \"{oldName}\"";
+            await cmd.ExecuteNonQueryAsync(cancellationToken);
+
+            // Rename live → old. IF EXISTS is a no-op on first run when no live table exists.
+            // Using ALTER TABLE IF EXISTS avoids PL/pgSQL string interpolation entirely.
+            cmd.CommandText = $"ALTER TABLE IF EXISTS \"{collectionName}\" RENAME TO \"{oldName}\"";
+            await cmd.ExecuteNonQueryAsync(cancellationToken);
+
+            // Rename staging → live
+            cmd.CommandText = $"ALTER TABLE \"{stagingName}\" RENAME TO \"{collectionName}\"";
+            await cmd.ExecuteNonQueryAsync(cancellationToken);


These SQL statements interpolate collectionName/derived names directly into identifier-quoted SQL. If collectionName can be influenced outside trusted code, this becomes identifier-injection (quotes can be escaped/broken). Consider restricting collectionName to a safe identifier regex (e.g., letters/digits/underscore) before composing SQL, and use Npgsql's identifier-quoting helper to build the final identifiers consistently.

Copilot · 2026-04-26T15:25:42Z

+    ///   4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
+    ///      PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
+    ///      LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
+    ///   5. Drop the old live backup table.


The method documentation says the swap uses "three SQL RENAMEs", but the implementation performs two RENAME operations (live→old, staging→live) plus DROP TABLE statements. Update the comments so they accurately describe the actual DDL being executed.

Suggested change

/// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.

/// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit

/// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.

/// 5. Drop the old live backup table.

/// 4. Atomically swap tables in a single transaction using two SQL RENAME operations

/// (live → old, staging → live). PostgreSQL ALTER TABLE acquires

/// AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The

/// transaction ensures no reader sees an intermediate state.

/// 5. Drop the old live backup table with DROP TABLE.

Copilot · 2026-04-26T15:25:43Z

+    public async Task GenerateBookContentEmbeddingsAndUploadToVectorStore(
+        IEnumerable<BookContentChunk> bookContents,
+        CancellationToken cancellationToken,
+        string? collectionName = null)
    {
        collectionName ??= CollectionName;
+        string stagingName = $"{collectionName}_staging";
+        string oldName = $"{collectionName}_old";

-        var collection = vectorStore.GetCollection<string, BookContentChunk>(collectionName);
-        await collection.EnsureCollectionDeletedAsync(cancellationToken);
-        await collection.EnsureCollectionExistsAsync(cancellationToken);
+        if (dataSource is null)
+            throw new InvalidOperationException("NpgsqlDataSource is required for the staging swap. Ensure it is registered in DI.");

-        int uploadedCount = 0;
+        // ── Step 1: Prepare staging collection ────────────────────────────────────────
+        var staging = vectorStore.GetCollection<string, BookContentChunk>(stagingName);
+        await staging.EnsureCollectionDeletedAsync(cancellationToken);
+        await staging.EnsureCollectionExistsAsync(cancellationToken);

-        foreach (var chunk in bookContents)
+        // ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
+        // IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
+        // The single-string overload used previously is a convenience extension method
+        // that wraps one item and calls this same method.
+        var chunkList = bookContents.ToList();
+        var texts = chunkList.Select(c => c.ChunkText).ToList();
+
+        GeneratedEmbeddings<Embedding<float>> embeddings =
+            await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);
+
+        if (embeddings.Count != chunkList.Count)
+            throw new InvalidOperationException(
+                $"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");
+
+        for (int i = 0; i < chunkList.Count; i++)
+        {
+            chunkList[i].TextEmbedding = embeddings[i].Vector;
+        }
+
+        // ── Step 3: Batch-upsert all chunks into staging ──────────────────────────────
+        await staging.UpsertAsync(chunkList, cancellationToken);
+        Console.WriteLine($"Uploaded {chunkList.Count} chunks to staging collection '{stagingName}'.");


GenerateBookContentEmbeddingsAndUploadToVectorStore now contains significant new behavior (batch embedding + staging/swap workflow) but there are no unit/integration tests covering it in the Chat test suite. Adding tests that at least verify batching semantics (e.g., embedding generator called with N inputs and staging collection UpsertAsync invoked once with embeddings assigned) would help prevent regressions; the swap step can be isolated behind an interface to make it testable without a live database.

Copilot · 2026-04-26T15:25:43Z

+                var candidates_list = new List<VectorSearchResult<BookContentChunk>>();
+                await foreach (var result in collection.SearchAsync(searchVector, options: vectorSearchOptions, top: candidates, cancellationToken: cancellationToken))
                {
-                    results.Add(result);
+                    candidates_list.Add(result);
                }


Local variable name candidates_list uses snake_case, which is inconsistent with the surrounding camelCase naming in this method (e.g., vectorSearchOptions, searchVector). Rename to candidatesList (or similar) for consistency and readability.

Update embedding logic

ce08abb

Copilot AI review requested due to automatic review settings April 26, 2026 15:21

Copilot started reviewing on behalf of BenjaminMichaelis April 26, 2026 15:21 View session

github-code-quality Bot found potential problems Apr 26, 2026

View reviewed changes

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Update embedding logic to bulk#1037

feat: Update embedding logic to bulk#1037
BenjaminMichaelis wants to merge 1 commit intomainfrom
bmichaelis/EmbeddingUpdates

BenjaminMichaelis commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    ///   4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
-    ///      PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
-    ///      LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
-    ///   5. Drop the old live backup table.
+    ///   4. Atomically swap tables in a single transaction using two SQL RENAME operations
+    ///      (live → old, staging → live). PostgreSQL ALTER TABLE acquires
+    ///      AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The
+    ///      transaction ensures no reader sees an intermediate state.
+    ///   5. Drop the old live backup table with DROP TABLE.

Conversation

BenjaminMichaelis commented Apr 26, 2026

Description

Ensure that your pull request has followed all the steps below:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants