Skip to content

feat: Update embedding logic to bulk#1037

Open
BenjaminMichaelis wants to merge 1 commit intomainfrom
bmichaelis/EmbeddingUpdates
Open

feat: Update embedding logic to bulk#1037
BenjaminMichaelis wants to merge 1 commit intomainfrom
bmichaelis/EmbeddingUpdates

Conversation

@BenjaminMichaelis
Copy link
Copy Markdown
Member

Description

Describe your changes here.

Fixes #Issue_Number (if available)

Ensure that your pull request has followed all the steps below:

  • Code compilation
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary

Copilot AI review requested due to automatic review settings April 26, 2026 15:21
Comment on lines +69 to +76
foreach (var raw in fileContent)
{
var line = raw.Trim();
var isBlank = string.IsNullOrWhiteSpace(line);
if (!isBlank || !lastWasBlank)
normalizedLines.Add(line);
lastWasBlank = isBlank;
}
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors markdown chunk handling and embedding generation to support bulk embedding and a staging-table swap workflow intended to keep the live vector collection available during rebuilds.

Changes:

  • Introduces a MarkdownChunk model and updates chunking/output/tests to use Heading + ChunkText.
  • Adjusts markdown preprocessing to preserve paragraph separators (blank lines) for paragraph-aware chunking.
  • Updates embedding upload to batch-generate embeddings and load into a staging collection before swapping it into place in PostgreSQL.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
EssentialCSharp.Chat/Program.cs Updates chunk stats/output to use MarkdownChunk fields.
EssentialCSharp.Chat.Tests/MarkdownChunkingServiceTests.cs Updates assertions for the new chunk model (ChunkText).
EssentialCSharp.Chat.Shared/Services/MarkdownChunkingService.cs Preserves blank lines for paragraph-aware splitting and emits MarkdownChunk instances.
EssentialCSharp.Chat.Shared/Services/FileChunkingResult.cs Adds MarkdownChunk record and changes FileChunkingResult.Chunks type accordingly.
EssentialCSharp.Chat.Shared/Services/EmbeddingService.cs Implements batch embedding + staging-then-swap upload strategy using Npgsql.
EssentialCSharp.Chat.Shared/Services/ChunkingResultExtensions.cs Updates conversion to BookContentChunk using MarkdownChunk and adds deterministic IDs + ChunkIndex.
EssentialCSharp.Chat.Shared/Services/AISearchService.cs Adds heading-based deduplication of vector search results.
EssentialCSharp.Chat.Shared/Models/BookContentChunk.cs Adds ChunkIndex as stored metadata for chunks.

Comment on lines +62 to +78
// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
// The single-string overload used previously is a convenience extension method
// that wraps one item and calls this same method.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();

GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);

if (embeddings.Count != chunkList.Count)
throw new InvalidOperationException(
$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");

for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].TextEmbedding = embeddings[i].Vector;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embedding generator call batches all chunks into a single request, but Azure OpenAI embeddings have an input-count limit (noted in the comment as 2048). If the book produces more than that many chunks, this will fail at runtime. Consider chunking texts into batches (<= provider limit) and merging the returned vectors back into chunkList (or throw a clear exception when the limit is exceeded).

Suggested change
// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
// The single-string overload used previously is a convenience extension method
// that wraps one item and calls this same method.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();
GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);
if (embeddings.Count != chunkList.Count)
throw new InvalidOperationException(
$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");
for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].TextEmbedding = embeddings[i].Vector;
// ── Step 2: Batch-embed all chunks in provider-safe API calls ─────────────────
// Azure OpenAI embeddings impose an input-count limit per request.
// Process the texts in batches and merge the returned vectors back into the
// original chunk list to preserve ordering.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();
const int maxEmbeddingBatchSize = 2048;
for (int batchStart = 0; batchStart < texts.Count; batchStart += maxEmbeddingBatchSize)
{
int batchSize = Math.Min(maxEmbeddingBatchSize, texts.Count - batchStart);
List<string> batchTexts = texts.GetRange(batchStart, batchSize);
GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(batchTexts, cancellationToken: cancellationToken);
if (embeddings.Count != batchSize)
throw new InvalidOperationException(
$"Embedding count mismatch for batch starting at index {batchStart}: expected {batchSize}, got {embeddings.Count}.");
for (int i = 0; i < batchSize; i++)
{
chunkList[batchStart + i].TextEmbedding = embeddings[i].Vector;
}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +107
// Drop any leftover backup from a previous run
cmd.CommandText = $"DROP TABLE IF EXISTS \"{oldName}\"";
await cmd.ExecuteNonQueryAsync(cancellationToken);

// Rename live → old. IF EXISTS is a no-op on first run when no live table exists.
// Using ALTER TABLE IF EXISTS avoids PL/pgSQL string interpolation entirely.
cmd.CommandText = $"ALTER TABLE IF EXISTS \"{collectionName}\" RENAME TO \"{oldName}\"";
await cmd.ExecuteNonQueryAsync(cancellationToken);

// Rename staging → live
cmd.CommandText = $"ALTER TABLE \"{stagingName}\" RENAME TO \"{collectionName}\"";
await cmd.ExecuteNonQueryAsync(cancellationToken);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These SQL statements interpolate collectionName/derived names directly into identifier-quoted SQL. If collectionName can be influenced outside trusted code, this becomes identifier-injection (quotes can be escaped/broken). Consider restricting collectionName to a safe identifier regex (e.g., letters/digits/underscore) before composing SQL, and use Npgsql's identifier-quoting helper to build the final identifiers consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +40
/// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
/// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
/// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
/// 5. Drop the old live backup table.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method documentation says the swap uses "three SQL RENAMEs", but the implementation performs two RENAME operations (live→old, staging→live) plus DROP TABLE statements. Update the comments so they accurately describe the actual DDL being executed.

Suggested change
/// 4. Atomically swap staging → live via three SQL RENAMEs in a single transaction.
/// PostgreSQL ALTER TABLE acquires AccessExclusiveLock automatically; no explicit
/// LOCK TABLE is needed. The transaction ensures no reader sees an intermediate state.
/// 5. Drop the old live backup table.
/// 4. Atomically swap tables in a single transaction using two SQL RENAME operations
/// (live → old, staging → live). PostgreSQL ALTER TABLE acquires
/// AccessExclusiveLock automatically; no explicit LOCK TABLE is needed. The
/// transaction ensures no reader sees an intermediate state.
/// 5. Drop the old live backup table with DROP TABLE.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +83
public async Task GenerateBookContentEmbeddingsAndUploadToVectorStore(
IEnumerable<BookContentChunk> bookContents,
CancellationToken cancellationToken,
string? collectionName = null)
{
collectionName ??= CollectionName;
string stagingName = $"{collectionName}_staging";
string oldName = $"{collectionName}_old";

var collection = vectorStore.GetCollection<string, BookContentChunk>(collectionName);
await collection.EnsureCollectionDeletedAsync(cancellationToken);
await collection.EnsureCollectionExistsAsync(cancellationToken);
if (dataSource is null)
throw new InvalidOperationException("NpgsqlDataSource is required for the staging swap. Ensure it is registered in DI.");

int uploadedCount = 0;
// ── Step 1: Prepare staging collection ────────────────────────────────────────
var staging = vectorStore.GetCollection<string, BookContentChunk>(stagingName);
await staging.EnsureCollectionDeletedAsync(cancellationToken);
await staging.EnsureCollectionExistsAsync(cancellationToken);

foreach (var chunk in bookContents)
// ── Step 2: Batch-embed all chunks in a single API call ───────────────────────
// IEmbeddingGenerator.GenerateAsync natively accepts IEnumerable<string>.
// The single-string overload used previously is a convenience extension method
// that wraps one item and calls this same method.
var chunkList = bookContents.ToList();
var texts = chunkList.Select(c => c.ChunkText).ToList();

GeneratedEmbeddings<Embedding<float>> embeddings =
await embeddingGenerator.GenerateAsync(texts, cancellationToken: cancellationToken);

if (embeddings.Count != chunkList.Count)
throw new InvalidOperationException(
$"Embedding count mismatch: expected {chunkList.Count}, got {embeddings.Count}.");

for (int i = 0; i < chunkList.Count; i++)
{
chunkList[i].TextEmbedding = embeddings[i].Vector;
}

// ── Step 3: Batch-upsert all chunks into staging ──────────────────────────────
await staging.UpsertAsync(chunkList, cancellationToken);
Console.WriteLine($"Uploaded {chunkList.Count} chunks to staging collection '{stagingName}'.");
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenerateBookContentEmbeddingsAndUploadToVectorStore now contains significant new behavior (batch embedding + staging/swap workflow) but there are no unit/integration tests covering it in the Chat test suite. Adding tests that at least verify batching semantics (e.g., embedding generator called with N inputs and staging collection UpsertAsync invoked once with embeddings assigned) would help prevent regressions; the swap step can be isolated behind an interface to make it testable without a live database.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +40 to 44
var candidates_list = new List<VectorSearchResult<BookContentChunk>>();
await foreach (var result in collection.SearchAsync(searchVector, options: vectorSearchOptions, top: candidates, cancellationToken: cancellationToken))
{
results.Add(result);
candidates_list.Add(result);
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local variable name candidates_list uses snake_case, which is inconsistent with the surrounding camelCase naming in this method (e.g., vectorSearchOptions, searchVector). Rename to candidatesList (or similar) for consistency and readability.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants