Skip to content

feat(infra): LiteLLM unified gateway for multi-provider LLM routing#19

Draft
moralespanitz wants to merge 17 commits intomainfrom
feat/litellm-unified-gateway
Draft

feat(infra): LiteLLM unified gateway for multi-provider LLM routing#19
moralespanitz wants to merge 17 commits intomainfrom
feat/litellm-unified-gateway

Conversation

@moralespanitz
Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in LiteLLM proxy sidecar so AtomicMemory can route LLM calls to Anthropic, OpenAI, Microsoft Foundry / Azure, AWS Bedrock, or Google Gemini through a single OpenAI-compatible endpoint. Provider swap is config-only — no new code path in src/services/llm.ts.

The existing LLM_PROVIDER=openai-compatible lane already accepts LLM_API_URL + LLM_API_KEY (see config.tsservices/llm.tsOpenAICompatibleLLM). Pointing it at a LiteLLM proxy reuses that seam and keeps cost telemetry, AUDN-timeout, retry, and per-request config_override behavior unchanged.

Why Pattern A (proxy + existing adapter)

Pattern Outcome
A. Proxy + openai-compatible adapter Chosen. LiteLLM has no maintained TypeScript SDK; recommended JS path is the OpenAI Node SDK pointed at the proxy URL. Our adapter already does that.
B. Direct LiteLLM SDK Rejected. No TS SDK.
C. Hybrid Not warranted. Cost telemetry parity is a small follow-up, not a blocker.

What ships (no src/ changes)

Path Purpose
docker/litellm/litellm-config.yaml model_list for Anthropic (Haiku 4.5, Sonnet 4.6), OpenAI (gpt-5-chat, gpt-4o-mini), Foundry (gpt-5-chat via azure/), Bedrock (Claude Sonnet), Gemini (1.5-pro). Provider keys via os.environ/VAR_NAME.
docker/litellm/docker-compose.litellm.yml Pinned ghcr.io/berriai/litellm:main-stable sidecar on port 4000. Explicit name: atomicmemory-litellm so the project never collides with another litellm/-named compose stack.
docker/litellm/README.md Quick start, env-var table per provider, switching providers at runtime, cost-telemetry caveats.
docker/litellm/.env.example Credential template.

Activation

# atomicmemory-core/.env
LLM_PROVIDER=openai-compatible
LLM_API_URL=http://localhost:4000
LLM_API_KEY=$LITELLM_MASTER_KEY
LLM_MODEL=anthropic-haiku-4-5    # or any model_name from litellm-config.yaml

# Proxy creds
cp docker/litellm/.env.example docker/litellm/.env  # fill in keys you have
docker compose -f docker/litellm/docker-compose.litellm.yml up -d

Smoke results

Provider Status Notes
Anthropic Haiku 4.5 PASS 1.95s, 23 in / 14 out tokens, ~$0.00009. Coherent answer.
OpenAI / Foundry / Bedrock / Gemini Config-load PASS No live calls (no creds in this environment); proxy startup log lists all 7 aliases under Set models:.

Total smoke spend: ~$0.00009 of the $0.05 cap.

Pre-commit checks

  • npx tsc --noEmit — pass
  • fallow audit (husky pre-commit, against origin/main) — pass, no new findings
  • npm test — skipped, no source changes

Known limitations / follow-up

  • Cost telemetry: per-provider rates aren't perfectly mirrored across providers behind LiteLLM. The proxy emits x-litellm-response-cost; wiring cost-telemetry.ts to read it is a one-file follow-up.
  • Foundry + Entra ID: LiteLLM azure/ requires a static API key. For Entra-only deployments, keep using the direct path (atomicmemory-benchmarks/data/exp-cr-mini/foundry-client.ts) and route the rest through LiteLLM.
  • Streaming: proxy supports it; core call sites don't stream today, so irrelevant.

Test plan

  • docker compose -f docker/litellm/docker-compose.litellm.yml up -d brings up healthy on port 4000
  • curl http://localhost:4000/health/liveliness returns 200
  • With LLM_PROVIDER=openai-compatible + LLM_API_URL=http://localhost:4000 + LLM_MODEL=anthropic-haiku-4-5, a smoke ingest call succeeds end-to-end
  • Switching LLM_MODEL=anthropic-sonnet-4-6 and restarting requires no other config changes
  • docker compose -f docker/litellm/docker-compose.litellm.yml down cleans up

Adds the TLL primitive — a per-entity sparse graph of event nodes with
predecessor/successor edges. Each new memory referencing an entity
appends an event node to that entity's chain; the predecessor pointer
allows traversal of the chain backward at query time.

Targets the abilities current SOTA architectures admit they don't crack
at scale: temporal reasoning (TR), event ordering (EO), and multi-
session reasoning (MSR). These require higher-order representations of
how events relate across time — fact-level and entity-level matching
are insufficient on their own.

Implementation:
  - schema.sql: new table temporal_linkage_list with composite uniqueness
    on (user_id, entity_id, memory_id), predecessor pointer,
    position_in_chain, and supporting indexes on (user_id, entity_id,
    position_in_chain) and (memory_id).
  - db/repository-tll.ts: TllRepository with append() (idempotent batch
    insert with predecessor wiring) and chain()/chainsFor() readers.
  - services/memory-storage.ts: append after entity link in
    resolveAndLinkEntities (best-effort, fire-and-forget — keeps the
    ingest hot path fast).
  - services/tll-retrieval.ts: shouldUseTLL() regex gate over ordering/
    temporal query phrasing, plus expandViaTLL() helper that takes the
    top-N retrieval candidates, finds their linked entities, and pulls
    the entities' chain memory_ids.
  - services/memory-search.ts: when the gate fires, expand the candidate
    set with chain members (deterministic chain-traversal augmentation,
    not a similarity search). Fail-open — chain expansion errors don't
    block primary retrieval.
  - app/runtime-container.ts: instantiate TllRepository when entity
    graph is enabled.
  - services/memory-service-types.ts + memory-service.ts: thread the
    repository through MemoryServiceDeps as an optional null-able field.

Read-only retrieval augmentation; AUDN/ingest behaviour unchanged.
…rimitive

Introduces a new memory primitive — first-mention events — distinct from
both atomic facts (claims) and memories (ingested chunks). For a given
conversation, captures the first turn at which each topic is brought up.
The grain matches event-ordering rubrics that ask "in what order did the
user bring up these aspects."

Caller-driven extraction (no in-core ingest hook): the in-core ingest
pipeline does not retain turn structure (it extracts atomic facts from
chunks, not turns). External callers that know the turn structure
supply a turn-id-to-memory-id mapping via the public API. This keeps
extraction explicit and the core ingest pipeline unchanged. An
automatic post-write hook is deferred until a core-side notion of "turn"
exists.

Implementation:
  - schema.sql: new table first_mention_events with `(user_id, memory_id)`
    unique constraint for idempotent re-extraction. Indexed on
    `(user_id, position_in_conversation)` and on `topic` via GIN.
  - db/repository-first-mentions.ts: FirstMentionRepository with
    `store()`, `getByMemoryId()`, `list()`. Mirrors the TllRepository
    pattern.
  - services/first-mention-service.ts: FirstMentionService with
    `extractAndStore(userId, conversationText, sourceSite,
    memoryIdsByTurnId)`. One LLM call via an injected ChatFn; salvage
    parser tolerates truncated JSON; loose LLM output is mapped to the
    strict FirstMentionEvent schema.
  - app/runtime-container.ts: instantiate the repository + service. The
    ChatFn adapter wraps the configured `llm.chat` singleton (per-call
    cost is tracked inside `llm.chat`).
  - services/memory-service-types.ts + memory-service.ts: thread the
    service through MemoryServiceDeps as an optional null-able field.

The HTTP endpoint that exposes this primitive is added in a follow-up
commit alongside the TLL read endpoint.
Exposes the first-mention-events and Temporal Linkage List primitives
through public HTTP endpoints. Both primitives existed inside the runtime
container (previous commits) but had no callable surface; this commit
adds the routes, schemas, public service methods, and tests.

## TLL EO read endpoint

  - `GET /v1/memories/event-chains?user_id=X&entity_ids=Y,Z` returns
    `{ chains: [{ entity_id, events: [...] }] }`. `entity_ids` is
    comma-separated, trimmed, deduplicated, and validated as UUIDs.
  - `db/repository-tll.ts` adds `chainEventsForEntities(userId,
    entityIds)` — enriched events joined with memory content
    (memoryId, content, observationDate, positionInChain,
    predecessorMemoryId). Soft-deleted memories are filtered out;
    entities with no events are dropped.
  - `services/memory-service.ts` adds public `getEventChains()` wrapper.
  - `schemas/memories.ts`: `EventChainsQuerySchema`.
  - `schemas/responses.ts`: `EventChainsResponseSchema`.
  - `routes/response-schema-map.ts`: corresponding entry.

## First-mention extract endpoint

  - `POST /v1/memories/first-mentions/extract` body:
    `{ user_id, conversation_text, source_site, memory_ids_by_turn_id }`
    where `memory_ids_by_turn_id` is `{ "0": "uuid", "5": "uuid", ... }`
    (object form because JSON has no Map). Returns `{ events: [...] }`.
  - `services/memory-service.ts` adds public `extractFirstMentions()`
    wrapper around `FirstMentionService.extractAndStore()`.
  - `schemas/memories.ts`: `FirstMentionsExtractBodySchema` (transforms
    the object into a `Map<number, string>`).
  - `schemas/responses.ts`: `FirstMentionsExtractResponseSchema`.
  - `routes/response-schema-map.ts`: corresponding entry.

Both endpoints fail closed (404/400 on validation) and additive — no
existing behaviour changes.

## Tests

  - `services/__tests__/tll-retrieval.test.ts` — 26 cases covering
    `shouldUseTLL` regex coverage (positive + negative + case
    insensitivity), `entitiesForMemories` SQL-shape verification, and
    `expandViaTLL` call ordering / 10-id slice / userId pass-through.
  - `db/__tests__/repository-tll.test.ts` — 13 integration tests
    against test Postgres covering `append` idempotency + predecessor
    wiring, `chain` and `chainsFor` ordering, and
    `chainEventsForEntities` enriched-join + soft-delete filtering.
  - `services/__tests__/first-mention-service.test.ts` — 9 unit tests
    covering happy path, salvage of truncated JSON, garbage-text
    fallback, non-array JSON, chatFn throw, missing `memoryId` mapping
    drop, schema validation drop, anchor_date parsing
    (valid/invalid/null), ascending sort. No DB required.

The previously-needed fallow suppression on `entitiesForMemories` is
removed now that the unit test consumes it.
…eview #1)

Replace the read-then-insert pattern in TllRepository.append() with an
atomic INSERT...SELECT serialized by pg_advisory_xact_lock keyed on
(user_id, entity_id). Concurrent appends to the same entity chain now
serialize on the lock and compute predecessor + position from the latest
committed row, eliminating the TOCTOU race where two parallel callers
read the same MAX(position_in_chain) and both wrote at tip+1.

Add a UNIQUE (user_id, entity_id, position_in_chain) index as
defense-in-depth so any future code path that bypasses the lock fails
loudly at the DB layer rather than silently producing duplicate
positions.

Add an integration test that fires three concurrent appends to the same
chain via Promise.all and asserts positions 0,1,2 with correctly wired
predecessor pointers.
The previous flow ran TLL chain augmentation inside executeSearchStep
and tagged hydrated rows with `similarity: 0.5` so they could pass
through applySearchRelevanceFilter. That magic constant either filtered
chain rows out (when the threshold was higher) or polluted ranking with
a meaningless score.

Move TLL augmentation AFTER postProcessResults / applySearchRelevanceFilter.
Hydrated rows now carry `similarity: null` and `retrieval_signal:
'tll-chain'`, so they ride around the similarity gate entirely —
chain-membership is a structurally different retrieval signal than
semantic similarity. The trace adds a `tll-augmentation` stage with the
ids that were appended.

Replace the `slice(0, 10)` magic number with a named constant
(TLL_SEED_CANDIDATE_COUNT).
…eview #3)

The TLL append in resolveAndLinkEntities was passing `new Date()` as
observation_date, which is the ingest-arrival time. Chains order by
observation_date ASC, so out-of-order or backfilled conversations would
chain by ingest order — destroying conversation chronology that EO and
MSR queries rely on.

Thread the caller-supplied logicalTimestamp through from
storeCanonicalFact to the new maybeAppendTll helper. When a logical
timestamp isn't supplied, look up the just-stored memory's observed_at
column rather than fabricating one. Last-resort new Date() fallback
only fires if the row lookup fails — keeps the append from silently
dropping.
Hard-deletes (e.g. resetBySource in repository-write) silently nulled
out predecessor_memory_id pointers, breaking backward chain traversal
and leaving half-broken chains where some events still reference a
deleted ancestor.

Change the predecessor FK to ON DELETE CASCADE so the dependent chain
node gets removed cleanly when its predecessor goes. Matches the
memory_id FK policy on the same table.

Schema migration runs an idempotent DROP CONSTRAINT / ADD CONSTRAINT in
a DO block because CREATE TABLE IF NOT EXISTS won't update column-level
FK definitions on an existing table.
The /v1/memories/event-chains endpoint fans out per entity. Without an
upper bound on entity_ids, a single caller could pull tens of thousands
of chain rows in one request — straightforward amplification target.

Add a MAX_ENTITY_IDS_PER_REQUEST = 100 named constant and refine the
schema to reject larger lists with a 400 response. Cap chosen to match
the existing MAX_SEARCH_LIMIT ceiling.
Same root cause as benchmarks repo `327326a`: git sets GIT_INDEX_FILE
before invoking pre-commit hooks; fallow's `git worktree add` for
base-ref scanning performs a checkout that writes to GIT_INDEX_FILE,
corrupting the main worktree's index by replacing it with the base
ref's tree (silently deletes files the feature branch added that
don't exist on base, then commits those deletions).

Reproduced during the PR #18 review-response work — 7 unrelated
files briefly marked deleted in the index after a fallow run; had to
recover with git reset --soft + git read-tree HEAD.

Fix: unset both vars at the top of the hook so nested git invocations
run against the worktree's own default index.
…iew #11)

The `chatFn` adapter wired in `runtime-container.ts` returned hardcoded
zero token counts for every call. `LLMProvider.chat()` returns
`Promise<string>` (no usage), so threading real counts here would require
widening that interface across every adapter. Nothing in the
`FirstMentionService` path actually consumed the fields — they only
existed to satisfy the local `ChatResult` shape — so dropping them is
strictly safer than leaving misleading zeros in place. Per-call cost
telemetry continues to flow from `LLMProvider.chat` -> `writeCostEvent`
unchanged.

Updated:
  - `ChatResult` in `first-mention-service.ts` -> `{ text: string }` only,
    with a comment documenting the deliberate decision.
  - `runtime-container.ts` adapter no longer fabricates zero usage.
  - `first-mention-service.test.ts` fixture updated to match.
The same magic 10 lived in two places: `tll-retrieval.ts:expandViaTLL`
sliced its input ids before entity lookup, and `memory-search.ts`
re-declared a private `TLL_SEED_CANDIDATE_COUNT = 10` for the same
purpose. Defined the constant once in `tll-retrieval.ts` and re-used it
from both call sites so a tuning change can't drift between them.

Updated the unit test to reference the exported constant directly
instead of asserting against the literal 10.
…#9)

The original gate was a single alternation regex that fired on any
single occurrence of `first|last|before|after|then|later|track|...`.
That over-fired on plain factual queries that incidentally contained
one of those tokens — `what is my first name`, `the model used before
GPT-4`, `track my spending` — pulling in unrelated TLL chain memories
on the augmented retrieval path.

Replaced the gate with a two-tier check:

  1. ORDERING_TERMS_RE — a curated set of single-token signals
     (first/last/before/after/then/later/earlier/previous/next/prior).
     Only fires TLL when TWO co-occur, e.g. "what aspects did I
     discuss BEFORE and AFTER X".
  2. SEQUENCE_PATTERNS — phrase-level structural signals
     (`in (chronological/reverse/the) order`, `when did`, `since when`,
     `over time`, `evolution of`, `history|timeline of`,
     `originally`/`initially`, `progression of`,
     `how X evolved/shifted/changed`, `brought up`). Single phrase
     hit is enough.

Removed `track`, `sequence`, and bare `order` from the gate — they
were the largest false-positive contributors.

Updated `src/services/__tests__/tll-retrieval.test.ts`:
  - Positive list rewritten to canonical EO/MSR/TR shapes that hit
    one of the structural patterns or co-occurring ordering terms.
  - Negative list now includes the false-positive shapes the loose
    regex used to match (the three reviewer-cited ones plus a handful
    of single-ordering-term factual queries).

41/41 unit tests pass against the updated gate.
…n paths (review #8)

Three deliberate fail-open sites were swallowing errors with weak or
no log signal, hiding production failures behind ephemeral
`console.error('[tll]', ...)` lines and `process.stderr.write` calls
that no log scraper greps for. Behaviour stays fail-open by design
(per CLAUDE.md "no fallback modes" applies to mutations, not to
augmentation paths) — only the observability changes.

Changes:

  1. `memory-storage.ts:maybeAppendTll` — fire-and-forget TLL append now
     logs `[tll-append-failed]` with the message and stringified
     fallback for non-Error throws. Comment documents the deliberate
     fire-and-forget choice (ingest hot path can't block on chain
     bookkeeping).

  2. `memory-search.ts:maybeExpandViaTLL` — return type widened from
     `SearchResult[]` to `{ memories, failed, errorMessage? }` so the
     caller can surface the failure on the retrieval trace. On the
     catch path: log `[tll-expansion-failed]` and propagate
     `{ failed: true }`. Added comment marking the fail-open as
     deliberate.

  3. `appendTllAugmentation` — emits a `tll_expansion_failed` event on
     the active retrieval `TraceCollector` whenever
     `maybeExpandViaTLL` reports `failed: true`, so the trace artifacts
     written to disk capture the failure instead of dropping it.

  4. `first-mention-service.ts:invokeLlm` and helpers — replaced ad-hoc
     `process.stderr.write` lines with structured
     `[first-mention-llm-failed]` / `[first-mention-llm-salvaged]` /
     `[first-mention-mapping]` prefixes routed through `console.error`
     / `console.warn`. JSDoc on `invokeLlm` now documents the
     deliberate fail-open (the EO read path treats no-events as
     "no signal", not an error).

Existing first-mention and tll-retrieval unit tests pass unchanged.
…view #7)

`positionInConversation` was set directly to `turn_id`. That looked
correct for a single extraction, but the (user_id, memory_id) UNIQUE
on `first_mention_events` means a re-run of `extractAndStore` for the
same conversation silently keeps the FIRST inserted row — including
its position. If the LLM's turn_id assignment drifted between runs
(which it does in practice — non-deterministic decoding even at
temperature=0 plus prompt-cache-state variation), readers would see
position values that depend on which run happened to write first,
breaking deterministic chronological ordering.

Fix: `positionInConversation` is now the 0-based index in the FINAL
turn-id-sorted output, NOT `turn_id` itself. Sort first, then enumerate.
Re-runs produce identical (position, topic) tuples regardless of any
turn_id drift, so the post-write read is stable.

Updated `mapToEvents` in `src/services/first-mention-service.ts`:
  - Build candidates first (without position).
  - Sort by `turnId` ASC.
  - Assign `positionInConversation = index` during the final map.

Tests:

  - `src/services/__tests__/first-mention-service.test.ts`:
    * existing happy-path / sort tests now assert position 0/1/...
      instead of position == turn_id.
    * new `produces stable positionInConversation across re-runs even
      when LLM turn_id drifts` test runs `extractAndStore` twice with
      drifted turn_ids and confirms both runs produce the same
      `[0, 1]` position sequence.

  - `src/db/__tests__/repository-first-mentions.test.ts` (new):
    integration test seeding a memory + running `store()` twice with
    drifted turn_ids — asserts only 2 rows survive, position sequence
    is `[0, 1]`, and the first-write turn_id (5) is what the read-back
    returns (ON CONFLICT DO NOTHING semantics).
…xtract (review #5)

The two PR #18 read endpoints had no HTTP-level tests — only the
underlying repository / service unit tests existed, which left the
schema-validation middleware and route-level wiring uncovered. A
schema rename or a route-handler regression could ship green.

New file `src/routes/__tests__/event-chains-and-first-mentions.test.ts`
mirrors the route-test pattern from `src/__tests__/route-validation.test.ts`:
spin up an Express app on `app.listen(0, ...)`, wire `createMemoryRouter`
against a real `MemoryService` backed by the test DB, drive endpoints
via `fetch`. The MemoryService gets a real `FirstMentionRepository`
plus a stubbed `chatFn` so the LLM call returns a deterministic JSON
array.

Coverage:

  GET /v1/memories/event-chains
    - 400 when `user_id` is missing
    - 400 when `entity_ids` is missing
    - 400 when `entity_ids` contains an invalid UUID
    - 400 when `entity_ids` exceeds the 100-entry cap (review #6)
    - 400 when `entity_ids` is present but holds only empty tokens
    - happy path: seed memory + entity + TLL row, hit the route, parse
      the response with `EventChainsResponseSchema`

  POST /v1/memories/first-mentions/extract
    - 400 when `user_id` is missing
    - 400 when `conversation_text` is empty
    - 400 when `conversation_text` exceeds MAX_CONVERSATION_LENGTH
      (100_000 chars)
    - 400 when `memory_ids_by_turn_id` is missing entirely
    - 400 when `source_site` is missing
    - happy path: stub LLM returns 2 events, route stores+returns
      them, response parsed with `FirstMentionsExtractResponseSchema`,
      `position_in_conversation` is the post-sorted [0, 1] sequence
      from review #7.

12/12 new tests pass.
Adds an opt-in LiteLLM proxy sidecar under docker/litellm/ so AtomicMemory
can route LLM calls to Anthropic, OpenAI, Microsoft Foundry / Azure, AWS
Bedrock, or Google Gemini through a single OpenAI-compatible endpoint.
Provider swap is config-only — no new code path in src/services/llm.ts.

Why
- Today llm.ts already supports `LLM_PROVIDER=openai-compatible` with
  `LLM_API_URL` + `LLM_API_KEY`. Pointing that lane at a LiteLLM proxy
  reuses the existing seam and keeps cost-telemetry, AUDN-timeout, and
  retry behavior unchanged.
- A single config.yaml replaces per-provider client wiring across the
  research harness and any future deployment, so we add a provider by
  appending one model_list entry instead of touching TypeScript.

What ships
- docker/litellm/litellm-config.yaml — model_list entries for Anthropic
  (Haiku 4.5, Sonnet 4.6), OpenAI (gpt-5-chat, gpt-4o-mini), Foundry
  (gpt-5-chat via azure/), Bedrock (Claude Sonnet), Gemini (1.5-pro).
  Provider keys resolved via os.environ/VAR_NAME at request time.
- docker/litellm/docker-compose.litellm.yml — pinned compose service on
  port 4000 with explicit `name: atomicmemory-litellm` so the project
  never collides with another `litellm/`-named compose stack.
- docker/litellm/README.md — quick start, env-var table per provider,
  cost-telemetry caveats.
- docker/litellm/.env.example — credential template.

No src/ changes; the existing openai-compatible lane already accepts
LLM_API_URL + LLM_API_KEY (config.ts → llm.ts → OpenAICompatibleLLM).

Smoke
- Anthropic Haiku 4.5 via the proxy: 200 OK, 1.95s, 23 in / 14 out
  tokens, ~$0.00009. Output coherent.
- Foundry / Bedrock / Gemini / OpenAI: model aliases load cleanly at
  proxy startup (`Set models:` lists all 7); no live calls without
  credentials.
…family

Two integration fixes discovered during smoke validation against real provider
keys (Anthropic, OpenAI, Foundry, Gemini):

1. Foundry — Azure-deployments path doesn't exist on Project Inference API.
   The previous config used `azure/gpt-5-chat` with the Azure deployments
   URL pattern (`/openai/deployments/<name>/chat/completions?api-version=X`).
   Foundry's Project Inference API exposes an OpenAI-compatible endpoint at
   `${FOUNDRY_API_BASE}/openai/v1/chat/completions` with no api-version
   parameter. Switched to LiteLLM's `openai/` provider with a custom
   api_base, mirroring how any OpenAI-compatible third party is routed.
   Adds FOUNDRY_API_BASE_OPENAI env var (= FOUNDRY_API_BASE + "/openai/v1").

2. Gemini — 1.5 series isn't available on current generative-language API
   keys; 2.0 Flash is no longer offered to new accounts. Swapped to the 2.5
   family: gemini-2-5-flash and gemini-2-5-pro aliases. The 1.5 entry is
   removed.

   Also flagged a Gemini 2.5 caveat in .env.example: 2.5 generates reasoning
   tokens that count against max_tokens. Callers must send max_tokens >= 500
   (Flash) or >= 1500 (Pro) or responses truncate to empty content.

Smoke results (all 200 OK via the proxy at localhost:4000):
  Anthropic Haiku 4.5     — 1s
  OpenAI GPT-4o-mini      — 3s
  Foundry GPT-5-chat      — 1s
  Gemini 2.5 Flash        — 4s, max_tokens=500
  Gemini 2.5 Pro          — 9s, max_tokens=1500

Total smoke spend: ~$0.001.

Files:
  docker/litellm/litellm-config.yaml  - foundry-gpt-5-chat switches to openai/, gemini entries replaced
  docker/litellm/docker-compose.litellm.yml - passes FOUNDRY_API_BASE_OPENAI through to the proxy
  docker/litellm/.env.example         - documents new FOUNDRY_API_BASE_OPENAI + Gemini 2.5 caveats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant