Skip to content

Streaming parser hangs indefinitely on Gemini 3.x preview models (Vertex AI) #511

@harmoney-franck

Description

@harmoney-franck

gen.forward(llm, input) (and llm.chat(...)) hangs forever when using any Gemini 3 family model on Vertex AI with the default streaming mode. The Vertex endpoint responds with HTTP 200 and produces bytes, but axllm's streaming parser never resolves. Disabling streaming via config: { stream: false } fixes it immediately.

Gemini 2.x models on the same Vertex provider, project, region, auth, and axllm instance work fine with streaming.

Environment

  • @ax-llm/ax: 19.0.45
  • Runtime: Node.js 24
  • Provider: ai({ name: 'google-gemini', projectId, region: 'global', apiKey: () => googleAuthToken })
  • Auth: google-auth-library via ADC, scope https://www.googleapis.com/auth/cloud-platform
  • Affected models (all hang):
    • AxAIGoogleGeminiModel.Gemini3Flash ("gemini-3-flash-preview")
    • AxAIGoogleGeminiModel.Gemini3FlashLite ("gemini-3.1-flash-lite-preview")
    • AxAIGoogleGeminiModel.Gemini3Pro / Gemini31Pro ("gemini-3.1-pro-preview")
  • Working controls (same auth, same code path): gemini-2.0-flash-001, gemini-2.5-pro, gemini-2.5-flash

Reproduction

  import { ai, ax } from '@ax-llm/ax';

  const llm = ai({
    name: 'google-gemini',
    apiKey: async () => (await googleAuthClient.getAccessToken()).token!,
    projectId: process.env.VERTEX_PROJECT!,
    region: 'global',
    config: { model: 'gemini-3-flash-preview' },
    // options: { debug: true, timeout: 30_000 },  // forces a visible timeout
  });

  const gen = ax('question:string -> answer:string');
  const result = await gen.forward(llm, { question: 'Say hi in one word.' });
  console.log(result);
  // → hangs forever; never resolves, never throws

Switching 'gemini-3-flash-preview' to 'gemini-2.5-pro' (everything else identical) returns within seconds.

Workaround

 config: { model: 'gemini-3-flash-preview', stream: false },

Works immediately. Output shape is identical; only streaming is disabled.

What's not the issue

Ruled out externally to axllm:

  • Not auth / credentials. Same GoogleAuth token, same service account works with 2.x models in streaming mode.
  • Not model availability. Direct curl to
    https://aiplatform.googleapis.com/v1/projects/$PROJECT/locations/global/publishers/google/models/gemini-3-flash-preview:generateContent returns 200 with a valid response. Same for the v1beta1 path.
  • Not region routing. global region works for 2.x on axllm; same configuration hangs for 3.x.
  • Not a network/socket hang at the connection level. With debug: true + timeout: 30_000, the request does send and receive bytes; the timeout fires inside the streaming parse loop, not at the HTTP layer.

Suspected cause

SSE terminator or chunk-framing mismatch between axllm's streaming response parser and the Gemini 3 preview endpoint's SSE format. The streaming parser likely awaits a delimiter (e.g. data: [DONE], a specific content-block-stop event, or a final empty chunk) that the 3.x preview endpoint either omits or formats differently from 2.x.

Suggested areas to check in the Google Gemini provider implementation:

  • src/ax/ai/google-gemini/ — streaming decoder / chunk reassembly
  • Any hardcoded end-of-stream sentinel specific to 2.x response shape
  • Handling of new Gemini 3 "thinking" blocks (3.x introduced extended thinking content parts); if the parser waits for a thinking-close that the API doesn't emit in stream mode, the promise never resolves

Happy to help

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions