Streaming with AI Gateway

Uses: Kong Gateway AI Gateway

What is request streaming?

In an LLM (Large Language Model) inference request, Kong Gateway uses the upstream provider’s REST API to generate the next chat message from the caller. Normally, this request is processed and completely buffered by the LLM before being sent back to Kong Gateway and then to the caller in a single large JSON block. This process can be time-consuming, depending on the max_tokens, other request parameters, and the complexity of the request sent to the LLM model.

To avoid making the user wait for their chat response with a loading animation, most models can stream each word (or sets of words and tokens) back to the client. This allows the chat response to be rendered in real time.

For example, a client could set up their streaming request using the OpenAI Python SDK like this:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/12/openai",
    api_key="none"
)

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me the history of Kong Inc."}],
    stream=True,
)

print('>')
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

The client won’t have to wait for the entire response. Instead, tokens will appear as they come in.

How AI Proxy streaming works

In streaming mode, a client can set "stream": true in their request, and the LLM server will stream each part of the response text (usually token-by-token) as a server-sent event. Kong Gateway captures each batch of events and translates them into the Kong Gateway inference format. This ensures that all providers are compatible with the same framework including OpenAI-compatible SDKs or similar.

In a standard LLM transaction, requests proxied directly to the LLM look like this:

 
sequenceDiagram    
  actor Client
  participant Kong Gateway
  Note right of Kong Gateway: AI Proxy Advanced plugin
  Client->>+Kong Gateway: 
  destroy Kong Gateway
  Kong Gateway->>+Cloud LLM: Sends proxy request information
  Cloud LLM->>+Client: Sends chunk to client
  

When streaming is requested, requests proxied directly to the LLM look like this:

 
flowchart LR
  A(client)
  B(Kong Gateway Gateway with 
  AI Proxy Advanced plugin)
  C(Cloud LLM)
  D[[transform frame]]
  E[[read frame]]

subgraph main
direction LR
  subgraph 1
  A
  end
  subgraph 3
  C
  end
  subgraph 2
  D
  E
  end
  A --> B --request--> C
  C --response--> B
  B --> D-->E
  E --> B
  B --> A
end

  linkStyle 2,3,4,5,6 stroke:#b6d7a8,color:#b6d7a8
  style 1 color:#fff,stroke:#fff
  style 2 color:#fff,stroke:#fff
  style 3 color:#fff,stroke:#fff
  style main color:#fff,stroke:#fff
  

The streaming framework captures each event, sends the chunk back to the client, and then exits early.

It also estimates tokens for LLM services that decided to not stream back the token use counts when the message is completed.

Streaming limitations

Keep the following limitations in mind when you configure streaming for the AI Gateway plugin:

  • Multiple AI features shouldn’t be expected to be applied and work simultaneously.
  • You can’t use the Response Transformer plugin or any other response phase plugin when streaming is configured.
  • The AI Request Transformer plugin plugin will work, but the AI Response Transformer plugin will not. This is because Kong Gateway can’t check every single response token against a separate system.
  • Streaming currently doesn’t work with the HTTP/2 protocol. You must disable this in your proxy_listen configuration.

Configuration

The AI Proxy and AI Proxy Advanced plugins already support request streaming; all you have to do is request Kong Gateway to stream the response tokens back to you.

The following is an example llm/v1/completions route streaming request:

{
  "prompt": "What is the theory of relativity?",
  "stream": true
}

You should receive each batch of tokens as HTTP chunks, each containing one or many server-sent events.

Response streaming configuration parameters

In the AI Proxy and AI Proxy Advanced plugin configuration, you can set an optional field config.response_streaming to one of three values:

Value

Effect

allow Allows the caller to optionally specify a streaming response in their request (default is not-stream).
deny Prevents the caller from setting stream=true in their request.
always Always returns streaming responses, even if the caller hasn’t specified it in their request.
Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!