In an LLM (Large Language Model) inference request, Kong Gateway uses the upstream provider’s REST API to generate the next chat message from the caller.
Normally, this request is processed and completely buffered by the LLM before being sent back to Kong Gateway and then to the caller in a single large JSON block. This process can be time-consuming, depending on the max_tokens
, other request parameters, and the complexity of the request sent to the LLM model.
To avoid making the user wait for their chat response with a loading animation, most models can stream each word (or sets of words and tokens) back to the client. This allows the chat response to be rendered in real time.
For example, a client could set up their streaming request using the OpenAI Python SDK like this:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8000/12/openai",
api_key="none"
)
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Tell me the history of Kong Inc."}],
stream=True,
)
print('>')
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
The client won’t have to wait for the entire response. Instead, tokens will appear as they come in.