The AI Proxy plugin lets you transform and proxy requests to a number of AI providers and models.
AI Proxy plugin accepts requests in one of a few defined and standardized OpenAI formats, translates them to the configured target format, and then transforms the response back into a standard format.
v3.10+ To use AI Proxy with non-OpenAI format without conversion, see section below for more details.
The AI Proxy plugin will mediate the following for you:
Request and response formats appropriate for the configured config.targets[].model.provider and config.targets.route_type
The following service request coordinates (unless the model is self-hosted):
Protocol
Host name
Port
Path
HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets[].model.options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations
Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.
v3.11+ AI Proxy supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.
The plugin’s route_type should be set based on the target upstream endpoint and model, based on this capability matrix:
The following requirements are enforced by upstream providers:
For Azure Responses API, set config.azure_api_version to "preview".
For OpenAI and Azure Assistant APIs, include the header OpenAI-Beta: assistants=v2.
For requests with large payloads (e.g., image edits, audio transcription/translation), consider increasing config.max_request_body_size to three times the raw binary size.
As defined in config.targets[].model.options.upstream_url
Llama2
As defined in config.targets[].model.options.upstream_url
Amazon Bedrock
https://bedrock-runtime.{region}.amazonaws.com
Gemini
https://generativelanguage.googleapis.com
Hugging Face
https://api-inference.huggingface.co
While only the Llama2 and Mistral models are classed as self-hosted, the target URL can be overridden for any of the supported providers.
For example, a self-hosted or otherwise OpenAI-compatible endpoint can be called by setting the same config.targets[].model.options.upstream_url plugin option.
v3.10+ If you are using each provider’s native SDK, Kong Gateway allows you to transparently proxy the request without any transformation and return the response unmodified. This can be done by setting config.llm_format to a value other than openai, such as gemini or bedrock. See the section below for more details.
In this mode, Kong Gateway will still provide useful analytics, logging, and cost calculation.
v3.10+ By default, Kong Gateway uses the OpenAI format, but you can customize this using config.llm_format. If llm_format is not set to openai, the plugin will not transform the request when sending it upstream and will leave it as-is.
The Kong Gateway AI Proxy accepts the following inputs formats, standardized across all providers. The config.targets.route_type must be configured respective to the required request and response format examples.
The following examples show standardized text-based request formats for each supported llm/v1/* route. These formats are normalized across providers to help simplify downstream parsing and integration.
{"messages":[{"role":"system","content":"You are a scientist."},{"role":"user","content":"What is the Boltzmann equation?"}]}
v3.9+ With Amazon Bedrock, you can include your guardrail configuration in the request:
{"messages":[{"role":"system","content":"You are a scientist."},{"role":"user","content":"What is the Boltzmann equation?"}],"guardrailConfig":{"guardrailIdentifier":"$GUARDRAIL-IDENTIFIER","guardrailVersion":"1","trace":"enabled"}}
{"prompt":"You are a scientist. What is quantum entanglement?"}
Supported in: v3.11+
{"input":"Quantum computing is expected to revolutionize cryptography.","model":"text-embedding-ada-002","encoding_format":"float"}
Supported in: v3.11+
This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.
This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.
{"instructions":"You are a frontend mentor. When asked a question, write and explain JavaScript code to help the user understand key concepts.","name":"Frontend Mentor","tools":[{"type":"code_interpreter"}],"model":"gpt-4o"}
Supported in: v3.11+
This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.
The following examples show standardized audio and image request formats for each supported route. These formats are normalized across providers to help simplify downstream parsing and integration.
Supported in: v3.11+
curlhttp://localhost:8000\-H"Authorization: Bearer $OPENAI_API_KEY"\-H"Content-Type: application/json"\-d'{"input":"In the heart of the city, the rain whispered secrets to the streets.","voice":"serene"}'\--outputspeech.mp3
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"The Boltzmann equation is a fundamental equation in statistical mechanics that describes how the distribution function f(x, v, t) of particles in a gas evolves over time. It accounts for the effects of particle collisions and external forces, providing a bridge between microscopic particle dynamics and macroscopic thermodynamic behavior. The general form of the equation is: ∂f/∂t + v · ∇f + F · ∇_v f = (∂f/∂t)_collision, where f is the distribution function, v is velocity, F is an external force, and the right-hand side represents the change in f due to collisions.","role":"assistant"}}],"created":1707769597,"id":"chatcmpl-ID","model":"gpt-4-0613","object":"chat.completion","usage":{"completion_tokens":94,"prompt_tokens":26,"total_tokens":120}}
{"choices":[{"finish_reason":"stop","index":0,"text":"Quantum entanglement is a phenomenon where particles become interconnected such that the state of one instantly influences the state of another, regardless of distance."}],"created":1707769597,"id":"cmpl-ID","model":"gpt-3.5-turbo-instruct","object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":9,"total_tokens":35}}
{"id":"asst_def456","object":"assistant","created_at":1698984975,"name":"Frontend Mentor","description":null,"model":"gpt-4o","instructions":"You are a frontend mentor. When asked a question, write and explain JavaScript code to help the user understand key concepts.","tools":[{"type":"code_interpreter"}],"metadata":{},"top_p":1.0,"temperature":1.0,"response_format":"auto"}
Supported in: v3.11+
{"id":"resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b","object":"response","created_at":1741476542,"status":"completed","error":null,"incomplete_details":null,"instructions":null,"max_output_tokens":null,"model":"gpt-4.1-2025-04-14","output":[{"type":"message","id":"msg_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b","status":"completed","role":"assistant","content":[{"type":"output_text","text":"HTTP/1.1 uses a single connection per request-response cycle, leading to inefficiencies, especially with multiple resources. In contrast, HTTP/2 supports multiplexing, allowing multiple streams over one connection, which reduces latency. HTTP/2 also introduces binary framing and header compression for improved performance.","annotations":[]}]}],"parallel_tool_calls":true,"previous_response_id":null,"reasoning":{"effort":null,"summary":null},"store":true,"temperature":1.0,"text":{"format":{"type":"text"}},"tool_choice":"auto","tools":[],"top_p":1.0,"truncation":"disabled","usage":{"input_tokens":36,"input_tokens_details":{"cached_tokens":0},"output_tokens":60,"output_tokens_details":{"reasoning_tokens":0},"total_tokens":96},"user":null,"metadata":{}}
The following examples show standardized response formats returned by supported audio/ and image/ routes. These formats are normalized across providers to support consistent multimodal output parsing.
Supported in: v3.11+
The response contains the audio file content of speech.mp3.
Supported in: v3.11+
{"text":"Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100 or a 1,000 times bigger. This is a place where you can get to do that.","usage":{"type":"tokens","input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}
Supported in: v3.11+
{"text":"Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"}
Configure a header capture to insert the requested model name directly into the plugin configuration for Kong AI Gateway deployment with Azure OpenAI, as a string substitution.
Yes, if Kong Gateway is running on Azure, AI Proxy can detect the designated Managed Identity or User-Assigned Identity of that Azure Compute resource, and use it accordingly.
In your AI Proxy configuration, set the following parameters:
No. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.
Yes, but only if config.auth.allow_override is set to true in the plugin configuration.
When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.