AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.
The AI Proxy Advanced plugin lets you transform and proxy requests to multiple AI providers and models at the same time. This lets you set up load balancing between targets.
AI Proxy Advanced plugin accepts requests in one of a few defined and standardized OpenAI formats, translates them to the configured target format, and then transforms the response back into a standard format.
v3.10+ To use AI Proxy Advanced with non-OpenAI format without conversion, see section below for more details.
AI Gateway supports proxying requests to the following AI providers. Each provider page documents supported capabilities, configuration requirements, and provider-specific details.
For detailed capability support, configuration requirements, and provider-specific limitations, see the individual provider reference pages.
The AI Proxy Advanced plugin will mediate the following for you:
Request and response formats appropriate for the configured config.targets[].model.provider and config.targets[].route_type
The following service request coordinates (unless the model is self-hosted):
Protocol
Host name
Port
Path
HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets.model[].options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations
Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.
v3.11+ AI Proxy Advanced supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.
The following table maps each route type to its OpenAI API reference and generative AI category. See the AI provider reference pages for provider-specific details.
Provider-specific parameters can be passed using the extra_body field in your request. See the sample OpenAPI specification for detailed format examples.
If you use a provider’s native SDK, AI Gateway v3.10+ can proxy the request and return the upstream response without payload format conversion. Set config.llm_format to a value other than openai to preserve the provider’s native request and response formats.
In this mode, AI Gateway will still provide analytics, logging, and cost calculation.
When config.llm_format is set to a native format, only the corresponding provider is supported with its specific APIs.
The AI load balancer supports configurable retries, timeouts, and failover to different models when a target is unavailable.
v3.10+ Fallback works across targets with any supported format. You can mix providers freely, for example OpenAI and Mistral. Earlier versions require compatible formats between fallback targets. For configuration details, see Retry and fallback configuration.
Client errors don’t trigger failover.
To failover on additional error types, set config.balancer.failover_criteria to include HTTP codes like http_429 or http_502, and non_idempotent for POST requests.
$(headers.header_name): The value of a specific request header.
$(uri_captures.path_parameter_name): The value of a captured URI path parameter.
$(query_params.query_parameter_name): The value of a query string parameter.
You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:
A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.
The AI Proxy Advanced plugin supports the following vector databases:
Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
If your plugin uses a Redis datastore, you can authenticate to it with a cloud Redis provider.
This allows you to seamlessly rotate credentials without relying on static passwords.
The following providers are supported:
AWS ElastiCache
Azure Managed Redis
Google Cloud Memorystore (with or without Valkey)
You need:
A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
No. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.
Yes, but only if config.targets.auth.allow_override is set to true in the plugin configuration.
When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.
It uses Kong’s built-in load balancing mechanism with the EWMA (Exponentially Weighted Moving Average) algorithm to dynamically route traffic to the backend with the lowest observed latency.
There’s no fixed time window. EWMA continuously updates with every response, giving more weight to recent observations. Older latencies decay over time, but still contribute in smaller proportions.
The fastest model gets a majority of traffic, but Kong never sends 100% to a single target unless it’s the only one available. In practice, the dominant target may receive ~90–99% of traffic, depending on how much better its EWMA score is.
Yes. EWMA ensures all targets continue to receive a small amount of traffic. This ongoing probing lets the system adapt if a previously slower model becomes faster later.
While exact percentages vary with latency gaps, less performant targets typically get between 0.1%–5% of traffic, just enough to keep updating their EWMA score for comparison.
failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)
Copied!
This means that the hardcoded MemoryDB instance limit has been reached.
To resolve this, create more MemoryDB instances to handle multiple AI Proxy Advanced plugin instances.