AI Proxy Advanced

AI License Required

Overview Examples Guides Configuration reference Changelog

Overview of capabilities

AI Proxy Advanced plugin supports capabilities across batch processing, multimodal embeddings, agents, audio, image, streaming, and more, spanning multiple providers.

For Kong Gateway versions 3.6 or earlier:

Chat Completions APIs: Multi-turn conversations with system/user/assistant roles.
Completions API: Generates free-form text from a prompt.

OpenAI has marked this endpoint as legacy and recommends using the Chat Completions API for developing new applications.

See the following table for capabilities supported in AI Gateway:

API capability	Description	Examples
Chat completions	Generates conversational responses from a sequence of messages using supported LLM providers.	`llm/v1/chat`
Embeddings	Converts text to vector representations for semantic search and similarity matching.	`llm/v1/embeddings`
Function calling	Allows models to invoke external tools and APIs based on conversation context.	“`llm/v1/chat`”
Assistants and responses	Powers persistent tool-using agents and exposes metadata for debugging and evaluation.	`llm/v1/assistants` `llm/v1/responses`
Batches and files	Supports asynchronous bulk LLM requests and file uploads for long documents and structured input.	`llm/v1/batches` `llm/v1/files` Send asynchronous requests to LLMs
Audio	Enables speech-to-text, text-to-speech, and translation for voice applications.	`audio/v1/audio/transcriptions` `audio/v1/audio/speech` `audio/v1/audio/translations`
Image generation and editing	Generates or modifies images from text prompts.	`image/v1/images/generations` `image/v1/images/edits`
Video generation	Generates videos from text prompts.	`video/v1/videos/generations`
Realtime	Bidirectional WebSocket streaming for low-latency, interactive voice and text applications.	`realtime/v1/realtime`
AWS Bedrock native APIs	Enables advanced orchestration and real-time RAG via Converse and RetrieveAndGenerate endpoints. Available only when using native LLM format for Bedrock.	`/converse` `/retrieveAndGenerate`
Hugging Face native APIs	Provides text generation and streaming using Hugging Face models. Available only when using native LLM format for Hugging Face.	`/generate`
Rerank	Reorders documents by relevance for RAG pipelines using Bedrock or Cohere rerank APIs. Available only when using native LLM format for Bedrock and Cohere.	`/rerank`

The following providers are supported by the legacy Completions API:

OpenAI

Azure OpenAI

Cohere

Llama2

Amazon Bedrock

Gemini

Hugging Face

Supported AI providers

AI Gateway supports proxying requests to the following AI providers. Each provider page documents supported capabilities, configuration requirements, and provider-specific details.

For detailed capability support, configuration requirements, and provider-specific limitations, see the individual provider reference pages.

Provider	Description
OpenAI	GPT-5, GPT-4, GPT-4o, GPT-3.5, DALL-E, Whisper, Sora, and text embedding models.
Azure OpenAI	Microsoft-hosted OpenAI models with Azure enterprise integration.
Amazon Bedrock	AWS-managed foundation models including Claude, Titan, Llama, and Stable Diffusion.
Anthropic	Claude model family for chat, completions, and function calling.
Gemini	Google’s Gemini models via the Generative Language API.
Vertex AI	Google Cloud-hosted Gemini models with enterprise features.
Cohere	Command models for chat, completions, embeddings, and reranking.
Mistral	Mistral AI models in cloud, self-hosted, or OLLAMA formats.
Hugging Face	Open-source models via Hugging Face Inference API.
Llama	Meta’s Llama 2 and Llama 3 models in raw, OLLAMA, or OpenAI formats.
xAI	Grok models for chat, function calling, and image generation.
Alibaba Cloud DashScope	Qwen models for chat, embeddings, and image generation.
Cerebras	High-performance inference for Llama models via Cerebras Cloud.

How it works

The AI Proxy Advanced plugin will mediate the following for you:

Request and response formats appropriate for the configured config.targets[].model.provider and config.targets[].route_type
The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets.model[].options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations

Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.

v3.11+ AI Proxy Advanced supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.

Request and response formats

AI Gateway transforms requests and responses according to the configured config.targets[].model.provider and config.targets[].route_type, using the OpenAI format by default. v3.10+ To use a provider’s native format instead, set config.llm_format to a value other than openai. The plugin then passes requests upstream without transformation. See Supported native LLM formats for available options.

The following table maps each route type to its OpenAI API reference and generative AI category. See the AI provider reference pages for provider-specific details.

Route type	OpenAI API reference	Gen AI category	Min version
`llm/v1/chat`	Chat completions	`text/generation`	3.6
`llm/v1/completions`	Completions	`text/generation`	3.6
`llm/v1/embeddings`	Embeddings	`text/embeddings`	3.11
`llm/v1/files`	Files	N/A	3.11
`llm/v1/batches`	Batch	N/A	3.11
`llm/v1/assistants`	Assistants	`text/generation`	3.11
`llm/v1/responses`	Responses	`text/generation`	3.11
`realtime/v1/realtime`	Realtime	`realtime/generation`	3.11
`audio/v1/audio/speech`	Create speech	`audio/speech`	3.11
`audio/v1/audio/transcriptions`	Create transcription	`audio/transcription`	3.11
`audio/v1/audio/translations`	Create translation	`audio/transcription`	3.11
`image/v1/images/generations`	Create image	`image/generation`	3.11
`image/v1/images/edits`	Create image edit	`image/generation`	3.11
`video/v1/videos/generations`	Create video	`video/generation`	3.13

Provider-specific parameters can be passed using the extra_body field in your request. See the sample OpenAPI specification for detailed format examples.

Supported native LLM formats v3.10+

If you use a provider’s native SDK, AI Gateway v3.10+ can proxy the request and return the upstream response without payload format conversion. Set config.llm_format to a value other than openai to preserve the provider’s native request and response formats.

In this mode, AI Gateway will still provide analytics, logging, and cost calculation. When config.llm_format is set to a native format, only the corresponding provider is supported with its specific APIs.

Provider	LLM format	Native capabilities
Anthropic	`anthropic`	Messages, batch processing
Amazon Bedrock	`bedrock`	Converse, RAG (RetrieveAndGenerate), reranking, async invocation
Cohere	`cohere`	Reranking
Gemini	`gemini`	Content generation, embeddings, batches, file uploads
Vertex AI	`gemini`	Content generation, embeddings, batches, reranking, long-running predictions
Hugging Face	`huggingface`	Text generation, streaming

Load balancing

AI Proxy Advanced supports several load balancing algorithms for distributing requests across AI models:

Round-robin: Weighted traffic distribution.
Consistent-hashing: Sticky sessions based on header values.
Least-connections: Route to backends with spare capacity.
Lowest-latency: Route to fastest-responding models.
Lowest-usage: Route based on token counts or cost.
Semantic: Route based on prompt-to-model similarity.
Priority: Tiered failover across model groups.

For detailed algorithm descriptions and selection guidance, see Load balancing algorithms.

For load balancing across Gateway Upstreams and Targets instead of LLMs, see load balancing with Kong Gateway.

Retry and fallback

The AI load balancer supports configurable retries, timeouts, and failover to different models when a target is unavailable.

v3.10+ Fallback works across targets with any supported format. You can mix providers freely, for example OpenAI and Mistral. Earlier versions require compatible formats between fallback targets. For configuration details, see Retry and fallback configuration.

Client errors don’t trigger failover. To failover on additional error types, set config.balancer.failover_criteria to include HTTP codes like http_429 or http_502, and non_idempotent for POST requests.

Health check and circuit breaker v3.13+

The AI load balancer supports circuit breakers to improve reliability. If a target reaches the failure threshold defined by config.balancer.max_fails, the load balancer stops routing requests to it until the timeout period (config.balancer.fail_timeout) elapses.

For configuration details and behavior examples, see Circuit breaker.

Templating v3.7+

The plugin allows you to substitute values in the config.targets[].model.name and any parameter under config.targets.model[].options with specific placeholders, similar to those in the Request Transformer Advanced plugin.

The following templated parameters are available:

$(headers.header_name): The value of a specific request header.
$(uri_captures.path_parameter_name): The value of a captured URI path parameter.
$(query_params.query_parameter_name): The value of a query string parameter.

You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:

Action	Description
Select different models dynamically on one provider	Allow users to select the target model based on a request header or parameter. Supports flexible routing across different models on the same provider.
Use one chat route with dynamic Azure OpenAI deployments	Configure a dynamic route to target multiple Azure OpenAI model deployments.
Use multiple routes to map mulitple Azure Deployment	Use separate Routes to map Azure OpenAI SDK requests to specific deployments of GPT-3.5 and GPT-4.

Vector databases

A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.

The AI Proxy Advanced plugin supports the following vector databases:

Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
- Redis with Vector Similarity Search (VSS)
- AWS MemoryDB for Redis v3.12+
Using config.vectordb.strategy: pgvector and parameters in config.vectordb.pgvector:
- PostgreSQL with pgvector v3.10+

To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.

Using cloud authentication with Redis v3.13+

Starting in Kong Gateway 3.13, you can authenticate with a cloud Redis provider for your Redis strategy. This allows you to seamlessly rotate credentials without relying on static passwords.

The following providers are supported:

AWS ElastiCache
Azure Managed Redis
Google Cloud Memorystore (with or without Valkey)

Each provider also supports an instance and cluster configuration.

Important: Kong Gateway open source plugins do not support any Redis cloud provider cluster configurations.

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

You need:

A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The ElastiCache user needs to set “Authentication mode” to “IAM”

The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticache:Connect"
            ],
            "Resource": [
                "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE",
                "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER"
            ]
        }
    ]
}

Copied!

config:
  storage: redis
  storage_config:
    redis:
      host: $INSTANCE_ADDRESS
      username: $INSTANCE_USERNAME
      port: 6379
      cloud_authentication:
        auth_provider: aws
        aws_cache_name: $AWS_CACHE_NAME
        aws_is_serverless: false
        aws_region: $AWS_REGION
        aws_access_key_id: $AWS_ACCESS_KEY_ID
        aws_secret_access_key: $AWS_ACCESS_SECRET_KEY

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The ElastiCache instance address.
$INSTANCE_USERNAME: The ElastiCache username with IAM Auth mode configured.
$AWS_CACHE_NAME: Name of your AWS ElastiCache instance.
$AWS_REGION: Your AWS ElastiCache instance region.
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID.
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.

You need:

A running Redis instance on an AWS ElastiCache cluster for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The ElastiCache user needs to set “Authentication mode” to “IAM”

The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticache:Connect"
            ],
            "Resource": [
                "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE",
                "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER"
            ]
        }
    ]
}

Copied!

config:
  storage: redis
  storage_config:
    redis:
      cluster_nodes:
      - ip: $CLUSTER_ADDRESS
        port: 6379
      username: $CLUSTER_USERNAME
      port: 6379
      cloud_authentication:
        auth_provider: aws
        aws_cache_name: $AWS_CACHE_NAME
        aws_is_serverless: false
        aws_region: $AWS_REGION 
        aws_access_key_id: $AWS_ACCESS_KEY_ID
        aws_secret_access_key: $AWS_ACCESS_SECRET_KEY

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The ElastiCache cluster address.
$CLUSTER_USERNAME: The ElastiCache username with IAM Auth mode configured.
$AWS_CACHE_NAME: Name of your AWS ElastiCache cluster.
$AWS_REGION: Your AWS ElastiCache cluster region.
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID.
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.

You need:

A running Redis instance on an Azure Managed Redis instance with Entra authentication configured
Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance

config:
  storage: redis
  storage_config:
    redis:
      host: $INSTANCE_ADDRESS
      username: $INSTANCE_USERNAME
      port: 6379
      cloud_authentication:
        auth_provider: azure
        azure_client_id: $AZURE_CLIENT_ID
        azure_client_secret: $AZURE_CLIENT_SECRET
        azure_tenant_id: $AZURE_TENANT_ID

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The Azure Managed Redis instance address.
$INSTANCE_USERNAME: The object (principal) ID of the Principal/Identity with essential access.
$AZURE_CLIENT_ID: The client ID of the Principal/Identity.
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity.
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.

You need:

A running Redis instance on an Azure Managed Redis cluster with Entra authentication configured
Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance

config:
  storage: redis
  storage_config:
    redis:
      cluster_nodes:
      - ip: $CLUSTER_ADDRESS
        port: 6379
      username: $CLUSTER_USERNAME
      port: 6379
      cloud_authentication:
        auth_provider: azure
        azure_client_id: $AZURE_CLIENT_ID
        azure_client_secret: $AZURE_CLIENT_SECRET
        azure_tenant_id: $AZURE_TENANT_ID

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The Azure Managed Redis cluster address.
$CLUSTER_USERNAME: The object (principal) ID of the Principal/Identity with essential access.
$AZURE_CLIENT_ID: The client ID of the Principal/Identity.
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity.
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.

You need:

A running Redis instance on an Google Cloud Memorystore instance
Assign the principal to the corresponding role:
- Cloud Memorystore Redis DB Connection User(roles/redis.dbConnectionUser) for Memorystore for Redis Cluster
- Memorystore DB Connector User (roles/memorystore.dbConnectionUser) for Memorystore for Valkey

config:
  storage: redis
  storage_config:
    redis:
      host: $INSTANCE_ADDRESS
      port: 6379
      cloud_authentication:
        auth_provider: gcp
        gcp_service_account_json: $GCP_SERVICE_ACCOUNT

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The Memorystore instance address.
$GCP_SERVICE_ACCOUNT: (Optional) The GCP service account JSON.

You need:

A running Redis instance on an Google Cloud Memorystore cluster
Assign the principal to the corresponding role:
- Cloud Memorystore Redis DB Connection User(roles/redis.dbConnectionUser) for Memorystore for Redis Cluster
- Memorystore DB Connector User (roles/memorystore.dbConnectionUser) for Memorystore for Valkey

config:
  storage: redis
  storage_config:
    redis:
      cluster_nodes:
      - ip: $CLUSTER_ADDRESS
        port: 6379 
      port: 6379
      cloud_authentication:
        auth_provider: gcp
        gcp_service_account_json: $GCP_SERVICE_ACCOUNT

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The Memorystore cluster address.
$GCP_SERVICE_ACCOUNT: The GCP service account JSON.

FAQs

Can I override config.model.name by specifying a different model name in the request?

No. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.

Can I override temperature, top_p, and top_k from the request?

Yes. The values for temperature, top_p, and top_k in the request take precedence over those set in config.targets.model.options.

Can I override authentication values from the request?

Yes, but only if config.targets.auth.allow_override is set to true in the plugin configuration. When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.

What algorithm does ai-proxy-advanced use for selecting the lowest latency target?

It uses Kong’s built-in load balancing mechanism with the EWMA (Exponentially Weighted Moving Average) algorithm to dynamically route traffic to the backend with the lowest observed latency.

What is the duration of the learning phase with AI Proxy Advanced?

There’s no fixed time window. EWMA continuously updates with every response, giving more weight to recent observations. Older latencies decay over time, but still contribute in smaller proportions.

How does AI Proxy Advanced distribute traffic once a faster model is identified?

The fastest model gets a majority of traffic, but Kong never sends 100% to a single target unless it’s the only one available. In practice, the dominant target may receive ~90–99% of traffic, depending on how much better its EWMA score is.

Does the system continue testing other targets when the AI Proxy Advanced plugin identifies the fastest model?

Yes. EWMA ensures all targets continue to receive a small amount of traffic. This ongoing probing lets the system adapt if a previously slower model becomes faster later.

What’s the approximate percentage of traffic sent to non-dominant targets with AI Proxy Advanced?

While exact percentages vary with latency gaps, less performant targets typically get between 0.1%–5% of traffic, just enough to keep updating their EWMA score for comparison.

How do I resolve the MemoryDB error Number of indexes exceeds the limit?

If you see the following error in the logs:

failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)

Copied!

This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI Proxy Advanced plugin instances.

AI Proxy Advanced

Overview of capabilities

Supported AI providers

How it works

Request and response formats

Supported native LLM formats v3.10+

Load balancing

Retry and fallback

Health check and circuit breaker v3.13+

Templating v3.7+

Vector databases

Using cloud authentication with Redis v3.13+

FAQs

Help us make these docs great!

Still need help?