AI Proxy Advanced

AI License Required
Related Documentation
Made by
Kong Inc.
Supported Gateway Topologies
hybrid db-less traditional
Supported Konnect Deployments
hybrid cloud-gateways serverless
Compatible Protocols
grpc grpcs http https ws wss
Minimum Version
Kong Gateway - 3.8
Tags
AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.

The AI Proxy Advanced plugin lets you transform and proxy requests to multiple AI providers and models at the same time. This lets you set up load balancing between targets.

AI Proxy Advanced plugin accepts requests in one of a few defined and standardized OpenAI formats, translates them to the configured target format, and then transforms the response back into a standard format.

v3.10+ To use AI Proxy Advanced with non-OpenAI format without conversion, see section below for more details.

Overview of capabilities

AI Proxy Advanced plugin supports capabilities across batch processing, multimodal embeddings, agents, audio, image, streaming, and more, spanning multiple providers.

For Kong Gateway versions 3.6 or earlier:

  • Chat Completions APIs: Multi-turn conversations with system/user/assistant roles.

  • Completions API: Generates free-form text from a prompt.

    OpenAI has marked this endpoint as legacy and recommends using the Chat Completions API for developing new applications.

See the following table for capabilities supported in AI Gateway:

API capability Description Examples OpenAI format
Chat completions Generates conversational responses from a sequence of messages using supported LLM providers.
Embeddings Converts text to vector representations for semantic search and similarity matching.
Function calling Allows models to invoke external tools and APIs based on conversation context.
  • llm/v1/chat
Assistants and responses Powers persistent tool-using agents and exposes metadata for debugging and evaluation.
Batches and files Supports asynchronous bulk LLM requests and file uploads for long documents and structured input.
Audio Enables speech-to-text, text-to-speech, and translation for voice applications.
Image generation and editing Generates or modifies images from text prompts.
Video generation Generates videos from text prompts.
Realtime Bidirectional WebSocket streaming for low-latency, interactive voice and text applications.
AWS Bedrock native APIs Enables advanced orchestration and real-time RAG via Converse and RetrieveAndGenerate endpoints.

Available only when using native LLM format for Bedrock.
Hugging Face native APIs Provides text generation and streaming using Hugging Face models.

Available only when using native LLM format for Hugging Face.
Rerank Reorders documents by relevance for RAG pipelines using Bedrock or Cohere rerank APIs.

Available only when using native LLM format for Bedrock and Cohere.

The following providers are supported by the legacy Completions API:

  • OpenAI
  • Azure OpenAI
  • Cohere
  • Llama2
  • Amazon Bedrock
  • Gemini
  • Hugging Face

Supported AI providers

AI Gateway supports proxying requests to the following AI providers. Each provider page documents supported capabilities, configuration requirements, and provider-specific details.

For detailed capability support, configuration requirements, and provider-specific limitations, see the individual provider reference pages.

Provider Description
OpenAI GPT-5, GPT-4, GPT-4o, GPT-3.5, DALL-E, Whisper, Sora, and text embedding models.
Azure OpenAI Microsoft-hosted OpenAI models with Azure enterprise integration.
Amazon Bedrock AWS-managed foundation models including Claude, Titan, Llama, and Stable Diffusion.
Anthropic Claude model family for chat, completions, and function calling.
Gemini Google’s Gemini models via the Generative Language API.
Vertex AI Google Cloud-hosted Gemini models with enterprise features.
Cohere Command models for chat, completions, embeddings, and reranking.
Mistral Mistral AI models in cloud, self-hosted, or OLLAMA formats.
Hugging Face Open-source models via Hugging Face Inference API.
Llama Meta’s Llama 2 and Llama 3 models in raw, OLLAMA, or OpenAI formats.
xAI Grok models for chat, function calling, and image generation.
Alibaba Cloud DashScope Qwen models for chat, embeddings, and image generation.
Cerebras High-performance inference for Llama models via Cerebras Cloud.

How it works

The AI Proxy Advanced plugin will mediate the following for you:

  • Request and response formats appropriate for the configured config.targets[].model.provider and config.targets[].route_type
  • The following service request coordinates (unless the model is self-hosted):
    • Protocol
    • Host name
    • Port
    • Path
    • HTTP method
  • Authentication on behalf of the Kong API consumer
  • Decorating the request with parameters from the config.targets.model[].options block, appropriate for the chosen provider
  • Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
  • Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
  • Fulfillment of requests to self-hosted models, based on select supported format transformations

Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.

v3.11+ AI Proxy Advanced supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.

Request and response formats

AI Gateway transforms requests and responses according to the configured config.targets[].model.provider and config.targets[].route_type, using the OpenAI format by default. v3.10+ To use a provider’s native format instead, set config.llm_format to a value other than openai. The plugin then passes requests upstream without transformation. See Supported native LLM formats for available options.

The following table maps each route type to its OpenAI API reference and generative AI category. See the AI provider reference pages for provider-specific details.

Route type

OpenAI API reference

Gen AI category

Min version

llm/v1/chat Chat completions text/generation 3.6
llm/v1/completions Completions text/generation 3.6
llm/v1/embeddings Embeddings text/embeddings 3.11
llm/v1/files Files N/A 3.11
llm/v1/batches Batch N/A 3.11
llm/v1/assistants Assistants text/generation 3.11
llm/v1/responses Responses text/generation 3.11
realtime/v1/realtime Realtime realtime/generation 3.11
audio/v1/audio/speech Create speech audio/speech 3.11
audio/v1/audio/transcriptions Create transcription audio/transcription 3.11
audio/v1/audio/translations Create translation audio/transcription 3.11
image/v1/images/generations Create image image/generation 3.11
image/v1/images/edits Create image edit image/generation 3.11
video/v1/videos/generations Create video video/generation 3.13

Provider-specific parameters can be passed using the extra_body field in your request. See the sample OpenAPI specification for detailed format examples.

Supported native LLM formats v3.10+

If you use a provider’s native SDK, AI Gateway v3.10+ can proxy the request and return the upstream response without payload format conversion. Set config.llm_format to a value other than openai to preserve the provider’s native request and response formats.

In this mode, AI Gateway will still provide analytics, logging, and cost calculation. When config.llm_format is set to a native format, only the corresponding provider is supported with its specific APIs.

Provider

LLM format

Native capabilities

Anthropic anthropic Messages, batch processing
Amazon Bedrock bedrock Converse, RAG (RetrieveAndGenerate), reranking, async invocation
Cohere cohere Reranking
Gemini gemini Content generation, embeddings, batches, file uploads
Vertex AI gemini Content generation, embeddings, batches, reranking, long-running predictions
Hugging Face huggingface Text generation, streaming

Load balancing

AI Proxy Advanced supports several load balancing algorithms for distributing requests across AI models:

For detailed algorithm descriptions and selection guidance, see Load balancing algorithms.

For load balancing across Gateway Upstreams and Targets instead of LLMs, see load balancing with Kong Gateway.

Retry and fallback

The AI load balancer supports configurable retries, timeouts, and failover to different models when a target is unavailable.

v3.10+ Fallback works across targets with any supported format. You can mix providers freely, for example OpenAI and Mistral. Earlier versions require compatible formats between fallback targets. For configuration details, see Retry and fallback configuration.

Client errors don’t trigger failover. To failover on additional error types, set config.balancer.failover_criteria to include HTTP codes like http_429 or http_502, and non_idempotent for POST requests.

Health check and circuit breaker v3.13+

The AI load balancer supports circuit breakers to improve reliability. If a target reaches the failure threshold defined by config.balancer.max_fails, the load balancer stops routing requests to it until the timeout period (config.balancer.fail_timeout) elapses.

For configuration details and behavior examples, see Circuit breaker.

Templating v3.7+

The plugin allows you to substitute values in the config.targets[].model.name and any parameter under config.targets.model[].options with specific placeholders, similar to those in the Request Transformer Advanced plugin.

The following templated parameters are available:

  • $(headers.header_name): The value of a specific request header.
  • $(uri_captures.path_parameter_name): The value of a captured URI path parameter.
  • $(query_params.query_parameter_name): The value of a query string parameter.

You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:

Action

Description

Select different models dynamically on one provider Allow users to select the target model based on a request header or parameter. Supports flexible routing across different models on the same provider.
Use one chat route with dynamic Azure OpenAI deployments Configure a dynamic route to target multiple Azure OpenAI model deployments.
Use multiple routes to map mulitple Azure Deployment Use separate Routes to map Azure OpenAI SDK requests to specific deployments of GPT-3.5 and GPT-4.

Vector databases

A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.

The AI Proxy Advanced plugin supports the following vector databases:

  • Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
  • Using config.vectordb.strategy: pgvector and parameters in config.vectordb.pgvector:

To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.

Using cloud authentication with Redis v3.13+

Starting in Kong Gateway 3.13, you can authenticate with a cloud Redis provider for your Redis strategy. This allows you to seamlessly rotate credentials without relying on static passwords.

The following providers are supported:

  • AWS ElastiCache
  • Azure Managed Redis
  • Google Cloud Memorystore (with or without Valkey)

Each provider also supports an instance and cluster configuration.

Important: Kong Gateway open source plugins do not support any Redis cloud provider cluster configurations.

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

FAQs

No. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.

Yes. The values for temperature, top_p, and top_k in the request take precedence over those set in config.targets.model.options.

Yes, but only if config.targets.auth.allow_override is set to true in the plugin configuration. When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.

It uses Kong’s built-in load balancing mechanism with the EWMA (Exponentially Weighted Moving Average) algorithm to dynamically route traffic to the backend with the lowest observed latency.

There’s no fixed time window. EWMA continuously updates with every response, giving more weight to recent observations. Older latencies decay over time, but still contribute in smaller proportions.

The fastest model gets a majority of traffic, but Kong never sends 100% to a single target unless it’s the only one available. In practice, the dominant target may receive ~90–99% of traffic, depending on how much better its EWMA score is.

Yes. EWMA ensures all targets continue to receive a small amount of traffic. This ongoing probing lets the system adapt if a previously slower model becomes faster later.

While exact percentages vary with latency gaps, less performant targets typically get between 0.1%–5% of traffic, just enough to keep updating their EWMA score for comparison.

If you see the following error in the logs:

failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)

This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI Proxy Advanced plugin instances.

Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!