AI Proxy

Overview Examples Configuration reference Changelog

Overview of capabilities

AI Proxy plugin supports capabilities across batch processing, multimodal embeddings, agents, audio, image, streaming, and more, spanning multiple providers.

For Kong Gateway versions 3.6 or earlier:

Chat Completions APIs: Multi-turn conversations with system/user/assistant roles.
Completions API: Generates free-form text from a prompt.

OpenAI has marked this endpoint as legacy and recommends using the Chat Completions API for developing new applications.

See the following table for capabilities supported in Kong Gateway version v3.11+ or later:

API capability	Description	Examples
Embeddings	Offers unified text-to-vector embedding with support for multiple providers and analytics.	`llm/v1/embeddings`
Assistants and responses	Powers persistent tool-using agents and exposes metadata for debugging and evaluation.	`llm/v1/assistants` `llm/v1/responses` Secure GitHub MCP Server traffic using `llm/v1/responses` route type
Batch and files	Supports parallel LLM requests and file upload for long documents and structured input.	`llm/v1/batch` `llm/v1/files` `llm/v1/batches` Send asynchronous requests to LLMs
Audio	Enables speech-to-text, text-to-speech, and real-time translation for voice agents and multilingual UIs.	`/v1/audio/transcriptions` `/v1/audio/speech`
Image generation and editing	Generates or modifies images from text prompts for multimodal agent input/output.	`/v1/images/generations` `/v1/images/edits`
AWS Bedrock agent APIs	Enables advanced orchestration and real-time RAG via Converse and RetrieveAndGenerate endpoints.	`/converse` `/retrieveAndGenerate`
Hugging Face text generation	Provides text generation and streaming using open-source Hugging Face models.	`/text-generation`
Rerank	Improves relevance in RAG pipelines by reordering documents based on context.	`/rerank`

Core text generation

The following reference tables detail feature availability across supported LLM providers when used with the AI Proxy plugin.

Support for chat, completions, and embeddings:

Provider	Chat Completions	Chat Completions streaming	Embeddings v3.11+
Amazon Bedrock
Anthropic
Azure
Cohere
Gemini
Gemini Vertex
Hugging Face
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)

The following providers are supported by the legacy Completions API:

OpenAI

Azure OpenAI

Cohere

Llama2

Amazon Bedrock

Gemini

Hugging Face

Advanced text generation v3.11+

Support for function calling, tool use, and batch processing:

Provider	Files	Batches	Assistants	Responses
Amazon Bedrock
Anthropic
Azure
Cohere
Gemini
Gemini Vertex
Hugging Face
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)

Audio features v3.11+

Support for text-to-speech, transcription, and translation:

Provider	Audio: Speech	Audio: Transcriptions	Audio: Translations
Amazon Bedrock
Anthropic
Azure
Cohere
Gemini
Gemini Vertex
Hugging Face
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)

Image features v3.11+

Support for image generation, image editing interaction:

Provider	Image: Generations	Image: Edits
Amazon Bedrock
Anthropic
Azure
Cohere
Gemini
Gemini Vertex
Hugging Face
Llama2 (supports Llama2 and Llama3 models and raw, OLLAMA, and OpenAI formats)
Mistral (mistral.ai, OpenAI, raw, and OLLAMA formats)
OpenAI (GPT-3.5, GPT-4, GPT-4o, and Multi-Modal)

How it works

The AI Proxy plugin will mediate the following for you:

Request and response formats appropriate for the configured config.targets[].model.provider and config.targets.route_type
The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets[].model.options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations

Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.

v3.11+ AI Proxy supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.

Request and response formats

The plugin’s route_type should be set based on the target upstream endpoint and model, based on this capability matrix:

The following requirements are enforced by upstream providers:

For Azure Responses API, set config.azure_api_version to "preview".

For OpenAI and Azure Assistant APIs, include the header OpenAI-Beta: assistants=v2.

For requests with large payloads (e.g., image edits, audio transcription/translation), consider increasing config.max_request_body_size to three times the raw binary size.

Provider path	Kong Gateway route type	Example model name	Min version
/openai/deployments/{deployment_name}/chat/completions	llm/v1/chat	gpt-4	3.6
/openai/deployments/{deployment_name}/completions	llm/v1/completions	gpt-3.5-turbo-instruct	3.6
/openai/deployments/{deployment_name}/embeddings	llm/v1/embeddings	text-embedding-ada-002¹	3.11
/openai/files	llm/v1/files	n/a	3.11
/openai/batches	llm/v1/batches	n/a	3.11
/openai/assistants	llm/v1/assistants	n/a	3.11
/openai/v1/responses	llm/v1/responses	n/a	3.11
/openai/audio/speech	audio/v1/audio/speech	n/a	3.11
/openai/audio/transcriptions	audio/v1/audio/transcriptions	n/a	3.11
/openai/audio/translations	audio/v1/audio/translations	n/a	3.11
/openai/images/generations	image/v1/images/generations	n/a	3.11
/openai/images/edits	image/v1/images/edits	n/a	3.11

[1]: If you use the text-embedding-ada-002 as an embedding model, you must set a fixed dimension of 1536, as required by the official model specification. Alternatively, use the text-embedding-3-small model, which supports dynamic dimensions and works without specifying a fixed value.

Provider path	Kong Gateway route type	Example model name	Min version
/v1/chat/completions	llm/v1/chat	gpt-4	3.6
/v1/completions	llm/v1/completions	gpt-3.5-turbo-instruct	3.6
/v1/embeddings	llm/v1/embeddings	text-embedding-ada-002¹	3.11
/v1/files	llm/v1/files	n/a	3.11
/v1/batches	llm/v1/batches	n/a	3.11
/v1/assistants	llm/v1/assistants	gpt-4-1106-preview	3.11
/v1/responses	llm/v1/responses	gpt-4-1106-preview	3.11
/v1/audio/speech	audio/v1/audio/speech	tts-1	3.11
/v1/audio/transcriptions	audio/v1/audio/transcriptions	whisper-1	3.11
/v1/audio/translations	audio/v1/audio/translations	whisper-1	3.11
/v1/images/generations	image/v1/images/generations	dall-e-3	3.11
/v1/images/edits	image/v1/images/edits	dall-e-2	3.11

[1]: If you use the text-embedding-ada-002 as an embedding model, you must set a fixed dimension of 1536, as required by the official model specification. Alternatively, use the text-embedding-3-small model, which supports dynamic dimensions and works without specifying a fixed value.

Provider path	Kong Gateway route type	Example model name	Min version
Use the LLM `chat` upstream path	llm/v1/chat	Use the model name for the specific LLM provider	3.8
Use the LLM `completions` upstream path	llm/v1/completions	Use the model name for the specific LLM provider	3.8
Use the LLM `embeddings` upstream path	llm/v1/embeddings	Use the model name for the specific LLM provider	3.11
Use the LLM `image/generations` upstream path	image/v1/images/generations	Use the model name for the specific LLM provider	3.11
Use the LLM `image/edits` upstream path	image/v1/images/edits	Use the model name for the specific LLM provider	3.11

Provider path	Kong Gateway route type	Example model name	Min version
/v1/messages	llm/v1/chat	claude-3-opus-20240229	3.6
/v1/complete	llm/v1/completions	claude-2.1	3.6

Provider path	Kong Gateway route type	Example model name	Min version
/v1/chat	llm/v1/chat	command	3.6
/v1/generate	llm/v1/completions	command	3.6
/v2/embed	llm/v1/embeddings	embed-english-v3.0	3.11

Provider path	Kong Gateway route type	Example model name	Min version
llm/v1/chat	llm/v1/chat	gemini-2.0-flash	3.8
llm/v1/embeddings	llm/v1/embeddings	text-embedding-004	3.11
image/v1/images/generations	image/v1/images/generations	gemini-2.0-flash-preview-image-generation¹	3.11
image/v1/images/edits	image/v1/images/edits	gemini-2.0-flash-preview-image-generation¹	3.11

[1]: Kong AI Gateway does not support the Imagen model family. For image generation with Google Vertex AI, use Gemini models instead.

Provider path	Kong Gateway route type	Example model name	Min version
llm/v1/chat	llm/v1/chat	gemini-2.0-flash	3.8
llm/v1/completions	llm/v1/completions	gemini-2.0-flash	3.8
llm/v1/embeddings	llm/v1/embeddings	text-embedding-004	3.11
image/v1/images/generations	image/v1/images/generations	gemini-2.0-flash-preview-image-generation¹	3.11
image/v1/images/edits	image/v1/images/edits	gemini-2.0-flash-preview-image-generation¹	3.11

[1]: Kong AI Gateway does not support the Imagen model family. For image generation with Google Vertex AI, use Gemini models instead.

Provider path	Kong Gateway route type	Example model name	Min version
/models/{model_provider}/{model_name}	llm/v1/chat	Use the model name for the specific LLM provider	3.9
/models/{model_provider}/{model_name}	llm/v1/completions	Use the model name for the specific LLM provider	3.9
/models/{model_provider}/{model_name}	llm/v1/embeddings	Use the embedding model name	3.11

Provider path	Kong Gateway route type	Example model name	Min version
User-defined	llm/v1/chat	User-defined	3.6
User-defined	llm/v1/completions	User-defined	3.6
User-defined	llm/v1/embeddings	User-defined	3.11

Provider path	Kong Gateway route type	Example model name	Min version
User-defined	llm/v1/chat	mistral-tiny	3.6
User-defined	llm/v1/completions	mistral-tiny	3.6
User-defined	llm/v1/embeddings	mistral-embed	3.11

The following upstream URL patterns are used:

Provider	URL
Amazon Bedrock	https://bedrock-runtime.{region}.amazonaws.com
Anthropic	https://api.anthropic.com:443/{route_type_path}
Azure	https://{azure_instance}.openai.azure.com:443/openai/deployments/{deployment_name}/{route_type_path}
Cohere	https://api.cohere.com:443/{route_type_path}
Gemini	https://generativelanguage.googleapis.com
Gemini Vertex	https://aiplatform.googleapis.com/
Hugging Face	https://api-inference.huggingface.co
Llama2	As defined in `config.targets[].model.options.upstream_url`
Mistral	As defined in `config.targets[].model.options.upstream_url`
OpenAI	https://api.openai.com:443/{route_type_path}

While only the Llama2 and Mistral models are classed as self-hosted, the target URL can be overridden for any of the supported providers. For example, a self-hosted or otherwise OpenAI-compatible endpoint can be called by setting the same config.targets[].model.options.upstream_url plugin option.

v3.10+ If you are using each provider’s native SDK, Kong Gateway allows you to transparently proxy the request without any transformation and return the response unmodified. This can be done by setting config.llm_format to a value other than openai, such as gemini or bedrock. See the section below for more details.

In this mode, Kong Gateway will still provide useful analytics, logging, and cost calculation.

Input formats

Kong Gateway mediates the request and response format based on the selected config.targets[].model.provider and config.targets.route_type.

v3.10+ By default, Kong Gateway uses the OpenAI format, but you can customize this using config.llm_format. If llm_format is not set to openai, the plugin will not transform the request when sending it upstream and will leave it as-is.

The Kong Gateway AI Proxy accepts the following inputs formats, standardized across all providers. The config.targets.route_type must be configured respective to the required request and response format examples.

Text generation inputs

The following examples show standardized text-based request formats for each supported llm/v1/* route. These formats are normalized across providers to help simplify downstream parsing and integration.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the Boltzmann equation?"
        }
    ]
}

Copied!

v3.9+ With Amazon Bedrock, you can include your guardrail configuration in the request:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a scientist."
        },
        {
            "role": "user",
            "content": "What is the Boltzmann equation?"
        }
    ],
    "guardrailConfig": {
        "guardrailIdentifier": "$GUARDRAIL-IDENTIFIER",
        "guardrailVersion": "1",
        "trace": "enabled"
    }
}

Copied!

{
    "prompt": "You are a scientist. What is quantum entanglement?"
}

Copied!

Supported in: v3.11+

{
  "input": "Quantum computing is expected to revolutionize cryptography.",
  "model": "text-embedding-ada-002",
  "encoding_format": "float"
}

Copied!

Supported in: v3.11+

This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.

curl http://localhost:8000 \
  -F purpose="batch" \
  -F file="@mydata.jsonl"

Copied!

Supported in: v3.11+

For OpenAI and Azure Assistant APIs, include the header OpenAI-Beta: assistants=v2.

This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.

{
  "instructions": "You are a frontend mentor. When asked a question, write and explain JavaScript code to help the user understand key concepts.",
  "name": "Frontend Mentor",
  "tools": [{"type": "code_interpreter"}],
  "model": "gpt-4o"
}

Copied!

Supported in: v3.11+

This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.

{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
}

Copied!

Supported in: v3.11+

This is a RESTful endpoint that supports all CRUD operations, but this preview example demonstrates only a POST request.

{
  "input": "Summarize the key differences between HTTP/1.1 and HTTP/2."
}

Copied!

Audio and image generation inputs

The following examples show standardized audio and image request formats for each supported route. These formats are normalized across providers to help simplify downstream parsing and integration.

Supported in: v3.11+

curl http://localhost:8000 \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "In the heart of the city, the rain whispered secrets to the streets.",
    "voice": "<VOICE_NAME>"
  }' \
  --output speech.mp3

Copied!

Note: Replace <VOICE_NAME> with a supported voice identifier (e.g., serene, vibrant). Available voices depend on the LLM model or provider. Check your provider’s documentation for the list of supported voices.

Supported in: v3.11+

curl http://localhost:8000/ \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \

Copied!

Supported in: v3.11+

curl http://localhost:8000/ \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/german.m4a" \

Copied!

Supported in: v3.11+

{
  "prompt": "An abstract visualization of API traffic flowing through a secure gateway",
  "n": 1,
  "size": "<IMAGE_SIZE>"
}

Copied!

Note: The <IMAGE_SIZE> placeholder defines the output image resolution (for example, 1024x1024, 512x512). Supported sizes vary depending on the LLM model used. Refer to your LLM provider’s documentation for allowed dimensions.

Supported in: v3.11+

curl -s -D >(grep -i x-request-id >&2) \
  -o >(jq -r '.data[0].b64_json' | base64 --decode > breakfast-platter.png) \
  -X POST "https://api.openai.com/v1/images/edits" \
  -F "image=@pancakes.png" \
  -F "image=@coffee-cup.png" \
  -F "image=@fruit-bowl.png" \
  -F "image=@orange-juice.png" \
  -F 'prompt=Create a delicious breakfast platter with these four items arranged beautifully'

Copied!

Response formats

Conversely, the response formats are also transformed to a standard format across all providers:

Text-based responses

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "The Boltzmann equation is a fundamental equation in statistical mechanics that describes how the distribution function f(x, v, t) of particles in a gas evolves over time. It accounts for the effects of particle collisions and external forces, providing a bridge between microscopic particle dynamics and macroscopic thermodynamic behavior. The general form of the equation is: ∂f/∂t + v · ∇f + F · ∇_v f = (∂f/∂t)_collision, where f is the distribution function, v is velocity, F is an external force, and the right-hand side represents the change in f due to collisions.",
                "role": "assistant"
            }
        }
    ],
    "created": 1707769597,
    "id": "chatcmpl-ID",
    "model": "gpt-4-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 94,
        "prompt_tokens": 26,
        "total_tokens": 120
    }
}

Copied!

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "text": "Quantum entanglement is a phenomenon where particles become interconnected such that the state of one instantly influences the state of another, regardless of distance."
        }
    ],
    "created": 1707769597,
    "id": "cmpl-ID",
    "model": "gpt-3.5-turbo-instruct",
    "object": "text_completion",
    "usage": {
        "completion_tokens": 26,
        "prompt_tokens": 9,
        "total_tokens": 35
    }
}

Copied!

Supported in: v3.11+

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [
        0.0023064255,
        -0.009327292,
        .... (1536 floats total for ada-002)
        -0.0028842222,
      ],
      "index": 0
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Copied!

Supported in: v3.11+

{
  "id": "file-abc123",
  "object": "file",
  "bytes": 120000,
  "created_at": 1677610602,
  "filename": "mydata.jsonl",
  "purpose": "fine-tune",
}

Copied!

Supported in: v3.11+

{
  "id": "batch_d41d8cd98f00b204e9800998ecf8427e",
  "object": "batch",
  "endpoint": "/v1/chat/completions",
  "errors": null,
  "input_file_id": "file-abcd",
  "completion_window": "24h",
  "status": "validating",
  "output_file_id": null,
  "error_file_id": null,
  "created_at": 1751281814,
  "in_progress_at": null,
  "expires_at": 1751368214,
  "finalizing_at": null,
  "completed_at": null,
  "failed_at": null,
  "expired_at": null,
  "cancelling_at": null,
  "cancelled_at": null,
  "request_counts": {
    "total": 0,
    "completed": 0,
    "failed": 0
  },
  "metadata": null
}

Copied!

Supported in: v3.11+

{
  "id": "asst_def456",
  "object": "assistant",
  "created_at": 1698984975,
  "name": "Frontend Mentor",
  "description": null,
  "model": "gpt-4o",
  "instructions": "You are a frontend mentor. When asked a question, write and explain JavaScript code to help the user understand key concepts.",
  "tools": [
    {
      "type": "code_interpreter"
    }
  ],
  "metadata": {},
  "top_p": 1.0,
  "temperature": 1.0,
  "response_format": "auto"
}

Copied!

Supported in: v3.11+

{
  "id": "resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b",
  "object": "response",
  "created_at": 1741476542,
  "status": "completed",
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "max_output_tokens": null,
  "model": "gpt-4.1-2025-04-14",
  "output": [
    {
      "type": "message",
      "id": "msg_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "HTTP/1.1 uses a single connection per request-response cycle, leading to inefficiencies, especially with multiple resources. In contrast, HTTP/2 supports multiplexing, allowing multiple streams over one connection, which reduces latency. HTTP/2 also introduces binary framing and header compression for improved performance.",
          "annotations": []
        }
      ]
    }
  ],
  "parallel_tool_calls": true,
  "previous_response_id": null,
  "reasoning": {
    "effort": null,
    "summary": null
  },
  "store": true,
  "temperature": 1.0,
  "text": {
    "format": {
      "type": "text"
    }
  },
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 36,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 60,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 96
  },
  "user": null,
  "metadata": {}
}

Copied!

Image, and audio responses

The following examples show standardized response formats returned by supported audio/ and image/ routes. These formats are normalized across providers to support consistent multimodal output parsing.

Supported in: v3.11+

The response contains the audio file content of speech.mp3.

Supported in: v3.11+

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100 or a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

Copied!

Supported in: v3.11+

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Copied!

Supported in: v3.11+

{
  "created": 1713833628,
  "data": [
    {
      "b64_json": "..."
    }
  ],
  "usage": {
    "total_tokens": 100,
    "input_tokens": 50,
    "output_tokens": 50,
    "input_tokens_details": {
      "text_tokens": 10,
      "image_tokens": 40
    }
  }
}

Copied!

Supported in: v3.11+

{
  "created": 1713833629,
  "data": [
    {
      "b64_json": "..."
    }
  ],
  "usage": {
    "total_tokens": 100,
    "input_tokens": 50,
    "output_tokens": 50,
    "input_tokens_details": {
      "text_tokens": 10,
      "image_tokens": 40
    }
  }
}

Copied!

The request and response formats are loosely modeled after OpenAI’s API. For detailed format specifications, see the sample OpenAPI specification.

Supported native LLM formats

When config.llm_format is set to a native format, only the corresponding provider is supported with its specific APIs as listed below.

LLM format	Provider	Supported APIs
`gemini`	Gemini	`/generateContent` `/streamGenerateContent`
`bedrock`	Bedrock	`/converse` `/converse-stream` `/retrieveAndGenerate` `/retrieveAndGenerateStream` `/rerank`
`cohere`	Cohere	`/v1/rerank` `/v2/rerank`
`huggingface`	Hugging Face	`/generate` `/generate_stream`

Caveats and limitations

The following sections detail the provider and statistic logging limitations.

Provider-specific limitations

Anthropic: Does not support llm/v1/completions or llm/v1/embeddings.
Llama2: Raw format lacks support for llm/v1/embeddings.
Bedrock and Gemini: Only support auth.allow_override = false.

Statistics logging limitations

Anthropic: No statistics logging for llm/v1/completions.
OpenAI and Azure: No statistics logging for assistants, batch, or audio APIs.
Bedrock: No statistics logging for image generation or editing APIs.

Templating v3.7+

The plugin allows you to substitute values in the config.targets[].model.name and any parameter under config.targets[].model.options with specific placeholders, similar to those in the Request Transformer Advanced plugin.

The following templated parameters are available:

$(headers.header_name): The value of a specific request header.
$(uri_captures.path_parameter_name): The value of a captured URI path parameter.
$(query_params.query_parameter_name): The value of a query string parameter.

You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy plugin, depending on your specific use case:

Action	Description
Use chat route with dynamic model selection	Configure a chat route that reads the target model from the request path instead of hardcoding it in the configuration.
Use the Azure deployment relevant to a specific model name	Configure a header capture to insert the requested model name directly into the plugin configuration for Kong AI Gateway deployment with Azure OpenAI, as a string substitution.
Proxy multiple models deployed in the same Azure instance	Configure one route to proxy multiple models deployed in the same Azure instance.

This can be used to OpenAI-compatible SDK with this plugin in multiple ways, depending on the required use case.

FAQs

Can I authenticate to Azure AI with Azure Identity?

Yes, if Kong Gateway is running on Azure, AI Proxy can detect the designated Managed Identity or User-Assigned Identity of that Azure Compute resource, and use it accordingly. In your AI Proxy configuration, set the following parameters:

config.auth.azure_use_managed_identity to true to use an Azure-Assigned Managed Identity.
config.auth.azure_use_managed_identity to true and an config.auth.azure_client_id to use a User-Assigned Identity.

Can I override config.model.name by specifying a different model name in the request?

No. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.

Can I override temperature, top_p, and top_k from the request?

Yes. The values for temperature, top_p, and top_k in the request take precedence over those set in config.targets.model.options.

How can I set model generation parameters when calling Gemini?

You have several options, depending on the SDK and configuration:

Use the Gemini SDK:

Set llm_format to gemini.
Use the Gemini provider.

Configure parameters like temperature, top_p, and top_k on the client side:

 model = genai.GenerativeModel(
     'gemini-1.5-flash',
     generation_config=genai.types.GenerationConfig(
         temperature=0.7,
         top_p=0.9,
         top_k=40,
         max_output_tokens=1024
     )
 )

Copied!

Use the OpenAI SDK with the Gemini provider:
1. Set llm_format to openai.
2. You can configure parameters in one of three ways:
  - Configure them in the plugin only.
  - Configure them in the client only.
  - Configure them in both—the client-side values will override plugin config.

Can I override authentication values from the request?

Yes, but only if config.auth.allow_override is set to true in the plugin configuration. When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.

AI Proxy

Overview of capabilities

Core text generation

Advanced text generation v3.11+

Audio features v3.11+

Image features v3.11+

How it works

Request and response formats

Input formats

Text generation inputs

Audio and image generation inputs

Response formats

Text-based responses

Image, and audio responses

Supported native LLM formats

Caveats and limitations

Provider-specific limitations

Statistics logging limitations

Templating v3.7+

FAQs

Help us make these docs great!

Still need help