Kong AI Gateway gives you advanced load balancing capabilities to efficiently distribute requests across multiple LLM models. This helps you ensure fault tolerance, optimize resource utilization, and balance traffic across your AI systems.
With the AI Proxy Advanced plugin, you can select from several load balancing algorithms—similar to those used for Kong upstreams but extended for AI model routing. You configure load balancing using the Upstream entity, giving you flexibility to fine-tune how requests are routed to various AI providers and LLM models.
Load balancing algorithms
Kong AI Gateway supports multiple load balancing strategies to optimize traffic distribution across AI models. Each algorithm is suited for different performance goals such as balancing load, improving cache-hit ratios, reducing latency, or ensuring failover reliability.
The table below provides a detailed overview of the available algorithms, along with considerations to keep in mind when selecting the best option for your use case.
Algorithm |
Description |
Considerations |
---|---|---|
Round-robin (weighted) |
Distributes requests across models in a circular pattern with weight-based allocation. The weight parameter (for example, weight: 70 ) controls the proportion of traffic sent to each model. By default, all models have the same weight and receive the same percentage of requests.
|
|
Consistent-hashing |
Routes requests based on a hash of a configurable client input, such as a header or user ID. The hash_on_header setting (for example, X-Hashing-Header ) defines the source for the hash and drives all routing decisions. By default, the header is set to X-Kong-LLM-Request-ID .
|
|
Lowest-usage |
Routes requests to the least-utilized models based on resource usage metrics. In the configuration, the tokens_count_strategy (for example, prompt-tokens ) defines how usage is measured, focusing on prompt tokens or other resource indicators.
|
|
Lowest-latency |
Routes requests to the models with the lowest observed latency. In the configuration, the latency_strategy parameter (for example, latency_strategy: e2e ) defines how latency is measured, typically based on end-to-end response times. By default, the latency is calculated based on the time the model takes to generate each token (tpot ).
The latency algorithm is based on peak EWMA (Exponentially Weighted Moving Average), which ensures that the balancer selects the backend by the lowest latency. The latency metric used is the full request cycle, from TCP connect to body response time. Since it’s a moving average, the metrics will decay over time. |
|
Semantic |
Routes requests based on semantic similarity between the prompt and model descriptions. In the configuration, embeddings are generated using a specified model (e.g., text-embedding-3-small ), and similarity is calculated using vector search.
|
|
Priority |
Routes requests to models based on assigned priority groups and weights. In the configuration, models are grouped by priority and can have individual weight settings (for example, weight: 70 for GPT-4), allowing proportional load distribution within each priority tier.
By default, all models have the same priority. The balancer always chooses one of the targets in the group with the highest priority first. If all targets in the highest priority group are down, the balancer chooses one of the targets in the next highest priority group. |
|
Retry and fallback
The load balancer includes built-in support for retries and fallbacks. When a request fails, the balancer can automatically retry the same target or redirect the request to a different upstream target.
How retry and fallback works
- Client sends a request.
- The load balancer selects a target based on the configured algorithm (round-robin, lowest-latency, etc.).
-
If the target fails (based on defined
failover_criteria
), the balancer:- Retries the same or another target.
- Fallbacks to another available target.
- If retries are exhausted without success, the load balancer returns a failure to the client.
flowchart LR Client(((Application))) --> LBLB subgraph AIGateway LBLB[/Load Balancer/] end LBLB -->|Request| AIProvider1(AI Provider 1) AIProvider1 --> Decision1{Is Success?} Decision1 -->|Yes| Client Decision1 -->|No| AIProvider2(AI Provider 2) subgraph Retry AIProvider2 --> Decision2{Is Success?} end Decision2 ------>|Yes| Client
Figure 1: A simplified diagram of fallback and retry processing in AI Gateway’s load balancer.
Retry and fallback configuration
The AI Gateway load balancer offers several configuration options to fine-tune request retries, timeouts, and failover behavior:
Setting |
Description |
---|---|
retries
|
Defines how many times to retry a failed request before reporting failure to the client. Increase for better resilience to transient errors; decrease if you need lower latency and faster failure. |
failover_criteria
|
Specifies which types of failures (e.g., http_429 , http_500 ) should trigger a failover to a different target.
Customize based on your tolerance for specific errors and desired failover behavior.
|
connect_timeout
|
Sets the maximum time allowed to establish a TCP connection with a target. Lower it for faster detection of unreachable servers; raise it if some servers may respond slowly under load. |
read_timeout
|
Defines the maximum time to wait for a server response after sending a request. Lower it for real-time applications needing quick responses; increase it for long-running operations. |
write_timeout
|
Sets the maximum time allowed to send the request payload to the server. Increase if large request bodies are common; keep short for small, fast payloads. |
Retry and fallback scenarios
You can customize AI Gateway load balancer to fit different application needs, such as minimizing latency, enabling sticky sessions, or optimizing for cost. The table below maps common scenarios to key configuration options that control load balancing behavior:
Scenario |
Action |
Description |
---|---|---|
Requests must not hang longer than 3 seconds |
Adjust connect_timeout , read_timeout , write_timeout
|
Shorten these timeouts to quickly fail if a server is slow or unresponsive, ensuring faster error handling and responsiveness. |
Prioritize the lowest-latency target |
Set latency_strategy to e2e
|
Optimize routing based on full end-to-end response time, selecting the target that minimizes total latency. |
Need predictable fallback for the same user |
Use hash_on_header
|
Ensure that the same user consistently routes to the same target, enabling sticky sessions and reliable fallback behavior. |
Models have different costs |
Set tokens_count_strategy to cost
|
Route requests intelligently by considering cost, balancing model performance with budget optimization. |
Version compatibility for fallbacks
Kong Gateway version compatibility for fallbacks: v3.10+
- Full fallback support across targets, even with different API formats.
- Mix models from different providers if needed (for example, OpenAI and Mistral).
Pre-v3.10:
- Fallbacks only allowed between targets using the same API format.
- Example: OpenAI-to-OpenAI fallback is supported; OpenAI-to-OLLAMA is not.