Load balancing with AI Proxy Advanced

Uses: Kong Gateway AI Gateway

Load balancing algorithms

Kong AI Gateway supports multiple load balancing strategies to optimize traffic distribution across AI models. Each algorithm is suited for different performance goals such as balancing load, improving cache-hit ratios, reducing latency, or ensuring failover reliability.

The table below provides a detailed overview of the available algorithms, along with considerations to keep in mind when selecting the best option for your use case.

Algorithm	Description	Considerations
Round-robin (weighted)	Distributes requests across models in a circular pattern with weight-based allocation. The `weight` parameter (for example, `weight: 70`) controls the proportion of traffic sent to each model. By default, all models have the same weight and receive the same percentage of requests.	Traffic is routed proportionally based on weights. Requests follow a sequence adjusted by weight. Focuses purely on traffic distribution, not cache-hit ratios.
Consistent-hashing	Routes requests based on a hash of a configurable client input, such as a header or user ID. The `hash_on_header` setting (for example, `X-Hashing-Header`) defines the source for the hash and drives all routing decisions. By default, the header is set to `X-Kong-LLM-Request-ID`.	Especially effective with consistent keys like user IDs. Requires diverse hash inputs for balanced distribution. Ideal for maintaining session persistence.
Lowest-usage	Routes requests to the least-utilized models based on resource usage metrics. In the configuration, the `tokens_count_strategy` (for example, `prompt-tokens`) defines how usage is measured, focusing on prompt tokens or other resource indicators.	Dynamically balances load based on measured usage. Useful for optimizing cost and avoiding overloading heavier models. Ensures more efficient resource allocation across available models.
Lowest-latency	Routes requests to the models with the lowest observed latency. In the configuration, the `latency_strategy` parameter (for example, `latency_strategy: e2e`) defines how latency is measured, typically based on end-to-end response times. By default, the latency is calculated based on the time the model takes to generate each token (`tpot`). The latency algorithm is based on peak EWMA (Exponentially Weighted Moving Average), which ensures that the balancer selects the backend by the lowest latency. The latency metric used is the full request cycle, from TCP connect to body response time. Since it’s a moving average, the metrics will decay over time.	Prioritizes models with the fastest response times. Optimizes for real-time performance in time-sensitive applications. Less suitable for long-lived or persistent connections (e.g., WebSockets).
Semantic	Routes requests based on semantic similarity between the prompt and model descriptions. In the configuration, embeddings are generated using a specified model (e.g., `text-embedding-3-small`), and similarity is calculated using vector search.	Uses vector search (for example, Redis) to find the best match based on prompt embeddings. `distance_metric` and `threshold` settings fine-tune matching sensitivity. Best for routing prompts to domain-specialized models, like coding, analytics, text generation.
Priority	Routes requests to models based on assigned priority groups and weights. In the configuration, models are grouped by priority and can have individual `weight` settings (for example, `weight: 70` for GPT-4), allowing proportional load distribution within each priority tier. By default, all models have the same priority. The balancer always chooses one of the targets in the group with the highest priority first. If all targets in the highest priority group are down, the balancer chooses one of the targets in the next highest priority group.	Traffic first targets higher-priority groups; lower-priority groups are used only if needed (failover). Useful for balancing reliability, cost-efficiency, and resource optimization. Ideal for high-availability deployments needing controlled fallback behavior.

Retry and fallback

The load balancer includes built-in support for retries and fallbacks. When a request fails, the balancer can automatically retry the same target or redirect the request to a different upstream target.

How retry and fallback works

Client sends a request.
The load balancer selects a target based on the configured algorithm (round-robin, lowest-latency, etc.).
If the target fails (based on defined failover_criteria), the balancer:
- Retries the same or another target.
- Fallbacks to another available target.
If retries are exhausted without success, the load balancer returns a failure to the client.

 
flowchart LR
    Client(((Application))) --> LBLB
    subgraph AIGateway
        LBLB[/Load Balancer/]
    end
    LBLB -->|Request| AIProvider1(AI Provider 1)
    AIProvider1 --> Decision1{Is Success?}
    Decision1 -->|Yes| Client
    Decision1 -->|No| AIProvider2(AI Provider 2)
    subgraph Retry
        AIProvider2 --> Decision2{Is Success?}
    end
    Decision2 ------>|Yes| Client

Figure 1: A simplified diagram of fallback and retry processing in AI Gateway’s load balancer.

Retry and fallback configuration

The AI Gateway load balancer supports fine-grained control over failover behavior. Use failover_criteria to define when a request should retry on the next upstream target. By default, retries occur on error and timeout. An error means a failure occurred while connecting to the server, forwarding the request, or reading the response header. A timeout indicates that any of those stages exceeded the allowed time.

You can add more criteria to adjust retry behavior as needed:

Setting	Description
`retries`	Defines how many times to retry a failed request before reporting failure to the client. Increase for better resilience to transient errors; decrease if you need lower latency and faster failure.
`failover_criteria`	Specifies which types of failures (e.g., `http_429`, `http_500`) should trigger a failover to a different target. Customize based on your tolerance for specific errors and desired failover behavior.
`connect_timeout`	Sets the maximum time allowed to establish a TCP connection with a target. Lower it for faster detection of unreachable servers; raise it if some servers may respond slowly under load.
`read_timeout`	Defines the maximum time to wait for a server response after sending a request. Lower it for real-time applications needing quick responses; increase it for long-running operations.
`write_timeout`	Sets the maximum time allowed to send the request payload to the server. Increase if large request bodies are common; keep short for small, fast payloads.

Retry and fallback scenarios

You can customize AI Gateway load balancer to fit different application needs, such as minimizing latency, enabling sticky sessions, or optimizing for cost. The table below maps common scenarios to key configuration options that control load balancing behavior:

Scenario	Action	Description
Requests must not hang longer than 3 seconds	Adjust `connect_timeout`, `read_timeout`, `write_timeout`	Shorten these timeouts to quickly fail if a server is slow or unresponsive, ensuring faster error handling and responsiveness.
Prioritize the lowest-latency target	Set `latency_strategy` to `e2e`	Optimize routing based on full end-to-end response time, selecting the target that minimizes total latency.
Need predictable fallback for the same user	Use `hash_on_header`	Ensure that the same user consistently routes to the same target, enabling sticky sessions and reliable fallback behavior.
Models have different costs	Set `tokens_count_strategy` to `cost`	Route requests intelligently by considering cost, balancing model performance with budget optimization.

Version compatibility for fallbacks

Kong Gateway version compatibility for fallbacks: v3.10+

Full fallback support across targets, even with different API formats.

Mix models from different providers if needed (for example, OpenAI and Mistral).

Pre-v3.10:

Fallbacks only allowed between targets using the same API format.

Example: OpenAI-to-OpenAI fallback is supported; OpenAI-to-OLLAMA is not.