Related Documentation
Made by
Kong Inc.
Supported Gateway Topologies
hybrid db-less traditional
Supported Konnect Deployments
hybrid cloud-gateways serverless
Compatible Protocols
grpc grpcs http https
Minimum Version
Kong Gateway - 3.7
AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.

The AI Rate Limiting Advanced plugin provides rate limiting for the providers used by any AI plugins. The AI Rate Limiting plugin extends the Rate Limiting Advanced plugin.

This plugin uses the token data returned by the LLM provider to calculate the costs of queries. The same HTTP request can vary greatly in cost depending on the calculation of the LLM providers.

A common pattern to protect your AI API is to analyze and assign costs to incoming queries, then rate limit the consumer’s cost for a given time window and provider. You can also create a generic prompt rate limit using the request prompt provider.

Kong also provides multiple specialized rate limiting plugins, including rate limiting for service protection and on GraphQL queries. See Rate Limiting in Kong Gateway to choose the plugin that is most useful in your use case.

Strategies

The AI Rate Limiting Advanced plugin supports three rate limiting strategies: local, cluster, and redis.

Strategy

Description

Pros

Cons

local Counters are stored in-memory on the node. Minimal performance impact. Less accurate. Unless there’s a consistent-hashing load balancer in front of Kong Gateway, it diverges when scaling the number of nodes.
cluster Counters are stored in the Kong Gateway data store and shared across nodes. Accurate1, no extra components to support. Each request forces a read and a write on the data store. Therefore, relatively, the biggest performance impact.
Not supported in hybrid mode or Konnect deployments.
redis Counters are stored on a Redis server and shared across nodes. Accurate1, less performance impact than a cluster policy. Needs a Redis installation. Bigger performance impact than a local policy.

[1]: Only when config.sync_rate option is set to 0 (synchronous behavior).

Two common use cases for rate limiting are:

  1. Every transaction counts: The highest level of accuracy is needed. An example is a transaction with financial consequences.
  2. Backend protection: Accuracy is not as relevant. The requirement is only to protect backend services from overloading that’s caused either by specific users or by attacks.

Every transaction counts

In this scenario, because accuracy is important, the local policy is not an option. Consider the support effort you might need for Redis, and then choose either cluster or redis.

You could start with the cluster policy, and move to redis if performance reduces drastically.

If using a very high sync frequency, use redis. Very high sync frequencies with cluster mode are not scalable and not recommended. The sync frequency becomes higher when the sync_rate setting is a lower number - for example, a sync_rate of 0.1 is a much higher sync frequency (10 counter syncs per second) than a sync_rate of 1 (1 counter sync per second).

You can calculate what is considered a very high sync rate in your environment based on your topology, number of plugins, their sync rates, and tolerance for loose rate limits.

Together, the interaction between sync rate and window size affects how accurately the plugin can determine cluster-wide traffic. For example, the following table represents the worst-case scenario where a full sync interval’s worth of data hasn’t yet propagated across nodes:

Property

Formula or config location

Value

Window size in seconds Value set in config.window_size 5
Limit (in window) Value set in config.limit 1000
Sync rate (interval) Value set in config.sync_rate 0.5
Number of nodes (>1) 10
Estimated load balanced requests-per-second (RPS) to a node Limit / Window size / Number of nodes 1000 / 5 / 10 = 20
Max potential lag in cluster count for a given node/s Estimated load balanced RPS * Sync rate 20 * 0.5 = 10
Cluster-wide max potential overage/s Max potential lag * Number of nodes 10 * 10 = 100
Cluster-wide max potential overage/s as a percentage Cluster-wide max potential overage / Limit 100 / 1000 = 10%
Effective worst case cluster-wide requests allowed at window size Limit * Cluster-wide max potential overage 1000 + 100 = 1100

If you choose to switch strategies, note that you can’t port the existing usage metrics from the Kong Gateway data store to Redis. This might not be a problem with short-lived metrics (for example, seconds or minutes) but if you use metrics with a longer time frame (for example, months), plan your switch carefully.

Backend protection

If accuracy is less important, choose the local policy. You might need to experiment a little before you get a setting that works for your scenario. As the cluster scales to more nodes, more user requests are handled. When the cluster scales down, the probability of false negatives increases. Make sure to adjust your rate limits when scaling.

For example, if a user can make 100 requests every second, and you have an equally balanced 5-node Kong Gateway cluster, you can set the local limit to 30 requests every second. If you see too many false negatives, increase the limit.

To minimize inaccuracies, consider using a consistent-hashing load balancer in front of Kong Gateway. The load balancer ensures that a user is always directed to the same Kong Gateway node, which reduces inaccuracies and prevents scaling problems.

Headers sent to the client

When this plugin is enabled, Kong Gateway sends some additional headers back to the client, indicating the allowed limits, how many requests are available, and how long it will take until the quota is restored. It also sends the limits in the time frame and the number of remaining minutes for each provider.

For example:

X-AI-RateLimit-Reset: 47
X-AI-RateLimit-Retry-After: 47
X-AI-RateLimit-Limit-30-azure: 1000
X-AI-RateLimit-Remaining-30-azure: 950

You can optionally hide the limit and remaining headers with the config.hide_client_headers option.

If more than one limit is set, the plugin returns multiple time limit headers. For example:

X-AI-RateLimit-Limit-30-azure: 1000
X-AI-RateLimit-Remaining-30-azure: 950
X-AI-RateLimit-Limit-40-cohere: 2000
X-AI-RateLimit-Remaining-40-cohere: 1150

If any of the limits are reached, the plugin returns an HTTP/1.1 429 status code to the client with the following JSON body:

{ "message": "API rate limit exceeded for provider azure, cohere" }

For each provider, the plugin also indicates how long it will take until the quota is restored:

X-AI-RateLimit-Retry-After-30-azure: 1500
X-AI-RateLimit-Reset-30-azure: 1500

If using the request prompt provider, the plugin will send the query cost:

X-AI-RateLimit-Query-Cost: 100

The Retry-After headers will be present on 429 errors to indicate how long the service is expected to be unavailable to the client. When using window_type=sliding and RateLimit-Reset, Retry-After may increase due to the rate calculation for the sliding window.

The headers RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset are based on the Internet-Draft RateLimit Header Fields for HTTP and may change in the future to respect specification updates.

Token count strategies

The plugin supports three strategies to calculate the number of tokens, configurable using the parameter tokens_count_strategy:

Strategy

Description

total_tokens Represents the total number of tokens, including both the prompt and the generated completion, in the LLM’s input sequence.
prompt_tokens Represents the tokens provided by the user as input to the LLM, typically defining the context or task.
completion_tokens Represents the tokens generated by the LLM in response to the prompt, representing the completed output or continuation of the task.
cost v3.8+ Represents the financial or computational cost incurred based on the tokens used by the LLM during the request. Using this strategy can help you limit API usage based on the actual costs of processing the request, ensuring that expensive requests (in terms of token usage) are managed more carefully.

This cost is the sum of the number of prompt tokens multiplied by the cost per prompt token (input cost) and the number of completion tokens multiplied by the cost per completion token (output cost): cost = prompt_tokens * input_cost + completion_tokens * output_cost.

To use this strategy, you must define the config.input_cost and config.output_cost in either the AI Proxy or AI Proxy Advanced plugin.

Request prompt function

You can decide to use a custom function to count the tokens for a requests. To configure it, you must set the config.llm_providers.name to requestPrompt and specify the function in config.request_prompt_count_function.

When using the request prompt provider, it will call the function to get the token count at the request level and implement a limit.

See the following example configuration for more detail.

Known limitations of AI Rate Limiting Advanced

The cost for the AI Proxy or AI Proxy Advanced is only reflected during the next request.

For example, if a request is made and the AI Proxy plugin returns a token cost of 100 for the OpenAI provider:

  • The request is made to the OpenAI provider and the response is returned to the user
  • If the rate limit is reached, the next request will be blocked

Additionally, config.disable_penalty only works for the requestPrompt function.

Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!