Monitor AI LLM metrics

Kong AI Gateway calls LLM-based services according to the settings of the AI Proxy and AI Proxy Advanced plugins. You can aggregate the LLM provider responses to count the number of tokens used by the AI plugins. If you have defined input and output costs in the models, you can also calculate cost aggregation. The metrics details also expose whether the requests have been cached by Kong Gateway, saving the cost of contacting the LLM providers, which improves performance.

v3.12+ In addition to LLM usage, Kong AI Gateway also tracks MCP server traffic. MCP metrics provide visibility into latency, response sizes, and error rates when AI plugins invoke external MCP tools and servers.

Kong AI Gateway exposes metrics related to Kong and proxied upstream services in Prometheus exposition format, which can be scraped by a Prometheus server.

The metrics are available on both the Admin API and the Status API at the http://{host}:{port}/metrics endpoint. Note that the URL to those APIs is specific to your installation. See Accessing the metrics for more information.

The Prometheus plugin records and exposes metrics at the node level. Your Prometheus server will need to discover all Kong nodes via a service discovery mechanism, and consume data from each node’s configured /metrics endpoint.

AI metrics exported by the plugin can be graphed in Grafana using a drop-in dashboard.

Available metrics

The following sections describe the AI metrics that are available.

LLM traffic metrics

When the config.ai_metrics parameter is set to true in the Prometheus plugin, you can get the following AI LLM metrics:

  • AI requests: AI request sent to LLM providers.
  • AI cost: AI cost charged by LLM providers.
  • AI tokens: AI tokens counted by LLM providers.
  • AI LLM latency: v3.8+ Time taken to return a response by LLM providers.
  • AI cache fetch latency: v3.8+ Time taken to return a response from the cache.
  • AI cache embeddings latency: v3.8+ Time taken to generate embedding during the cache.

These metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and Workspace. The AI Tokens metrics are also available per token type.

Note: Starting with v3.11+, AI metrics include the consumer label. This enables you to attribute AI usage and token counts to individual Consumers, helping you measure cost, performance, and client-specific behavior.

Starting with v3.12+, AI metrics (except kong_ai_llm_tokens_total) include the request_mode label. This label shows how the request was processed:

  • oneshot: A single response was returned.
  • stream: The response was delivered as a stream of tokens.
  • realtime: The request was handled as a real-time session.

MCP traffic metrics v3.12+

When the config.ai_metrics parameter is set to true, the following MCP-specific metrics are also available:

  • MCP response body size: Histogram of response body sizes (in bytes) returned by MCP servers.
  • MCP latency: Histogram of request latencies (in milliseconds) for MCP server calls.
  • MCP error total: Counter of total MCP server errors, labeled by error type.

These metrics are labeled with service, route, method, workspace, and tool_name. The MCP error total metric also includes the type label.

Overview

AI metrics are disabled by default as it may create high cardinality of metrics and may cause performance issues. To enable them:

LLM traffic metrics overview

Here is an example of output you could expect from the /metrics endpoint for LLM traffic:

# HELP ai_llm_requests_total AI requests total per ai_provider in Kong
# TYPE ai_llm_requests_total counter
ai_llm_requests_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large","request_mode"="oneshot",Workspace="workspace1",consumer="consumer1"} 100

# HELP ai_llm_cost_total AI requests cost per ai_provider/cache in Kong
# TYPE ai_llm_cost_total counter
ai_llm_cost_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large","request_mode"="oneshot",Workspace="workspace1",consumer="consumer1"} 50

# HELP ai_llm_provider_latency AI latencies per ai_provider in Kong
# TYPE ai_llm_provider_latency bucket
ai_llm_provider_latency_ms_bucket{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="","request_mode"="oneshot",Workspace="workspace1",le="+Inf",consumer="consumer1"} 2

# HELP ai_llm_tokens_total AI tokens total per ai_provider/cache in Kong
# TYPE ai_llm_tokens_total counter
ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",token_type="prompt_tokens",Workspace="workspace1",consumer="consumer1"} 1000
ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",token_type="completion_tokens",Workspace="workspace1",consumer="consumer1"} 2000
ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="total_tokens",Workspace="workspace1",consumer="consumer1"} 3000

# HELP ai_cache_fetch_latency AI cache latencies per ai_provider/database in Kong
# TYPE ai_cache_fetch_latency bucket
ai_cache_fetch_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large","request_mode"="oneshot",Workspace="workspace1",le="+Inf",consumer="consumer1"} 2

# HELP ai_cache_embeddings_latency AI cache latencies per ai_provider/database in Kong
# TYPE ai_cache_embeddings_latency bucket
ai_cache_embeddings_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large","request_mode"="oneshot",Workspace="workspace1",le="+Inf",consumer="consumer1"} 2

# HELP ai_llm_provider_latency AI cache latencies per ai_provider/database in Kong
# TYPE ai_llm_provider_latency bucket
ai_llm_provider_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large","request_mode"="oneshot",Workspace="workspace1",le="+Inf",consumer="consumer1"} 2

Note: If you don’t use any cache plugins, then cache_status, vector_db, embeddings_provider, and embeddings_model values will be empty.

To expose the ai_llm_cost_total metric, you must define the model.options.input_cost model.options.output_cost parameters. See the AI Proxy and AI Proxy Advanced configuration references for more details.

MCP traffic metrics overview

Here is an example of output you could expect from the /metrics endpoint for MCP traffic:

# HELP kong_ai_mcp_response_body_size_bytes MCP server response body sizes in bytes
# TYPE kong_ai_mcp_response_body_size_bytes histogram
kong_ai_mcp_response_body_size_bytes_bucket{service="svc1",route="route1",method="tools/call",workspace="workspace1",tool_name="tool1",le="+Inf"} 1

# HELP kong_ai_mcp_latency_ms MCP server latencies in milliseconds
# TYPE kong_ai_mcp_latency_ms histogram
kong_ai_mcp_latency_ms_bucket{service="svc1",route="route1",method="tools/call",workspace="workspace1",tool_name="tool1",le="+Inf"} 1

# HELP kong_ai_mcp_error_total Total MCP server errors by type
# TYPE kong_ai_mcp_error_total counter
kong_ai_mcp_error_total{service="svc1",route="route1",type="Invalid Request",method="tools/call",workspace="workspace1",tool_name=""} 3

Accessing the metrics

In most configurations, the Kong Admin API will be behind a firewall or would need to be set up to require authentication. Here are a couple of options to allow access to the /metrics endpoint to Prometheus:

  • If the Status API is enabled with the status_listen parameter in the Kong Gateway configuration, then its /metrics endpoint can be used. This is the preferred method, and this is also the only method compatible with Konnect, since Data Planes can’t use the Admin API.

  • The /metrics endpoint is also available on the Admin API, which can be used if the Status API is not enabled. Note that this endpoint is unavailable when RBAC is enabled on the Admin API, as Prometheus doesn’t support key authentication to pass the RBAC token.

Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!