AI LLM as Judge

AI License Required
Related Documentation
Made by
Kong Inc.
Supported Gateway Topologies
hybrid db-less traditional
Supported Konnect Deployments
hybrid cloud-gateways serverless
Compatible Protocols
grpc grpcs http https
Minimum Version
Kong Gateway - 3.12
Tags
#ai
AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.

The AI LLM as Judge plugin enables automated evaluation of prompt-response pairs using a dedicated LLM. The plugin assigns a numerical score to LLM responses from 1 to 100, where:

  • 1: Completely incorrect or irrelevant response
  • 100: Perfect or ideal response

This plugin is part of the AI plugin suite, making it easy to integrate LLM-based evaluation workflows into your API pipelines.

Prerequisites

This plugin requires the AI Proxy Advanced plugin with config.balancer.tokens_count_strategy set to llm-accuracy. The balancer compares responses from at least two LLM models. When you enable AI LLM as Judge on a service or route, it evaluates all LLM requests handled by that service or route.

Features

The AI LLM as Judge plugin offers several configurable features that control how the LLM evaluates prompts and responses:

Feature

Description

Configurable system prompt Instructs the LLM to act as a strict evaluator.
Numerical scoring Assigns a score from 1–100 to assess response quality.
History depth Includes previous chat messages for context when scoring.
Ignore prompts Options to ignore system, assistant, or tool prompts.
Sampling rate Controls probabilistic request volume for judging.
Native LLM schema Leverages Kong’s LLM schema for seamless integration.

How it works

  1. The plugin sends the user prompt and response to the configured LLM as a judge.
  2. The LLM evaluates the response and returns a numeric score between 1 (ideal) and 100 (wrong or irrelevant).
  3. This score can be used in downstream workflows, such as automated grading, feedback systems, or learning pipelines.

The following sequence diagram illustrates this simplified flow:

 
sequenceDiagram
    actor Client
    participant AIP as AI Proxy Advanced
    participant LLM as LLM Model (A or B)
    participant Judge as AI LLM as Judge
    participant JudgeLLM as Judge LLM

    Client->>AIP: Send prompt
    AIP->>LLM: Forward prompt (balancer selects model)
    LLM-->>AIP: Response
    AIP->>Judge: Prompt + response
    Judge->>JudgeLLM: Evaluate response
    JudgeLLM-->>Judge: Score (1–100)
    Judge-->>AIP: Evaluation result
    AIP-->>Client: Response
  

To ensure concise, consistent scoring, configure the LLM that acts as the judge with these values:

Setting

Recommended value

Description

temperature 2 Controls randomness. A lower value leads to a more deterministic output.
max_tokens 5 Maximum tokens for the LLM response.
top_p 1 Nucleus sampling probability; limits token selection.

These settings produce short, precise numeric scores without extra text or verbosity.

Known issues

  • The LLM as judge approach can lead to preference leakage issues when the same family of models is used as both the judge and the source.
  • The score generated by the LLM needs human preference alignment and should not be over-trusted.
Something wrong?

Help us make these docs great!

Kong Developer docs are open source. If you find these useful and want to make them better, contribute today!