AI LLM as Judge

AI License Required

Overview Examples Guides Configuration reference Changelog

Prerequisites

This plugin requires the AI Proxy Advanced plugin with config.balancer.tokens_count_strategy set to llm-accuracy. The balancer compares responses from at least two LLM models. When you enable AI LLM as Judge on a service or route, it evaluates all LLM requests handled by that service or route.

Features

The AI LLM as Judge plugin offers several configurable features that control how the LLM evaluates prompts and responses:

Feature	Description
Configurable system prompt	Instructs the LLM to act as a strict evaluator.
Numerical scoring	Assigns a score from 1–100 to assess response quality.
History depth	Includes previous chat messages for context when scoring.
Ignore prompts	Options to ignore system, assistant, or tool prompts.
Sampling rate	Controls probabilistic request volume for judging.
Native LLM schema	Leverages Kong’s LLM schema for seamless integration.

How it works

The plugin sends the user prompt and response to the configured LLM as a judge.
The LLM evaluates the response and returns a numeric score between 1 (ideal) and 100 (wrong or irrelevant).
This score can be used in downstream workflows, such as automated grading, feedback systems, or learning pipelines.

The following sequence diagram illustrates this simplified flow:

 
sequenceDiagram
    actor Client
    participant AIP as AI Proxy Advanced
    participant LLM as LLM Model (A or B)
    participant Judge as AI LLM as Judge
    participant JudgeLLM as Judge LLM

    Client->>AIP: Send prompt
    AIP->>LLM: Forward prompt (balancer selects model)
    LLM-->>AIP: Response
    AIP->>Judge: Prompt + response
    Judge->>JudgeLLM: Evaluate response
    JudgeLLM-->>Judge: Score (1–100)
    Judge-->>AIP: Evaluation result
    AIP-->>Client: Response

Recommended LLM settings

To ensure concise, consistent scoring, configure the LLM that acts as the judge with these values:

Setting	Recommended value	Description
`temperature`	`2`	Controls randomness. A lower value leads to a more deterministic output.
`max_tokens`	`5`	Maximum tokens for the LLM response.
`top_p`	`1`	Nucleus sampling probability; limits token selection.

These settings produce short, precise numeric scores without extra text or verbosity.

Known issues

The LLM as judge approach can lead to preference leakage issues when the same family of models is used as both the judge and the source.
The score generated by the LLM needs human preference alignment and should not be over-trusted.