Retrieval-Augmented Generation (RAG) is a technique that improves the accuracy and relevance of language model responses by enriching prompts with external data at runtime. Instead of relying solely on what the model was trained on, RAG retrieves contextually relevant information—such as documents, support articles, or internal knowledge—from connected data sources like vector databases.
This retrieved context is then automatically injected into the prompt before the model generates a response. RAG is a critical safeguard in specialized or high-stakes applications, where factual accuracy matters. LLMs are prone to hallucinations, plausible-sounding but factually incorrect or fabricated responses. RAG helps mitigate this by grounding the model’s output in real, verifiable data.
The following table describes the different use cases for RAG based on industry:
Industry
Use case
Healthcare
RAG can help surface up-to-date clinical guidelines or patient records in a timely manner, critical when treatment decisions depend on the most current information.
Legal
Lawyers can use RAG-powered assistants to instantly retrieve relevant case law, legal precedents, or compliance documentation during client consultations.
Finance
In fast-moving markets, RAG enables models to deliver financial insights based on current data, avoiding outdated or misleading responses driven by stale training snapshots.
The AI RAG Injector plugin automates the retrieval and injection of contextual data for RAG pipelines without doing manual prompt engineering or retrieval logic. Integrated at the gateway level, it handles embedding generation, vector search, and context injection transparently for each request.
Simplifies RAG workflows: Automatically embeds prompts, queries the vector DB, and injects relevant context without custom retrieval logic.
Platform-level control: Shifts RAG logic from app code to infrastructure, allowing platform teams to enforce global policies, update configurations centrally, and reduce developer overhead.
Improved security: Vector DB access is limited to the AI Gateway, eliminating the need to expose it to individual dev teams or AI agents.
Enables RAG in restricted environments: Supports RAG even where direct access to the vector database is not possible, such as external-facing or isolated services.
Developer productivity: Developers can focus on building AI features without needing to manage embeddings, similarity search, or context handling.
v3.11+Save LLM costs: When using the AI RAG Injector plugin with the AI Prompt Compressor, you can wrap specific prompt parts in <LLMLINGUA> tags within your template to target only those sections for compression, preserving the rest of the prompt unchanged.
When a user sends a prompt, the RAG Injector plugin queries a configured vector database for relevant context and injects that information into the request before passing it to the language model.
You configure the AI RAG Injector plugin via the Admin API or decK, setting up the RAG content to send to the vector database.
When a request reaches the AI Gateway, the plugin generates embeddings for request prompts, then queries the vector database for the top-k most similar embeddings.
The plugin injects the retrieved content from the vector search result into the request body, and forwards the request to the upstream service.
The following diagram is a simplified overview of how the plugin works. See the following section for a more detailed description.
sequenceDiagram
participant User
participant AIGateway as AI Gateway (RAG Injector Plugin)
participant VectorDB as Vector DB (Data Source)
participant Upstream as Upstream Service
User->>AIGateway: Send request with prompt
AIGateway->>VectorDB: Query for similar embeddings
VectorDB-->>AIGateway: Return relevant context
AIGateway->>Upstream: Inject context and forward enriched request
Upstream-->>User: Return response
This phase runs in real time, taking user input and producing a context-aware response using the indexed data.
Step breakdown:
The user’s query is converted into an embedding using the same model used during data preparation.
A semantic similarity search locates the most relevant content chunks in the vector database.
The system builds a custom prompt by combining the retrieved chunks with the original query.
The LLM generates a contextually accurate response using both the retrieved context and its own internal knowledge.
The diagram below shows how data flows through both phases of the RAG pipeline, from ingestion and embedding to real-time query handling and response generation:
sequenceDiagram
autonumber
actor User
participant RawData as Raw Data
participant EmbeddingModel as Embedding Model
participant VectorDB as Vector Database
participant LLM
par Data Preparation Phase
activate RawData
RawData->>EmbeddingModel: Load and chunk documents, generate embeddings
deactivate RawData
activate EmbeddingModel
EmbeddingModel->>VectorDB: Store embeddings
deactivate EmbeddingModel
activate VectorDB
deactivate VectorDB
end
par Retrieval & Generation Phase
activate User
User->>EmbeddingModel: (1) Submit query and generate query embedding
activate EmbeddingModel
EmbeddingModel->>VectorDB: (2) Search vector DB
deactivate EmbeddingModel
activate VectorDB
VectorDB-->>EmbeddingModel: Return relevant chunks
deactivate VectorDB
activate EmbeddingModel
EmbeddingModel->>LLM: (3) Assemble prompt and send
deactivate EmbeddingModel
activate LLM
LLM-->>User: (4) Generate and return response
deactivate LLM
deactivate User
end
Rather than guessing from memory, the LLM paired with the RAG pipeline now has the ability to look up the information it needs in real time, which reduces hallucinations and increases the accuracy of the AI output.
A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.
The AI RAG Injector plugin supports the following vector databases:
Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
Once you’ve configured your vector database and ingested content, you can control which Consumers access specific knowledge base articles and refine query results using metadata filters.
A collection is a logical grouping of knowledge base articles with independent access control rules. When you ingest content via the Admin API, assign it to a collection using the collection field in the metadata.
In this configuration, collections with their own ACL in collection_acl_config ignore global_acl_config entirely. They must explicitly list all allowed subjects.
Check the how-to guide for details about how ACLs work in the AI RAG Injector plugin.
LLM clients can refine search results by specifying filter criteria in the query request. Filters apply within the collections. The AI RAG Injector plugin uses a Bedrock-compatible filter grammar with the following operators:
equals: Exact match
greaterThan: Greater than (>)
greaterThanOrEquals: Greater than or equal to (>=)
lessThan: Less than (<)
lessThanOrEquals: Less than or equal to (<=)
in: Match any value in array
andAll: Combine multiple filter clauses
Review the how-to guide for details about how metadata filtering works.
JSON object with filter clauses using the grammar above
filter_mode
Controls how chunks with no metadata are handled:
• "compatible": Includes chunks matching filter OR chunks with no metadata
• "strict": Includes only chunks matching filter
stop_on_filter_error
Fail query on filter parse error (default: false)
You can include filters in the ai_rag_injector parameter of your request:
Starting in Kong Gateway 3.13, you can authenticate with a cloud Redis provider for your Redis strategy. This allows you to seamlessly rotate credentials without relying on static passwords.
The following providers are supported:
AWS ElastiCache
Azure Managed Redis
Google Cloud Memorystore (with or without Valkey)
Each provider also supports an instance and cluster configuration.
Important: Kong Gateway open source plugins do not support any Redis cloud provider cluster configurations.
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
You need:
A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The embedding dimension you use depends on your model and use case. More dimensions improve accuracy but increase cost. 1536 is a balanced default if you use the OpenAI text-embedding-3-large model.
failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)
Copied!
This means that the hardcoded MemoryDB instance limit has been reached.
To resolve this, create more MemoryDB instances to handle multiple AI RAG Injector plugin instances.
No. GCP Memorystore Redis clusters do not support the AI RAG Injector plugin. The Redis JSON module required for vector operations is not available in GCP’s managed Redis service.
Attempting to ingest chunks with GCP Redis results in the following error: