When a user sends a prompt, the RAG Injector plugin queries a configured vector database for relevant context and injects that information into the request before passing it to the language model.
- You configure the AI RAG Injector plugin via the Admin API or decK, setting up the RAG content to send to the vector database.
- When a request reaches the AI Gateway, the plugin generates embeddings for request prompts, then queries the vector database for the top-k most similar embeddings.
- The plugin injects the retrieved content from the vector search result into the request body, and forwards the request to the upstream service.
The following diagram is a simplified overview of how the plugin works. See the following section for a more detailed description.
sequenceDiagram
participant User
participant AIGateway as AI Gateway (RAG Injector Plugin)
participant VectorDB as Vector DB (Data Source)
participant Upstream as Upstream Service
User->>AIGateway: Send request with prompt
AIGateway->>VectorDB: Query for similar embeddings
VectorDB-->>AIGateway: Return relevant context
AIGateway->>Upstream: Inject context and forward enriched request
Upstream-->>User: Return response
The RAG workflow consists of two critical phases:
-
Data preparation: Processes and embeds unstructured data into a vector index for efficient semantic search
-
Retrieval and generation: The system uses similarity search to dynamically assemble contextual prompts that guide the language model’s output.
This phase sets up the foundation for semantic retrieval by converting raw data into a format that can be indexed and searched efficiently.
Step breakdown:
- A document loader pulls content from various sources, such as PDFs, websites, emails, or internal systems.
- The system breaks the unstructured data into smaller, semantically meaningful chunks to support precise retrieval.
- Each chunk is transformed into a vector embedding (a numeric representation that captures its semantic content).
- These embeddings are saved to a vector database, enabling a fast, similarity-based search during query time.
This phase runs in real time, taking user input and producing a context-aware response using the indexed data.
Step breakdown:
- The user’s query is converted into an embedding using the same model used during data preparation.
- A semantic similarity search locates the most relevant content chunks in the vector database.
- The system builds a custom prompt by combining the retrieved chunks with the original query.
- The LLM generates a contextually accurate response using both the retrieved context and its own internal knowledge.
The diagram below shows how data flows through both phases of the RAG pipeline, from ingestion and embedding to real-time query handling and response generation:
sequenceDiagram
autonumber
actor User
participant RawData as Raw Data
participant EmbeddingModel as Embedding Model
participant VectorDB as Vector Database
participant LLM
par Data Preparation Phase
activate RawData
RawData->>EmbeddingModel: Load and chunk documents, generate embeddings
deactivate RawData
activate EmbeddingModel
EmbeddingModel->>VectorDB: Store embeddings
deactivate EmbeddingModel
activate VectorDB
deactivate VectorDB
end
par Retrieval & Generation Phase
activate User
User->>EmbeddingModel: (1) Submit query and generate query embedding
activate EmbeddingModel
EmbeddingModel->>VectorDB: (2) Search vector DB
deactivate EmbeddingModel
activate VectorDB
VectorDB-->>EmbeddingModel: Return relevant chunks
deactivate VectorDB
activate EmbeddingModel
EmbeddingModel->>LLM: (3) Assemble prompt and send
deactivate EmbeddingModel
activate LLM
LLM-->>User: (4) Generate and return response
deactivate LLM
deactivate User
end
Rather than guessing from memory, the LLM paired with the RAG pipeline now has the ability to look up the information it needs in real time, which will dramatically reduce hallucinations and increase the accuracy of the AI output.