What Is RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation) is an architecture that augments LLM generation with context retrieved from external knowledge stores at query time. Instead of relying only on training data, the model searches a document collection, retrieves relevant passages, and uses them to produce grounded, up-to-date responses.

How RAG Works

A RAG pipeline has three stages. Ingestion: documents are chunked into passages and converted to vector embeddings using an embedding model. The embeddings are stored in a vector database alongside the original text. Retrieval: when a query arrives, it is embedded using the same model, and the vector database performs nearest-neighbor search to find the most semantically relevant passages. Generation: the retrieved passages are injected into the LLM's prompt as context, and the model generates a response grounded in the retrieved information.

This architecture solves two fundamental LLM limitations: knowledge cutoff (the model only knows what it was trained on) and hallucination (the model can generate plausible but incorrect information). By grounding generation in retrieved documents, RAG provides responses that are both current and verifiable.

Vector Stores and Embedding Models

The vector store is the core data structure in a RAG pipeline. It stores high-dimensional vectors (typically 768 to 3072 dimensions) and supports efficient approximate nearest-neighbor search. When a query vector arrives, the store returns the k most similar document vectors, which correspond to the most relevant passages.

Common vector databases include pgvector (a PostgreSQL extension, ideal for teams already using PostgreSQL), Chroma (lightweight, embedded), and Qdrant (purpose-built, high-performance). All of these can run on osModa servers since every plan provides full root SSH access on NixOS. The declarative NixOS configuration ensures your vector database setup is reproducible and can be rolled back via atomic rollback if a configuration change causes issues.

Running RAG Pipelines on osModa

osModa provides dedicated infrastructure for persistent RAG workloads. Unlike serverless platforms where cold starts and execution limits make RAG pipelines unreliable, osModa servers run continuously with the osmoda-watch daemon supervising all processes.

Storage Tools

osModa's 83-tool catalog includes storage tools for persistent key-value stores and structured data management. These tools support agent memory patterns where context from previous interactions is stored and retrieved for future use.

File Management Tools

File management tools handle document ingestion: reading, parsing, chunking, and processing documents on the server filesystem. Agents can watch directories for new documents and automatically ingest them into the RAG pipeline.

Dedicated Resources

RAG pipelines benefit from dedicated CPU and memory. Plans range from Solo (2 CPU / 4 GB / 40 GB at $29/mo) to Scale (16 CPU / 32 GB / 320 GB at $299/mo). No resource contention from other tenants.

Multi-Model Support

The osModa dashboard supports Claude Opus, Sonnet, Haiku, GPT-4o, and o3-mini. Different models can be used for different RAG stages: a smaller model for classification, a larger model for generation.

RAG and Tool Use Working Together

RAG and tool use are complementary patterns. RAG retrieves information to improve generation quality. Tool use executes actions to interact with external systems. In practice, an agentic AI system uses both: it invokes a retrieval tool (tool use) to search a vector store (RAG), incorporates the results into its reasoning, and then takes action based on the grounded context.

On osModa, this pattern is implemented through MCP. A custom MCP server can expose a "search_knowledge_base" tool that queries a vector database and returns relevant documents. The agent invokes this tool through osmoda-mcpd, receives the results, and uses them in its next reasoning step. Every retrieval and every action is logged in the audit ledger.

Frequently Asked Questions

What does RAG stand for?

RAG stands for Retrieval-Augmented Generation. It is an architecture where a language model's generation is augmented with information retrieved from external data sources at query time. Instead of relying solely on the model's training data, the system searches a knowledge store, retrieves relevant documents, and includes them in the prompt context before the model generates a response.

Why is RAG better than fine-tuning?

Fine-tuning bakes knowledge into the model's weights, which means updating the knowledge requires retraining. RAG keeps knowledge external, in a searchable store, so updates are immediate -- add a document to the store and the next query can use it. RAG also provides attribution: you can trace every piece of generated content back to the source document it was retrieved from.

What is a vector store?

A vector store is a database optimized for storing and searching high-dimensional vectors (embeddings). Text documents are converted into numerical vectors using an embedding model. When a query arrives, it is also converted to a vector, and the store performs nearest-neighbor search to find the most semantically similar documents. Common vector stores include pgvector (PostgreSQL extension), Chroma, and Qdrant.

How does osModa support RAG?

osModa provides the infrastructure for running RAG pipelines on dedicated servers. The storage tools in osModa's 83-tool catalog handle persistent data management. The file management tools enable document ingestion and processing. You can run vector databases (pgvector, Chroma, Qdrant) directly on your NixOS server with full root access. The watchdog daemon ensures your RAG pipeline stays running 24/7.

Can I run a vector database on osModa?

Yes. Every osModa server provides full root SSH access on NixOS, so you can install and run any vector database. NixOS declarative configuration means your vector database setup is reproducible and version-controlled. If the database process crashes, osmoda-watch restarts it automatically. Plans range from Solo (2 CPU / 4 GB / 40 GB storage at $29/mo) to Scale (16 CPU / 32 GB / 320 GB storage at $299/mo).

What is the difference between RAG and tool use?

RAG retrieves information to augment generation -- the agent gets better context before producing output. Tool use executes actions -- the agent changes something in the external world. In practice, RAG and tool use work together: an agent might use a retrieval tool (tool use) to search a vector store (RAG), then use the retrieved context to generate a response, and finally use a communication tool (tool use) to deliver the result.

Run RAG Pipelines on Dedicated Servers

Full root access for vector databases, persistent storage, and watchdog-supervised ingestion pipelines. Plans from $29/month.

Spawn Server

Explore More

Tool Use in AI Agents

Function calling and structured tools

MCP (Model Context Protocol)

The protocol powering tool access

Agentic AI

Autonomous AI agents

NixOS

Declarative Linux for infrastructure

Watchdog Daemon

Process supervision and auto-restart

Pricing

Plans from $29/mo