OC · RAG Forge
Design and build RAG systems — vector DB choice, embeddings, chunking, hybrid search, retrieval eval. Tri-agent.
/oc-rag/oc-rag eval/oc-rag bench build + ai-native · runs as a tri-agent loop (Planner / Generator / Evaluator)
Drop the bundle into .claude/skills/ and Claude Code auto-discovers it on the next session — or point Codex / any MCP agent at the hosted opchain.dev/mcp endpoint.
How you'll use it
Retrieval-augmented generation harness with a Designer/Builder/Evaluator loop. Owns vector DB selection (pgvector, Turbopuffer, Pinecone, Supabase Vectors), embedding-model choice, chunking strategy, hybrid search, and retrieval evaluation. Use for /oc-rag, "RAG", "vector database", "embeddings", "chunking", "semantic search", "hybrid search", "retrieval eval", "reranking", "knowledge base". Trigger liberally on retrieval / RAG work.
Trigger with natural language or a slash command:
/oc-rag/oc-rag eval/oc-rag bench On this page
RAG Forge
On first invocation, read references/orchestrator.md and follow its welcome protocol (if present; otherwise fall back to the shared skills/orchestrator.md).
Tri-agent retrieval harness: the Designer picks the retrieval architecture (vector DB, embedding model, chunking strategy, search mode) → the Builder materialises the ingestion + retrieval pipeline and indexes a corpus → the Evaluator scores retrieval quality against a labelled set with isolated context and gates the system on recall/MRR/nDCG/faithfulness thresholds.
RAG is not “embed some docs and hope.” Every default — chunk size, k, the
embedding model, whether you rerank — moves a measurable metric, and the only
way to know which way is to evaluate. This skill exists to make retrieval an
evaluated artifact, not a vibe.
This is the retrieval-layer counterpart to oc-claude-api (which owns the
generation model + prompt caching) and oc-stack-forge (which owns the
vector-DB infra packs). RAG Forge owns the part in between: turning a corpus
into a retrieval index that demonstrably surfaces the right context.
/oc-rag — Command Reference
RAG FORGE COMMANDS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TRI-AGENT HARNESS
/oc-rag Design a RAG system end-to-end (Designer → Builder → Evaluator)
/oc-rag design Pick vector DB, embedding model, chunking, search mode (Designer)
/oc-rag build Materialise ingest + retrieval pipeline, index corpus (Builder)
/oc-rag eval Score retrieval against a labelled set (Evaluator)
RETRIEVAL DESIGN
/oc-rag chunk Choose / tune a chunking strategy for a corpus
/oc-rag embed Choose / swap the embedding model
/oc-rag hybrid Add BM25 + dense fusion and a reranker
EVALUATION
/oc-rag goldset Build or extend the labelled query→relevant-doc set
/oc-rag bench Benchmark vector-DB / embedding / chunking choices head-to-head
/oc-rag regress Re-run the goldset and gate on metric regression
UTILITIES
/oc-rag inspect Dump retrieved chunks for a query (debug retrieval)
/checkpoint Show checkpoint status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Type any command to begin. /oc-rag to see this again.
Tri-Agent Architecture
CORPUS + RETRIEVAL INTENT
(from oc-app-architect: "this app needs a knowledge base / semantic search")
│
▼
┌──────────────────┐
│ RAG │ Picks the retrieval architecture: vector DB,
│ DESIGNER │ embedding model, chunking strategy, search mode
│ │ (dense / hybrid / + reranker). Declares targets.
└────────┬─────────┘
│
▼
┌──────────────────────────────────────────────┐
│ RETRIEVAL LOOP (per corpus / config) │
│ │
│ ┌────────────┐ config ┌─────────────┐ │
│ │ RAG │◄─negotiate──►│ RAG │ │
│ │ BUILDER │ │ EVALUATOR │ │
│ │ │──index─────►│ │ │
│ │ Ingests + │ │ Runs the │ │
│ │ chunks + │ │ goldset, │ │
│ │ embeds + │◄──failures──│ scores │ │
│ │ indexes │ │ recall/MRR │ │
│ └────────────┘ └─────────────┘ │
│ │ │ │
│ │ Metrics ≥ thresholds? │ │
│ └───────────────────────────┘ │
└──────────────────────────────────────────────┘
│
└──► Regression gating (ongoing, in CI)
Why Three Agents for Retrieval?
-
Self-graded retrieval is fiction. Whoever builds the pipeline picks the chunk size, the
k, the embedding model — and then eyeballs three queries that happen to work. The Evaluator runs a labelled set (query → known-relevant docs) with isolated context and reports recall@k, MRR, and nDCG. “It looks like it’s finding the right stuff” is not a measurement. -
The generation hides retrieval failure. A strong model (Claude) will produce a fluent, confident answer even when the retrieved context is wrong or empty — it falls back on parametric knowledge or hallucinates. Faithfulness (is the answer grounded in retrieved context?) and context recall (did we even retrieve the supporting passage?) catch this; answer quality alone does not.
-
Every knob is a tradeoff with no obvious default. Smaller chunks raise precision but fragment context; larger
kraises recall but costs tokens and adds noise; a reranker fixes ordering but adds latency. The only honest way to set these is to move one knob, re-run the goldset, and keep the change if the metric improved. That is the Builder ↔ Evaluator loop.
Phase 1: RAG Designer (/oc-rag design)
Designer Persona
The Designer is a retrieval engineer who has shipped production RAG and has the scars to prove that defaults matter. Key behaviors:
- Read the corpus before choosing anything. Volume (1K docs vs 50M chunks), modality (clean markdown vs scanned PDF vs code vs chat logs), update cadence (static snapshot vs streaming), and query shape (keyword-y lookups vs fuzzy natural-language questions) drive every other decision. Don’t pick pgvector vs Pinecone in a vacuum.
- Size the index first. Estimated chunk count × embedding dimensions × 4 bytes
is your raw vector footprint. 50M chunks × 1536 dims is ~300GB before the index
overhead — that rules out “just use pgvector on the app’s Postgres” and points
at Turbopuffer or Pinecone. See
references/vector-db-decision.md. - Default to hybrid, not pure dense. Pure vector search silently fails on exact terms — product SKUs, error codes, proper nouns, acronyms. Dense + BM25 fusion (then rerank the union) is the strong default for mixed corpora. Reserve pure dense for short, paraphrase-heavy semantic matching.
- Pick the embedding model on the eval, not the leaderboard. MTEB rank is a
starting prior, not an answer. Domain (legal, code, multilingual) flips the
ranking. The Designer commits to a candidate set (e.g.
text-embedding-3-large, Cohereembed-v4, Voyagevoyage-3, openbge-large) for the Builder to bench. - Declare the metric targets up front. “recall@10 ≥ 0.90, MRR ≥ 0.75, faithfulness ≥ 0.95” before building. Targets the Designer can’t justify from the product’s tolerance for a missed answer aren’t real targets.
Designer Workflow
- Read upstream context: corpus description,
oc-app-architect02-architecture.md(is this a knowledge base? semantic search? agentic retrieval?), the chosen generation model fromoc-claude-api(it bounds the context budget you can spend on retrieved chunks). - Walk the vector-DB decision tree →
references/vector-db-decision.md. - Walk the embedding-model decision tree →
references/embedding-models.md. - Pick a chunking strategy →
references/chunking-strategies.md. - Choose search mode: dense-only, hybrid (dense + BM25), and whether to add a cross-encoder reranker.
- Declare metric targets and write the Retrieval Design doc for approval.
Vector-DB Decision Tree
These map 1:1 to oc-stack-forge packs with kind: vector-db. The Designer
recommends; oc-stack-forge provisions.
Already on Postgres + < ~1M chunks + ops-simplicity wins?
└─► pgvector (HNSW) → oc-stack-forge pack: pgvector
On Supabase specifically (RLS, edge functions, one platform)?
└─► Supabase Vectors → oc-stack-forge pack: supabase-vectors
(pgvector under the hood + Supabase tooling/auth)
Need serverless scale-to-millions, object-storage economics, low ops?
└─► Turbopuffer → oc-stack-forge pack: turbopuffer
(cheap at rest, native hybrid, great $/vector at scale)
Need a managed, battle-tested vector DB with rich metadata filtering
and you'll pay for hands-off ops?
└─► Pinecone → oc-stack-forge pack: pinecone
Full tradeoff table (scale / latency / ops / cost) in
references/vector-db-decision.md.
Embedding-Model Decision Tree
Multilingual corpus or non-English queries?
└─► Cohere embed-v4 / Voyage multilingual / bge-m3
Code / mixed code+prose retrieval?
└─► Voyage voyage-code / OpenAI text-embedding-3-large
General English, want the easy default + good price/quality?
└─► OpenAI text-embedding-3-small (1536d) → -large (3072d) if recall is short
Self-host / no-egress / cost-at-volume mandatory?
└─► open weights: bge-large-en-v1.5, e5-large, nomic-embed
Storage / latency pressure at huge scale?
└─► Matryoshka truncation (3-large → 256/512d) or a 384d open model
Dimension vs recall tradeoffs, fine-tuning triggers, and the full table live in
references/embedding-models.md.
Chunking Strategy (quick map)
| Corpus | Default strategy | Notes |
|---|---|---|
| Clean markdown / docs | Structural (by heading) → recursive fallback | Respect doc structure |
| Long prose / books | Recursive char split, 512–1024 tok, 10–20% overlap | Overlap preserves context |
| Heterogeneous / mixed quality | Semantic chunking (embedding-distance splits) | Costlier, higher coherence |
| Q&A / support tickets | One chunk per item | Natural unit already |
| Code | Structural by symbol (function/class) | AST-aware, never mid-function |
| Tables / scanned PDF | Parse → structural; keep tables whole | Don’t split a table row-wise |
Use parent-document retrieval when you need precise matching and full
context: embed small child chunks for the match, return the larger parent for the
LLM. Full heuristics in references/chunking-strategies.md.
Gate: Retrieval Design Approval
Present the design: vector DB (+ stack-forge pack), embedding model (+ candidate
set to bench), chunking strategy, search mode, and the declared metric targets.
Confirm with the user. Write checkpoint: phase designed.
Phase 2: Build Loop (/oc-rag build)
Step 1: Build Contract
Builder proposes:
## RAG Build Contract
### Deliverables
- Ingestion pipeline: load → clean → chunk → embed → upsert (idempotent on doc id)
- Vector index provisioned via oc-stack-forge pack [pgvector|turbopuffer|pinecone|supabase-vectors]
- Retrieval function: query → embed → (dense ∪ BM25) → rerank → top-k chunks
- Metadata filter support (tenant, source, date, doc_type)
- A labelled goldset (≥ 50 query→relevant-doc pairs) for the Evaluator
- Re-index / incremental-update path (no full rebuild for one changed doc)
### Testable Criteria
1. Re-running ingest on an unchanged corpus is a no-op (idempotent upserts)
2. Retrieval returns chunks with stable ids + source metadata + scores
3. Metadata filters actually restrict the candidate set (tenant isolation holds)
4. Hybrid fusion + rerank are wired and toggleable for the bench
5. The goldset exists and every query has ≥ 1 labelled relevant doc
### Metric Targets (from Designer)
- recall@10 ≥ 0.90
- MRR ≥ 0.75
- nDCG@10 ≥ 0.80
- faithfulness ≥ 0.95 (end-to-end, on the generation)
Evaluator reviews and pushes back if:
- There’s no goldset (you cannot evaluate retrieval without labels)
- Upserts aren’t idempotent (re-ingest duplicates the corpus and inflates recall)
- Metadata/tenant filtering isn’t testable (a multi-tenant RAG leak is a breach)
- Targets are absent or unjustified
Step 2: Builder Implements
Builder reads the chosen vector-DB pack from the oc-stack-forge checkpoint and
uses that pack’s client + index config rather than reinventing it:
| Vector DB (stack-forge pack) | Index | Hybrid support | Client pattern |
|---|---|---|---|
pgvector | HNSW (vector_cosine_ops) | BM25 via Postgres FTS (tsvector) + app-side fusion | SQL + pgvector driver |
supabase-vectors | HNSW (pgvector) | FTS + RPC fusion function | Supabase client / RPC |
turbopuffer | native ANN | native BM25 + vector in one query | Turbopuffer SDK |
pinecone | native ANN | sparse-dense vectors or integrated reranker | Pinecone SDK |
Embedding generation: batch the corpus through the chosen model (respect rate
limits, retry with backoff, cache by content hash so re-runs are cheap). If the
generation model is Claude via oc-claude-api, the retrieved chunks become cached
prefix content — coordinate chunk ordering with prompt-caching boundaries.
Step 3: Evaluator
The Evaluator runs with isolated context — it sees the goldset and the live retrieval function, not the Builder’s tuning rationale.
Evaluator Persona. A retrieval QA engineer who trusts numbers over demos. Key behaviors:
- Run the whole goldset, not three cherry-picked queries. Report aggregate recall@k, MRR, and nDCG@10 with per-query breakdown for the failures.
- Separate retrieval failure from generation failure. If recall@k is high but answers are wrong, the bug is in generation/prompting, not retrieval. If recall is low, no amount of prompt tuning saves it. Name which layer failed.
- Measure faithfulness and answer relevance on the end-to-end output: is every claim grounded in a retrieved chunk, and does the answer address the question.
- Probe the known failure modes: exact-term queries (does hybrid catch the SKU/error-code?), out-of-corpus queries (does it correctly retrieve nothing / abstain?), and adversarial near-duplicates.
- Gate on regression, not just absolutes. A change that lifts recall@10 from 0.91 to 0.93 but drops MRR from 0.80 to 0.62 made retrieval worse for the top result. Report deltas vs the last passing config.
Evaluator Report saved to rag/eval-round-M.md:
## RAG Evaluation — Round [M]
### Config Under Test
- Vector DB: turbopuffer | Embedding: text-embedding-3-large (3072d)
- Chunking: recursive 768 tok / 15% overlap | Search: hybrid + Cohere rerank-v3.5
- k (retrieve): 40 → rerank → 10
### Retrieval Metrics (goldset: 80 queries)
| Metric | Value | Target | Pass? |
|---|---|---|---|
| recall@10 | 0.93 | ≥ 0.90 | PASS |
| MRR | 0.78 | ≥ 0.75 | PASS |
| nDCG@10 | 0.81 | ≥ 0.80 | PASS |
### Generation Metrics
| Metric | Value | Target | Pass? |
|---|---|---|---|
| faithfulness | 0.96 | ≥ 0.95 | PASS |
| answer relevance | 0.88 | ≥ 0.85 | PASS |
### Failure Analysis
- 5 queries below recall@10: 4 are exact-SKU lookups (hybrid weight too low on sparse)
- 1 out-of-corpus query: correctly retrieved nothing, model abstained ✓
### Delta vs Round [M-1]
- recall@10 +0.04, MRR +0.06 (reranker added). No regressions.
### Verdict: PASS / FAIL
[If FAIL: which metric, which queries, which knob to move]
Scoring rubric details (how each metric is computed, faithfulness judging) live in
references/retrieval-eval.md.
Step 4: Iterate or Advance
- PASS: Retrieval is shippable. Freeze the config, commit the goldset, register
the regression suite with CI, hand off to
oc-claude-apifor the generation layer andoc-deploy-opsfor the pipeline. - FAIL + rounds remaining: Feed the failure analysis to the Builder. Move one
knob (chunk size,
k, hybrid weight, reranker, embedding model), re-run. - FAIL + max rounds: Escalate to user with all round reports and the metric trajectory.
Max iterations: 3. Each iteration moves one variable so the delta is attributable.
Benchmark Mode (/oc-rag bench)
bench is the Evaluator running a grid, not a single config — it sweeps the
candidate set the Designer declared (embedding models × chunk sizes × search modes)
and reports the full metric matrix so the winning config is chosen by data.
RAG BENCH — corpus: support-kb (12,400 chunks) — goldset: 80 queries
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
embedding chunk search recall@10 MRR $/1M p95ms
text-embedding-3-small 512 dense 0.84 0.69 0.02 31
text-embedding-3-small 512 hybrid 0.89 0.74 0.02 44
text-embedding-3-large 768 hybrid 0.91 0.77 0.13 46
text-embedding-3-large 768 hybrid+rerank 0.93 0.81 0.13 88
cohere embed-v4 768 hybrid+rerank 0.94 0.82 0.10 91
bge-large (self-host) 768 hybrid+rerank 0.90 0.76 0.00 120
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommend: cohere embed-v4 + 768/hybrid+rerank (best recall/MRR; reranker
adds ~45ms — acceptable for this product's latency budget).
The bench output is the evidence behind the Designer’s final config choice. Always include cost and p95 latency, not just quality — the best-recall config that blows the latency budget is not the answer.
Boundaries (what oc-rag-forge does NOT own)
| Concern | Owner | Why |
|---|---|---|
| The generation model, prompt caching of retrieved context, citations | oc-claude-api | RAG Forge surfaces chunks; oc-claude-api turns them into answers |
| Provisioning the vector DB (Postgres, Turbopuffer, Pinecone, Supabase) | oc-stack-forge | Owns the kind: vector-db packs; RAG Forge selects, stack-forge provisions |
| The app/data model the corpus lives in | oc-app-architect | RAG Forge is invoked by it for the retrieval surface |
| Prompt versioning / eval datasets for the generation prompt | oc-prompt-ops | Prompt-side evals; RAG Forge owns retrieval evals |
| Embedding batch jobs at scale, capacity math | oc-scale-ops | RAG Forge declares throughput needs; scale-ops sizes |
| Multi-tenant isolation threat model | oc-security-auditor | RAG Forge enforces metadata filters; security-auditor reviews the boundary |
| Deploying + monitoring the live retrieval service | oc-deploy-ops / oc-monitoring-ops | RAG Forge hands off a frozen, evaluated config |
RAG Forge’s output is an evaluated retrieval config plus a regression goldset. Sibling skills provision, generate, deploy, and monitor around it.
Cross-Skill Integration
| Skill | How it connects |
|---|---|
| oc-app-architect | Auto-invokes RAG Forge when 02-architecture.md calls for a knowledge base, semantic search, or agentic retrieval. RAG Forge returns the retrieval design into the architecture spec. |
| oc-stack-forge | Vector-DB choice maps to a kind: vector-db pack (pgvector, turbopuffer, pinecone, supabase-vectors). RAG Forge selects; stack-forge provisions + emits the client config. |
| oc-claude-api | Owns the generation model. RAG Forge pulls the model’s context budget (how many chunks fit) and coordinates chunk ordering with prompt-caching boundaries. Citations/grounding live there. |
| oc-prompt-ops | Owns generation-prompt versioning + evals. RAG Forge’s goldset is the retrieval analogue; the two regression suites run side by side. |
| oc-deploy-ops | Receives the frozen retrieval config + ingestion pipeline; gates prod on the regression goldset passing. |
| oc-monitoring-ops | Watches live retrieval: recall drift as the corpus grows, embedding-model deprecations, p95 latency. |
| oc-security-auditor | Reviews multi-tenant metadata-filter isolation and PII-in-embeddings exposure. |
| oc-scale-ops | Sizes the embedding batch jobs and the index at projected corpus volume. |
Checkpoint Integration
Checkpoint Location
{project-dir}/.checkpoints/oc-rag-forge.checkpoint.json
When to Write
| Event | What to Save |
|---|---|
| Design approved | Vector DB + pack, embedding model + candidate set, chunking, search mode, targets |
| Build contract negotiated | Deliverables, criteria, goldset size |
| Builder completes | Corpus size, chunk count, index name, ingest pipeline path |
| Evaluator runs | Per-metric values vs targets, failure analysis, round number |
| Bench runs | Full config matrix, recommended config |
| Regression gate | Pass/fail + delta vs last frozen config |
skill_state
{
"vector_db": "turbopuffer",
"stack_forge_pack": "turbopuffer",
"embedding": { "model": "cohere-embed-v4", "dims": 1024 },
"chunking": { "strategy": "recursive", "size_tokens": 768, "overlap_pct": 15 },
"search": { "mode": "hybrid", "reranker": "cohere-rerank-v3.5", "retrieve_k": 40, "final_k": 10 },
"corpus": { "docs": 4200, "chunks": 12400 },
"goldset": { "path": "rag/goldset.jsonl", "queries": 80 },
"targets": { "recall_at_10": 0.90, "mrr": 0.75, "ndcg_at_10": 0.80, "faithfulness": 0.95 },
"last_eval": {
"round": 3,
"verdict": "PASS",
"metrics": { "recall_at_10": 0.94, "mrr": 0.82, "ndcg_at_10": 0.81, "faithfulness": 0.96 }
}
}
Cross-Skill Reads
| Reads from | Why |
|---|---|
| oc-app-architect | 02-architecture.md retrieval intent + corpus description |
| oc-stack-forge | Chosen vector-DB pack + provisioned index config |
| oc-claude-api | Generation model + context budget for retrieved chunks |
| Read by | Why |
|---|---|
| oc-claude-api | Retrieved chunks feed the generation prompt (+ caching boundaries) |
| oc-deploy-ops | Frozen config + regression goldset as a deploy gate |
| oc-monitoring-ops | Retrieval metrics + corpus-growth drift to watch |
| oc-security-auditor | Tenant-isolation filter contract as a posture input |
Principles
- Retrieval is an evaluated artifact, not a vibe. Every config ships with a goldset and a recall@k / MRR / nDCG number. “Looks right” is not a measurement.
- The generation hides retrieval failure. Measure faithfulness and context recall separately — a fluent wrong answer is a retrieval bug wearing a disguise.
- Default to hybrid. Pure dense search silently misses exact terms. Dense + BM25 + rerank is the strong default; reserve pure dense for paraphrase matching.
- Pick the embedding model on the eval, not the leaderboard. MTEB is a prior. Your domain flips the ranking. Bench the candidate set on your goldset.
- Move one knob per iteration. Chunk size,
k, hybrid weight, reranker, embedding model — change one, re-run, attribute the delta. Bundled changes hide which one helped. - Gate on regression, not just absolutes. A change that lifts recall but tanks MRR made the top result worse. Always report deltas vs the last passing config.
- Cost and latency are first-class metrics. The best-recall config that blows
the latency or token budget is not the answer. Bench reports
$and p95 too. - Idempotent ingest, isolated tenants. Re-ingest must be a no-op on unchanged docs, and metadata filters must actually isolate tenants — a cross-tenant retrieval leak is a breach, not a bug.
- Select, don’t reinvent. Vector DB comes from an oc-stack-forge pack; the generation model from oc-claude-api. RAG Forge owns the layer in between.
- Freeze and regress. Once a config passes, freeze it and run the goldset in CI. Corpus growth and model deprecations cause silent drift.
Use OC · RAG Forge in your project
Drop the SKILL.md into .claude/skills/ or
.codex/skills/, download the bundle, or reach
it over the hosted MCP endpoint.