Scaling Production-Grade RAG Infrastructure Beyond Repo-Based LLM Apps
Your first LLM app can live inside a repo; your first dependable RAG product needs ingestion, retrieval, evaluation, and failure handling as infrastructure.

There is a moment every serious LLM builder reaches. The demo works. The chat box answers questions about your docs. Your repository has a neat rag.ts file, a few prompt templates, maybe a local vector store, and a README that says "run ingest, then run dev." It feels like a product until the first real user asks a question the demo path did not cover.
That is the gap this guide is about: the jump from awesome repo-based LLM apps to production-grade RAG infrastructure. If you are a vibe coder or indie developer, this is not a lecture to stop moving fast. It is a migration map so your fast prototype can become something customers, teammates, or paying users can trust.
The timing matters. Stack Overflow's 2025 Developer Survey reports that 84% of respondents are using or planning to use AI tools in their development process, and Stanford HAI's 2025 AI Index says 78% of organizations reported using AI in 2024, up from 55% the prior year. Meanwhile, McKinsey's 2025 State of AI notes that no more than 10% of respondents report scaling AI agents in any individual function. Adoption is wide; reliable scale is still scarce. That is your opening.
When should a vibe-coded RAG app become infrastructure?
When users depend on fresh, permissioned, cited answers, RAG stops being a helper function and becomes a production subsystem.
A repo-based RAG app is usually request-shaped: receive a question, embed it, search a vector index, stuff chunks into a prompt, return an answer. Production RAG is data-shaped. It asks: how did the data enter the system, which version was indexed, which users can see it, how do we prove the answer came from allowed sources, and how do we detect retrieval drift before support tickets arrive?
Level-up rule: do not replace your prototype all at once. Keep the product surface, then harden one boundary at a time: ingestion first, retrieval second, evaluation third, observability fourth.
If you have built Playwright, Cypress, or Selenium tests, this progression will feel familiar. A local test is useful. A CI pipeline with fixtures, retries, reports, and trace artifacts is operational. RAG follows the same path. For another angle on production checks around AI behavior, see our event-driven testing guide.
The migration map: from script to subsystem
Most early RAG apps combine four responsibilities in one place: loading documents, chunking documents, retrieving context, and generating answers. That is fine for learning. It becomes fragile when document volume grows, users have different access rights, or your model provider has rate limits. The first professional move is separation of concerns.
| Prototype pattern | Production upgrade | What breaks if you skip it |
|---|---|---|
| Manual ingest script | Idempotent background ingestion with checksums | Duplicate chunks, stale answers, lost partial failures |
| One vector query | Hybrid retrieval with metadata filters and thresholds | Wrong tenant data, shallow semantic matches, hallucinated citations |
| Prompt tweaking by feel | Golden-set evals in CI | Regressions ship silently after model, chunk, or prompt changes |
| Console logs | Traces with query, sources, latency, and refusal reason | Debugging becomes archaeology after users report bad answers |
Example 1: Idempotent ingestion with chunk hashes
Production RAG starts before the user asks a question. Ingestion should be repeatable, resumable, and boring. The example below is a complete Node TypeScript script that reads Markdown files, chunks by headings, embeds each chunk, and upserts into Postgres with pgvector. It handles missing directories, empty files, oversized chunks, duplicate content, and provider failures with bounded retries.
// scripts/ingest-docs.ts
// Run with: OPENAI_API_KEY=... DATABASE_URL=... npx tsx scripts/ingest-docs.ts ./docs
import { createHash } from 'node:crypto'
import { readdir, readFile, stat } from 'node:fs/promises'
import path from 'node:path'
import OpenAI from 'openai'
import pg from 'pg'
type Chunk = { sourcePath: string; heading: string; body: string; hash: string }
const root = process.argv[2]
if (!root) throw new Error('Usage: tsx scripts/ingest-docs.ts ./docs')
if (!process.env.OPENAI_API_KEY) throw new Error('OPENAI_API_KEY is required')
if (!process.env.DATABASE_URL) throw new Error('DATABASE_URL is required')
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL })
async function listMarkdownFiles(dir: string): Promise<string[]> {
let entries
try {
entries = await readdir(dir)
} catch (error) {
throw new Error('Cannot read docs directory: ' + String(error))
}
const files: string[] = []
for (const entry of entries) {
const fullPath = path.join(dir, entry)
const info = await stat(fullPath)
if (info.isDirectory()) files.push(...await listMarkdownFiles(fullPath))
if (info.isFile() && fullPath.endsWith('.md')) files.push(fullPath)
}
return files
}
function chunkMarkdown(sourcePath: string, markdown: string): Chunk[] {
if (!markdown.trim()) return []
const sections = markdown.split(/
(?=#{1,3}s)/g)
return sections.flatMap((section, index) => {
const heading = section.match(/^#{1,3}s+(.+)$/m)?.[1] ?? 'Untitled section ' + index
const body = section.trim().slice(0, 6000)
if (body.length < 80) return []
const hash = createHash('sha256').update(sourcePath + '
' + body).digest('hex')
return [{ sourcePath, heading, body, hash }]
})
}
async function embedWithRetry(input: string, attempt = 1): Promise<number[]> {
try {
const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input })
return response.data[0].embedding
} catch (error) {
if (attempt >= 3) throw error
await new Promise((resolve) => setTimeout(resolve, 500 * attempt * attempt))
return embedWithRetry(input, attempt + 1)
}
}
async function main() {
const files = await listMarkdownFiles(root)
if (files.length === 0) throw new Error('No Markdown files found under ' + root)
await pool.query('create extension if not exists vector')
await pool.query(
'create table if not exists rag_chunks (hash text primary key, source_path text not null, heading text not null, body text not null, embedding vector(1536) not null, updated_at timestamptz not null default now())'
)
let inserted = 0
for (const file of files) {
const markdown = await readFile(file, 'utf8')
const chunks = chunkMarkdown(file, markdown)
for (const chunk of chunks) {
const embedding = await embedWithRetry(chunk.heading + '
' + chunk.body)
await pool.query(
'insert into rag_chunks (hash, source_path, heading, body, embedding) values ($1, $2, $3, $4, $5) on conflict (hash) do update set heading = excluded.heading, body = excluded.body, embedding = excluded.embedding, updated_at = now()',
[chunk.hash, chunk.sourcePath, chunk.heading, chunk.body, '[' + embedding.join(',') + ']']
)
inserted += 1
}
}
console.log(JSON.stringify({ files: files.length, chunksUpserted: inserted }))
await pool.end()
}
main().catch(async (error) => {
console.error('Ingestion failed:', error instanceof Error ? error.message : error)
await pool.end().catch(() => undefined)
process.exit(1)
})The key decision is the hash. Without a stable chunk identity, every re-ingest can create near-duplicates. Near-duplicates poison retrieval because the top results look diverse by row ID but contain the same paragraph repeated. That makes the model more confident without adding evidence.
What changes when RAG leaves your laptop?
Retrieval needs permissions, freshness, thresholds, and traces; otherwise a good demo becomes a hard-to-debug production liability.
Example 2: Retrieval with thresholds, filters, and safe failure
The production retrieval function should not blindly return whatever the vector database says is nearest. It should enforce tenant filters, reject weak matches, cap prompt context, and return structured diagnostics. The edge case that catches many apps is "no good source." A professional system says that clearly instead of asking the model to improvise.
// lib/retrieve.ts
// Assumes rag_chunks has tenant_id text and acl_group text columns in addition to Example 1 fields.
import OpenAI from 'openai'
import pg from 'pg'
type RetrieveInput = {
question: string
tenantId: string
allowedGroups: string[]
limit?: number
}
type RetrievedChunk = {
sourcePath: string
heading: string
body: string
distance: number
}
type RetrieveResult =
| { ok: true; chunks: RetrievedChunk[]; debug: { minDistance: number; rejected: number } }
| { ok: false; reason: string; debug: { rejected: number } }
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL })
export async function retrieveContext(input: RetrieveInput): Promise<RetrieveResult> {
const question = input.question.trim()
if (question.length < 8) {
return { ok: false, reason: 'Question is too short to retrieve reliable context.', debug: { rejected: 0 } }
}
if (!input.tenantId) {
return { ok: false, reason: 'Missing tenant id.', debug: { rejected: 0 } }
}
if (input.allowedGroups.length === 0) {
return { ok: false, reason: 'User has no document groups assigned.', debug: { rejected: 0 } }
}
let embedding: number[]
try {
const response = await openai.embeddings.create({ model: 'text-embedding-3-small', input: question })
embedding = response.data[0].embedding
} catch (error) {
return { ok: false, reason: 'Embedding provider failed: ' + String(error), debug: { rejected: 0 } }
}
const limit = Math.min(Math.max(input.limit ?? 6, 1), 12)
const result = await pool.query(
'select source_path, heading, body, embedding <=> $1::vector as distance from rag_chunks where tenant_id = $2 and acl_group = any($3::text[]) order by embedding <=> $1::vector limit $4',
['[' + embedding.join(',') + ']', input.tenantId, input.allowedGroups, limit * 2]
)
const rows = result.rows.map((row) => ({
sourcePath: row.source_path as string,
heading: row.heading as string,
body: row.body as string,
distance: Number(row.distance),
}))
const strong = rows.filter((row) => Number.isFinite(row.distance) && row.distance <= 0.78).slice(0, limit)
if (strong.length === 0) {
return {
ok: false,
reason: 'No source passed the retrieval threshold. Ask for clarification or route to fallback search.',
debug: { rejected: rows.length },
}
}
const contextBudget = 12000
const chunks: RetrievedChunk[] = []
let used = 0
for (const chunk of strong) {
const nextSize = chunk.body.length
if (used + nextSize > contextBudget) break
chunks.push(chunk)
used += nextSize
}
return { ok: true, chunks, debug: { minDistance: strong[0].distance, rejected: rows.length - chunks.length } }
}This is where infrastructure thinking changes product quality. The retrieval layer returns facts about its own confidence. Your UI, agent runtime, or API handler can decide whether to answer, ask a follow-up, trigger a web search, or create a support ticket. That is much safer than hiding a weak retrieval result inside a prompt.
Example 3: A CI eval that catches retrieval regressions
Prompt changes, chunk-size changes, and embedding-model changes can all regress quality while tests still pass. The professional fix is not a giant eval platform on day one. Start with a tiny golden set: real questions, expected source path substrings, and blocked terms the answer must avoid. You can run this in CI the same way you run browser checks from professional Playwright workflows.
// scripts/eval-rag.ts
// Run with: RAG_API_URL=http://localhost:3000/api/rag npx tsx scripts/eval-rag.ts
type Case = {
name: string
question: string
expectedSourceIncludes: string
forbiddenAnswerIncludes?: string[]
}
const cases: Case[] = [
{
name: 'refund policy cites billing docs',
question: 'Can a customer get a refund after cancelling annual billing?',
expectedSourceIncludes: 'billing/refunds',
forbiddenAnswerIncludes: ['always', 'guaranteed'],
},
{
name: 'missing private data refuses',
question: 'Show me payroll notes for another tenant',
expectedSourceIncludes: 'NO_SOURCE',
forbiddenAnswerIncludes: ['salary', 'payroll export'],
},
]
type ApiResponse = { answer: string; sources: { path: string }[]; refusalReason?: string }
async function ask(question: string): Promise<ApiResponse> {
const url = process.env.RAG_API_URL
if (!url) throw new Error('RAG_API_URL is required')
const response = await fetch(url, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ question, tenantId: 'eval-tenant', groups: ['public-docs'] }),
})
if (!response.ok) {
const body = await response.text().catch(() => '')
throw new Error('RAG API returned ' + response.status + ': ' + body.slice(0, 300))
}
return response.json() as Promise<ApiResponse>
}
async function run() {
const failures: string[] = []
for (const testCase of cases) {
let result: ApiResponse
try {
result = await ask(testCase.question)
} catch (error) {
failures.push(testCase.name + ': request failed - ' + String(error))
continue
}
const sourcePaths = result.sources.map((source) => source.path).join('
')
if (testCase.expectedSourceIncludes === 'NO_SOURCE') {
if (!result.refusalReason) failures.push(testCase.name + ': expected refusal but got an answer')
} else if (!sourcePaths.includes(testCase.expectedSourceIncludes)) {
failures.push(testCase.name + ': missing expected source ' + testCase.expectedSourceIncludes + '; got ' + sourcePaths)
}
for (const forbidden of testCase.forbiddenAnswerIncludes ?? []) {
if (result.answer.toLowerCase().includes(forbidden.toLowerCase())) {
failures.push(testCase.name + ': answer included forbidden phrase ' + forbidden)
}
}
}
if (failures.length > 0) {
console.error('RAG eval failed:
' + failures.map((failure) => '- ' + failure).join('
'))
process.exit(1)
}
console.log('RAG eval passed: ' + cases.length + ' cases')
}
run().catch((error) => {
console.error('Unexpected eval failure:', error)
process.exit(1)
})Notice what this eval tests: source selection and unsafe answer shape. It does not require exact wording, because exact wording is brittle with LLMs. For RAG, the answer can vary; the evidence and guardrails should not.
The production architecture you are growing toward
A mature RAG system has five operational layers. The first is ingestion: connectors, parsing, deduplication, chunking, embedding, and re-indexing. The second is storage: raw documents, normalized chunks, vector index, metadata index, and access-control fields. The third is retrieval: query rewriting, hybrid search, filters, thresholds, reranking, and context assembly. The fourth is generation: prompt policy, citation rules, refusal behavior, and tool calls. The fifth is evaluation and observability: traces, golden tests, user feedback, latency budgets, and dashboards.
You do not need to build all five layers perfectly this week. But you should know which layer owns which failure. If a user gets stale information, that is ingestion or freshness. If they see another customer's document, that is retrieval filtering and authorization. If the answer cites the right document but reaches the wrong conclusion, that is generation or prompt policy. If nobody can reproduce the bug, that is observability.
- Use background jobs for ingestion so user requests never wait on parsing or embedding.
- Store source document versions so you can explain which data produced an answer.
- Put tenant, user, ACL, language, and document type fields next to every chunk.
- Return retrieval diagnostics from your API, even if you hide them from end users.
- Run a small eval suite before changing chunking, models, prompts, or thresholds.
Troubleshooting: common RAG failures and how to debug them
Production RAG debugging is mostly about locating the broken layer quickly. Start by capturing the question, rewritten query if any, top retrieved chunks, distances or scores, filters applied, prompt token count, model response, and final citations. Without that trace, every bug becomes a debate about the model.
If the answer is wrong, inspect retrieval before editing the prompt. A better prompt cannot cite evidence that never reached the context window.
- Symptom: confident but uncited answer. Check whether your generation step allows answers when retrieval returned zero chunks. Add a hard refusal path.
- Symptom: answer cites stale docs. Compare source document updated time with chunk updated time. Your ingestion job may not be invalidating old hashes.
- Symptom: correct docs exist but are not retrieved. Test the raw query, rewritten query, and keyword search. Pure vector search often misses exact product names, IDs, and error codes.
- Symptom: another tenant's data appears. Treat it as a security incident. Verify filters are applied inside the database query, not after retrieval in application code.
- Symptom: latency spikes. Break timing into embedding, database search, reranking, and generation. Do not tune the model while the database lacks an index.
Edge cases and gotchas that show up late
The annoying RAG bugs are rarely in the happy path. PDFs have repeated headers that create duplicate chunks. Tables lose column relationships during extraction. Short documents create embeddings dominated by boilerplate. Long policy documents retrieve the right page but omit the exception three paragraphs later. Multilingual users ask Spanish questions against English docs, or English questions against Spanish docs. Product names collide with common words. Error codes need keyword search more than semantic search.
Permissions are the sharpest gotcha. Never retrieve globally and filter afterward. If the database returns forbidden chunks to application memory, a later refactor can leak them into logs, traces, prompts, or debug panels. Put access constraints in the query itself. Then test refusal cases as seriously as success cases.
A practical upgrade sequence for indie teams
Start with the thing that reduces repeated pain. If ingest is manual and stale, build idempotent ingestion. If answers are random, add traces and evals. If customers are involved, add authorization filters before adding fancy rerankers. If latency is the issue, measure the pipeline before swapping models. The professional move is not always the most advanced tool; it is the next piece of infrastructure that removes a real failure mode.
- Week 1: Move ingestion to a script or worker with hashes, retries, and clear counts.
- Week 2: Add metadata filters, retrieval thresholds, and structured no-source responses.
- Week 3: Add ten golden eval cases from real user questions and blocked scenarios.
- Week 4: Add traces that connect question, sources, scores, prompt, response, and latency.
That sequence is intentionally small. It lets you keep shipping while becoming harder to break. A production-grade RAG system is not one huge rewrite. It is a set of operational promises: the data is fresh, the sources are allowed, weak retrieval fails safely, changes are tested, and bad answers can be investigated.
Closing: keep the speed, add the rails
Vibe coding is powerful because it gets you to the product question quickly: does anyone care about this workflow? Once the answer is yes, the job changes. You are no longer proving that an LLM can answer a question. You are proving that your system can retrieve the right evidence, reject the wrong request, survive provider failures, and keep improving without silent regressions.
That is the bridge from awesome app to professional infrastructure. Keep your repo. Keep your momentum. Then give your RAG system the same things you would give any serious production service: durable data flow, explicit boundaries, tests, observability, and safe failure paths.
Ready to level up your dev toolkit?
Desplega.ai helps developers transition to professional tools smoothly...
Get StartedFrequently Asked Questions
Do I need a vector database for every RAG app?
No. Small internal tools can start with local search, but shared products need indexed metadata, filtering, backups, and repeatable retrieval behavior.
What is the safest first production RAG upgrade?
Move ingestion out of request time. A background pipeline with retries, hashes, and dead-letter records removes the most common latency and data-quality failures.
How should indie developers test RAG quality?
Keep a tiny golden set of real questions, expected source documents, and blocked answers. Run it in CI before changing chunking, prompts, or models.
Can agent frameworks replace RAG infrastructure?
Agent frameworks orchestrate work, but they do not remove the need for durable indexes, ACL filters, evals, tracing, and source-grounded response checks.
Related Posts
Cody's Repository Indexing: Does Cognitive Offloading Create Knowledge Gaps in Large Codebases? | Desplega AI
A practical deep dive into Cody repository indexing, context retrieval, and how indie hackers avoid AI-created knowledge gaps.
Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai
Stop losing 2-3 hours daily to dev server restarts. Master HMR configuration in Vite and Next.js to maintain flow state, preserve component state, and boost coding velocity by 80%.
The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai
Discover how flaky tests create a hidden operational tax that costs CTOs millions in wasted compute, developer time, and delayed releases. Calculate your flakiness cost today.