Scaling Beyond the Repo: Building Production-Grade Test Infrastructure for 100+ AI Agents and RAG Apps
Your repo test folder is enough for a prototype; production AI systems need shared harnesses, replayable traces, and release gates that scale with every new agent.

The first version of an AI agent usually lives inside one repo: a prompt file, a route handler, a few tool functions, a vector search helper, and a small test folder that proves the happy path. That is a reasonable way to start. It is also the point where many teams accidentally build a ceiling over their own product.
Once you have 10 agents, the repo test folder starts to creak. Once you have 100 agents, multiple RAG apps, and customer-specific tool permissions, it collapses. Every agent has a slightly different fixture format. Every retrieval test uses a different seed corpus. Every failed run leaves a Slack screenshot instead of a replayable trace. The codebase still has tests, but the organization no longer has test infrastructure.
The industry data explains why this moment matters. The Stack Overflow 2025 Developer Survey reports that 84% of respondents are using or planning to use AI tools in development, up from 76% the prior year. GitHub's Octoverse coverage says nearly 80% of new developers on GitHub use Copilot within their first week. Adoption is no longer the hard part. Proving that AI-assisted systems behave correctly is.
This is a Level Up guide for the developer moving from "my repo has tests" to "my AI product has a release system." You will keep the fast local loop that made vibe coding fun, but add the pieces professionals rely on: a test-case registry, replayable agent runs, retrieval contracts, tool sandboxes, CI risk tiers, and observability that makes the next failure diagnosable. For adjacent migration work, pair this with our RAG infrastructure guide and our agent runtime architecture deep dive.
What changes when AI tests move beyond one repo?
The unit of quality shifts from functions to runs: inputs, retrieved evidence, tool calls, model output, side effects, and trace context.
Traditional app tests usually verify deterministic code paths. Given the same database row and HTTP request, the app should return the same result. AI systems add probabilistic model behavior, changing retrieval corpora, external tools, asynchronous jobs, and policy layers. A passing test that only checks the final text is too thin because the answer may be right for the wrong reason, right with forbidden evidence, or right while calling a tool it should never touch.
Production-grade AI testing treats each run as a structured object. A run contains the prompt version, model, parameters, fixture identity, user role, retrieved chunk IDs, tool call arguments, output, evaluator scores, latency, cost, and trace links. That object can be replayed, compared, quarantined, and promoted. Your repo tests can still exercise functions, but your release gates need to reason about runs.
Code comparison: repo-local checks are fast, but infrastructure checks are reusable across every agent.
| Need | Repo-only test | Production infrastructure |
|---|---|---|
| Prompt coverage | Inline examples | Versioned case registry |
| RAG correctness | Snapshot final answer | Assert source IDs and citations |
| Agent tools | Mock one function | Sandbox, allowlist, trace calls |
| Failure debugging | Read CI logs | Replay the exact run with captured evidence |
Can one CI pipeline test 100+ AI agents?
Yes, if every agent plugs into one typed harness and CI runs risk-based slices instead of every expensive scenario on every commit.
The mistake is trying to make each agent special. You do not want 100 bespoke Jest files that each know how to seed data, stub tools, call the model, parse output, and decide pass or fail. You want one runner that accepts typed cases. The cases describe what matters: fixture data, user role, task, allowed tools, expected evidence, forbidden behavior, and risk tier. The runner owns the mechanics.
# tests/ai/case_registry.py
import json
from pathlib import Path
from pydantic import BaseModel, Field, ValidationError
class AgentCase(BaseModel):
id: str = Field(min_length=1)
agent: str
risk: str
user_role: str
input: str = Field(min_length=5)
allowed_tools: list[str] = []
expected_sources: list[str] = []
forbidden_phrases: list[str] = []
def load_agent_cases(path: str, risk: str) -> list[AgentCase]:
try:
raw = json.loads(Path(path).read_text())
cases = [AgentCase.model_validate(item) for item in raw]
except (OSError, json.JSONDecodeError, ValidationError) as exc:
raise RuntimeError(f"Invalid AI test registry: {exc}") from exc
selected = [case for case in cases if case.risk == risk]
if not selected:
raise RuntimeError(f"No AI test cases found for risk tier {risk}")
seen = set()
for case in selected:
if case.id in seen:
raise RuntimeError(f"Duplicate AI test case id: {case.id}")
seen.add(case.id)
return selectedThis registry does three important things. First, it rejects malformed tests before they produce confusing model failures. Second, it forces every case to declare risk. Third, it makes duplicate IDs impossible, which matters when quarantine lists, dashboards, and historical pass rates all key off the case ID. The edge case is empty selection: if CI asks for release tests and no release cases exist, the harness fails loudly instead of publishing a false green build.
Risk tiers keep cost sane. Run smoke cases on each pull request. Run release cases before deploy. Run broad regression, adversarial, and language-variant suites nightly. That shape is especially important for indie developers because your CI bill, model bill, and attention span are all finite. Professional does not mean testing everything every minute; it means the right test has the authority to block the right change.
The agent harness: sandbox tools and capture every decision
Agent tests are not just prompt tests. Agents can browse, call APIs, write records, send messages, trigger jobs, and mutate state. The harness must intercept those tool calls with the same seriousness a web test gives network requests. A good agent harness gives each run a sandbox, a tool allowlist, deterministic fixtures, and a trace sink.
# tests/ai/run_agent_case.py
class ToolSandbox:
def __init__(self, allowed: list[str]):
self.allowed = set(allowed)
self.trace = []
def search_docs(self, query: str):
if "search_docs" not in self.allowed:
raise PermissionError("search_docs is not allowed")
if len(query.strip()) < 3:
raise ValueError("search_docs query is too short")
self.trace.append(("search_docs", query))
return [("docs/install", "Install with pnpm and set API keys.")]
def issue_refund(self, order_id: str, amount_cents: int):
if "issue_refund" not in self.allowed:
raise PermissionError("issue_refund is not allowed")
if amount_cents <= 0:
raise ValueError("Refund amount must be positive")
self.trace.append(("issue_refund", order_id, amount_cents))
return "dry-run-refund"
def run_agent_case(case, agent_entrypoint):
sandbox = ToolSandbox(case.allowed_tools)
result = agent_entrypoint(case.input, sandbox)
for phrase in case.forbidden_phrases:
if phrase.lower() in result.output.lower():
raise AssertionError(
f"Forbidden phrase leaked in {case.id}: {phrase}"
)
for source_id in case.expected_sources:
if source_id not in result.sources:
raise AssertionError(
f"Missing expected source {source_id} in {case.id}"
)
return result, sandbox.traceNotice the boundary: the agent can only use tools passed by the harness. The refund tool is dry-run only. The docs tool validates input before returning fixture data. The test asserts both final output and intermediate evidence. If an agent produces a polished answer while ignoring the expected source, the test fails because the reasoning path matters.
The key gotcha is tool drift. A product engineer adds a new production tool and forgets to expose it in the test harness. That should fail in the harness contract, not silently skip behavior. Keep a generated inventory of production tools and compare it against sandbox tools in CI. Missing tool, mismatched schema, or unhandled side effect should block release.
RAG tests should assert evidence, not vibes
RAG applications fail in ways normal chat tests miss. The final answer can look plausible while the retriever used stale chunks, crossed tenant boundaries, ignored permissions, or cited a source that does not support the claim. Snapshotting only the answer rewards fluent guessing. Production RAG tests need to inspect the retrieval layer.
# tests/rag/assert_rag_contract.py
from datetime import datetime, timezone, timedelta
def assert_rag_contract(case_id, tenant_id, chunks, answer, expected_chunk_ids):
if not chunks:
raise AssertionError(f"{case_id}: retriever returned no chunks")
for chunk in chunks:
if chunk.tenant_id != tenant_id:
raise AssertionError(
f"{case_id}: cross-tenant chunk leaked: {chunk.id}"
)
max_age = datetime.now(timezone.utc) - timedelta(days=90)
if chunk.updated_at < max_age:
raise AssertionError(f"{case_id}: stale chunk used: {chunk.id}")
returned_ids = set(chunk.id for chunk in chunks)
for expected in expected_chunk_ids:
if expected not in returned_ids:
raise AssertionError(
f"{case_id}: missing expected evidence {expected}"
)
if "according to the docs" in answer.lower() and "[" not in answer:
raise AssertionError(
f"{case_id}: answer claims docs support without citations"
)This code makes hidden RAG assumptions executable. Cross-tenant leakage is an immediate failure. Stale evidence is visible. Expected evidence is explicit. Citation claims must be backed by citation markers. You can adapt the rules: legal and healthcare apps may require stricter freshness, while internal knowledge bases may tolerate older policy pages with manual approval. The important move is turning retrieval expectations into code.
In our experience, the highest-value RAG cases are not broad trivia prompts. They are product-specific decisions: "Can this user see this document?", "Should the assistant abstain when only old evidence exists?", and "Does the answer cite the same document a human support engineer would use?" Those cases catch production risk before a model-quality score does.
CI gates: block on contracts, trend on model quality
Not every AI metric should block a pull request. Deterministic contracts should block: forbidden tool use, schema violations, permission leaks, missing expected evidence, invalid JSON, and broken browser flows. Subjective model-quality scores should usually trend over time unless the product area is high risk. That distinction keeps CI useful instead of turning it into a slot machine.
# tests/ai/ci_gate.py
def decide_ai_gate(outcomes):
contract_failures = [item for item in outcomes if not item.contract_passed]
if contract_failures:
return (
"fail",
"contract failures",
[item.trace_url or "missing-trace" for item in contract_failures],
)
release_scores = [
item.quality_score for item in outcomes if item.risk == "release"
]
if not release_scores:
return "fail", "no release cases configured", []
average = sum(release_scores) / len(release_scores)
if average < 0.82:
return (
"warn",
f"release quality score below trend threshold: {average:.2f}",
[],
)
return "pass", "contracts passed", []This gate blocks on crisp failures and warns on aggregate quality drift. The edge case is missing release coverage: that is a failure because an empty suite should never masquerade as a pass. Another edge case is a missing trace URL. The gate still fails, but it labels the debugging debt so the team fixes instrumentation instead of debating the model answer in the pull request.
The 2024 DORA research is useful context here: DORA reported that AI adoption can improve individual productivity and flow while creating tradeoffs for delivery stability and throughput. That is exactly why gates need to be boring. AI makes it easier to create more code, more prompts, more agents, and more variants. The test infrastructure has to preserve release discipline as output accelerates.
Troubleshooting: how to debug failing AI infrastructure
When an AI test fails, resist the urge to rerun immediately. First classify the failure. Infrastructure failures mean the harness, fixture, queue, sandbox, or trace sink broke. Product failures mean the agent, retriever, permission model, or UI broke. Model variation means the run stayed inside the contract but produced a weaker answer. Each class needs a different response.
- Missing trace: fail the gate and fix instrumentation. A non-replayable AI failure is not actionable at scale.
- Wrong source IDs: inspect index freshness, embedding version, tenant filters, and chunking. Do not tune the prompt first.
- Forbidden tool call: compare production tool schemas to sandbox allowlists and verify user-role policy before blaming the model.
- Quality score dip: replay the exact case with captured evidence, then decide whether to update prompt, evaluator, or fixture.
- CI-only failure: check concurrency, rate limits, clock-dependent fixtures, and shared vector indexes. Local determinism can hide shared service contention.
The professional habit is to attach a trace to every failure. A trace should answer: which case ran, which prompt version executed, which chunks were retrieved, which tools were called, which evaluator judged the output, and what changed since the last pass. If your test output cannot answer those questions, your next infrastructure task is not adding more tests. It is making the existing tests explain themselves.
Migration plan: from repo tests to platform tests
You do not need to build the whole platform in one sprint. Start by extracting test cases from individual repos into a shared JSON or YAML registry. Add a runner that can execute one agent case locally. Capture traces to a directory before you wire up dashboards. Then add CI tiers. Finally, standardize the contract between product repos and the harness.
- Week 1: define the case schema, risk tiers, and minimum trace shape.
- Week 2: move the top 20 smoke cases into the registry and run them in CI.
- Week 3: add RAG evidence assertions for source IDs, tenant filters, and freshness.
- Week 4: sandbox agent tools and block release on forbidden side effects.
- Week 5: add nightly suites, trend reports, quarantine policy, and ownership labels.
The end state is not a giant QA machine. It is a small, composable platform that lets every new agent inherit mature testing from day one. That is the real level up: you keep shipping quickly, but every prompt, retriever, and tool call now passes through a system designed for production instead of a folder designed for a prototype.
Ready to level up your dev toolkit?
Desplega.ai helps developers transition to professional tools smoothly with AI testing workflows that make agents, RAG apps, and browser journeys safe to ship.
Get StartedFrequently Asked Questions
When should a vibe-coded AI app get dedicated test infrastructure?
Add it once prompts, retrieval, or agent tools affect real users. Keep repo tests, then introduce shared fixtures, eval gates, traces, and replayable failures.
Do LLM evals replace Playwright or API tests?
No. Evals judge model behavior. Playwright and API tests prove product flows, permissions, network boundaries, and UI contracts still work around that behavior.
How do you test 100+ AI agents without a huge QA team?
Standardize every agent behind the same harness: typed cases, sandboxed tools, deterministic fixtures, trace capture, risk tags, and small blocking CI gates.
What is the biggest gotcha in RAG test suites?
Teams often snapshot final answers while ignoring retrieval evidence. Assert source IDs, chunk freshness, permissions, citations, and abstention behavior first.
Related Posts
Cody's Repository Indexing: Does Cognitive Offloading Create Knowledge Gaps in Large Codebases? | Desplega AI
A practical deep dive into Cody repository indexing, context retrieval, and how indie hackers avoid AI-created knowledge gaps.
Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai
Stop losing 2-3 hours daily to dev server restarts. Master HMR configuration in Vite and Next.js to maintain flow state, preserve component state, and boost coding velocity by 80%.
The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai
Discover how flaky tests create a hidden operational tax that costs CTOs millions in wasted compute, developer time, and delayed releases. Calculate your flakiness cost today.