Back to Blog
June 18, 2026

Scaling Beyond the Repo: Production Test Infrastructure for 100+ AI Agents and RAG Apps

Your first agent demo can live in one repo; your hundredth needs queues, fixtures, evals, traces, and failure budgets.

Production AI test infrastructure for many agents and RAG applications

A single AI agent is forgiving. You can run it from your laptop, paste a few prompts into a local script, check the output by eye, and ship a small demo. That is a reasonable place to start. The problem begins when the demo becomes a product: one agent becomes twelve, a RAG prototype gets three indexes, your browser automation touches payment and account settings, and every small prompt change can quietly break a workflow that used to pass.

This is the moment where beginner tooling starts to feel unfair. The issue is not that vibe coding was bad. It helped you move quickly enough to discover a useful workflow. The issue is that production AI systems are distributed systems with probabilistic components. They need the same professional habits that mature web teams already use: stable fixtures, test environments, traceable runs, CI gates, ownership, and fast debugging loops.

The adoption pressure is real. The Stack Overflow 2025 Developer Survey reports that 84% of respondents use or plan to use AI tools in their development process, and 51% of professional developers use AI tools daily. GitHub also reported in Octoverse 2024 that Python overtook JavaScript as the most popular language on GitHub, with AI and data science work as major drivers. More people are building AI systems; fewer can rely on manual inspection as their quality strategy.

This guide is for the indie developer or small team ready to level up. We will build a practical test infrastructure model for 100+ agents and RAG apps, using patterns you can adopt incrementally. If you are already writing Playwright, Cypress, or simple Jest tests, this is the bridge from repo-local confidence to production-grade AI verification. For the browser side of the stack, pair this with our Playwright and Cypress flake deep dive when you want to harden end-to-end checks.

What changes when one repo becomes 100 agents?

Production AI test infrastructure is the shared layer that turns local agent demos into repeatable, observable, CI-gated systems.

At small scale, the repo is the boundary. You run tests near the code, inspect failures near the commit, and understand most dependencies from memory. At agent scale, the runtime graph is the boundary. One user request might invoke a planner, a retriever, a browser agent, a billing tool, a summarizer, a notification worker, and a human approval checkpoint. The repo no longer tells you whether the system works.

The upgrade is mental as much as technical: stop asking, "Does this prompt work?" Start asking, "Can this agent complete its contract with known inputs, bounded tools, traceable state, and diagnosable failure?"

A professional AI testing setup separates four concerns. Fixtures define the world the agent sees. Manifests define the agent contract. Runners execute many contracts consistently. Evidence stores traces, artifacts, and metrics so failures can be debugged later. Once those layers exist, you can add parallelism, queues, dashboards, and policy gates without rewriting every test.

Local demo patternProduction infrastructure patternWhy it scales
Hard-coded prompt scriptVersioned agent manifestContracts can be reviewed, sharded, and compared across releases.
Manual RAG spot checksRetrieval fixtures plus answer assertionsIndex, chunking, and model changes fail before users find them.
One CI jobShard-aware runner with retries and quarantineLarge suites stay fast without hiding persistent failures.
Screenshots in chatTrace, log, and artifact storageFailures remain debuggable after the CI container disappears.

Layer 1: Use manifests instead of scattered prompt scripts

A manifest is a boring file with a powerful job: it turns an agent from an informal script into a testable unit. It names the agent, the tools it can call, the fixtures it needs, the environment variables it requires, the timeout it must fit inside, and the checks that define success. This matters because large suites fail in administrative ways before they fail in interesting AI ways: missing credentials, stale test users, tool permissions, accidental live writes, and unbounded retries.

The runner below is deliberately small, but it demonstrates production behaviors: schema validation without a framework, missing environment detection, shard selection for CI, quarantine that does not hide the failure, timeouts, artifact paths, and edge-case handling for empty suites.

// agent-test-runner.ts
// Run with: npx tsx agent-test-runner.ts ./agent-tests.json
import { readFile, mkdir, writeFile } from 'node:fs/promises'
import { createHash } from 'node:crypto'

type AgentCase = {
  id: string
  agent: string
  prompt: string
  requiredEnv: string[]
  allowedTools: string[]
  timeoutMs: number
  quarantined?: boolean
}

type Result = {
  id: string
  status: 'passed' | 'failed' | 'quarantined'
  durationMs: number
  error?: string
  artifactPath: string
}

function assertCase(value: unknown): AgentCase {
  const item = value as Partial<AgentCase>
  if (!item.id || !item.agent || !item.prompt) {
    throw new Error('Invalid case: id, agent, and prompt are required')
  }
  if (!Array.isArray(item.requiredEnv) || !Array.isArray(item.allowedTools)) {
    throw new Error('Invalid case ' + item.id + ': requiredEnv and allowedTools must be arrays')
  }
  if (!Number.isInteger(item.timeoutMs) || item.timeoutMs < 1000) {
    throw new Error('Invalid case ' + item.id + ': timeoutMs must be at least 1000')
  }
  return item as AgentCase
}

async function runAgent(testCase: AgentCase, signal: AbortSignal): Promise<string> {
  for (const key of testCase.requiredEnv) {
    if (!process.env[key]) {
      throw new Error('Missing required environment variable: ' + key)
    }
  }

  // Replace this adapter with your real agent SDK call.
  await new Promise((resolve, reject) => {
    const timer = setTimeout(resolve, 100)
    signal.addEventListener('abort', () => {
      clearTimeout(timer)
      reject(new Error('Agent timed out after ' + testCase.timeoutMs + 'ms'))
    })
  })

  if (testCase.prompt.includes('DROP TABLE')) {
    throw new Error('Safety edge case failed: destructive prompt was not blocked')
  }

  return 'ok:' + createHash('sha256').update(testCase.prompt).digest('hex').slice(0, 12)
}

function pickShard<T>(items: T[]): T[] {
  const total = Number(process.env.CI_NODE_TOTAL || '1')
  const index = Number(process.env.CI_NODE_INDEX || '0')
  if (!Number.isInteger(total) || !Number.isInteger(index) || total < 1 || index < 0 || index >= total) {
    throw new Error('Invalid shard config: CI_NODE_INDEX must be within CI_NODE_TOTAL')
  }
  return items.filter((_, position) => position % total === index)
}

async function main() {
  const manifestPath = process.argv[2]
  if (!manifestPath) throw new Error('Usage: npx tsx agent-test-runner.ts ./agent-tests.json')

  const raw = JSON.parse(await readFile(manifestPath, 'utf8')) as unknown
  if (!Array.isArray(raw)) throw new Error('Manifest root must be an array')

  const cases = pickShard(raw.map(assertCase))
  if (cases.length === 0) {
    console.warn('No test cases selected for this shard; check CI_NODE_TOTAL and manifest size')
    return
  }

  await mkdir('.artifacts/agents', { recursive: true })
  const results: Result[] = []

  for (const testCase of cases) {
    const started = Date.now()
    const controller = new AbortController()
    const timer = setTimeout(() => controller.abort(), testCase.timeoutMs)
    const artifactPath = '.artifacts/agents/' + testCase.id + '.json'

    try {
      const output = await runAgent(testCase, controller.signal)
      results.push({
        id: testCase.id,
        status: testCase.quarantined ? 'quarantined' : 'passed',
        durationMs: Date.now() - started,
        artifactPath,
      })
      await writeFile(artifactPath, JSON.stringify({ testCase, output }, null, 2))
    } catch (error) {
      const message = error instanceof Error ? error.message : String(error)
      results.push({
        id: testCase.id,
        status: testCase.quarantined ? 'quarantined' : 'failed',
        durationMs: Date.now() - started,
        error: message,
        artifactPath,
      })
      await writeFile(artifactPath, JSON.stringify({ testCase, error: message }, null, 2))
    } finally {
      clearTimeout(timer)
    }
  }

  const blockingFailures = results.filter((result) => result.status === 'failed')
  console.table(results)
  if (blockingFailures.length > 0) {
    process.exitCode = 1
  }
}

main().catch((error) => {
  console.error(error instanceof Error ? error.stack : error)
  process.exit(1)
})

The important detail is not the runner itself. The important detail is that each test has a contract independent of the implementation. You can swap your local agent function for a cloud worker, a queue job, or a hosted browser agent without changing what the suite means.

Layer 2: Treat RAG as two systems, not one answer box

RAG failures are often misdiagnosed because the answer is the visible output. In practice, RAG has at least two separable contracts: retrieval and generation. Retrieval asks, "Did the system bring the right evidence into context?" Generation asks, "Did the model use that evidence correctly?" If you only assert the final answer, you cannot tell whether the retriever missed the document, the ranker buried it, the prompt ignored it, or the model hallucinated around it.

The next example tests retrieval before generation. It handles empty corpora, duplicate IDs, missing expected documents, embedding API failures, and low-similarity matches. In a real stack you would replace the in-memory cosine search with pgvector, Pinecone, Weaviate, OpenSearch, or your existing vector store, but the assertions stay the same.

// rag-retrieval-eval.ts
// Run with: OPENAI_API_KEY=... npx tsx rag-retrieval-eval.ts
type Document = { id: string; text: string }
type Fixture = { query: string; expectedDocIds: string[]; minScore: number }

const docs: Document[] = [
  { id: 'billing-refunds', text: 'Refunds are available within 14 days for unused subscription time.' },
  { id: 'security-sso', text: 'Enterprise accounts can require SAML SSO and SCIM provisioning.' },
  { id: 'browser-traces', text: 'Playwright traces include DOM snapshots, network requests, console logs, and screenshots.' },
]

const fixtures: Fixture[] = [
  { query: 'How long can a customer request a refund?', expectedDocIds: ['billing-refunds'], minScore: 0.72 },
  { query: 'What evidence does a browser trace save?', expectedDocIds: ['browser-traces'], minScore: 0.72 },
]

function dot(a: number[], b: number[]): number {
  if (a.length !== b.length) throw new Error('Embedding length mismatch: ' + a.length + ' vs ' + b.length)
  return a.reduce((sum, value, index) => sum + value * b[index], 0)
}

function norm(a: number[]): number {
  return Math.sqrt(a.reduce((sum, value) => sum + value * value, 0))
}

function cosine(a: number[], b: number[]): number {
  const denominator = norm(a) * norm(b)
  if (denominator === 0) throw new Error('Cannot compare zero-vector embedding')
  return dot(a, b) / denominator
}

async function embed(input: string): Promise<number[]> {
  if (!process.env.OPENAI_API_KEY) throw new Error('OPENAI_API_KEY is required')
  const response = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'content-type': 'application/json',
      authorization: 'Bearer ' + process.env.OPENAI_API_KEY,
    },
    body: JSON.stringify({ model: 'text-embedding-3-small', input }),
  })

  if (!response.ok) {
    const body = await response.text()
    throw new Error('Embedding request failed with ' + response.status + ': ' + body.slice(0, 300))
  }

  const json = (await response.json()) as { data?: Array<{ embedding?: number[] }> }
  const vector = json.data?.[0]?.embedding
  if (!vector || vector.length === 0) throw new Error('Embedding response did not include a vector')
  return vector
}

async function main() {
  if (docs.length === 0) throw new Error('Document corpus is empty; retrieval eval would be meaningless')

  const seen = new Set<string>()
  for (const doc of docs) {
    if (seen.has(doc.id)) throw new Error('Duplicate document id: ' + doc.id)
    seen.add(doc.id)
  }

  const docVectors = new Map<string, number[]>()
  for (const doc of docs) {
    docVectors.set(doc.id, await embed(doc.text))
  }

  const failures: string[] = []
  for (const fixture of fixtures) {
    for (const expected of fixture.expectedDocIds) {
      if (!seen.has(expected)) failures.push('Fixture expects missing document: ' + expected)
    }

    const queryVector = await embed(fixture.query)
    const ranked = docs
      .map((doc) => ({ id: doc.id, score: cosine(queryVector, docVectors.get(doc.id) || []) }))
      .sort((a, b) => b.score - a.score)

    const top = ranked[0]
    if (!top) failures.push('No retrieval result for query: ' + fixture.query)
    if (top && !fixture.expectedDocIds.includes(top.id)) {
      failures.push('Wrong top doc for "' + fixture.query + '": got ' + top.id)
    }
    if (top && top.score < fixture.minScore) {
      failures.push('Low similarity for "' + fixture.query + '": ' + top.score.toFixed(3))
    }
  }

  if (failures.length > 0) {
    console.error(failures.join('\n'))
    process.exit(1)
  }

  console.log('RAG retrieval eval passed for ' + fixtures.length + ' fixtures')
}

main().catch((error) => {
  console.error(error instanceof Error ? error.stack : error)
  process.exit(1)
})

This gives you a clean debugging fork. If retrieval fails, inspect chunking, metadata filters, embedding model changes, index freshness, and ranker behavior. If retrieval passes but the answer fails, inspect prompt format, context ordering, citation requirements, and model settings. That separation is the difference between a two-minute diagnosis and a week of prompt guessing.

How do you keep AI tests deterministic enough for CI?

You do not remove all uncertainty; you isolate it behind stable fixtures, bounded outputs, retries, and evidence-rich failure reports.

CI needs a binary result, but AI systems often produce a distribution of acceptable outputs. The trick is to make the contract deterministic even when the wording varies. Instead of comparing an entire paragraph, validate structure, citations, tool calls, state transitions, and safety boundaries. That is why schemas are so useful: they let your agent speak naturally inside a typed envelope.

The example below validates a support triage agent. It rejects malformed JSON, unknown tools, unsupported priorities, missing citations for policy answers, and accidental escalation loops. It also includes an edge case for ambiguous user requests, where asking for clarification is the correct behavior.

// agent-contract.test.ts
// Run with: npm i -D vitest && npx vitest agent-contract.test.ts
import { describe, expect, it } from 'vitest'

type AgentDecision = {
  action: 'answer' | 'ask_clarifying_question' | 'create_ticket'
  priority: 'low' | 'normal' | 'urgent'
  toolName: 'none' | 'zendesk.create_ticket' | 'billing.lookup_invoice'
  citations: string[]
  message: string
}

const allowedTools = new Set(['none', 'zendesk.create_ticket', 'billing.lookup_invoice'])

function parseDecision(raw: string): AgentDecision {
  let value: unknown
  try {
    value = JSON.parse(raw)
  } catch (error) {
    throw new Error('Agent returned invalid JSON: ' + (error instanceof Error ? error.message : String(error)))
  }

  const item = value as Partial<AgentDecision>
  if (!['answer', 'ask_clarifying_question', 'create_ticket'].includes(String(item.action))) {
    throw new Error('Invalid action: ' + String(item.action))
  }
  if (!['low', 'normal', 'urgent'].includes(String(item.priority))) {
    throw new Error('Invalid priority: ' + String(item.priority))
  }
  if (!allowedTools.has(String(item.toolName))) {
    throw new Error('Tool is not allowed in this environment: ' + String(item.toolName))
  }
  if (!Array.isArray(item.citations)) throw new Error('citations must be an array')
  if (!item.message || item.message.length < 12) throw new Error('message is too short to be useful')
  return item as AgentDecision
}

function validateDecision(decision: AgentDecision, userText: string) {
  if (decision.action === 'answer' && userText.toLowerCase().includes('policy') && decision.citations.length === 0) {
    throw new Error('Policy answers require at least one citation')
  }
  if (decision.action === 'ask_clarifying_question' && decision.toolName !== 'none') {
    throw new Error('Clarifying questions must not call external tools')
  }
  if (decision.action === 'create_ticket' && decision.priority === 'urgent' && !userText.toLowerCase().includes('blocked')) {
    throw new Error('Urgent tickets require a blocking user impact signal')
  }
}

async function fakeSupportAgent(userText: string): Promise<string> {
  if (userText.includes('maybe refund or invoice')) {
    return JSON.stringify({
      action: 'ask_clarifying_question',
      priority: 'normal',
      toolName: 'none',
      citations: [],
      message: 'Do you want help with a refund request or with finding an invoice?',
    })
  }
  if (userText.includes('policy')) {
    return JSON.stringify({
      action: 'answer',
      priority: 'low',
      toolName: 'none',
      citations: ['billing-refunds'],
      message: 'The refund policy allows requests within the documented eligibility window.',
    })
  }
  return JSON.stringify({
    action: 'create_ticket',
    priority: 'normal',
    toolName: 'zendesk.create_ticket',
    citations: [],
    message: 'I created a ticket so support can investigate the account-specific issue.',
  })
}

describe('support triage agent contract', () => {
  it('answers policy questions with citations', async () => {
    const userText = 'What is the refund policy?'
    const decision = parseDecision(await fakeSupportAgent(userText))
    validateDecision(decision, userText)
    expect(decision.action).toBe('answer')
  })

  it('asks before using tools for ambiguous billing intent', async () => {
    const userText = 'I need maybe refund or invoice help'
    const decision = parseDecision(await fakeSupportAgent(userText))
    validateDecision(decision, userText)
    expect(decision.action).toBe('ask_clarifying_question')
    expect(decision.toolName).toBe('none')
  })

  it('fails loudly when the agent calls an unknown tool', () => {
    expect(() =>
      parseDecision(JSON.stringify({
        action: 'create_ticket',
        priority: 'normal',
        toolName: 'stripe.refund_everything',
        citations: [],
        message: 'Trying an unsafe tool call.',
      })),
    ).toThrow(/not allowed/)
  })
})

Notice what this test does not do: it does not demand one golden sentence. Golden strings are brittle for language models. Contract tests are stronger because they assert the behavior the product depends on. For more practical migration steps from prototype checks to professional QA, see the test run debugging guide.

Layer 3: Add browser evidence for workflows that touch the product

Agents and RAG apps rarely live only in APIs. They log into dashboards, read tables, click buttons, update settings, and explain results to users. Browser tests are where AI quality meets product quality. When a planner chooses a tool, the UI still has to render the right state. When a RAG answer cites a document, the citation link still has to open. When an autonomous workflow edits a record, the audit log still has to show who did it.

Playwright traces are valuable here because they preserve DOM snapshots, network activity, console output, screenshots, and timing. The trace is not just a video; it is a failure packet. At 100+ agents, you need failure packets because the person debugging tomorrow may not be the person who wrote the agent today.

// playwright-agent-flow.spec.ts
// Run with: BASE_URL=http://localhost:3000 npx playwright test playwright-agent-flow.spec.ts
import { expect, test } from '@playwright/test'

test('RAG assistant cites retrieved source and preserves audit trail', async ({ page }, testInfo) => {
  const baseUrl = process.env.BASE_URL
  if (!baseUrl) throw new Error('BASE_URL is required for browser agent tests')

  await page.goto(baseUrl + '/app/support-assistant')
  await expect(page.getByRole('heading', { name: /support assistant/i })).toBeVisible()

  await page.getByLabel(/customer question/i).fill('Can enterprise accounts require SSO?')
  await page.getByRole('button', { name: /ask assistant/i }).click()

  const answer = page.getByTestId('assistant-answer')
  await expect(answer).toContainText(/SSO|SAML/i, { timeout: 15000 })

  const citation = page.getByRole('link', { name: /security-sso/i })
  await expect(citation).toBeVisible()
  await expect(citation).toHaveAttribute('href', /security-sso/)

  await page.getByRole('button', { name: /create follow-up task/i }).click()
  await expect(page.getByText(/task created/i)).toBeVisible()

  await page.goto(baseUrl + '/app/audit-log')
  const latest = page.getByTestId('audit-row').first()
  await expect(latest).toContainText(/support assistant/i)
  await expect(latest).toContainText(/security-sso/i)

  const consoleErrors: string[] = []
  page.on('console', (message) => {
    if (message.type() === 'error') consoleErrors.push(message.text())
  })

  if (consoleErrors.length > 0) {
    await testInfo.attach('console-errors', {
      body: consoleErrors.join('\n'),
      contentType: 'text/plain',
    })
    throw new Error('Unexpected browser console errors: ' + consoleErrors.length)
  }
})

The edge case in this test is not exotic: the assistant can answer correctly while the product fails to preserve the citation in the audit trail. That is a production bug. Users and compliance teams do not care that the model was right if the system cannot prove where the answer came from.

Layer 4: Scale with queues, shards, and failure budgets

Once the suite grows, the bottleneck becomes orchestration. Running 800 agent checks serially is slow. Retrying every flaky test hides bugs. Running every expensive RAG eval on every pull request burns money and patience. Mature teams split the suite into tiers.

  • Pull request smoke: fast contract tests, permission checks, critical RAG fixtures, and one or two browser paths.
  • Nightly regression: wider RAG coverage, larger browser matrix, prompt comparison, and tool behavior checks.
  • Pre-release gate: full corpus evals, migration checks, production-like permissions, and human review for risky deltas.
  • Post-deploy monitor: synthetic workflows against production-safe accounts, with strict write boundaries.

Sharding should be deterministic. Hash the test ID, not the current file order, so a renamed file does not reshuffle the whole suite. Retries should be narrow. Retry transient network failures, provider 429s, and browser startup issues; do not retry policy violations, schema failures, or missing citations. Quarantine should create visibility, not silence. A quarantined test should still run, report, and carry an owner.

Troubleshooting failure modes

AI test failures feel noisy until you categorize them. Use the failure shape to decide where to look first.

  • Wrong retrieved document: check index freshness, chunk boundaries, metadata filters, embedding model versions, and whether test fixtures reference deleted documents.
  • Right document, wrong answer: inspect prompt assembly, context ordering, truncation, model temperature, and citation rules.
  • Tool called unexpectedly: verify tool allowlists, planner examples, environment mode, and whether ambiguous user input should force clarification.
  • CI-only failures: compare environment variables, network access, test user state, clock/timezone assumptions, and browser dependencies.
  • Flaky browser checks: prefer role selectors, wait for user-visible state, attach traces, and remove sleeps that guess at async completion.
  • Slow suite growth: split smoke and regression tiers, shard by stable IDs, cache immutable fixtures, and cap expensive model calls.

A useful debugging habit: every failure should include the manifest ID, prompt version, model name, tool list, fixture version, trace path, and artifact path. Without those, you are debugging a memory of a run instead of the run itself.

Gotchas that catch teams during the migration

The first gotcha is fixture drift. Your tests pass against old documents while production retrieval uses a new index. Stamp fixtures with corpus versions and fail when the expected document no longer exists. The second is permission drift. Local agents often run with broad developer credentials; CI should run with the narrowest role that can complete the workflow.

The third gotcha is hidden state. Agents remember, queues retry, browsers keep sessions, and vector stores cache. Make state explicit in the manifest. Create test users per run or reset them before each scenario. The fourth is overusing model-judged evals. LLM judges can be useful, but they should complement deterministic assertions, not replace them. If a citation is required, assert the citation directly.

Finally, watch cost and latency. A suite that calls a hosted model thousands of times on every commit will be disabled by the first frustrated developer. Keep PR checks small, cache immutable embeddings, and move broad quality sweeps to scheduled runs. Professional infrastructure is not the biggest possible suite; it is the smallest suite that catches the failures that matter.

A practical 30-day migration path

Week one: write manifests for your top ten workflows and record the fixtures they need. Do not automate everything yet. Make the contracts reviewable. Week two: add a small runner, artifact output, and CI smoke checks. Week three: split RAG retrieval from answer validation and add browser traces for the highest-risk path. Week four: shard the suite, add nightly regression, and create a quarantine policy with owners and expiry dates.

That path is intentionally modest. You do not need to build a platform team before you build quality. You need enough structure that every new agent inherits good defaults: bounded tools, known fixtures, repeatable CI, clear evidence, and failure modes that point to the right layer. That is how a vibe-coded prototype grows into a system you can trust.

Ready to level up your dev toolkit?

Desplega.ai helps developers transition to professional tools smoothly with reliable browser checks, CI-ready test flows, and practical quality gates for AI products.

Get Started

Frequently Asked Questions

Do I need Kubernetes before testing AI agents?

No. Start with stable fixtures, per-agent manifests, CI shards, and trace storage. Kubernetes helps scale runners later, but it will not fix weak test contracts.

How often should RAG evaluations run?

Run smoke evals on every pull request, broader retrieval and answer-quality suites nightly, and full corpus regression checks before major index or model changes.

What should I test first in a multi-agent system?

Test tool permissions, state transitions, idempotency, and handoff contracts first. These failures create the most confusing bugs once many agents run in parallel.

Can vibe-coded apps use this without a platform team?

Yes. Use a manifest file, a small runner, Playwright traces, and a few deterministic eval fixtures. Add queues and dashboards only when the suite outgrows one CI job.