Is prompt engineering still useful when agents are tested deterministically?

Yes. Prompts still shape behavior, but deterministic tests turn them into versioned contracts with fixtures, schemas, traces, and CI checks.

Do I need Playwright if my agent does not use a browser?

Not always. Use Playwright for browser-facing workflows. For API-only agents, keep the same contract approach with request mocks and JSON evidence.

How should indie developers start without building a huge QA platform?

Start with one golden path, one adversarial fixture, one schema validator, and one CI gate. Add traces only where failures are expensive.

What is the most common mistake in AI agent testing?

Testing the happy prompt only. Reliable agents need tests for malformed output, tool timeouts, ambiguous goals, retries, and unsafe partial success.

The AI Skills Playbook: Deterministic Agent Testing with Playwright and TypeScript

Vibe coding is a real unlock. You describe the feature, the model drafts the UI, and a working prototype appears before your coffee gets cold. The trap is that the same loop that feels magical in a solo session can become slippery in production. The agent succeeds once, then fails on a slightly different account state. It returns JSON today and prose tomorrow. It clicks the right button in Chrome but stalls in WebKit.

This playbook is for builders who already know how to get useful work out of AI tools and now want the next professional skill: turning agent behavior into something testable. The move is not from creativity to bureaucracy. The move is from prompt-only confidence to deterministic evidence. A test does not make the system deterministic. It creates a deterministic contract around the parts you control.

Two market signals explain why this matters now. Stack Overflow's 2025 Developer Survey reports that 84% of respondents use or plan to use AI tools, while 46% distrust the accuracy of AI-tool output. GitHub Octoverse 2025 reports more than 1.1 million public repositories using an LLM SDK and 518.7 million pull requests merged. AI code is no longer a side experiment; the review and verification layer is becoming the bottleneck.

From Prompt Engineering to Agent Testing

Prompt engineering optimizes a request. Agent testing verifies a workflow. That distinction matters because agents are not just text generators. They choose tools, read state, mutate systems, retry, summarize, and sometimes stop early. A good prompt can reduce variance, but it cannot prove that the workflow handled a timeout, preserved user data, or refused a dangerous action.

Agent testing wraps non-deterministic model calls with deterministic contracts: schemas, fixtures, mocked tools, traces, assertions, and CI gates.

A professional agent test suite usually has four layers. First, input fixtures describe realistic user goals and messy edge cases. Second, tool mocks make external systems predictable. Third, output validators reject malformed, unsafe, or incomplete results. Fourth, trace assertions check the path the agent took, not only the final text.

If you want a broader quality-engineering frame for this shift, pair this article with our agentic testing rails guide. If you are still choosing browser automation tooling, the migration notes in our selector rot scanner map well to the agent examples below.

Beginner habit	Professional replacement	Why it becomes a moat
Keep editing the prompt until the demo passes	Version prompts with fixtures and expected contracts	You can detect regressions when the model, data, or tool changes
Inspect the final response manually	Validate JSON, citations, actions, and trace events in CI	Failures become reproducible instead of vibes
Trust the agent because it worked once	Replay golden paths, adversarial paths, and tool failures	You build confidence across real user states

How do you create a deterministic testing moat around non-deterministic agents?

You do not freeze the model. You freeze the contract around it: inputs, tool responses, allowed actions, output shape, and observable trace events.

The word "moat" is deliberate. Anyone can copy a prompt. Fewer teams can copy the accumulated fixtures that encode your product edge cases, the failure traces that show exactly how your agent behaves under pressure, and the CI gates that stop quiet regressions. Over time, those assets become more valuable than the first prompt.

Example 1: Validate agent output before it touches product state

Start with the cheapest reliability win: never let raw model output mutate your system. The following TypeScript script accepts an agent response, parses it, validates a strict action contract, rejects prompt-injection leakage, and handles edge cases like empty output, malformed JSON, duplicate actions, and missing confirmation text.

// validate-agent-output.ts
// Run: npx tsx validate-agent-output.ts ./fixtures/agent-output.json
import { readFile } from 'node:fs/promises';

type AgentAction = {
  type: 'create_issue' | 'send_email' | 'no_op';
  target: string;
  payload: Record<string, unknown>;
  requiresHumanApproval: boolean;
};

type AgentResult = {
  summary: string;
  actions: AgentAction[];
  traceId: string;
};

function fail(message: string): never {
  throw new Error(`AGENT_CONTRACT_FAILED: ${message}`);
}

function isRecord(value: unknown): value is Record<string, unknown> {
  return typeof value === 'object' && value !== null && !Array.isArray(value);
}

function parseAgentResult(raw: string): AgentResult {
  if (raw.trim().length === 0) fail('empty model output');
  if (/ignore previous instructions/i.test(raw)) {
    fail('possible prompt-injection text leaked into output');
  }

  let parsed: unknown;
  try {
    parsed = JSON.parse(raw);
  } catch (error) {
    fail(`model returned non-JSON output: ${(error as Error).message}`);
  }

  if (!isRecord(parsed)) fail('top-level output must be an object');
  if (typeof parsed.summary !== 'string' || parsed.summary.length < 20) {
    fail('summary must explain the result in at least 20 characters');
  }
  if (typeof parsed.traceId !== 'string' || !/^trace_[a-z0-9-]+$/.test(parsed.traceId)) {
    fail('traceId must use the trace_<id> format');
  }
  if (!Array.isArray(parsed.actions)) fail('actions must be an array');
  if (parsed.actions.length > 3) fail('agent attempted too many actions for one request');

  const seen = new Set<string>();
  const actions = parsed.actions.map((action, index): AgentAction => {
    if (!isRecord(action)) fail(`action[${index}] must be an object`);
    if (!['create_issue', 'send_email', 'no_op'].includes(String(action.type))) {
      fail(`action[${index}] has unsupported type`);
    }
    if (typeof action.target !== 'string' || action.target.trim() === '') {
      fail(`action[${index}] target is required`);
    }
    if (!isRecord(action.payload)) fail(`action[${index}] payload must be an object`);
    if (typeof action.requiresHumanApproval !== 'boolean') {
      fail(`action[${index}] requiresHumanApproval must be boolean`);
    }

    const dedupeKey = `${action.type}:${action.target}`;
    if (seen.has(dedupeKey)) fail(`duplicate action: ${dedupeKey}`);
    seen.add(dedupeKey);

    if (action.type === 'send_email' && action.requiresHumanApproval !== true) {
      fail('send_email actions require human approval');
    }

    return action as AgentAction;
  });

  return { summary: parsed.summary, actions, traceId: parsed.traceId };
}

async function main() {
  const file = process.argv[2];
  if (!file) fail('usage: npx tsx validate-agent-output.ts <json-file>');
  const raw = await readFile(file, 'utf8').catch((error) => {
    fail(`could not read ${file}: ${(error as Error).message}`);
  });
  const result = parseAgentResult(raw);
  console.log(JSON.stringify({ ok: true, traceId: result.traceId, actions: result.actions.length }));
}

main().catch((error) => {
  console.error((error as Error).message);
  process.exit(1);
});

The important design choice is that the validator fails closed. If the model returns something surprising, the script exits non-zero. That is exactly what you want in CI and exactly what you want before an agent sends an email, updates billing, or files a support ticket. The agent is allowed to be probabilistic; the boundary is not.

Example 2: Test the browser workflow with mocked agent responses

Browser-facing agents fail in a different way. They may produce valid JSON but still drive the interface incorrectly. Playwright is useful because its locator engine waits for elements to be actionable, and its route mocking lets you control the agent API response without hitting a live model.

// tests/agent-checkout.spec.ts
// Run: npx playwright test tests/agent-checkout.spec.ts
import { expect, test } from '@playwright/test';

test.describe('checkout assistant agent', () => {
  test('applies a safe discount recommendation and blocks invalid carts', async ({ page }) => {
    const calls: string[] = [];

    await page.route('**/api/agent/checkout', async (route) => {
      const request = route.request();
      calls.push(request.postData() ?? '');

      try {
        const body = JSON.parse(request.postData() || '{}');
        if (body.cartTotal <= 0) {
          return route.fulfill({
            status: 422,
            contentType: 'application/json',
            body: JSON.stringify({ error: 'cartTotal must be positive' }),
          });
        }

        return route.fulfill({
          status: 200,
          contentType: 'application/json',
          body: JSON.stringify({
            summary: 'Applied loyalty discount because the cart is eligible.',
            actions: [
              {
                type: 'apply_discount',
                target: 'cart',
                payload: { code: 'LOYALTY10', percent: 10 },
                requiresHumanApproval: false,
              },
            ],
            traceId: 'trace_checkout-001',
          }),
        });
      } catch {
        return route.fulfill({
          status: 400,
          contentType: 'application/json',
          body: JSON.stringify({ error: 'invalid request JSON' }),
        });
      }
    });

    await page.goto('/checkout?fixture=loyalty-user');
    await page.getByRole('button', { name: 'Ask checkout assistant' }).click();

    await expect(page.getByText('LOYALTY10')).toBeVisible();
    await expect(page.getByText('trace_checkout-001')).toBeVisible();
    await expect(page.getByRole('button', { name: 'Place order' })).toBeEnabled();

    await page.goto('/checkout?fixture=empty-cart');
    await page.getByRole('button', { name: 'Ask checkout assistant' }).click();
    await expect(page.getByText('cartTotal must be positive')).toBeVisible();
    await expect(page.getByRole('button', { name: 'Place order' })).toBeDisabled();

    expect(calls.length).toBe(2);
  });
});

This is not a toy assertion about whether a button exists. It verifies the complete user-facing contract: the page sends a realistic request, the agent service returns structured evidence, the UI renders the action and trace, and the unsafe empty-cart edge case blocks checkout.

Example 3: Gate agent traces in CI, not just final answers

The final answer can lie by omission. "I created the issue" is not evidence that the issue was created. A trace gate checks the steps the agent actually took: which tool was called, whether it retried, whether it asked for approval, and whether the output was persisted.

// ci/check-agent-trace.ts
// Run: npx tsx ci/check-agent-trace.ts artifacts/agent-trace.jsonl
import { readFile } from 'node:fs/promises';

type TraceEvent = {
  traceId: string;
  type: 'model_output' | 'tool_call' | 'tool_result' | 'approval_requested' | 'workflow_completed';
  toolName?: string;
  status?: 'ok' | 'error';
  timestamp: string;
};

const REQUIRED_SEQUENCE: TraceEvent['type'][] = [
  'model_output',
  'tool_call',
  'tool_result',
  'workflow_completed',
];

function die(message: string): never {
  throw new Error(`TRACE_GATE_FAILED: ${message}`);
}

function parseLines(raw: string): TraceEvent[] {
  const lines = raw.split('\n').filter(Boolean);
  if (lines.length === 0) die('trace file is empty');

  return lines.map((line, index) => {
    try {
      const event = JSON.parse(line) as TraceEvent;
      if (!event.traceId || !event.type || !event.timestamp) {
        die(`line ${index + 1} missing traceId, type, or timestamp`);
      }
      return event;
    } catch (error) {
      die(`line ${index + 1} is not valid JSON: ${(error as Error).message}`);
    }
  });
}

function assertTrace(events: TraceEvent[]) {
  const types = events.map((event) => event.type);
  for (const required of REQUIRED_SEQUENCE) {
    if (!types.includes(required)) die(`missing required event: ${required}`);
  }

  const failedTools = events.filter((event) => event.type === 'tool_result' && event.status === 'error');
  if (failedTools.length > 0) {
    die(`tool errors present: ${failedTools.map((event) => event.toolName ?? 'unknown').join(', ')}`);
  }

  const dangerousCalls = events.filter(
    (event) => event.type === 'tool_call' && ['send_email', 'charge_card', 'delete_record'].includes(event.toolName ?? ''),
  );
  const approvals = events.filter((event) => event.type === 'approval_requested');
  if (dangerousCalls.length > 0 && approvals.length === 0) {
    die('dangerous tool call occurred without approval request');
  }
}

async function main() {
  const file = process.argv[2];
  if (!file) die('usage: npx tsx ci/check-agent-trace.ts <trace.jsonl>');
  const raw = await readFile(file, 'utf8').catch((error) => {
    die(`unable to read trace file: ${(error as Error).message}`);
  });
  const events = parseLines(raw);
  assertTrace(events);
  console.log(JSON.stringify({ ok: true, traceId: events[0].traceId, eventCount: events.length }));
}

main().catch((error) => {
  console.error((error as Error).message);
  process.exit(1);
});

This is where deterministic testing becomes a moat. Your trace fixtures encode product judgment. Maybe a billing action always needs approval. Maybe a support draft can be created automatically but not sent. Maybe a retry is acceptable for a read-only lookup but unacceptable for a payment action. The generic model does not know those rules. Your tests do.

What should you test first when your AI app starts getting real users?

Test the highest-consequence path first: the workflow where a plausible agent mistake would cost money, trust, data integrity, or user time.

A practical first suite has five fixtures. The golden path proves the intended workflow. The ambiguous-goal fixture checks that the agent asks a clarifying question instead of guessing. The malformed-output fixture proves your validator catches non-JSON or incomplete JSON. The tool failure fixture proves the agent reports partial failure honestly. The permission fixture proves the agent refuses an action it should not perform.

Golden path: the agent completes the workflow and emits the expected trace events.
Ambiguous input: the agent asks for missing information before calling tools.
Malformed output: validators reject prose, partial JSON, duplicate actions, and unsafe action types.
Tool timeout: the agent records the failed dependency and avoids claiming success.
Permission edge case: dangerous actions require approval even when the prompt asks the agent to skip it.

Troubleshooting: common agent test failures and how to debug them

When an agent test fails, resist the urge to immediately tweak the prompt. First identify which boundary failed. Prompt edits are useful only after you know whether the issue is input ambiguity, tool state, output validation, browser timing, or missing evidence.

Flaky pass/fail: check whether the test hits a live model or live dependency. If yes, replace it with a recorded fixture or route mock for CI.
Valid-looking answer, wrong side effect: inspect trace events. Final text is not enough; assert the tool call and result that prove the action happened.
JSON parser failures: log the first 500 characters of raw output, then fail closed. Do not silently coerce prose into a partial action.
Browser timeout: prefer role-based locators and user-visible states. Avoid fixed sleeps because they hide race conditions instead of diagnosing them.
CI-only failures: compare environment flags, mocked clock, locale, viewport, permissions, and network routes. Agents often expose hidden state assumptions.

A useful debugging habit is to store three artifacts for every failed agent test: the prompt fixture, the raw model output, and the trace timeline. With those, you can usually tell whether the model chose the wrong plan, the tool returned an unexpected response, or the UI rendered the correct state too late for the assertion.

Edge cases and gotchas that separate demos from production systems

Agent reliability work is mostly edge-case work. The common gotcha is assuming that better prompts remove the need for test boundaries. They do not. Prompts are instructions; tests are executable expectations.

Model upgrades can change formatting, tool choice, refusal style, and verbosity.
Streaming responses can expose partial JSON before the final message is valid.
Retry logic can duplicate side effects unless tool calls are idempotent.
Locale and timezone differences can break scheduling agents that passed locally.
RAG answers can cite stale context unless source timestamps are in fixtures.
Browser agents can pass in Chromium and fail in WebKit because focus differs.

The professional move is to make these gotchas explicit. Add an idempotency key to write tools. Include timezone in fixtures. Validate citation URLs and retrieved document dates. Capture trace IDs in the UI. Separate "drafted" from "sent" and "queued" from "completed." These are small engineering habits, but they compound into trust.

A 30-day leveling plan

You do not need to build a giant evaluation platform this month. Start with the agent workflow that matters most, then add determinism at the boundaries. Week one: write five fixtures and one output validator. Week two: mock the agent API in one Playwright test. Week three: emit JSONL traces and gate them in CI. Week four: add a small regression set from real failures.

That is enough to change how you build. You will still use prompts. You will still move quickly. But you will stop relying on memory, screenshots, and "it worked on my machine" confidence. The skill that levels you up is not knowing the perfect incantation. It is building a harness where every important agent behavior leaves evidence.

Sources: Stack Overflow 2025 Developer Survey AI section for AI adoption and trust statistics; GitHub Octoverse 2025 for public LLM SDK repository and pull request activity. Treat benchmark claims without fixtures as anecdotes until you can replay them in your own app.

The AI Skills Playbook: Deterministic Agent Testing with Playwright and TypeScript

The next moat for AI builders is not a better prompt; it is the ability to prove an agent behaves correctly when the model is uncertain.

From Prompt Engineering to Agent Testing

How do you create a deterministic testing moat around non-deterministic agents?

Example 1: Validate agent output before it touches product state

Example 2: Test the browser workflow with mocked agent responses

Example 3: Gate agent traces in CI, not just final answers

What should you test first when your AI app starts getting real users?

Troubleshooting: common agent test failures and how to debug them

Edge cases and gotchas that separate demos from production systems

A 30-day leveling plan

Ready to level up your dev toolkit?

Frequently Asked Questions

Is prompt engineering still useful when agents are tested deterministically?

Do I need Playwright if my agent does not use a browser?

How should indie developers start without building a huge QA platform?

What is the most common mistake in AI agent testing?

Related Posts

When I Reject v0 Code: Pattern-Matching Rules for Safer UI Generation

Cody's Repository Indexing: Does Cognitive Offloading Create Knowledge Gaps in Large Codebases? | Desplega AI

Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai