Can Playwright test agentic AI workflows?

Yes. Use Playwright for API and UI orchestration, then assert on traces, contracts, tool calls, and safety gates instead of exact text snapshots.

How do you avoid flaky AI tests?

Separate deterministic contracts from probabilistic judging, seed fixtures, capture traces, and fail on unsafe actions rather than harmless wording drift.

Should every agent response be evaluated by an LLM judge?

No. Prefer schema checks, tool invariants, and policy assertions first. Use LLM judges only for semantic quality that rules cannot capture.

What is the biggest gotcha in agent test automation?

Testing only the final answer. Reliable suites inspect the full decision path because unsafe tool calls can happen before a polished response appears.

Building Reliable Agentic AI Systems: A Quality Engineering Framework for Non-Deterministic Workflows

Agentic AI changes the unit under test. A traditional application receives an input, follows code paths the team wrote, and returns an output. An agent receives an input, chooses an intermediate plan, calls tools, observes results, revises the plan, and only then returns an output. The final message may look acceptable while the internal path violated policy, skipped a required lookup, retried a non-idempotent action, or hid a tool failure. That is why quality engineering for agentic systems cannot be a thin wrapper around snapshot testing.

The pressure to solve this is rising quickly. Stack Overflow's 2024 Developer Survey reported that 76% of respondents were using or planning to use AI tools in their development process. GitLab's 2024 Global DevSecOps Report reported that 78% of respondents were using AI in software development or planned to within two years. Those numbers do not prove agents are reliable; they prove QA teams will increasingly be asked to verify them. The framework below is written for engineers who already know Playwright, Cypress, or Selenium and now need a disciplined way to test non-deterministic workflows.

What Makes Agentic AI Testing Different?

Agentic AI testing verifies decisions, tool use, policy boundaries, and recovery behavior across a variable execution trace.

The core difference is observability. In a normal checkout test, the browser trace, network calls, and database state tell you whether the workflow behaved. In an agent workflow, the important behavior may be inside the planner loop: which tool was selected, what arguments were passed, whether a failed observation changed the next step, and whether the agent stopped before taking an unsafe action. A string assertion on the final answer is necessary, but it is never sufficient.

A reliable framework treats the agent run as a protocol. The prompt is the request. Tool calls are RPC messages. Tool results are observations. The final answer is just the last frame. That model lets QA engineers apply familiar test automation skills: contract testing, negative paths, timeout handling, trace analysis, and invariant checks. If you are building a broader automation strategy, pair this with our Playwright test architecture guide so browser and agent tests share fixtures, reporting, and release gates.

Practical rule: assert on the smallest deterministic surface first. Validate schemas, allowed tools, required preconditions, idempotency keys, and trace shape before asking an LLM judge whether the answer was good.

The Quality Framework: Four Gates Before Release

A production agent test suite should have four gates. Gate one is contract validation: can every model-produced decision be parsed, bounded, and rejected safely? Gate two is behavioral trace testing: did the agent perform required steps in the right order? Gate three is adversarial and edge-case testing: what happens when tools fail, inputs are ambiguous, or policy boundaries are close? Gate four is evaluation and monitoring: are traces scored consistently in CI and sampled in production?

This structure prevents two common mistakes. The first is overusing LLM-as-judge checks for problems that schemas can solve with more precision. The second is pretending deterministic tests are impossible because the model is non-deterministic. You cannot force identical reasoning every run, but you can force the system to obey contracts, disclose uncertainty, avoid forbidden tools, and fail closed when a dependency misbehaves.

Testing layer	Traditional UI automation	Agentic workflow automation
Primary artifact	DOM state, API response, database record	Trace of plans, tool calls, observations, and final answer
Best assertion	Exact selector state or response body	Policy invariant, schema contract, required or forbidden action
Main flake source	Timing, network, async rendering	Ambiguous prompts, tool latency, incomplete observability

Gate 1: Contract-Test Every Agent Decision

Start by reducing the model's freedom at integration boundaries. If the agent decides the next action, require a typed decision object. The application should not execute a tool call because a model wrote persuasive prose. It should execute only after the decision passes a schema and business invariant check. This is the same reason API teams use OpenAPI or JSON Schema: contracts turn vague expectations into executable release criteria.

import { z } from 'zod';

const ToolCallSchema = z.object({
  name: z.enum(['searchOrders', 'lookupRefundPolicy', 'issueCredit', 'escalateTicket']),
  arguments: z.record(z.unknown()),
  correlationId: z.string().min(8),
});

const AgentDecisionSchema = z.object({
  intent: z.enum(['refund', 'exchange', 'status_check', 'unknown']),
  confidence: z.number().min(0).max(1),
  nextAction: z.enum(['call_tool', 'ask_clarifying_question', 'escalate', 'final_answer']),
  toolCall: ToolCallSchema.optional(),
  riskFlags: z.array(z.enum(['policy_boundary', 'missing_order', 'high_value', 'tool_failure'])),
  userMessage: z.string().min(1).max(900),
});

export function validateAgentDecision(raw: unknown) {
  const parsed = AgentDecisionSchema.safeParse(raw);
  if (!parsed.success) {
    return {
      ok: false as const,
      errors: parsed.error.issues.map((issue) => issue.path.join('.') + ': ' + issue.message),
    };
  }

  const decision = parsed.data;
  if (decision.nextAction === 'call_tool' && !decision.toolCall) {
    return { ok: false as const, errors: ['nextAction=call_tool requires toolCall'] };
  }

  if (decision.confidence < 0.72 && decision.nextAction !== 'escalate') {
    return { ok: false as const, errors: ['low confidence decisions must escalate'] };
  }

  return { ok: true as const, value: decision };
}

The edge case in this example is subtle: a syntactically valid decision can still be semantically invalid. If the model says nextAction is call_tool but omits toolCall, the schema alone may not catch the workflow contradiction. The second invariant catches low-confidence actions that should escalate. This is where QA engineers add real value: encode the domain rules that a model can easily phrase around but must not bypass.

How Do You Test a Workflow That Never Runs the Same Way Twice?

Do not assert identical wording. Assert stable invariants: required lookups, forbidden actions, bounded retries, trace shape, and fail-closed behavior.

Non-determinism is not the same as randomness everywhere. Many properties should remain stable even when the model chooses different words. A refund agent should look up policy before issuing credit. A healthcare triage assistant should not diagnose. A procurement agent should not approve a purchase above its threshold. A support agent should not retry a charge twice after a timeout unless the payment tool is idempotent. These are invariant tests, and they are the backbone of agentic quality engineering.

Gate 2: Use Playwright to Assert on the Trace, Not Just the UI

Playwright is a strong fit because it can orchestrate API calls, browser flows, storage state, and network interception in one runner. For agent tests, expose a test-only trace mode that returns structured steps. That trace should include correlation IDs, tool names, sanitized arguments, errors, timing, and the final answer. Keep secrets out of the trace, but keep enough detail to reconstruct the decision path.

import { test, expect, APIRequestContext } from '@playwright/test';

type AgentStep = {
  id: string;
  kind: 'reason' | 'tool_call' | 'tool_result' | 'final';
  toolName?: string;
  error?: string;
};

type AgentRun = {
  runId: string;
  status: 'passed' | 'failed' | 'needs_review';
  steps: AgentStep[];
  finalAnswer?: string;
};

async function runAgent(request: APIRequestContext, prompt: string): Promise<AgentRun> {
  const response = await request.post('/api/agent/run', {
    data: { prompt, trace: true, maxToolCalls: 8 },
    timeout: 45_000,
  });

  if (!response.ok()) {
    throw new Error('Agent API failed with ' + response.status() + ': ' + await response.text());
  }

  const body = await response.json();
  if (!body.runId || !Array.isArray(body.steps)) {
    throw new Error('Malformed agent response: missing runId or steps');
  }
  return body as AgentRun;
}

test('refund agent confirms policy before issuing credit', async ({ request }) => {
  const run = await runAgent(
    request,
    'Customer asks for a refund on order 1245. Order is 54 days old. Decide next action.'
  );

  const toolCalls = run.steps.filter((step) => step.kind === 'tool_call');
  const policyLookup = toolCalls.find((step) => step.toolName === 'lookupRefundPolicy');
  const creditCall = toolCalls.find((step) => step.toolName === 'issueCredit');

  expect(policyLookup, 'agent must inspect policy before acting').toBeTruthy();
  expect(creditCall, '54-day order is outside policy and must not be credited automatically').toBeFalsy();
  expect(run.status).toBe('needs_review');
  expect(run.finalAnswer ?? '').toContain('manual review');

  for (const step of run.steps) {
    expect(step.error, 'tool errors should be surfaced in the trace').toBeFalsy();
  }
});

This test shows what breaks. It blocks an automatic refund for a 54-day-old order and verifies the agent does not continue to a dangerous action. A flaky version of this test would assert the exact final sentence. The production-ready version asserts policy lookup, absence of issueCredit, surfaced errors, and review status. Those are stable behaviors that should survive harmless phrasing changes.

If your team uses Cypress or Selenium, the same principle applies: drive the workflow through the public surface, but collect the trace from an API, test hook, or observability pipeline. For implementation patterns on exposing safe test hooks, see how to structure test run evidence.

Gate 3: Build an Edge-Case Matrix Before Prompt Tuning

Prompt tuning often hides gaps instead of fixing them. Before changing instructions, define an edge-case matrix that QA, product, and engineering can agree on. Include ambiguous intent, missing records, boundary values, high-risk actions, tool timeouts, malformed tool output, duplicate submissions, stale context, and user attempts to override policy. Each row should specify expected system behavior, not expected wording.

Boundary values: order is exactly at the refund deadline, purchase is exactly at approval limit, user age is exactly at eligibility threshold.
Tool failures: timeout, 429 rate limit, 500 response, partial response, stale cache, and malformed JSON from a downstream service.
Memory issues: previous conversation contradicts current account state, or retrieved context belongs to another tenant.
Safety issues: user asks the agent to ignore policy, reveal hidden instructions, or perform an action without confirmation.
Concurrency issues: two runs attempt the same non-idempotent action, such as issuing credit or closing a ticket.

The gotcha is that model quality and system quality are different. A better model may reduce some mistakes, but it will not replace idempotency, authorization, schema validation, or traceability. Treat prompt improvements like code changes: version them, test them against the matrix, and keep a rollback path.

Gate 4: Evaluate Traces in CI and Production Sampling

Deterministic gates should run in CI. Semantic evaluation can run in CI for a smaller golden set and in production sampling for real drift detection. The evaluator should be boring: consume a trace file, apply deterministic rules first, return explicit reasons, and exit non-zero when release criteria fail. If you later add an LLM judge, keep its reasoning as one input rather than the whole gate.

import fs from 'node:fs/promises';

type Trace = {
  runId: string;
  promptId: string;
  steps: Array<{ kind: string; toolName?: string; durationMs?: number; error?: string }>;
  finalAnswer?: string;
};

type Evaluation = { passed: boolean; score: number; reasons: string[] };

export async function evaluateTrace(tracePath: string): Promise<Evaluation> {
  let trace: Trace;
  try {
    trace = JSON.parse(await fs.readFile(tracePath, 'utf8')) as Trace;
  } catch (error) {
    return { passed: false, score: 0, reasons: ['trace file could not be read: ' + String(error)] };
  }

  const reasons: string[] = [];
  if (!trace.runId || !trace.promptId) reasons.push('missing runId or promptId');
  if (trace.steps.length === 0) reasons.push('empty trace');
  if (trace.steps.some((step) => step.error)) reasons.push('tool error surfaced in trace');
  if (trace.steps.some((step) => (step.durationMs ?? 0) > 10_000)) reasons.push('step over 10s');

  const issuedCredit = trace.steps.some((step) => step.toolName === 'issueCredit');
  const checkedPolicy = trace.steps.some((step) => step.toolName === 'lookupRefundPolicy');
  if (issuedCredit && !checkedPolicy) reasons.push('credit issued without policy lookup');

  const final = trace.finalAnswer ?? '';
  if (final.length < 20) reasons.push('final answer too short to audit');
  if (/guaranteed|always|never/i.test(final)) reasons.push('absolute language requires review');

  const score = Math.max(0, 1 - reasons.length * 0.22);
  return { passed: reasons.length === 0, score, reasons };
}

This evaluator catches missing identifiers, empty traces, tool errors, slow steps, policy-order violations, weak final answers, and risky absolute language. The edge case is the credit-without-policy rule: the final answer might be polite, but the trace proves the workflow skipped a required control. In CI, store the failing trace as an artifact. In production, sample traces by prompt version, model version, customer segment, and tool path so drift is diagnosable.

Troubleshooting and Debugging Agent Test Failures

Debugging agent tests is slower when failures are collapsed into one message like answer did not match expected. Preserve the chain. A useful failure report includes prompt version, model version, fixture ID, trace ID, tool arguments after redaction, tool response status, retry count, evaluator reasons, and the final answer. Without those fields, engineers cannot tell whether they have a prompt issue, a tool contract issue, a test fixture issue, or an actual product defect.

If the final answer is wrong but the trace is correct, inspect response formatting instructions and final-answer synthesis.
If the trace uses the wrong tool, inspect tool descriptions, routing examples, and whether similar tools have overlapping names.
If the agent skips a required lookup, add a deterministic invariant and move the rule closer to execution, not only into the prompt.
If failures appear only in CI, compare model version, timeout budget, seeded fixtures, network mocks, and parallel test isolation.
If retries cause duplicate actions, require idempotency keys and assert them in tests before enabling automatic retry behavior.

Common diagnostic shortcut: replay the same trace with tool execution disabled. If the plan is bad before tools run, you have a planning or prompt issue. If the plan is good but execution fails, debug contracts, auth, timeouts, and downstream state.

How Should QA Teams Decide What to Automate First?

Automate high-risk invariants first: money movement, policy boundaries, tenant isolation, destructive actions, and escalation paths.

Start where the blast radius is largest. A content summarizer can tolerate more semantic variation than an agent that changes account state. For each workflow, identify irreversible actions, regulated claims, customer-visible commitments, and places where a tool failure could be hidden by confident language. Those become the first automated gates. Lower-risk quality attributes, such as tone and helpfulness, can be evaluated later with sampled reviews or semantic scorers.

Keep the suite layered. Unit-test pure validators. Contract-test tool inputs and outputs. API-test agent traces. Browser-test the human approval flow. Monitor sampled production traces. This avoids a giant end-to-end suite that is expensive, slow, and hard to debug. The best agent automation looks less like one magical test and more like a quality mesh around every place non-determinism touches deterministic business rules.

Implementation Gotchas That Cause False Confidence

The first gotcha is using golden transcripts as the main oracle. They are useful for review, but brittle as release gates. The second is letting the model decide whether a decision is safe after it already selected a tool. Safety checks should sit in application code, close to execution. The third is hiding tool errors from the trace because they make demos look polished. Hidden errors become production mysteries.

Another gotcha is evaluating only happy paths. Agent bugs often show up when context is missing, tools disagree, or the user asks for something adjacent to policy. Also watch for cross-tenant retrieval, stale vector indexes, prompt injections inside retrieved documents, and parallel tests sharing the same account. These are familiar QA problems with a new surface area. Treat them with the same discipline you use for data isolation, network mocking, and permission testing.

A Practical Release Checklist

Every executable model decision has a schema and at least one business invariant.
Every state-changing tool requires authorization, idempotency, and a correlation ID.
Every critical workflow has trace assertions for required and forbidden actions.
Tool timeout, malformed output, and rate-limit paths are covered by negative tests.
Prompt, model, retrieval index, and tool versions are stored with each trace.
CI stores failing traces as artifacts and production sampling reports evaluator reasons.

Agentic AI systems can be tested, but not by pretending they are deterministic web forms. The reliable approach is to separate what may vary from what must never vary. Let language vary. Do not let policy, authorization, tool contracts, tenant boundaries, or failure handling vary. That is the quality engineering line that turns impressive demos into systems a QA team can stand behind.

Building Reliable Agentic AI Systems: A Quality Engineering Framework for Non-Deterministic Workflows

A practical test automation framework for teams that need agentic AI to behave safely when the same prompt can produce different paths.