Prompts are Dead: Skill-Based Test Agents for Reliable CI

Vibe coding gets you moving. You paste a prompt, ask the model to add a test, and maybe it opens Playwright or writes a Cypress spec that passes locally. That is a useful start. It is not a professional testing system. Professional CI does not reward clever one-off prompting. It rewards repeatability, observability, and boring recovery paths. That is why prompts are becoming the weakest part of many AI testing workflows. The moat is moving to skills: versioned, testable operating procedures that tell an agent what to do, what tools it may use, what evidence it must collect, and how it should fail.

The shift matters because AI is already inside developer workflows. The Stack Overflow 2024 Developer Survey reported that 76% of respondents were using or planning to use AI tools in development. Google Cloud's 2024 DORA report also found broad AI adoption among technology professionals and cautioned that AI effects vary by capability and practice. The practical reading is simple: adding AI is not enough. You need to engineer the workflow around it. If you are moving from solo experiments to CI that protects real users, this is where the upgrade begins.

What Is a Skill in an AI Test Agent?

A prompt is a request. A skill is a versioned operating procedure with tools, checks, fallbacks, and CI-visible state that survives model swaps.

A prompt says, "write tests for checkout." A skill says, "given a branch, base URL, and test scope, inspect changed files, choose the smallest relevant test surface, run the approved command, capture traces on failure, classify known edge cases, and return a JSON summary with artifacts." The second form is longer, but it has shape. That shape is what lets teams review it, version it, run it in CI, and improve it without hoping the model remembers tribal knowledge. For a broader migration mindset, see our flaky distributed test environments deep dive.

Skills are especially powerful for testing because test work is full of local context: fixture names, timeout policy, browser matrix, secrets policy, service readiness checks, trace retention, quarantine rules, and release-blocking thresholds. A raw prompt tends to flatten those details into vague instructions. A skill preserves them as executable constraints.

Prompt-only agent	Skill-based test agent
Relies on natural-language memory	Loads a versioned procedure from the repo
May choose any command	Uses an allowlisted command plan
Explains failure after the fact	Captures traces, logs, and structured failure classes
Hard to review in PRs	Reviewed like source code with diffs and tests

Why Do Prompts Break Down in CI?

Because CI rewards repeatability: skills pin context, commands, schemas, and recovery steps so agents fail loudly and fixably.

CI is a hostile environment for vague automation. The filesystem is fresh. Browser caches are empty. Ports collide. Secrets may be masked. A backend may need 20 seconds to warm up. A test can fail because the app is broken, because the fixture is stale, because the selector drifted, or because the agent chose a command that never belonged in the pipeline. A prompt can describe those hazards, but it cannot enforce them. A skill can.

Inputs are explicit: branch, base URL, test scope, retry budget, and artifact directory.
Allowed tools are explicit: for example, Playwright test, curl health checks, git diff, and a JUnit parser.
Outputs are explicit: pass/fail, failure class, evidence paths, and next action.
Edge cases are explicit: missing secrets, absent browser binaries, no changed test files, and flaky upstream dependencies.

Example 1: Gate a Test Skill Before the Agent Runs

The first production move is to stop treating the skill as a document only humans read. Validate it before the agent uses it. This Node script checks a skill contract, rejects unsafe commands, handles missing files, times out long-running tests, and emits structured output CI can archive. Run it with node scripts/run-test-skill.mjs skills/checkout-agent.json.

#!/usr/bin/env node
import { spawn } from 'node:child_process';
import { access, readFile, writeFile, mkdir } from 'node:fs/promises';
import path from 'node:path';

const skillPath = process.argv[2];
const artifactDir = process.env.ARTIFACT_DIR || '.artifacts/test-agent';
const allowedCommands = new Set(['npm test', 'npx playwright test', 'pnpm test:e2e']);

function fail(code, message, extra = {}) {
  console.error(JSON.stringify({ ok: false, message, ...extra }, null, 2));
  process.exit(code);
}

if (!skillPath) fail(2, 'Usage: node scripts/run-test-skill.mjs <skill.json>');

let skill;
try {
  skill = JSON.parse(await readFile(skillPath, 'utf8'));
} catch (error) {
  fail(2, 'Skill contract is missing or invalid JSON', { error: String(error.message || error) });
}

for (const key of ['name', 'command', 'scope']) {
  if (!skill[key] || typeof skill[key] !== 'string') fail(2, 'Skill contract missing required string field', { key });
}

if (!allowedCommands.has(skill.command)) {
  fail(2, 'Skill command is not allowlisted', { command: skill.command, allowed: [...allowedCommands] });
}

if (skill.requiresEnv) {
  for (const name of skill.requiresEnv) {
    if (!process.env[name]) fail(3, 'Required environment variable is not set', { name });
  }
}

if (skill.changedFilesPath) {
  try {
    await access(skill.changedFilesPath);
  } catch {
    fail(4, 'Changed-files manifest not found; run the diff step first', { changedFilesPath: skill.changedFilesPath });
  }
}

await mkdir(artifactDir, { recursive: true });
const [bin, ...args] = skill.command.split(' ');
const child = spawn(bin, args, { stdio: ['ignore', 'pipe', 'pipe'], env: process.env });
const timeoutMs = Number(skill.timeoutMs || 600000);
let stdout = '';
let stderr = '';

const timer = setTimeout(() => {
  child.kill('SIGTERM');
}, timeoutMs);

child.stdout.on('data', chunk => {
  stdout += chunk;
});
child.stderr.on('data', chunk => {
  stderr += chunk;
});

const exitCode = await new Promise(resolve => child.on('close', resolve));
clearTimeout(timer);

await writeFile(path.join(artifactDir, 'stdout.log'), stdout);
await writeFile(path.join(artifactDir, 'stderr.log'), stderr);

const timedOut = exitCode === null;
const result = {
  ok: exitCode === 0,
  skill: skill.name,
  scope: skill.scope,
  command: skill.command,
  exitCode,
  timedOut,
  artifacts: [path.join(artifactDir, 'stdout.log'), path.join(artifactDir, 'stderr.log')],
};

await writeFile(path.join(artifactDir, 'result.json'), JSON.stringify(result, null, 2));
if (!result.ok) fail(1, 'Skill command failed', result);
console.log(JSON.stringify(result, null, 2));

Notice the edge case policy. Missing environment variables exit with a different code than a failing test. A missing changed-files manifest exits differently again. That distinction prevents the agent from "fixing" application code when the actual problem is CI setup.

Example 2: Give the Agent a Real Browser Contract

A durable test agent should not invent selectors while CI is red. Give it a browser contract that checks readiness, uses accessible locators, captures traces on failure, and handles the edge case where a payment test account is unavailable. This Playwright example is runnable with npx playwright test tests/checkout-agent.spec.ts.

import { test, expect, request } from '@playwright/test';

const baseURL = process.env.BASE_URL || 'http://127.0.0.1:3000';
const buyerEmail = process.env.TEST_BUYER_EMAIL;

test.beforeAll(async () => {
  const api = await request.newContext();
  const res = await api.get(baseURL + '/api/health', { timeout: 15000 });
  if (!res.ok()) {
    throw new Error('Application health check failed with HTTP ' + res.status());
  }
  await api.dispose();
});

test('checkout skill completes a smoke purchase or reports a fixture problem', async ({ page }, testInfo) => {
  if (!buyerEmail) {
    test.skip(true, 'TEST_BUYER_EMAIL is required for checkout smoke tests');
  }

  await page.goto(baseURL + '/pricing', { waitUntil: 'networkidle' });
  await page.getByRole('button', { name: /start checkout/i }).click();
  await page.getByLabel(/email/i).fill(buyerEmail);

  const plan = page.getByRole('radio', { name: /pro monthly/i });
  await expect(plan).toBeVisible({ timeout: 10000 });
  await plan.check();

  const submit = page.getByRole('button', { name: /continue/i });
  await expect(submit).toBeEnabled();
  await submit.click();

  const paymentUnavailable = page.getByText(/payment provider unavailable/i);
  const confirmation = page.getByRole('heading', { name: /confirm your order/i });

  await Promise.race([
    confirmation.waitFor({ timeout: 20000 }),
    paymentUnavailable.waitFor({ timeout: 20000 }),
  ]);

  if (await paymentUnavailable.isVisible()) {
    await testInfo.attach('checkout-provider-state', {
      body: 'Payment provider unavailable in test environment',
      contentType: 'text/plain',
    });
    throw new Error('Fixture failure: payment provider unavailable');
  }

  await expect(confirmation).toBeVisible();
});

This is not just a test; it is a boundary for the agent. If the health check fails, the agent should report environment failure. If the provider fixture is down, it should report fixture failure. Only when the app reaches the confirmation path and assertions fail should it propose product-code changes.

Example 3: Triage CI Failures Before Asking the Model to Patch

The fastest bad agent is one that patches before it understands the failure. Add a triage step that reads test output and classifies the failure. The agent can then choose a skill: fix selector drift, refresh fixture data, investigate environment readiness, or ask for human input. Run this with node scripts/triage-junit.mjs test-results/junit.xml.

#!/usr/bin/env node
import { readFile, writeFile } from 'node:fs/promises';

const reportPath = process.argv[2];
if (!reportPath) {
  console.error('Usage: node scripts/triage-junit.mjs <junit.xml>');
  process.exit(2);
}

let xml;
try {
  xml = await readFile(reportPath, 'utf8');
} catch (error) {
  console.error(JSON.stringify({ ok: false, category: 'missing_report', error: String(error.message || error) }));
  process.exit(2);
}

const patterns = [
  { category: 'environment_readiness', regex: /ECONNREFUSED|health check failed|net::ERR_CONNECTION_REFUSED/i },
  { category: 'selector_drift', regex: /getByRole|getByLabel|strict mode violation|Timeout.*locator/i },
  { category: 'fixture_or_secret', regex: /Required environment variable|fixture failure|401|403|secret/i },
  { category: 'product_regression', regex: /expect\(|AssertionError|toBeVisible|toHaveText/i },
];

if (!xml.includes('<testcase')) {
  const result = { ok: false, category: 'malformed_or_empty_report', nextAction: 'rerun_tests_with_reporter_enabled' };
  await writeFile('triage-result.json', JSON.stringify(result, null, 2));
  console.log(JSON.stringify(result, null, 2));
  process.exit(1);
}

const failures = (xml.match(/<failure[\s\S]*?<\/failure>/g) || []).join('\n');
const errors = (xml.match(/<error[\s\S]*?<\/error>/g) || []).join('\n');
const text = failures + '\n' + errors;

let category = 'unknown_failure';
for (const pattern of patterns) {
  if (pattern.regex.test(text)) {
    category = pattern.category;
    break;
  }
}

const result = {
  ok: false,
  category,
  nextAction: {
    environment_readiness: 'check_service_logs_before_code_changes',
    selector_drift: 'inspect_dom_snapshot_and_update_locator_contract',
    fixture_or_secret: 'repair_ci_fixture_or_secret_before_rerun',
    product_regression: 'open_targeted_patch_with_reproduction',
    unknown_failure: 'attach_trace_and_request_human_review',
  }[category],
};

await writeFile('triage-result.json', JSON.stringify(result, null, 2));
console.log(JSON.stringify(result, null, 2));
process.exit(category === 'unknown_failure' ? 1 : 0);

Design Principles for Durable Test Skills

The best skills are small enough to review and specific enough to constrain action. A checkout smoke skill should not also own visual regression, load testing, dependency upgrades, and release notes. Split skills by decision boundary. If the agent needs a different mental model, it probably needs a different skill. You can connect these skills to professional QA workflows like the ones in our Vibe QA assessment toolkit.

Pin the entry point. Use one command or script per skill so CI can reproduce the same behavior.
Make outputs structured. JSON beats prose for gates, dashboards, and downstream agents.
Separate diagnosis from patching. Triage first, then choose whether code changes are justified.
Prefer artifact paths over summaries. Traces, screenshots, logs, and JUnit files survive model context limits.
Write refusal rules. The skill should say when the agent must not edit code, such as missing secrets or broken deploys.

Troubleshooting Skill-Based Test Agents

When a skill-based agent fails, debug the contract before blaming the model. Most failures come from ambiguous inputs, missing artifacts, overbroad permissions, or CI state the skill did not model. Here is the practical checklist.

The agent edits too much. Narrow the skill scope and add a changed-files filter before patching is allowed.
The agent reruns forever. Add a retry budget and require a new hypothesis before every rerun.
Failures are vague. Require JUnit, trace, screenshot, stdout, stderr, and a failure category in the output schema.
Local passes but CI fails. Compare browser versions, secrets, base URL, service readiness, and network policy.
The model ignores instructions. Move the instruction into executable validation. Prompts guide; checks enforce.

Gotchas matter. Do not let the agent install packages in CI unless the skill explicitly allows it. Do not let it update snapshots without attaching before-and-after evidence. Do not classify every timeout as flake; timeouts can be accessibility regressions, slow API calls, missing waits, or dead services. And do not hide skipped tests. A skill that silently skips because an env var is missing is worse than no skill at all.

The Migration Path: From Vibe Prompt to Professional Skill

You do not need to rebuild your stack overnight. Start with the prompt you already use. Highlight every sentence that describes a repeatable rule. Turn those rules into a skill file and a wrapper script. Add one structured output. Add one artifact. Add one CI gate. Then review the skill like production code. Over time, the prompt becomes smaller because the process lives in the repo. That is the level-up moment: the agent stops being a clever autocomplete session and becomes a teammate operating inside constraints you can trust.

The teams that win with AI testing will not be the teams with the longest prompts. They will be the teams with the clearest procedures, the best evidence capture, and the fastest feedback loops. Prompts are easy to copy. Skills are harder to copy because they encode your product, your CI, your test philosophy, and your recovery playbook. That is the new moat.

FAQs

Are prompts still useful for test agents? Yes, but treat prompts as the interface, not the system. Durable agents need versioned skills, deterministic tools, logs, and CI checks around that text.

What should go into a testing skill? Include the goal, required inputs, allowed tools, commands, edge cases, failure handling, output schema, and examples of good and bad agent behavior.

Can skills work with Playwright or Cypress? Yes. A skill can wrap Playwright, Cypress, Selenium, API checks, fixtures, trace capture, and triage scripts without changing the test runner itself.

How do I know a skill is production ready? It is production ready when it runs in CI, validates inputs, records artifacts, handles missing secrets, and fails with instructions a teammate can act on.

Prompts are Dead: Skill-Based Test Agents for Reliable CI

Stop asking an AI to remember your testing process. Package the process as a skill your CI can run, inspect, and improve.

What Is a Skill in an AI Test Agent?

Why Do Prompts Break Down in CI?

Example 1: Gate a Test Skill Before the Agent Runs

Example 2: Give the Agent a Real Browser Contract

Example 3: Triage CI Failures Before Asking the Model to Patch

Design Principles for Durable Test Skills

Troubleshooting Skill-Based Test Agents

The Migration Path: From Vibe Prompt to Professional Skill

FAQs

Ready to level up your dev toolkit?

Frequently Asked Questions

Are prompts still useful for test agents?

What should go into a testing skill?

Can skills work with Playwright or Cypress?

How do I know a skill is production ready?

Related Posts

Cody's Repository Indexing: Does Cognitive Offloading Create Knowledge Gaps in Large Codebases? | Desplega AI

Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai

The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai