Which framework leaks more memory in CI: Playwright or Selenium?

Usually neither by default. Persistent leaks come from unclosed pages, contexts, or drivers, plus app-side state growth. The framework mostly changes how obvious the leak becomes.

Can Selenium be as stable as Playwright on shared CI runners?

Yes, if you control waits, isolate data, quit sessions reliably, and instrument Grid or driver logs. Selenium suffers when teams stack implicit waits and weak teardown on top of slow infra.

Should I measure browser memory, Node memory, or the whole runner?

Measure all three. Browser heap shows front-end retention, runner RSS exposes orphaned processes, and total job pressure explains why suites fail only under parallel CI contention.

Does headless mode always reduce CPU usage for end-to-end tests?

No. Headless removes paint overhead, but bad polling, excessive retries, video capture, and too many concurrent workers can still saturate CPU faster than a local headed run would.

Playwright vs. Selenium in Headless CI: Memory Leaks, CPU Spikes, and What Actually Breaks

Headless CI is where browser automation frameworks stop being marketing pages and start becoming operating-system problems. The suite that feels fine on a laptop suddenly pegs the runner at 100% CPU, Chrome tabs crash, Selenium sessions hang in teardown, and the same pull request goes green only after a rerun.

When teams ask whether Playwright or Selenium is "faster," they usually compress three separate questions into one: which tool isolates state more cheaply, which tool gives better visibility when the browser starts thrashing, and which tool makes it harder to accidentally keep dead sessions alive. Those are different engineering concerns, and CI punishes each of them differently.

This guide compares Playwright and Selenium specifically through the lens of headless CI environments: shared Linux runners, Docker-based jobs, parallel workers, and suites that are large enough to accumulate memory pressure over time. The goal is not framework fandom. The goal is helping you identify what is actually causing the spike, the leak, or the crash.

Stack Overflow's 2024 Developer Survey used 65,437 qualified responses from 185 countries. In its professional developer section, 56.3% of respondents said automated testing exists at their company, while 31.5% said reliability of tools and systems is one of their biggest work frustrations. That is the real backdrop for CI performance work: test automation is mainstream, but dependable tooling is still not.

What actually creates CPU spikes and memory leaks in headless CI?

Headless CI packs browser sessions onto fewer shared cores, so weak cleanup, busy waits, and orphaned processes accumulate faster than they do on a local machine.

Browser automation jobs burn CPU and memory in four places at once: the test runner process, the browser parent process, one or more renderer processes, and whatever service container the application under test depends on. A leak in any one of those layers can look like a framework problem if you only measure the final job duration.

CPU spikes usually come from aggressive polling, too many parallel workers, video or trace capture on every run, or repeated DOM queries against a page that never becomes actionable.
Browser memory leaks come from contexts, pages, tabs, service workers, or app-side JavaScript objects that never get released between scenarios.
Runner memory leaks often come from test-side caches, retained traces, screenshots, HARs, or child processes that survive after the test has already failed.
False framework blame happens when the app itself leaks memory, but one framework keeps the session alive long enough for you to notice it sooner.

In other words, Playwright and Selenium are not just competing APIs. They are different ways of driving and isolating browsers under pressure, and those differences change how quickly damage accumulates when the job is under load.

Playwright vs. Selenium architecture under CI load

The biggest operational difference is that Selenium standardizes browser control around the W3C WebDriver model, while Playwright is built around its own automation stack and test runner conventions. Selenium's strength is portability. Playwright's strength is cheap, built-in isolation and tighter tooling around modern browser automation.

Dimension	Playwright	Selenium
Primary control model	Framework-managed browser automation stack	W3C WebDriver standard across browsers
Default isolation story	Browser contexts are cheap and intended for per-test isolation	Isolation usually means new sessions or stricter test design discipline
Waiting behavior	Strong built-in actionability and retry model	More explicit control, easier to build slow or conflicting waits
Leak visibility	Trace tooling and recent requestGC support help isolate browser-side growth	Grid tracing and logs are excellent, but teams must wire them in deliberately
Common CI failure mode	Too many workers, retained contexts, or headless-channel assumptions	Session churn, wait inflation, Grid latency, or driver teardown drift

That table explains why the same broken app can look healthier in one framework than the other. Playwright tends to surface cross-test state pollution earlier because it nudges you toward fresh contexts. Selenium tends to surface protocol, infra, and session-lifecycle costs earlier because a remote-driver architecture makes those layers impossible to ignore.

Why does Playwright usually recover faster from state pollution?

Playwright's cheap browser contexts reset cookies and storage per test, limiting cross-test contamination without forcing a full browser relaunch each time.

Playwright's isolation documentation is explicit: tests run in isolated clean-slate environments called browser contexts, and each test gets its own storage, cookies, and session state. That matters because state pollution is one of the easiest ways to misread a memory leak. If every test reuses the same authenticated session, the browser may keep background resources, caches, or app-level subscriptions alive for much longer than you realize.

Playwright also gives you a useful modern clue for leak hunting: the release notes mention that page.requestGC() may help detect memory leaks. It is not a silver bullet, but it gives you a controlled way to ask, "If I force collection now, does memory settle down, or am I retaining something real?"

// scripts/playwright-ci-leak-smoke.ts
import { chromium } from 'playwright';

type Sample = { run: number; heapUsedMb: number | null };

async function readHeapMb(page: import('playwright').Page): Promise<number | null> {
  return page.evaluate(() => {
    const memory = (performance as Performance & {
      memory?: { usedJSHeapSize: number };
    }).memory;

    if (!memory) return null; // Edge case: non-Chromium engines do not expose this API.
    return Math.round((memory.usedJSHeapSize / 1024 / 1024) * 100) / 100;
  });
}

async function main() {
  const browser = await chromium.launch({
    headless: true,
    args: ['--disable-dev-shm-usage'],
  });

  const samples: Sample[] = [];

  try {
    for (let run = 1; run <= 15; run += 1) {
      const context = await browser.newContext({ serviceWorkers: 'block' });
      const page = await context.newPage();

      try {
        await page.goto(process.env.TARGET_URL ?? 'http://127.0.0.1:3000/dashboard', {
          waitUntil: 'domcontentloaded',
          timeout: 30_000,
        });

        await page.getByRole('button', { name: 'Open analytics' }).click();
        await page.getByTestId('report-filter').selectOption('last-30-days');
        await page.getByRole('button', { name: 'Refresh' }).click();
        await page.waitForLoadState('networkidle');

        const maybeRequestGC = (page as import('playwright').Page & {
          requestGC?: () => Promise<void>;
        }).requestGC;

        if (typeof maybeRequestGC === 'function') {
          await maybeRequestGC.call(page);
        }

        const heapUsedMb = await readHeapMb(page);
        samples.push({ run, heapUsedMb });

        if (heapUsedMb !== null && heapUsedMb > 180) {
          throw new Error(
            `Renderer heap exceeded threshold on run ${run}: ${heapUsedMb} MB. Check retained charts, sockets, or listeners.`,
          );
        }
      } catch (error) {
        await page.screenshot({ path: `artifacts/playwright-leak-run-${run}.png`, fullPage: true }).catch(() => {
          // Ignore screenshot errors when the page has already crashed.
        });
        throw error;
      } finally {
        await context.close().catch((closeError) => {
          console.error(`Failed to close context on run ${run}:`, closeError);
        });
      }
    }

    console.table(samples);
  } finally {
    await browser.close().catch((closeError) => {
      console.error('Failed to close browser cleanly:', closeError);
      process.exitCode = 1;
    });
  }
}

main().catch((error) => {
  console.error('Playwright leak smoke test failed:', error);
  process.exit(1);
});

What breaks here in real life? Old versions may not exposerequestGC. Non-Chromium engines may not expose performance.memory. And if your app leaks WebSocket subscriptions or chart instances, the browser may survive several runs before the heap finally crosses a threshold. That is exactly why repeating a realistic workflow in fresh contexts is more useful than timing a single happy-path test.

Where Selenium still wins, and why teams keep it anyway

Selenium survives because it solves a different class of organizational problem. Its WebDriver model is standardized, the ecosystem is huge, browser coverage expectations are mature, and enterprises already know how to operate remote sessions, vendor clouds, and Grid-based routing. If your problem is broad compatibility or existing investment, Selenium remains rational.

The tradeoff is that Selenium makes session lifecycle mistakes easier to accumulate. A slow remote session creation, a dangling driver after a failed assertion, or a mixed implicit and explicit wait strategy can turn into CPU and memory waste long before the actual test logic is wrong. The framework is not hiding those costs from you; it is exposing them.

Selenium Grid's observability documentation is stronger than many teams realize. The Grid server is instrumented with OpenTelemetry tracing, and every request to the server is traced from start to end. If you are debugging CPU spikes across nodes, that visibility is not a side note. It is the difference between "Chrome was slow" and "session create calls piled up behind an overloaded distributor".

This is also where Selenium can outperform sloppy Playwright usage. If your team already has disciplined session provisioning, tuned Grid capacity, and clear node-level telemetry, Selenium's extra protocol boundaries are not automatically a disadvantage. They become an advantage when you need to separate browser slowness from network slowness from node exhaustion. Teams that call Selenium "slow" often mean their surrounding platform is under-instrumented.

// scripts/selenium-ci-leak-smoke.mjs
import { Builder, By, until } from 'selenium-webdriver';
import chrome from 'selenium-webdriver/chrome.js';

async function readHeapMb(driver) {
  try {
    const value = await driver.executeScript(() => {
      const memory = performance.memory;
      return memory ? memory.usedJSHeapSize : null;
    });

    return value === null ? null : Math.round((value / 1024 / 1024) * 100) / 100;
  } catch (error) {
    // Edge case: some browsers or policies block performance.memory.
    console.warn('Heap sampling unavailable:', error.message);
    return null;
  }
}

async function runIteration(iteration) {
  const options = new chrome.Options()
    .addArguments('--headless=new')
    .addArguments('--disable-dev-shm-usage')
    .addArguments('--no-sandbox');

  const driver = await new Builder().forBrowser('chrome').setChromeOptions(options).build();

  try {
    await driver.get(process.env.TARGET_URL ?? 'http://127.0.0.1:3000/dashboard');

    const refreshButton = await driver.wait(
      until.elementLocated(By.css('[data-testid="refresh-report"]')),
      20_000,
    );
    await driver.wait(until.elementIsVisible(refreshButton), 10_000);
    await refreshButton.click();

    const summary = await driver.wait(
      until.elementLocated(By.css('[data-testid="report-summary"]')),
      20_000,
    );
    await driver.wait(async () => {
      const text = await summary.getText();
      return text.includes('Last 30 days');
    }, 20_000, 'Summary never updated to expected range');

    const heapMb = await readHeapMb(driver);
    if (heapMb !== null && heapMb > 180) {
      throw new Error(`Heap threshold exceeded on iteration ${iteration}: ${heapMb} MB`);
    }

    return { iteration, heapMb };
  } catch (error) {
    const png = await driver.takeScreenshot().catch(() => null);
    if (png) {
      console.error(`Captured screenshot for failed iteration ${iteration} (base64 omitted).`);
    }
    throw error;
  } finally {
    await driver.quit().catch((quitError) => {
      console.error(`Driver quit failed on iteration ${iteration}:`, quitError);
    });
  }
}

const results = [];
for (let iteration = 1; iteration <= 10; iteration += 1) {
  try {
    results.push(await runIteration(iteration));
  } catch (error) {
    console.error(`Selenium smoke failed on iteration ${iteration}:`, error);
    process.exit(1);
  }
}

console.table(results);

Two practical gotchas matter here. First, do not combine long implicit waits with explicit waits unless you enjoy multiplying latency in ways that are hard to see from the test body. Second, do not trust a successful assertion to imply a healthy session lifecycle. If the driver never quits cleanly, the browser process can outlive the useful work and poison later jobs on the same runner.

How do you benchmark the problem instead of arguing on vibes?

Measure the runner, the browser, and the test command together. Single-pass timings hide the cumulative failures that actually make CI expensive.

The right benchmark is not "framework A finished one smoke test first." It is "after N realistic iterations on a constrained runner, did RSS keep climbing, did CPU stay saturated, and did teardown restore the machine to a known baseline?" That is the benchmark that tells you whether your suite will survive a normal workday of pull requests.

Separate baseline noise from framework behavior. Run the application under test alone first and record its idle RSS and CPU. Then run the automation command without parallelism. Then add workers one step at a time. That progression tells you whether the spike belongs to the app, the browser automation layer, or simple oversubscription. Without that sequence, it is easy to blame Selenium for a slow API or blame Playwright for a React page that never stops repainting live charts.

// scripts/ci-process-watchdog.mjs
import { execFileSync, spawn } from 'node:child_process';

const command = process.argv[2];
const args = process.argv.slice(3);

if (!command) {
  console.error('Usage: node scripts/ci-process-watchdog.mjs <command> [...args]');
  process.exit(1);
}

function readProcessStats(pid) {
  try {
    const output = execFileSync('ps', ['-o', '%cpu=,rss=', '-p', String(pid)], {
      encoding: 'utf8',
    }).trim();

    if (!output) return null; // Edge case: process exited between samples.

    const [cpu, rssKb] = output.split(/s+/);
    return {
      cpuPercent: Number(cpu),
      rssMb: Math.round((Number(rssKb) / 1024) * 100) / 100,
    };
  } catch {
    return null;
  }
}

const child = spawn(command, args, {
  stdio: 'inherit',
  shell: false,
  env: process.env,
});

let peakCpu = 0;
let peakRssMb = 0;

const interval = setInterval(() => {
  const stats = readProcessStats(child.pid);
  if (!stats) return;

  peakCpu = Math.max(peakCpu, stats.cpuPercent);
  peakRssMb = Math.max(peakRssMb, stats.rssMb);

  if (stats.rssMb > 1200) {
    console.error(`Memory threshold exceeded: ${stats.rssMb} MB RSS`);
    child.kill('SIGTERM');
  }
}, 1000);

child.on('exit', (code, signal) => {
  clearInterval(interval);
  console.log(`Peak CPU: ${peakCpu}%`);
  console.log(`Peak RSS: ${peakRssMb} MB`);

  if (signal) {
    console.error(`Child terminated by signal: ${signal}`);
    process.exit(1);
  }

  process.exit(code ?? 1);
});

Run that wrapper around both suites on the same runner class and keep the workload identical. For example, point both at the same seeded environment, fix worker counts, disable noisy extras like video unless you are explicitly testing them, and repeat enough iterations to let retained state show up. If one job looks clean only because it ran fewer scenarios or less instrumentation, you learned nothing.

Troubleshooting headless CI failures without lying to yourself

Most debugging stalls happen because teams inspect only the assertion failure and ignore the runner's resource story. That is how a CPU starvation issue gets mislabeled as flaky selectors for three weeks.

Symptom: CPU pins at the start of the job. Check:worker count, video or trace settings, antivirus or policy overhead in hosted runners, and whether the app boot itself is consuming the same cores your browser needs.

Symptom: Memory climbs only after several tests. Check:unclosed contexts or drivers, retained pages, background polling in the app, large screenshots or HAR retention, and any fixture that shares mutable state between tests.

Symptom: Selenium jobs hang after failure. Check:whether the driver quit path runs inside finally, whether Grid is still creating sessions after the test has already timed out, and whether implicit waits are masking a dead page.

Symptom: Playwright is stable locally but crashes on CI. Check:the browser channel. Playwright docs note that headless Chromium shell and the newer real Chrome headless mode can behave differently, so a local headed or branded-browser pass may not reproduce a shell-specific issue.

Also watch for Linux container edge cases that have nothing to do with test syntax:/dev/shm exhaustion, cgroup memory limits, and noisy-neighbor CPU contention on shared runners. These failures often manifest as browser crashes, detached targets, or apparently random timeout bursts. They are infra symptoms first and framework symptoms second.

One more edge case deserves explicit mention: headless browser mode is not one universal implementation. Playwright documents that its default Chromium headless shell differs from the newer real Chrome headless mode, and those differences can matter for extension support, rendering paths, and reproducibility. On the Selenium side, teams moving to--headless=new sometimes discover latent layout or timing changes they never saw in the old mode. If your CI only fails after a browser upgrade, check the headless engine first before rewriting selectors.

The decision rule most teams actually need

Pick the framework that makes your dominant failure mode easiest to isolate, then spend the rest of the effort on teardown discipline and measurement.

If your suite mostly fails because of state contamination, implicit UI waiting mistakes, and poor visibility into modern front-end behavior, Playwright usually reduces the cost of doing the right thing. If your suite mostly fails because of enterprise topology, remote-browser orchestration, or browser-policy constraints, Selenium remains the more natural fit. In both cases, the real win comes from treating CI as a constrained production system rather than a disposable script runner.

Should you migrate from Selenium to Playwright just to fix performance?

Not always. If Grid topology, waits, or teardown leaks are the real bottleneck, a framework migration may move the pain without removing the cause.

Migrate when your biggest pain is modern web-app ergonomics, per-test isolation, and the cost of debugging client-side flakiness with weak tooling. Stay with Selenium when your biggest pain is organizational: broad browser policy requirements, deep ecosystem integrations, or heavy existing investment that already works once session hygiene is fixed.

Choose Playwright when you want strong defaults around isolation, built-in tracing, and a test runner that makes modern browser behavior easier to reason about.
Choose Selenium when standardized WebDriver behavior, browser breadth, and existing Grid or vendor-cloud workflows matter more than having the most opinionated local ergonomics.
Choose discipline before migration when your current suite leaks because of fixture design, weak teardown, or missing measurement. Changing frameworks will not fix a test culture that refuses to close what it opens.

The practical conclusion

In headless CI, Playwright usually feels lighter because browser contexts make clean-slate isolation cheap, and the debugging ergonomics are better out of the box. Selenium usually feels heavier because the costs of remote sessions, explicit lifecycle management, and mixed infrastructure are visible sooner. But those are tendencies, not laws.

The brutal truth is simpler: the framework you can observe, isolate, and tear down correctly is the one that will look fastest after a thousand runs. If you are not measuring RSS, CPU, session cleanup, and app-side retention, you are not comparing Playwright and Selenium. You are comparing stories.

Playwright vs. Selenium in Headless CI: Memory Leaks, CPU Spikes, and What Actually Breaks

When your suite melts only on CI, the real problem is rarely the framework name alone.

What actually creates CPU spikes and memory leaks in headless CI?

Playwright vs. Selenium architecture under CI load

Why does Playwright usually recover faster from state pollution?

Where Selenium still wins, and why teams keep it anyway

How do you benchmark the problem instead of arguing on vibes?

Troubleshooting headless CI failures without lying to yourself

The decision rule most teams actually need

Should you migrate from Selenium to Playwright just to fix performance?

The practical conclusion

References

Ready to strengthen your test automation?

Frequently Asked Questions

Which framework leaks more memory in CI: Playwright or Selenium?

Can Selenium be as stable as Playwright on shared CI runners?

Should I measure browser memory, Node memory, or the whole runner?

Does headless mode always reduce CPU usage for end-to-end tests?

Related Posts

Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai

The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai

The QA Death Spiral: When Your Test Suite Becomes Your Product | desplega.ai