Do I need Docker locally to run sharded Playwright tests?

No. Playwright runs natively on macOS, Windows, and Linux. Docker becomes useful when you want CI parity, reproducible browser versions, and the same image used by GitHub Actions or your runner.

How many shards should I use for a Playwright suite?

Start with the number of free runners you actually have. Beyond that, each shard adds startup overhead. In our experience, 4-8 shards is the sweet spot for suites between 100 and 800 tests.

Why is my shard 3 always slower than shards 1 and 2?

Default sharding splits tests by file count, not duration. If one shard owns the heavy auth or upload specs, it lags. Switch to a duration-aware splitter that uses prior timings to balance work.

Will Docker volume mounts slow my tests on macOS?

Yes, noticeably. Docker Desktop volume sharing on macOS has well-known I/O overhead. Copy artifacts into the image, mount only the result directory, or run the build itself inside the container.

How do I keep sharded reports as a single HTML output?

Generate a blob report per shard, upload each as an artifact, then run "playwright merge-reports" in a follow-up job. The official Playwright docs cover the exact CLI flags for this flow.

Scaling Playwright with Docker Containers and Parallel Sharding

Your Playwright suite runs in eight minutes on your laptop. The same suite takes forty-three minutes in CI, fails on every third pull request with a different error, and somebody is already typing "just rerun it" into Slack. That is not a test-quality problem. That is a test infrastructure problem, and at some point every vibe coder who has shipped real software has had to grow past it.

This guide walks through the level-up: stop pretending one shared laptop and one CI runner are enough. Build a Docker image that pins your browser version, run your suite across a parallel matrix of containers, and use the shard timings you collect to keep that matrix balanced over time. None of these moves is exotic. All of them are standard infrastructure that solo devs and small teams can absolutely operate without an SRE department.

We will cover three production-shaped code examples: a Docker Compose stack with health checks so the app under test is actually ready before Playwright starts, a sharded GitHub Actions workflow with merge-reports, and a Node.js shard calculator that consumes prior test durations and bin-packs them so no single shard runs twice as long as its neighbors. We will finish with a troubleshooting table and a debugging mindset that survives contact with macOS, Windows runners, and shared CI minutes.

The Playwright Docker docs ship official images with the matching browser binaries and required system libraries baked in. That matters more than it sounds: most "works on my machine" Playwright failures in CI are not selector bugs, they are missing fonts, missing libnss3 packages, or a Chromium version drift between your local Mac and an Ubuntu runner. Pinning the image pins the problem.

Why does a 40-minute local run sometimes hit 5 minutes in CI?

Because parallel shards run the suite in slices on different machines simultaneously, while your laptop runs every spec serially through one Node.js process and one event loop.

A Playwright config with workers: 4already gives you four browser contexts at once locally, but you are still bound by one machine's CPU, memory, and I/O. CI sharding multiplies that: ten runners with four workers each is effectively forty parallel workers across separate hardware. The wall-clock time of the slowest shard becomes your suite duration, which is why balancing shards is more important than adding more of them.

This is also where Docker starts paying for itself. A locally tuned suite often relies on whatever browser was already installed, whatever Node version nvmlast switched to, and whatever fixtures happened to be cached. A Docker image moves all of that into something you can rebuild and version. CI runs the same image. Production tests run the same image. New contributors run the same image. The conversation stops being "does it work on your machine?" and starts being "does it work on the image?"

Containerizing the app under test with health checks

Before you shard tests, make sure the thing they hit comes up cleanly inside the same network as the runner. The classic level-up mistake is to start the dev server withnpm run dev &, sleep ten seconds, and hope. That works locally and fails on CI because the runner is slower, the database migration takes longer, and the test command lands before the API is ready. The result is a burst of confusing 502 errors that look like flaky selectors.

Docker Compose health checks fix this by giving the orchestrator a real readiness signal. The web service waits on the database. Playwright waits on the web service. Everything else is bookkeeping.

# docker-compose.ci.yml
# Used by both CI and local debugging. Same image, same wiring.
services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app
      POSTGRES_DB: app_test
    # Health check uses pg_isready so dependents only start when SQL is responsive.
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U app -d app_test']
      interval: 3s
      timeout: 5s
      retries: 20
    tmpfs:
      - /var/lib/postgresql/data  # Edge case: ephemeral DB makes test runs idempotent.

  web:
    build:
      context: .
      dockerfile: Dockerfile.test
    environment:
      DATABASE_URL: postgres://app:app@db:5432/app_test
      NODE_ENV: test
      PORT: '3000'
    depends_on:
      db:
        condition: service_healthy  # Wait for SQL, not just the container.
    healthcheck:
      # Use a real route. Do not point at '/', which often returns 200 even when broken.
      test: ['CMD-SHELL', 'wget -qO- http://127.0.0.1:3000/api/health || exit 1']
      interval: 3s
      timeout: 5s
      retries: 30  # Migrations on a cold DB can take a while.
    expose:
      - '3000'

  playwright:
    # Pin a specific Playwright image. Drift between local and CI is the #1 cause of
    # confusing failures: missing fonts, libnss3 versions, Chromium build numbers.
    image: mcr.microsoft.com/playwright:v1.49.0-jammy
    working_dir: /workspace
    volumes:
      - .:/workspace:cached  # macOS gotcha: ':cached' helps, but only the result dir is hot.
      - playwright-cache:/root/.cache/ms-playwright
    environment:
      CI: 'true'
      BASE_URL: http://web:3000
      # Splitting docs note that these two env vars get reused by playwright config below.
      SHARD_INDEX: ${SHARD_INDEX:-1}
      SHARD_TOTAL: ${SHARD_TOTAL:-1}
    depends_on:
      web:
        condition: service_healthy
    command: >-
      sh -c "set -e;
             npx playwright install --with-deps chromium >/dev/null 2>&1 || true;
             npx playwright test
               --shard=$SHARD_INDEX/$SHARD_TOTAL
               --reporter=blob,line"

volumes:
  playwright-cache:

Three things in that file are worth lingering on. First, the database health check usespg_isready, not the port. A port is open the instant Postgres binds, which is several seconds before the server actually accepts SQL. Second, the web service health check hits a real route. A homepage 200 can lie when the backing service is down behind a CDN or a static-export fallback. Third, the Playwright image is pinned to a version. Playwright is one of the few projects that ships browser binaries inside the image; do not chase :latestunless you enjoy debugging Chromium upgrades on a Friday.

Common gotcha: the official Playwright Docker image runs as root. If your app under test creates files on a mounted volume, those files will be root-owned, and your host user will not be able to delete them without sudo. Either run as a named user inside the image, copy artifacts out at the end, or setuser: "${UID}:${GID}" in your compose file for local dev.

Sharding across a GitHub Actions matrix

With the image and stack in place, the second leverage point is fan-out. GitHub Actions has native support for matrix jobs, where one workflow definition spawns N parallel runs across a parameter list. Playwright's test sharding documentation defines the--shard=current/total flag for exactly this: each runner takes a slice of the suite, blob reports get uploaded as artifacts, and a final job stitches everything into one HTML report.

The cost question matters here. The GitHub Actions billing documentation lists included minutes per plan and per-minute rates for additional usage on standard runners. Sharding is not free: ten shards that each take three minutes use thirty runner-minutes, even though wall-clock time is three minutes. For most small teams that trade is great. For very large suites on the free tier, sharding aggressively can burn through monthly minutes faster than expected. Measure first.

# .github/workflows/e2e.yml
name: e2e

on:
  pull_request:
  push:
    branches: [main]

jobs:
  e2e-shard:
    name: e2e (shard ${{ matrix.shard }}/${{ matrix.total }})
    runs-on: ubuntu-latest
    timeout-minutes: 25  # Hard ceiling: never let a runaway shard burn the whole hour.
    strategy:
      fail-fast: false  # Edge case: keep all shards going so you can see if it's one bad spec.
      matrix:
        # Increase 'total' to fan out wider. Keep 'shard' values dense (1..N) to match.
        total: [6]
        shard: [1, 2, 3, 4, 5, 6]
    steps:
      - uses: actions/checkout@v4

      - name: Cache Playwright browsers
        uses: actions/cache@v4
        with:
          path: ~/.cache/ms-playwright
          # Bust the cache when the lockfile changes (Playwright version usually lives there).
          key: pw-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}

      - name: Build and run tests in Docker
        env:
          SHARD_INDEX: ${{ matrix.shard }}
          SHARD_TOTAL: ${{ matrix.total }}
        run: |
          set -euo pipefail
          docker compose -f docker-compose.ci.yml up --build \
            --abort-on-container-exit \
            --exit-code-from playwright \
            playwright

      - name: Upload blob report
        if: always()  # Even on failure: we need the report to find out WHY.
        uses: actions/upload-artifact@v4
        with:
          name: blob-report-${{ matrix.shard }}
          path: blob-report
          retention-days: 7  # Cost control: blob reports are big.

  merge-reports:
    name: merge reports
    if: always()
    needs: [e2e-shard]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - run: npm install -g @playwright/test@1.49.0

      - name: Download all blob reports
        uses: actions/download-artifact@v4
        with:
          path: all-blob-reports
          pattern: blob-report-*
          merge-multiple: true

      - name: Merge into HTML report
        run: |
          npx playwright merge-reports --reporter=html,github ./all-blob-reports

      - name: Upload combined HTML report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-report
          path: playwright-report
          retention-days: 14

Two design choices in that workflow are intentional. First,fail-fast: false lets all shards run even if one fails, so you can distinguish "the whole stack is broken" from "shard 4 always fails because it owns the upload spec." Second, the merge job runs withif: always() so the HTML report is produced even when some shards failed. A failing test you cannot browse afterwards is a debugging dead end.

One subtle edge case: the GitHub Actions matrix index is 1-based. Playwright's--shard is also 1-based. If you introduce a custom splitter that uses zero-based indices, you can quietly lose the first or last shard's tests. Stick with the documented Playwright convention to avoid this whole category of bug.

Balancing shards with duration data

Default Playwright sharding splits the test file list evenly. That is fine until two facts catch up with you. First, file count is a bad proxy for runtime; one file with a single end-to-end checkout test can take longer than thirty unit-like specs. Second, the slowest shard sets your wall-clock time, so an imbalanced split is the same as paying for ten runners and using six.

The fix is to record per-spec durations from prior runs, then use a bin-packing heuristic (longest-processing-time, or LPT) to distribute specs across shards by total estimated runtime. This is not a research problem; it is a small Node.js script that reads JSON, sorts, and writes shard files for each runner to consume.

// scripts/balance-shards.mjs
// Reads test-durations.json from prior runs, splits specs across N shards using
// longest-processing-time (LPT) bin packing, and writes one shard manifest per worker.
// Run BEFORE the matrix job; commit or upload the result for the test runners to consume.
import { readFileSync, writeFileSync, existsSync } from 'node:fs';
import { resolve } from 'node:path';

const DURATIONS_PATH = resolve('./test-durations.json');
const TOTAL_SHARDS = Number(process.env.SHARD_TOTAL ?? 6);
const FALLBACK_MS = 8_000;  // Assume 8s for any spec we have no data on.

if (!Number.isFinite(TOTAL_SHARDS) || TOTAL_SHARDS < 1 || TOTAL_SHARDS > 64) {
  console.error('SHARD_TOTAL must be an integer between 1 and 64.');
  process.exit(1);
}

// Edge case: no prior data at all (fresh repo, brand-new suite). Fall back to file-count splitting.
let durations = {};
if (existsSync(DURATIONS_PATH)) {
  try {
    durations = JSON.parse(readFileSync(DURATIONS_PATH, 'utf8'));
  } catch (error) {
    console.warn(`test-durations.json was unreadable, using fallback: ${error.message}`);
    durations = {};
  }
}

// Discover specs deterministically (sorted) so the same input always produces the same split.
import { globSync } from 'glob';
const specs = globSync('tests/**/*.spec.ts', { absolute: false }).sort();

if (specs.length === 0) {
  console.error('No specs found under tests/. Aborting before producing empty shards.');
  process.exit(2);
}

// Sort specs longest-first. Items with no historical data get the fallback so they
// still get spread across shards instead of all landing in the first one.
const ranked = specs
  .map((path) => ({ path, ms: Number(durations[path] ?? FALLBACK_MS) }))
  .sort((a, b) => b.ms - a.ms);

// LPT: keep N bins, always drop the next-largest spec into the lightest bin.
const bins = Array.from({ length: TOTAL_SHARDS }, () => ({ totalMs: 0, specs: [] }));

for (const spec of ranked) {
  bins.sort((a, b) => a.totalMs - b.totalMs);
  bins[0].specs.push(spec.path);
  bins[0].totalMs += spec.ms;
}

// Restore shard order before writing so shard 1 is the first bin, shard 2 the second, etc.
const shards = bins.map((bin, index) => ({
  shard: index + 1,
  totalMs: bin.totalMs,
  estimatedSeconds: Math.round(bin.totalMs / 1000),
  specs: bin.specs.sort(),  // Stable order inside a shard for cache friendliness.
}));

// Sanity: imbalance ratio should be < 1.4. If higher, log a loud warning so CI shows it.
const totals = shards.map((s) => s.totalMs);
const ratio = Math.max(...totals) / Math.max(1, Math.min(...totals));
if (ratio > 1.4) {
  console.warn(
    `Shard imbalance ratio ${ratio.toFixed(2)} > 1.4. Consider splitting your slowest spec.`,
  );
}

for (const shard of shards) {
  writeFileSync(`shard-${shard.shard}.json`, JSON.stringify(shard, null, 2));
}

console.table(
  shards.map((s) => ({
    shard: s.shard,
    estSeconds: s.estimatedSeconds,
    specCount: s.specs.length,
  })),
);

Then feed each shard file into Playwright with a small custom config that whitelists only that shard's specs (or use Playwright'stestMatch in a per-shard config). The point is not the exact wiring; it is that you treat shard composition as data, not as a magic constant in your CI YAML.

How do you collect duration data in the first place? Add the JSON reporter to your Playwright run (--reporter=json), aggregate spec timings from each shard's output, and commit the merged file astest-durations.json on the main branch. Or upload it to artifact storage and download it before the next run. Either way, the splitter only needs recent data, not perfect data; a rolling 30-day average is more than enough for LPT to do its job.

Sharding strategies, side by side

Strategy	Setup cost	Balance quality	Best for
Default file-count shard	Zero — one CLI flag	Poor for mixed-cost suites	Small or uniform suites under 100 tests
Manual project split	Low — Playwright projects in config	Good if you already group by domain	Suites with clear feature boundaries
Duration-aware LPT (this post)	Medium — script + duration history	Strong — imbalance ratio under 1.4	Mixed suites where one spec is much slower
Dynamic test queue	High — needs a coordinator service	Excellent for very large suites	Hundreds of specs, dedicated infra team

For solo developers and small teams, duration-aware LPT is almost always the right stopping point. A dynamic test queue is impressive at conference talks and overkill in a repo that runs CI thirty times a day. Pick the strategy that matches the size of your real suite, not the size of your aspirations.

Troubleshooting the move from laptop to Docker to sharded CI

Most of the pain in this transition is not Playwright's fault and is not your test code's fault either. It is the seam between three environments: your local machine, the Docker image, and the CI runner. Each one has its own quirks, and the failures look identical from a stack trace alone.

Symptom	Likely root cause	Fix
First request after startup gets 502	App not actually ready; sleep is lying to you	Health check on a real route, not just a TCP probe
"Could not find Chromium" locally only	Image and host Playwright versions disagree	Pin the image tag to match the lockfile version
macOS host runs feel 3x slower than CI	Docker Desktop volume mount I/O overhead	Build inside the image; mount only the results dir
Shard 3 always lasts twice as long as shard 1	File-count split + uneven spec runtimes	Add LPT splitter that reads duration history
Two shards both try to seed the same DB row	Shared database, no per-shard isolation	Schema-per-shard, or unique fixture seeds per worker
Combined report is missing shard 4	Artifact upload skipped on failure	Set the upload step to `if: always()`

Two failure modes deserve a longer note. The first is the database race condition. The moment two shards write to the same Postgres in parallel, you have invited determinism out of your test suite and replaced it with whichever shard got there first. Either run a dedicated database per shard (the simplest option in Docker Compose) or scope every test to a unique tenant ID, schema, or namespace that no other shard can possibly touch. "The tests usually pass" is not an engineering position you want to defend in a code review.

The second is fail-fast. By default, GitHub Actions matrix jobs share afail-fast: true behavior that cancels the rest of the matrix on first failure. That is great for compile-error workflows and terrible for test runs. If one shard fails, you usually want to see whether the others also fail, because that distinguishes "the auth flow is broken" from "shard 4 owns the only flaky spec." Set fail-fast: falseon test matrices specifically, and keep the default for build matrices.

What does "leveled up" actually look like?

Leveling up is not about adopting every tool at once. It is about choosing the next smallest move that retires the most pain. For most solo developers and small teams, the sequence looks like this. Start with a single Playwright config, no shards, running against a Docker Compose stack. Add CI as a single sharded matrix once that is stable, with default file-count splitting. Add duration-aware LPT only when you can see real shard imbalance in your reports. Add a dynamic test queue or a hosted runner fleet only when those simpler options stop scaling.

Each of those steps is reversible. Each adds one new piece of infrastructure you can actually operate. None of them require giving up the language you write tests in or the framework you already know. That is the whole point of the level-up framing: keep the Playwright assertions, but stop hand-rolling the environment they run in.

In our experience, teams that pause at the wrong step often regret it. Stopping at "works on my laptop" means new hires cannot reproduce failures. Stopping at "single CI job, no shards" means tests stay slow and people start skipping them. Stopping at "default file sharding" with an unbalanced suite means engineers spend more time waiting on shard 3 than they save by parallelizing. The whole staircase is the answer, not any single step.

Edge cases you will trip over eventually

Stale duration data: if you committest-durations.json and never refresh it, the splitter optimizes for last year's suite. Refresh it from the main branch on a regular schedule.
Artifact retention costs: blob and HTML reports can be tens of megabytes per shard. Multiply by every PR push for a month and you will hit storage limits faster than expected. Tune retention-days.
Headed mode for debugging: Docker Playwright images can drive xvfb for headed runs, but the ergonomics on a CI runner are awful. Reproduce flakes locally with the same image instead of trying to attach a VNC viewer to a hosted runner.
Browser channel drift: Playwright Docker images include browsers that match the bundled version. If your config pinschannel: 'chrome', you need Google Chrome installed separately. Use the bundled Chromium in CI unless you have a specific reason not to.
Network access in the runner: sharded jobs often hit rate-limited third-party APIs. If three shards all call the same external service in parallel, you can unintentionally DOS your own credentials. Mock at the network boundary, or share a token-bucket via Redis if you must talk to real services.

The practical conclusion for vibe coders

Playwright is one of the few tools where the local experience and the professional CI experience can use the same image, the same flags, and the same reporters. That is rare. It means you can level up your test infrastructure without throwing away the muscle memory you built shipping side projects on your laptop. Container the app. Shard the test command. Balance the shards with duration data. Watch the build go from forty minutes to five.

You do not need a platform team to do any of this. You need one Compose file, one workflow file, and one small script that knows how to pack tests into bins. The hard part was never Playwright's API. It was treating CI as part of your codebase and not as someone else's problem. That mindset is the real level-up.

Beyond Locally Running Tests: Scaling Playwright via Docker Containers and Parallel Sharding

The jump from 40-minute laptop runs to 5-minute sharded CI is mostly infrastructure, not test code.

Why does a 40-minute local run sometimes hit 5 minutes in CI?

Containerizing the app under test with health checks

Sharding across a GitHub Actions matrix

Balancing shards with duration data

Sharding strategies, side by side

Troubleshooting the move from laptop to Docker to sharded CI

What does "leveled up" actually look like?

Edge cases you will trip over eventually

The practical conclusion for vibe coders

References

Ready to level up your dev toolkit?

Frequently Asked Questions

Do I need Docker locally to run sharded Playwright tests?

How many shards should I use for a Playwright suite?

Why is my shard 3 always slower than shards 1 and 2?

Will Docker volume mounts slow my tests on macOS?

How do I keep sharded reports as a single HTML output?

Related Posts

Cody's Repository Indexing: Does Cognitive Offloading Create Knowledge Gaps in Large Codebases? | Desplega AI

Hot Module Replacement: Why Your Dev Server Restarts Are Killing Your Flow State | desplega.ai

The Flaky Test Tax: Why Your Engineering Team is Secretly Burning Cash | desplega.ai