What is test data subsetting and why does it matter?

Test data subsetting extracts a small, referentially consistent slice of production data for dev and test use. It gives you realistic data distributions without privacy risks or terabyte-scale imports.

Is it safe to use production data in my local environment?

Only after masking PII. Replace real emails, names, and phone numbers with synthetic equivalents before loading into any non-production system. Our Python script example includes a masking step.

How does Docker improve test data isolation?

Docker lets you spin up a fully isolated database per branch or test run, seed it from your subset on container start, and destroy it when done — eliminating cross-test contamination and shared state bugs.

What breaks if I ignore foreign key constraints during subsetting?

Your import will fail with FK violation errors, or succeed silently if you disabled triggers — leaving orphaned rows that cause mysterious query failures and test flakiness later. Always walk the FK graph.

Can this approach work with MySQL or SQLite?

Yes. MySQL uses mysqldump with --where clauses for per-table filtering. SQLite loads from CSV. The Docker Compose pattern and FK-walking strategy work across all relational engines with minor syntax changes.

Scaling Beyond Mock Data: Production-Grade Test Data Subsetting with Docker and SQL

You wrote the feature. You wrote the tests. They pass locally against your carefully crafted mock data. Then staging explodes. The user with 12,000 orders breaks pagination. The product with a null category crashes your sort function. The edge case you never thought to mock is exactly the case your first real customer hits.

This is the mock data trap. And you're not alone — in our experience, unreliable or oversimplified test data is one of the most common reasons bugs survive local testing only to surface in staging or production. The data in your test environment doesn't behave like the data your users actually create.

The professional solution isn't "use more production data" — it's test data subsetting: extracting a small, referentially consistent, privacy-safe slice of your real database and running it in an isolated Docker container that spins up in seconds. This guide walks you through building that pipeline from scratch, with production-ready code you can drop into your project today.

What Does "Production-Grade" Test Data Actually Mean?

Production-grade test data mirrors real data distributions, preserves FK constraints, masks PII, and loads cleanly in under 60 seconds. Static mocks satisfy none of these consistently.

Most developers start with one of two test data strategies: hand-crafted fixtures (a JSON file with three users and two orders) or a full production dump (100GB, takes 40 minutes to restore, and contains your users' real email addresses). Neither works at scale.

Production-grade test data sits between these extremes. It must be:

Referentially consistent — every foreign key points to a row that exists. No orphaned orders, no products referencing missing categories.
Representatively distributed — includes power users with thousands of records, brand-new users with zero, and the messy in-between (nulls, partial data, unusual states).
Privacy-safe — PII (emails, names, phone numbers) replaced with realistic synthetic data before it ever leaves the production environment.
Fast to seed — small enough that a fresh Docker container initializes in under a minute, so developers don't skip tests to save time.
Deterministic — the same subset script produces the same rows, so a test failure at 3am is reproducible at 9am without manually rebuilding state.

The SQL subsetting technique achieves all five by starting with a "seed" set of rows and then walking your foreign key graph to pull every related record those rows depend on. The rest of this guide shows you exactly how that works.

Setting Up Your Isolated Test Database with Docker

Before loading subset data, you need a database that is completely isolated from production and fast to rebuild. Docker Compose is the right tool here: it gives you a PostgreSQL instance that mounts initialization scripts on first start, exposes a predictable port, and can be torn down and rebuilt with one command.

The healthcheck configuration below is critical. Without it, your application startup races against PostgreSQL's initialization sequence and you get intermittent connection errors that look like network problems but are actually a timing issue:

# docker-compose.testdb.yml
version: '3.8'

services:
  testdb:
    image: postgres:15-alpine
    container_name: myapp-testdb
    environment:
      POSTGRES_DB: testdb
      POSTGRES_USER: testuser
      POSTGRES_PASSWORD: testpass
      POSTGRES_INITDB_ARGS: "--encoding=UTF8 --locale=en_US.UTF-8"
    ports:
      - "5433:5432"  # 5433 avoids conflict with local Postgres on 5432
    volumes:
      - ./db/schema.sql:/docker-entrypoint-initdb.d/01_schema.sql
      - ./db/subset_data.sql:/docker-entrypoint-initdb.d/02_data.sql
    # CRITICAL: Without this healthcheck your app races Postgres initialization.
    # The --wait flag on docker compose up blocks until this passes.
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U testuser -d testdb"]
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 10s
    # Durability is irrelevant for throwaway test containers.
    # These flags trade crash-safety for 3-5x faster writes during seed load.
    command: >
      postgres
        -c fsync=off
        -c synchronous_commit=off
        -c full_page_writes=off
        -c max_connections=50
    # Store data in RAM — no persistence needed, startup is faster
    tmpfs:
      - /var/lib/postgresql/data

# Usage:
#   docker compose -f docker-compose.testdb.yml up -d --wait
#   psql postgresql://testuser:testpass@localhost:5433/testdb
#   docker compose -f docker-compose.testdb.yml down

Edge case — tmpfs and persistence: The tmpfs mount stores PostgreSQL data in RAM. This makes container startup significantly faster but means data is gone when the container stops. This is intentional for test databases. If you need to persist data between runs (for example, during a multi-hour local debugging session), replace tmpfs with a named volume: volumes: - testdb-data:/var/lib/postgresql/data at the service level and volumes: testdb-data: at the top level.

The fsync=off and synchronous_commit=off flags are the reason your subset loads in seconds instead of minutes. In production, these settings would risk data loss on a hard crash — PostgreSQL normally flushes every write to disk before acknowledging it. For a throwaway container that you rebuild from a script, durability is irrelevant, and skipping those disk flushes makes a substantial difference on seed loads with many small inserts.

How Does Test Data Subsetting Preserve Referential Integrity?

Subsetting walks your FK graph from seed rows outward, pulling every referenced parent record so every INSERT succeeds without constraint violations on import. Direction matters: children always depend on parents, never the reverse.

Your database schema is a directed acyclic graph (usually). Every foreign key is an edge from a child table to a parent table. A row in orders depends on a row in users. A row in order_items depends on both orders AND products. When you extract a subset, you must traverse this graph and pull every dependency — otherwise PostgreSQL will reject your import with foreign key violation errors.

This SQL subsetting script demonstrates the complete pattern for a typical e-commerce schema. It includes an integrity check phase that validates your subset before exporting, so you catch missing-parent issues before they cause a confusing import failure:

-- subset_extract.sql
-- Run this against your production READ REPLICA, never the primary.
-- Outputs CSV files ready for COPY import into your test database.

BEGIN;  -- We ROLLBACK at the end — this script never modifies source data.

-- ================================================================
-- STEP 1: Seed set — recent active users, capped to manageable size.
-- Adjust LIMIT and WHERE to control subset size vs. coverage tradeoff.
-- ================================================================
CREATE TEMP TABLE _subset_users AS
SELECT *
FROM users
WHERE created_at >= NOW() - INTERVAL '90 days'
  AND status IN ('active', 'trial')
ORDER BY random()  -- Randomize to avoid always pulling the same cohort
LIMIT 500;

-- Always include a handful of power users to test high-volume scenarios.
-- These are the rows that break pagination, slow down queries, etc.
INSERT INTO _subset_users
SELECT * FROM users
WHERE id IN (
  SELECT user_id
  FROM orders
  GROUP BY user_id
  HAVING COUNT(*) > 100
  ORDER BY COUNT(*) DESC
  LIMIT 10
)
ON CONFLICT (id) DO NOTHING;  -- Deduplicate against LIMIT set above

-- ================================================================
-- STEP 2: Walk FK graph — orders depend on users.
-- ================================================================
CREATE TEMP TABLE _subset_orders AS
SELECT o.*
FROM orders o
INNER JOIN _subset_users u ON o.user_id = u.id;

-- ================================================================
-- STEP 3: Order items depend on orders (and products).
-- ================================================================
CREATE TEMP TABLE _subset_order_items AS
SELECT oi.*
FROM order_items oi
INNER JOIN _subset_orders o ON oi.order_id = o.id;

-- ================================================================
-- STEP 4: Products referenced by order items — pull even old products.
-- A product created 3 years ago can still be in a recent order.
-- ================================================================
CREATE TEMP TABLE _subset_products AS
SELECT p.*
FROM products p
WHERE p.id IN (
  SELECT DISTINCT product_id
  FROM _subset_order_items
  WHERE product_id IS NOT NULL
);

-- ================================================================
-- STEP 5: Product categories — parent of products.
-- Don't forget lookup tables! They are easy to miss and always required.
-- ================================================================
CREATE TEMP TABLE _subset_categories AS
SELECT c.*
FROM categories c
WHERE c.id IN (
  SELECT DISTINCT category_id
  FROM _subset_products
  WHERE category_id IS NOT NULL
);

-- ================================================================
-- STEP 6: INTEGRITY CHECK — validate subset before export.
-- This surfaces missing parents before they cause import failures.
-- ================================================================
DO $$
DECLARE
  orphan_products INTEGER;
  orphan_orders   INTEGER;
BEGIN
  -- Check: products referencing categories not in subset
  SELECT COUNT(*) INTO orphan_products
  FROM _subset_products p
  WHERE p.category_id IS NOT NULL
    AND NOT EXISTS (
      SELECT 1 FROM _subset_categories c WHERE c.id = p.category_id
    );

  -- Check: orders referencing users not in subset
  SELECT COUNT(*) INTO orphan_orders
  FROM _subset_orders o
  WHERE NOT EXISTS (
    SELECT 1 FROM _subset_users u WHERE u.id = o.user_id
  );

  IF orphan_products > 0 THEN
    RAISE EXCEPTION 'Integrity check FAILED: % products reference missing categories', orphan_products;
  END IF;

  IF orphan_orders > 0 THEN
    RAISE EXCEPTION 'Integrity check FAILED: % orders reference missing users', orphan_orders;
  END IF;

  RAISE NOTICE 'Integrity check PASSED: %, %, %, %, % rows',
    (SELECT COUNT(*) FROM _subset_categories),
    (SELECT COUNT(*) FROM _subset_products),
    (SELECT COUNT(*) FROM _subset_users),
    (SELECT COUNT(*) FROM _subset_orders),
    (SELECT COUNT(*) FROM _subset_order_items);
END $$;

-- ================================================================
-- STEP 7: Export in dependency order — parents before children.
-- COPY order determines import order in the test database.
-- ================================================================
copy _subset_categories    TO '/tmp/subset_categories.csv'    WITH (FORMAT CSV, HEADER);
copy _subset_products      TO '/tmp/subset_products.csv'      WITH (FORMAT CSV, HEADER);
copy _subset_users         TO '/tmp/subset_users.csv'         WITH (FORMAT CSV, HEADER);
copy _subset_orders        TO '/tmp/subset_orders.csv'        WITH (FORMAT CSV, HEADER);
copy _subset_order_items   TO '/tmp/subset_order_items.csv'   WITH (FORMAT CSV, HEADER);

ROLLBACK;  -- Temp tables are session-scoped; ROLLBACK ensures no accidental writes

Automating Subset Extraction and PII Masking with Python

Running the SQL script manually works once. For a repeatable CI-friendly pipeline, you need automation: connect to the replica, extract the subset, mask PII, and load into the Docker test container. This Python script wraps the full cycle. It uses psycopg2 for direct database access and faker for generating realistic replacement data.

Note the error structure: each phase has its own handler so failures tell you exactly where the pipeline broke. Silent failures are the worst kind in data pipelines — a load that succeeds but skips half the rows is harder to debug than one that explodes immediately:

#!/usr/bin/env python3
"""
subset_and_load.py — Extract a masked subset from prod replica, load into test DB.

Requirements:
    pip install psycopg2-binary faker

Usage:
    PROD_DB_URL="postgresql://readonly:pass@replica.internal/proddb" \
    TEST_DB_URL="postgresql://testuser:testpass@localhost:5433/testdb" \
    python subset_and_load.py --days 90 --user-limit 500

Edge cases handled:
    - FK cycles via DEFERRABLE constraints + SET CONSTRAINTS ALL DEFERRED
    - Empty subsets raise ValueError, not silent success
    - Connection failures use exponential backoff retry
    - PII column scanner warns about columns that may need masking
"""

import argparse
import os
import sys
import time
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Generator

import psycopg2
import psycopg2.extras
from faker import Faker

fake = Faker()

# Tables in the order they must be loaded (parents before children)
LOAD_ORDER = ["categories", "products", "users", "orders", "order_items"]

# Column names that likely contain PII — warn if found in unexpected tables
PII_COLUMN_HINTS = {"email", "phone", "name", "address", "ssn", "dob", "birth"}


@dataclass
class SubsetStats:
    counts: dict = field(default_factory=dict)

    def record(self, table: str, count: int) -> None:
        self.counts[table] = count

    def __str__(self) -> str:
        return ", ".join(f"{t}: {c}" for t, c in self.counts.items())


@contextmanager
def get_conn(url: str, retries: int = 3) -> Generator:
    """DB connection context manager with exponential-backoff retry."""
    conn = None
    last_error = None

    for attempt in range(retries):
        try:
            conn = psycopg2.connect(url, connect_timeout=10)
            conn.autocommit = False
            yield conn
            return
        except psycopg2.OperationalError as e:
            last_error = e
            if attempt < retries - 1:
                wait = 2 ** attempt  # 1s, 2s, 4s
                print(f"  Connection failed (attempt {attempt + 1}/{retries}), retry in {wait}s...")
                time.sleep(wait)
        finally:
            if conn and not conn.closed:
                conn.close()

    raise RuntimeError(f"Could not connect after {retries} attempts: {last_error}") from last_error


def extract_subset(prod_url: str, days: int, user_limit: int) -> dict:
    """
    Extract a referentially consistent subset from the production replica.
    Returns dict of table_name -> list of row dicts.

    IMPORTANT: This function reads only — it calls conn.rollback() at the end
    to ensure no accidental writes to the replica even if autocommit is on.
    """
    print(f"Extracting subset: last {days} days, limit {user_limit} users...")

    with get_conn(prod_url) as conn:
        cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)

        # --- Seed: recent active users ---
        cur.execute(
            """
            SELECT * FROM users
            WHERE created_at >= NOW() - INTERVAL %(days)s
              AND status IN ('active', 'trial')
            ORDER BY random()
            LIMIT %(limit)s
            """,
            {"days": f"{days} days", "limit": user_limit},
        )
        users = [dict(row) for row in cur.fetchall()]

        if not users:
            raise ValueError(
                f"No active users found in last {days} days. "
                "Check replica lag or extend --days. "
                f"Replica lag: run SELECT NOW() - pg_last_xact_replay_timestamp();"
            )

        user_ids = [u["id"] for u in users]
        print(f"  Seeded {len(users)} users")

        # --- Always include power users for edge-case coverage ---
        cur.execute(
            """
            SELECT * FROM users WHERE id IN (
                SELECT user_id FROM orders
                GROUP BY user_id
                HAVING COUNT(*) > 50
                ORDER BY COUNT(*) DESC
                LIMIT 5
            )
            """
        )
        for row in cur.fetchall():
            row = dict(row)
            if row["id"] not in set(user_ids):
                users.append(row)
                user_ids.append(row["id"])

        print(f"  Total users after power-user injection: {len(users)}")

        # --- Walk FK graph ---
        cur.execute("SELECT * FROM orders WHERE user_id = ANY(%s)", (user_ids,))
        orders = [dict(row) for row in cur.fetchall()]
        order_ids = [o["id"] for o in orders]
        print(f"  Orders: {len(orders)}")

        cur.execute("SELECT * FROM order_items WHERE order_id = ANY(%s)", (order_ids,))
        items = [dict(row) for row in cur.fetchall()]
        product_ids = list({i["product_id"] for i in items if i.get("product_id")})
        print(f"  Order items: {len(items)}, unique products: {len(product_ids)}")

        cur.execute("SELECT * FROM products WHERE id = ANY(%s)", (product_ids,))
        products = [dict(row) for row in cur.fetchall()]
        category_ids = list({p["category_id"] for p in products if p.get("category_id")})

        cur.execute("SELECT * FROM categories WHERE id = ANY(%s)", (category_ids,))
        categories = [dict(row) for row in cur.fetchall()]
        print(f"  Products: {len(products)}, categories: {len(categories)}")

        conn.rollback()  # Ensure no writes to replica

    return {
        "categories": categories,
        "products": products,
        "users": users,
        "orders": orders,
        "order_items": items,
    }


def mask_pii(data: dict) -> dict:
    """
    Replace PII with realistic synthetic data using Faker.
    This MUST run before loading into any non-production system.

    Scans all tables for PII-like column names and warns about anything
    not explicitly masked — belt-and-suspenders approach.
    """
    print("Masking PII...")
    masked = {table: [dict(row) for row in rows] for table, rows in data.items()}

    for user in masked["users"]:
        user["email"] = fake.email()
        user["name"] = fake.name()
        if user.get("phone"):
            user["phone"] = fake.phone_number()
        # Null out optional PII fields rather than faking them
        for field_name in ("date_of_birth", "ssn", "national_id"):
            if field_name in user:
                user[field_name] = None

    for order in masked["orders"]:
        if "shipping_address" in order and order["shipping_address"]:
            order["shipping_address"] = fake.address()

    # Warn about any column in any table that sounds like PII
    for table, rows in masked.items():
        if not rows:
            continue
        for col in rows[0].keys():
            if any(hint in col.lower() for hint in PII_COLUMN_HINTS):
                if table != "users" or col not in ("email", "name", "phone"):
                    print(f"  WARNING: {table}.{col} may contain PII — verify it is masked")

    return masked


def load_subset(test_url: str, data: dict) -> SubsetStats:
    """
    Load masked subset into the test database.

    Uses SET CONSTRAINTS ALL DEFERRED to handle FK cycles gracefully.
    Truncates tables in reverse order before inserting to ensure clean state.
    """
    print("Loading into test database...")
    stats = SubsetStats()

    with get_conn(test_url) as conn:
        cur = conn.cursor()

        # Disable FK checks for TRUNCATE (re-enabled automatically after)
        cur.execute("SET session_replication_role = 'replica'")
        for table in reversed(LOAD_ORDER):
            cur.execute(f"TRUNCATE TABLE {table} RESTART IDENTITY CASCADE")
        cur.execute("SET session_replication_role = 'origin'")

        # Defer FK validation to COMMIT — handles circular references safely
        cur.execute("SET CONSTRAINTS ALL DEFERRED")

        for table in LOAD_ORDER:
            rows = data.get(table, [])
            if not rows:
                stats.record(table, 0)
                continue

            cols = list(rows[0].keys())
            placeholders = ", ".join(["%s"] * len(cols))
            col_names = ", ".join(cols)
            query = (
                f"INSERT INTO {table} ({col_names}) VALUES ({placeholders}) "
                f"ON CONFLICT DO NOTHING"
            )

            psycopg2.extras.execute_batch(
                cur,
                query,
                [tuple(row.get(c) for c in cols) for row in rows],
                page_size=100,
            )
            stats.record(table, len(rows))
            print(f"  {table}: {len(rows)} rows")

        conn.commit()  # FK constraints validated here — will raise if integrity fails

    return stats


def main() -> None:
    parser = argparse.ArgumentParser(description="Extract and load test data subset")
    parser.add_argument("--days", type=int, default=90)
    parser.add_argument("--user-limit", type=int, default=500)
    args = parser.parse_args()

    prod_url = os.environ.get("PROD_DB_URL")
    test_url = os.environ.get("TEST_DB_URL")

    if not prod_url or not test_url:
        print("ERROR: Set PROD_DB_URL and TEST_DB_URL environment variables.", file=sys.stderr)
        sys.exit(1)

    try:
        raw_data = extract_subset(prod_url, args.days, args.user_limit)
        masked_data = mask_pii(raw_data)
        stats = load_subset(test_url, masked_data)
        print(f"Done. {stats}")
    except ValueError as e:
        print(f"ERROR: {e}", file=sys.stderr)
        sys.exit(1)
    except psycopg2.errors.ForeignKeyViolation as e:
        print(f"FK INTEGRITY ERROR: {e}", file=sys.stderr)
        print("Hint: Check LOAD_ORDER — a parent table may be listed after its child.", file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

Mock Data vs. Subset Data: Side-by-Side Comparison

Understanding when each approach is appropriate prevents over-engineering small projects and under-engineering ones that have grown past the fixture threshold:

Dimension	Static Mock Data	Subset Data
Setup time	Minutes — write a JSON or SQL fixture file	Hours first time; minutes to refresh after
Data realism	Low — hand-crafted, rarely includes real edge cases	High — inherits production data distributions
FK integrity	Manual — breaks on every schema migration	Automatic — FK walk enforces consistency
Privacy risk	None (all fake)	Requires explicit PII masking step
Maintenance burden	High — update fixtures for every schema change	Low — re-run script; SQL adapts to current schema
Edge case coverage	Only what you remember to add	Inherits production edge cases automatically
CI speed	Very fast — tiny in-memory dataset	Fast — optimized Docker container, sub-60s seed
Best for	Unit tests, early projects (< 6 tables)	Integration tests, mature apps (> 10 tables, FK-heavy schemas)

Edge Cases and Gotchas You Will Hit

No subsetting guide is honest without the failure modes. These are the gotchas that bite almost every team on their first implementation:

Circular FK references. Some schemas have cycles: users.default_address_id points to addresses, and addresses.user_id points back to users. Standard parent-before-child import order fails. Solution: set default_address_id = NULL during import, then run a follow-up UPDATE after both tables are loaded. Alternatively, use DEFERRABLE INITIALLY DEFERRED on the FK column definition and PostgreSQL will validate the constraint only at COMMIT time — after all rows exist.
Forgotten lookup tables. Your order_status column might be a FK to a status_types reference table with 8 rows that you completely forgot to include. Always check for tables with under 100 rows — they are usually lookup/enum tables that are implicitly required everywhere.
Soft-deleted records still referenced. If your schema uses deleted_at timestamps, your WHERE status = 'active' filter might exclude products that were soft-deleted but are still referenced by historical orders in your subset. The FK constraint does not care about soft-delete status — it only checks whether the row exists.
Multi-tenant schema fragmentation. If rows are partitioned by tenant_id, your subset must contain complete tenants — not fragments. A user from tenant A whose orders reference tenant B's settings rows creates FK violations and breaks row-level security policies in ways that produce cryptic access errors.
Binary and large JSONB columns bloating subset size. Selecting SELECT * FROM products might include a product_images BYTEA column or a preferences JSONB blob that pushes your "500-user subset" to 2GB. Explicitly select only the columns your tests actually need.
ORDER BY random() on large tables is slow. PostgreSQL must assign a random value to every row before sorting. On a 10M-row table this can take minutes. Use TABLESAMPLE SYSTEM(0.1) for approximate 0.1% sampling — it reads only random 8KB pages and is orders of magnitude faster, at the cost of not being perfectly random at the row level.

Troubleshooting Common Failures

When this pipeline breaks, the error messages can be cryptic. Here is how to diagnose the most common issues quickly:

Error: insert or update on table "orders" violates foreign key constraint "orders_user_id_fkey"

Cause: Orders are being inserted before their referenced users exist, or some users were excluded from the subset but their orders were not.
Fix 1: Verify your load order matches LOAD_ORDER in the script — users must be inserted before orders.
Fix 2: Run the integrity check DO block from the SQL script before exporting. It will tell you exactly which rows have missing parents.
Fix 3: Add SET CONSTRAINTS ALL DEFERRED in the transaction — PostgreSQL will validate FK constraints at COMMIT time instead of per-INSERT.

Problem: Container shows "unhealthy" but never errors out

Cause: An error in your init SQL script caused PostgreSQL to crash after startup. The healthcheck fails silently and Docker keeps restarting the container.
Fix: docker logs myapp-testdb 2>&1 | tail -50. Look for lines starting with ERROR: or FATAL:. Common root causes: a CREATE EXTENSION for an extension not installed in postgres:15-alpine(install it with RUN apk add postgresql15-contrib in a custom Dockerfile), or a syntax error in your subset CSV file causing the COPY to fail.

Error: ValueError: No active users found in last 90 days

Cause 1: The PROD_DB_URL env var points to the wrong database or is missing entirely — double-check with echo $PROD_DB_URL.
Cause 2: Your read replica has high lag. Check: SELECT NOW() - pg_last_xact_replay_timestamp() AS replica_lag;
Cause 3: The time window genuinely has no qualifying users. Extend --days or relax the status IN filter. Start with --days 365 to rule out the window as the issue.

Problem: Import appears successful but tests fail with "record not found"

Cause: ON CONFLICT DO NOTHING silently skipped rows that conflicted with stale data from a previous run. Your test is looking for a row that was skipped, not inserted.
Fix: The TRUNCATE block in the Python script should have cleared old data before inserting. Verify SET session_replication_role = 'replica' is executing before the TRUNCATE — if this is skipped (e.g., due to an exception in a previous run leaving the connection in a bad state), the TRUNCATE may silently fail due to FK violations.

Problem: Extraction takes 5+ minutes and blocks the team

Cause: ORDER BY random() on a large table is doing a full table scan to assign random values before sorting.
Fix: Replace ORDER BY random() LIMIT 500 with TABLESAMPLE SYSTEM(0.05) (approximately 0.05% of rows). For a 10M-row users table, TABLESAMPLE reads random 8KB pages instead of scanning the whole table — typically 100x faster. Caveat: TABLESAMPLE gives approximate counts and slightly less uniform distribution. For test data this tradeoff is almost always acceptable.

Integrating This Pipeline Into CI

Once the local workflow is stable, the CI integration pattern is straightforward: store a compressed subset SQL dump in your repository (refreshed weekly by a scheduled job), and have CI spin up the Docker container using that file as the init script.

This approach keeps CI jobs deterministic — every job on the same commit uses identical data — and eliminates the need for CI runners to access your production replica. The weekly refresh job (a cron task that runs the Python script and commits the updated CSV files) keeps the data from going stale without adding production-replica latency to every CI run.

Anecdotally, teams that switch from static mocks to subset data for integration tests commonly report that classes of bugs which previously only appeared in staging start surfacing locally — catching them earlier, when they are cheapest to fix.

You've Just Graduated From Mock Data

Mock data gets you started. Subset data gets you production-grade confidence. The gap between them is larger than most developers realize until the first time a bug that "couldn't exist locally" becomes a production incident at 2am.

What you've built here — a Docker-isolated database with durability tuned for speed, a SQL subsetting script that walks your FK graph and validates integrity before export, and a Python automation layer with PII masking and FK-cycle handling — is exactly the infrastructure professional engineering teams use. It fits in a few files and a clear process.

Start with the Docker Compose setup. Get your schema loading cleanly. Then layer in the SQL subsetting script. Once that is stable, add the Python automation and wire it into a weekly refresh cron. Build incrementally, and you will have the full pipeline running inside a week — and a test suite that actually reflects your production environment.

Scaling Beyond Mock Data: Implementing Production-Grade Test Data Subsetting with Docker and SQL

Your test suite is lying to you — here's how to give it real data without exposing production.