Getting Playwright running locally takes about 20 minutes. Getting it running properly inside a CI/CD pipeline, in a way that's actually useful rather than just technically passing, is a different problem. I've helped a lot of teams set this up over the years, and the same issues come up every time: flaky tests that only fail in CI, pipelines that take forever, reports nobody looks at, and a configuration that made sense for one person's laptop but falls apart at scale.
This guide is the setup I'd walk a team through from scratch. It uses GitHub Actions because that's what most teams are running, but the same principles apply in GitLab CI, CircleCI, and the rest. The goal isn't just a working pipeline. It's a pipeline that gives your team reliable, actionable feedback on every push.
Basic GitHub Actions workflow for Playwright, browser caching to keep runs fast, test sharding for parallel execution, HTML report artifacts, retry configuration, environment variables and secrets, and the most common reasons CI runs behave differently than local runs.
Start with a clean playwright.config.ts
Before you touch any CI configuration, your Playwright config needs to be set up correctly for a CI environment. A lot of the frustration people have with Playwright in CI comes from config values that work fine on a developer's machine but cause problems in a headless, containerized environment.
Here's a config that handles the most common CI scenarios cleanly:
import { defineConfig, devices } from '@playwright/test'; export default defineConfig({ testDir: './tests', fullyParallel: true, forbidOnly: !!process.env.CI, retries: process.env.CI ? 2 : 0, workers: process.env.CI ? 1 : undefined, reporter: process.env.CI ? [['html', { open: 'never' }], ['github']] : 'html', use: { baseURL: process.env.BASE_URL || 'http://localhost:3000', trace: 'on-first-retry', screenshot: 'only-on-failure', video: 'retain-on-failure', }, projects: [ { name: 'chromium', use: { ...devices['Desktop Chrome'] } }, { name: 'firefox', use: { ...devices['Desktop Firefox'] } }, { name: 'webkit', use: { ...devices['Desktop Safari'] } }, ], });
A few things worth explaining here. The forbidOnly flag prevents someone from accidentally committing a test.only call that would silently skip all other tests in CI. It's caught us more than once. The workers: 1 in CI is intentional for the basic setup. CI runners, especially GitHub's free tier, don't have the memory headroom to run multiple browser contexts in parallel without instability. We'll handle actual parallelization through sharding instead, which is much more reliable.
The basic GitHub Actions workflow
Here's a working starting point that handles browser installation caching, runs tests across all three browsers, and uploads the HTML report as a downloadable artifact:
name: Playwright Tests on: push: branches: [main, develop] pull_request: branches: [main] jobs: test: timeout-minutes: 60 runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Node.js uses: actions/setup-node@v4 with: node-version: '20' cache: 'npm' - name: Install dependencies run: npm ci - name: Cache Playwright browsers uses: actions/cache@v4 id: playwright-cache with: path: ~/.cache/ms-playwright key: playwright-${{ hashFiles('package-lock.json') }} - name: Install Playwright browsers if: steps.playwright-cache.outputs.cache-hit != 'true' run: npx playwright install --with-deps - name: Install browser system deps (cache hit) if: steps.playwright-cache.outputs.cache-hit == 'true' run: npx playwright install-deps - name: Run Playwright tests run: npx playwright test env: CI: true BASE_URL: ${{ secrets.BASE_URL }} - name: Upload test report uses: actions/upload-artifact@v4 if: always() with: name: playwright-report path: playwright-report/ retention-days: 14
The browser caching step makes a real difference. Without it, every CI run downloads 300-400MB of browser binaries. With caching keyed to your package-lock.json, those binaries are reused until your Playwright version changes. On a busy repo, this saves several minutes per run and reduces egress costs noticeably over time.
The if: always() on the report upload is important. Without it, the artifact only uploads on a passing run, which means you lose the detailed failure report exactly when you need it most.
Adding test sharding for faster pipelines
Once your test suite grows past around 50 tests, a single runner starts to feel slow. Playwright's built-in sharding splits your test suite across multiple parallel jobs, each running a different slice of the total tests. Here's how to add it:
jobs: test: timeout-minutes: 60 runs-on: ubuntu-latest strategy: fail-fast: false matrix: shardIndex: [1, 2, 3, 4] shardTotal: [4] steps: # ... checkout, node, npm ci, browser cache steps same as above ... - name: Run Playwright tests (sharded) run: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }} env: CI: true BASE_URL: ${{ secrets.BASE_URL }} - name: Upload blob report uses: actions/upload-artifact@v4 if: always() with: name: blob-report-${{ matrix.shardIndex }} path: blob-report retention-days: 1 merge-reports: if: always() needs: [test] runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: '20' - run: npm ci - name: Download blob reports uses: actions/download-artifact@v4 with: path: all-blob-reports pattern: blob-report-* merge-multiple: true - name: Merge into single HTML report run: npx playwright merge-reports --reporter html ./all-blob-reports - name: Upload merged report uses: actions/upload-artifact@v4 with: name: playwright-report path: playwright-report/ retention-days: 14
With 4 shards, a 200-test suite that previously took 18 minutes now runs in roughly 5. The shard count is something you tune based on your suite size. A good rule of thumb: aim for each shard to take no more than about 5 minutes. More than that and developers start skipping the CI check rather than waiting.
The five things that trip up CI runs
I want to spend some time on the failure patterns because this is where most teams lose hours. These aren't edge cases; they're the things I see go wrong consistently on the first few attempts.
Hardcoded localhost URLs
The most common first mistake. Tests that point directly to localhost:3000 in test files will fail in CI because there's no running server at that address. Use the baseURL config option and pass the target environment URL as a secret. If you need to test against a locally spun-up server, use Playwright's webServer config option to start it automatically before the test run begins.
Missing system dependencies for browsers
Playwright's browsers need OS-level libraries that aren't present on a fresh Ubuntu runner by default. The --with-deps flag on the install command handles this on a clean install. But when the browser cache is hit and you skip the install step, those system deps don't get installed. That's why the workflow above has a separate install-deps step for the cache-hit case.
Tests that rely on local state or fixtures
Tests that pass locally because they depend on cookies, localStorage, or database state left over from a previous run will fail in CI every time, because each CI run starts fresh. Use Playwright's storageState feature to capture and replay auth state, and make sure your test fixtures set up whatever state they need rather than assuming it already exists.
Timing assumptions that break under load
CI runners are slower than developer laptops, and shared runners have variable performance depending on what else is running. Tests that work locally because the page loads in 200ms will intermittently fail in CI where it takes 800ms. The fix is proper use of Playwright's auto-wait and waitFor assertions rather than arbitrary timeout calls. If you're seeing flakiness specifically in CI, check for any page.waitForTimeout() calls. Those are almost always the culprit.
Environment variables not passed through
Any environment variable your tests or application needs has to be explicitly passed in the workflow's env block. This includes things like API keys for staging environments, feature flags, and auth tokens. Store sensitive values as GitHub Actions secrets and reference them as ${{ secrets.YOUR_SECRET_NAME }} rather than hardcoding them anywhere.
Making the reports actually useful
The HTML report Playwright generates is genuinely good, but only if people look at it. One pattern that helps is adding a summary comment to pull requests that links directly to the artifact. Here's a step you can add after the report upload:
- name: Comment test results on PR uses: actions/github-script@v7 if: always() && github.event_name == 'pull_request' with: script: | const status = '${{ job.status }}'; const emoji = status === 'success' ? '✅' : '❌'; const body = `${emoji} **Playwright Tests ${status}**\n\n` + `[View full HTML report](https://github.com/${{ github.repository }}` + `/actions/runs/${{ github.run_id }})`; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body });
Small thing, but it means developers see the test result and a direct link to the report right in the PR thread, without having to navigate to the Actions tab separately. When feedback is easier to access, it actually gets looked at.
A few configuration settings worth knowing
These are the config values that come up most often when we're setting this up for clients:
- retries: 2 in CI is a good default for most suites. It handles genuine network or timing flakiness without masking real failures. If a test only passes on the third try, Playwright marks it as flaky in the report, which is a useful signal rather than a hidden problem.
- trace: 'on-first-retry' means you get a full trace file (network requests, DOM snapshots, timeline) on any test that needed a retry. This is usually enough detail to diagnose the problem. Using trace on every test adds overhead you probably don't need.
- video: 'retain-on-failure' keeps video recordings only for failing tests, which keeps artifact size manageable without losing the diagnostic value.
- timeout at the test level defaults to 30 seconds. For tests that involve complex UI flows, you may need to increase this selectively. Don't increase the global timeout to paper over slow tests; that just delays when you find out a test is hanging.
A CI pipeline that takes 25 minutes to complete is almost as bad as no pipeline at all. Developers stop waiting for it, start merging without the green check, and the value drops to near zero. If your full Playwright suite takes more than about 10 minutes end-to-end, sharding is not optional. It's the thing that keeps the pipeline in the feedback loop rather than outside it.
Where to go from here
The setup described here will handle most teams well up to a few hundred tests. As your suite grows further, the next things to look at are test tagging to run targeted subsets on specific triggers, dedicated self-hosted runners if GitHub's shared runners are too slow or expensive, and integrating with a test management tool like TestRail or Allure for reporting that product and QA stakeholders can actually read without downloading a zip file.
If you're just starting out though, get the basics right first. A well-structured 50-test suite running reliably in CI is worth more than 500 tests that nobody trusts.
Need help wiring Playwright into your pipeline?
We set up and maintain Playwright automation for engineering teams as part of our QA consulting work. If your CI runs are unreliable or your suite is getting hard to manage, let's talk.