A buyer's guide, decomposed

Performance testing splits into three layers. Pick one open-source tool per layer.

Every guide on this topic flattens k6, JMeter, Gatling, Locust, and Lighthouse into one bucket called "performance." That framing is what gets teams stuck with a green dashboard and a slow product. The honest decomposition is three layers, six tools that matter, and a willingness to admit your favorite framework should refuse to do the wrong job.

Matthew Diakonov, Written with AI

Published April 28, 202611 min read

The three layers, in plain language

When a product feels slow, the slowness is happening at one of three places. The tool you reach for has to match. Mixing them up is the most common reason a performance program produces dashboards that nobody trusts.

Layer	What it measures	Failure modes it catches	Tools that live here
Protocol load	Throughput, error rate, latency percentiles under N concurrent users	DB pool exhaustion, queue overflow, autoscaler lag, 5xx under burst	k6, JMeter, Gatling, Locust, Artillery, Tsung
Network / edge	Connection setup, TLS handshake, HTTP/2 multiplexing, raw RPS ceiling	Slow TLS renegotiation, head-of-line blocking, single-instance saturation	wrk, Apache Bench, Vegeta, oha, hey
Browser-perceived	LCP, CLS, INP, total blocking time, resource waterfall, visible paint	Heavy JS bundle, render-blocking CSS, layout shift, slow third-party scripts	Lighthouse, Sitespeed.io, WebPageTest, web-vitals

A single tool that claims to span all three layers should be treated with suspicion. The internal architectures are different (load generators are about TCP and concurrency; browser tools are about V8 and the rendering pipeline), and doing both well in one binary is a research project, not a product. The honest open-source projects pick a layer and commit to it.

Layer 1: protocol load (the layer most people mean)

When somebody says "run a load test," they almost always mean: hold N concurrent virtual users hitting our API for M minutes and tell me when it falls over. Six open-source tools share this layer. The differences that matter are language fluency, license, and editorial overhead per test change.

k6

JavaScript-scripted, single-binary, AGPL v3. The default pick for teams that already write JS in CI. k6 Studio (separate, also OSS) records browser HTTP and emits a working test script.

JMeter

Apache 2.0, JVM-based, GUI-first. The most battle-tested option, with twenty-plus years of plugins. .jmx XML test plans run headless in any CI. Steepest editorial overhead per test change.

Gatling

Apache 2.0, Scala/Java, code-first. Asynchronous architecture, low memory per virtual user, excellent HTML reports. Best fit for teams comfortable on the JVM that want both code and depth.

Locust

MIT-licensed, Python-scripted, event-driven. User behavior is plain Python classes. Web UI for live spectator dashboards. Pythonic ergonomics, not the highest single-machine throughput.

Artillery

MPL-2.0, Node-based, YAML or JS scenarios. Has an opt-in Playwright integration for browser-driven load. Lightweight; pleasant for HTTP and WebSocket scenarios at modest scale.

Tsung

GPL-2.0, Erlang-based. The right pick when your protocol is XMPP, MQTT, AMQP, or you need a process model that scales to hundreds of thousands of long-lived connections per node.

Layer 2: network and edge benchmarking

These tools answer a narrower question: what happens at the network edge before any business logic runs? They are the right tool when you suspect the load generator itself is skewing your results, when you need a fast smoke benchmark of one endpoint, or when you are debugging the connection behavior of a CDN, a proxy, or a TLS termination layer. None of them are good at user simulation; that is not their job.

wrk

Apache 2.0, single C binary, the canonical HTTP benchmarker. Lua scripting for custom request shapes. The tool you reach for when you suspect the load generator itself is the bottleneck.

Apache Bench (ab)

Apache 2.0, ships with httpd. Quick smoke benchmark of a single endpoint. Limited concurrency model (one TCP per worker). Good for sanity checks, not realistic user simulation.

Vegeta

MIT, Go binary. Constant-rate attack model (rather than concurrent users), which is the right shape if you care about RPS targets. Streams output to a binary log you can replay.

oha / hey

Single Go/Rust binaries, MIT-style licenses. Quick concurrency-controlled load against one URL with a live TUI. The lightweight pick when you want a 30-second answer and not a test plan.

Layer 3: browser-perceived performance

The user does not live in a load test. They live in a browser. This layer measures what the browser actually does with the page: parse, paint, hydrate, interact. None of the load tools above are at this layer; treating Lighthouse as a substitute for k6 (or vice versa) is the textbook category error. A healthy program runs both and correlates them.

Lighthouse / Lighthouse CI

Apache 2.0, Google. The standard for Core Web Vitals scoring (LCP, CLS, INP). Lighthouse CI runs on every push, asserts budgets, surfaces regressions inline in pull requests.

Sitespeed.io

MIT-licensed, Node-based. Multi-page crawls, waterfall capture, video recording, multiple browser engines. Heavier setup than Lighthouse but produces investigation-grade artifacts.

WebPageTest (open source agent)

Polyform Shield. Real-browser, real-network performance auditing. The hosted version is famous; the agent and runner are open source for self-hosting against private staging environments.

web-vitals (the Google library)

Apache 2.0, single npm package. Reports LCP/CLS/INP from real users in production. Pair with your analytics backend (PostHog, GA4, custom) to get field data, not lab data.

The discipline most articles miss: scope refusal as a feature

A useful signal of a tool's quality is whether it knows what it does NOT do, and whether that boundary is committed somewhere a buyer can read. Most listicles never check. Below is a working example from a public open-source repo outside the load-testing space: an end-to-end testing framework whose system prompt explicitly forbids the AI from generating performance test cases. That refusal is the right call: an E2E framework that pretends to do performance will produce noisy numbers from a non-isolated browser, and users of that framework will trust those numbers.

Anchor fact, checkable in the public repo

The Assrt MCP server's system prompt refuses to author performance tests.

Assrt is an open-source AI browser-testing tool. It is not a performance testing tool; it sits on the correctness layer next to Playwright and Cypress. What is unusual is that the refusal is in the source. Two prompts in two files repeat the same constraint. The line numbers below are the actual file lines on main.

// src/core/agent.ts:267 — Assrt MCP server, MIT licensed.
// Inside the system prompt that the LLM sees when it is asked
// to discover test cases on a brand new page:

const DISCOVERY_SYSTEM_PROMPT = `You are a QA engineer generating
quick test cases for an AI browser agent that just landed on a
new page. The agent can click, type, scroll, and verify visible
text.
...
## Rules
- Generate only 1-2 cases
- Each case must be completable in 3-4 actions max
- Reference ACTUAL buttons/links/inputs visible on the page
- Do NOT generate login/signup cases
- Do NOT generate cases about CSS, responsive layout, or performance`;

// src/mcp/server.ts:233 — the same constraint, restated in the
// MCP tool prompt the IDE agent sees when planning a test run:

3. **Verify observable things** — check for visible text, page
   titles, URLs, element presence. NOT for CSS, colors,
   performance, or responsive layout.

Why it matters for tool selection: when an open-source project ships its own scope refusal, you can stack it with confidence. You know exactly which layer it is at, you know which other tool you still need, and you do not have to discover the limits the hard way after a bad run.

“Two source files in the assrt-mcp repo independently encode the same scope refusal: do NOT generate cases about performance.”

src/core/agent.ts:267 and src/mcp/server.ts:233 (MIT licensed)

The fantasy of one tool vs. the honest stack

Picking one tool and trusting it is the path most teams start on, because the listicles imply it is sufficient. Almost everyone ends up at the right side eventually. The cost of getting there is one or two incidents that the dashboard did not predict.

One tool vs. three honest tools

One open-source "performance testing" tool, picked from a listicle, run on a cron, dashboard wired up. Symptoms: • Green load runs, real users still see slow pages. • Cannot localize: is it API, network, or render? • Lighthouse score never drops because nobody runs it under load. • On-call cannot tell which dashboard to open during an incident. • Tool's own docs warn against the layer you are using it for.

Single dashboard, single confidence number
On-call has one URL to open during incidents
The tool's own docs warn against the layer you use it for
Real users still see the slowness the dashboard missed

How to actually pick, in five steps

The selection process is not feature-comparison-spreadsheet shaped. It is failure-mode shaped. Start from the incident you do not want to repeat and work back to a layer, then to a tool inside that layer that fits your team's language fluency.

Name the failure mode you actually fear

Write down the specific incident you do not want to repeat. "API tipped over at 3x normal traffic." "Checkout button took 4s to respond on a real Pixel 6." "Mobile users saw a 2s layout shift." The failure mode names the layer, and the layer names the tool.

Match the failure mode to a layer

Throughput, error rate, queue depth, and DB connection saturation are protocol-load symptoms; reach for k6, JMeter, Gatling, or Locust. Connection setup, TLS handshake, and HTTP/2 contention are network-edge symptoms; reach for wrk or Vegeta. LCP, CLS, INP, and JS execution time are browser-perceived symptoms; reach for Lighthouse, Sitespeed, or web-vitals.

Pick by language fluency, not feature list

Within a layer, the differences between leading tools are smaller than the difference between a tool you can debug at 2am and a tool you cannot. Python team picks Locust. JS team picks k6 or Artillery. JVM team picks Gatling or JMeter. Anyone picks wrk for edge benchmarking; it is a single binary.

Add the second tool deliberately

Almost no team is well-served by exactly one of these tools. The minimum viable stack is one load tool plus Lighthouse CI for perceived performance. The third addition (a network-edge tool) shows up the first time you debug a connection that is fine on paper but slow in production.

Put functional E2E on a separate track

End-to-end correctness tests (Playwright, Cypress, Assrt) are NOT performance tests. They live in a different CI stage, run on a different cadence, and answer a different question (did the right thing happen). Asking your E2E framework to also report performance is how you get noisy data and bad alerts.

A note on licenses, because most articles bury this

The licenses cluster: most network-edge tools are permissive (Apache 2.0 or MIT); the JVM load tools are Apache 2.0; Locust is MIT; Artillery is MPL-2.0; k6 OSS is AGPL v3, which is the strictest of the bunch. AGPL is copyleft only at the network boundary: if you fork k6 and run a competing managed cloud service, you must publish your modifications. Running k6 in your own CI to test your own application is unaffected. Most teams are unaffected. If your legal posture is allergic to AGPL on principle, JMeter, Gatling, and Locust are the permissive alternatives, and the second-day operability difference is language fluency, not license.

For the browser-perceived layer, Lighthouse is Apache 2.0 (Google), Sitespeed.io is MIT, the web-vitals npm package is Apache 2.0. The license question is rarely the deciding factor at this layer; pick by what artifact you want (lab-data score vs. waterfall vs. real-user telemetry).

Pairing an honest E2E layer with your performance stack?

If you already run k6 or Lighthouse and you want correctness coverage that knows it is not a performance tool, we can sketch the stack with you in 20 minutes.

Frequently asked questions

Is performance testing the same thing as load testing?

No, and conflating them is the root cause of most bad tool selection. Load testing means: hold N concurrent virtual users hitting your API and measure throughput, latency, and error rate at the protocol layer. Browser-perceived performance means: a real Chromium opens your page, paints pixels, runs JavaScript, and the user feels how long until the first interaction worked. Network-edge testing sits between them, asking how a TCP connection or HTTP/2 multiplexing behaves under contention. Open-source tools cluster around one of those three layers and only rarely cross over. k6 is a load testing tool. Lighthouse is a browser-perceived performance tool. wrk is a network-edge tool. Treating them as substitutes is how teams end up with a green k6 dashboard while real users experience a 6-second LCP.

Which open-source load testing tool should a small team start with in 2026?

If your team writes JavaScript or TypeScript every day, k6 is the path of least resistance. Tests are JavaScript modules, the CLI is a single binary, and CI integration is a one-line shell command. The tradeoff is that k6 open source is licensed under AGPL v3, which is copyleft, so be aware if your usage is unusual (most teams running k6 in CI are unaffected). If you write Python, Locust is the same shape but Pythonic, MIT licensed, and the web UI is friendly for non-engineers spectating a run. JMeter and Gatling are both Apache 2.0, both more powerful, both with steeper learning curves. JMeter is GUI-first and battle-tested across ten-plus years of enterprise deployments; Gatling is code-first in Scala or Java with excellent reporting. Pick by language fluency, not by feature checklist, because the second-day operability difference is much larger than any feature gap.

Why does k6 use AGPL v3 when the other tools use permissive licenses?

k6 is now part of Grafana Labs, and Grafana Labs uses AGPL on its open-source projects to limit competing managed cloud services without limiting in-house use. AGPL is a strong copyleft license: if you modify k6 and run it as a service for third parties, you must publish your modifications. Running k6 inside your own CI to test your own application is the normal use case and is not affected. If you are wary of AGPL specifically because your legal team has a blanket policy against it, JMeter (Apache 2.0), Gatling (Apache 2.0), and Locust (MIT) are the permissive alternatives.

Where do Lighthouse and Web Vitals fit if they are not load testing tools?

They measure what an individual user experiences in the browser: First Contentful Paint, Largest Contentful Paint, Cumulative Layout Shift, Interaction to Next Paint. None of those are load metrics. They tell you whether the page is slow for one user, not how it behaves under thirty thousand users. The right pairing is to run k6 or JMeter to put load on the API while Lighthouse runs the page in headless Chrome against the same backend. If LCP doubles under load, you have a backend bottleneck affecting the frontend; if LCP stays flat under load but is bad in isolation, the problem is in the JavaScript bundle or render path. Tools that try to merge both layers usually do one of them poorly.

Can my E2E test framework run performance tests too?

Almost always no, and the framework should tell you so out loud. End-to-end test frameworks (Playwright, Cypress, WebdriverIO, Selenium, Assrt) verify correctness against a real DOM. They use timing internally for waits and stability detection, but their assertions are about whether the right thing rendered, not how fast. The Assrt MCP server, for example, ships a discovery prompt at src/core/agent.ts:267 and an MCP server prompt at src/mcp/server.ts:233 that explicitly forbid the LLM from generating performance test cases. That refusal is committed to the public MIT-licensed repo. It is the right call: a tool that pretends to do performance testing while measuring DOM ready timestamps in a non-isolated browser will mislead you. Use a real load testing tool for load, a real perceived-performance tool for perceived performance, and an E2E framework for correctness.

What is the smallest honest performance testing stack a startup can run?

Three open-source tools, each at one layer. For protocol load: k6 (AGPL) or Locust (MIT). For network-edge benchmarking when you are debugging the connection itself: wrk or oha as a single binary. For browser-perceived performance: Lighthouse CI on every deploy, optionally extended with Sitespeed.io for a fuller waterfall view. The same dev pipeline runs all three; they produce different signal; you do not need a fourth tool unless you are testing a non-HTTP protocol (gRPC, MQTT, WebRTC), at which point Tsung or k6 with the relevant extension fills the gap.

Are GUI-based tools like JMeter actually slower in CI than code-first tools like k6 or Gatling?

Slightly, but not in the way the discussion usually frames it. JMeter test plans are XML files; you can run them headless with the JMeter CLI in any CI. The friction is editorial: every change to the test plan opens the GUI, and most teams end up with a JMeter expert who owns the .jmx files because nobody else wants to. k6 and Gatling avoid that bottleneck by making the test a normal source file in a normal language. Per-run wall clock is comparable. Per-week team velocity is not.

How do I know if my open-source performance tool is generating realistic load?

Three checks. First, look at the connection model: a load testing tool that opens one TCP connection per virtual user (k6, Gatling) is closer to real browser behavior than one that pools connections aggressively (some HTTP benchmarkers). Second, look at think time: if your test fires 10,000 RPS with no inter-request delay, you are stress testing your own load generator's network stack as much as the system under test. Third, run the same scenario from multiple geographic egresses (k6 cloud, your own CI runners in different regions) and compare: if the latency profile changes wildly, you are measuring your egress, not your application.

Other guides on the correctness layer (which is where Assrt lives).

Adjacent reading

Visual

Visual Regression Testing Guide

What screenshot diffing actually catches, what it does not, and how to keep a snapshot suite from rotting into a noisy graveyard.

Read

E2E

Open Source Test Automation with Playwright

Why Playwright sits at the correctness layer of your stack, and what it means to keep generated tests in your repo with zero vendor lock-in.

Read

Comparison

Selenium vs Cypress vs Playwright

Three E2E frameworks at the correctness layer, compared honestly. Where each wins, where each is the wrong default, and how to decide.

Read

The three layers, in plain language

Layer 1: protocol load (the layer most people mean)

k6

JMeter

Gatling

Locust

Artillery

Tsung

Layer 2: network and edge benchmarking

wrk

Apache Bench (ab)

Vegeta

oha / hey

Layer 3: browser-perceived performance

Lighthouse / Lighthouse CI

Sitespeed.io

WebPageTest (open source agent)

web-vitals (the Google library)

The discipline most articles miss: scope refusal as a feature

The Assrt MCP server's system prompt refuses to author performance tests.

The fantasy of one tool vs. the honest stack

One tool vs. three honest tools

How to actually pick, in five steps

Name the failure mode you actually fear

Match the failure mode to a layer

Pick by language fluency, not feature list

Add the second tool deliberately

Put functional E2E on a separate track

A note on licenses, because most articles bury this

Pairing an honest E2E layer with your performance stack?

Frequently asked questions

Adjacent reading

Visual Regression Testing Guide

Open Source Test Automation with Playwright

Selenium vs Cypress vs Playwright

Comments (••)

Comments ()