Stop hand-writing the broken-test exercise. Auto-draft it, then break three Cases on purpose.
The dev hiring test suite exercise is the take-home where you hand a candidate a repo with a few red tests and watch how they triage. The format is great. Authoring it is painful. Most teams burn half a day hand-writing the specs, and the result drifts the moment someone ships new UI copy. There is a faster path. Point assrt_plan at your running dev server, let it draft five to eight #Case blocks from three screenshots and an accessibility snapshot, prune to the flows you care about, regress three Cases on purpose, ship. Grading becomes a single command and a thirty-second video at 5x speed, not a subjective diff read.
Why the auto-drafted plan is worth grading against
The draft is not a black box. It is six rules, one model call, and a few numbers you can check by opening the source.
screenshots
at scroll offsets 0 / 800 / 1600 px
chars cap
on the concatenated accessibility tree
max tokens
on the Haiku 4.5 response budget
CRITICAL rules
encoded in PLAN_SYSTEM_PROMPT
The prompt that drafts your canonical plan
Every Case the generator emits obeys the same six rules. Open /Users/matthewdi/assrt-mcp/src/mcp/server.tsat lines 219 to 236 and you will see the full prompt. The rules you care about as an interviewer are rule 1 (“Each case must be SELF-CONTAINED”, which means no candidate can fail a later Case because an earlier one broke) and rule 4 (“Keep cases SHORT, 3-5 actions max per case”, which caps the blast radius of a single failure and makes triage tractable inside a ninety-minute take-home).
What actually runs when you call assrt_plan
Four inputs, one model call, one deterministic Markdown output. The pipeline is shallow on purpose; the forcing function is the prompt, not a clever chain.
Inputs to outputs
Hand-written interview suite vs. assrt_plan output
The hand-written suite is usually a relic. Someone drafted it when the signup flow had three steps; it now has two. A timing hack stayed in because nobody remembered why it was there. The generated plan is not more clever; it is simply written against the app that exists today, which is the only app the candidate is going to see.
Drift-resistant by construction
// interview-repo/tests/take-home.spec.ts
// Hand-written interview suite, three days to draft, six tests.
// Two weeks later: app copy changed, four tests are flaky,
// nobody remembers why /signup/step-3 uses a hard-coded wait.
test("signup happy path", async ({ page }) => {
await page.goto("/signup");
await page.fill("#email", "interview@example.com");
await page.click("text=Continue");
await page.waitForTimeout(2000); // why? lost to history
await page.fill("#password", "hunter2!");
await page.click("text=Create account");
expect(page.url()).toContain("/welcome");
});
test("search returns something", async ({ page }) => {
await page.goto("/app");
await page.fill("[data-testid='q']", "react");
await page.press("[data-testid='q']", "Enter");
const count = await page.locator(".result").count();
expect(count).toBeGreaterThan(0); // set by vibe, not spec
});
// Plus four more tests like this, each a week of drift
// away from what the app actually does today.“Generating and pruning the canonical plan is faster than reviewing two candidate diffs. The ROI kicks in on the first candidate.”
From the interviewer-side flow described in this guide
The six-step exercise workflow
This is what the interviewer-side flow looks like end to end. Each step has a fixed output: a file, a commit, or a thirty-second video.
1. Draft the canonical plan
Boot the app, hit the dev server URL, call assrt_plan. You get 5-8 #Case blocks that describe the flows the app actually affords today, not the flows it afforded when someone last hand-wrote the spec. The draft takes under ten seconds and costs a fraction of a cent in Haiku tokens.
2. Prune and pin
Read the draft. Delete Cases that are too narrow, too broad, or that test a flow you do not want candidates in. Keep five. Commit scenario.md to the interview repo so it is versioned alongside the app.
3. Seed the failures
Introduce exactly three bugs: one app regression (one-line source change), one stale test (mutate a Case assertion), one environment miss (break a seed script). Document nothing in the code; the whole point is that the candidate decides which bucket each belongs to.
4. Ship with a one-page README
Tell the candidate how to boot the dev server, how to run the plan, where scenario.md lives, and what a correct submission looks like (green run, branch with fixes, short writeup that names each bug's bucket).
5. Re-run the plan
Checkout the candidate's branch, run assrt_test against scenario.md, watch the video at 5x. Pass/fail is a boolean on the report, not a subjective read.
6. Diff the plan file
Before deciding, run git diff on scenario.md against main. If the candidate changed assertions to fake a green, it shows up here immediately. This is the single highest-signal check in the whole pipeline.
Five-stage timeline, measured in minutes
Draft (2 min)
Run assrt_plan against the target app URL. Read the 5-8 Case blocks the response returns. This is your starting plan, not your final plan.
Prune (5 min)
Cut Cases that test the wrong thing or duplicate each other. Keep exactly five. Every Case should be self-contained (rule 1 of PLAN_SYSTEM_PROMPT), 3-5 actions (rule 4), and verify something observable (rule 3).
Seed (5 min)
Introduce three failures, one per bucket. App bug: regress one line in the source. Flawed test: change a Case assertion to match stale copy. Environment issue: forget an env var or drop a seed script from package.json scripts.
Ship (2 min)
Commit scenario.md and the regressed app. Write a README with dev server boot command, scenario.md pointer, and the grading rubric (green run, categorized writeup, one regression Case added per app-bug fix).
Grade (3 min per candidate)
Check out their branch, run assrt_test, read the JSON report, watch the video at 5x, git diff scenario.md to catch softening. Decision comes in under three minutes.
Grading a submission in one terminal
You never have to read the candidate's diff first. You diff the plan file to catch softened assertions, then run the canonical plan against their branch. The video and the events log do the rest.
Assrt vs. closed QA vendor platforms for the hiring use case
Same exercise format. Very different operating characteristics when a hiring pipeline is running.
| Feature | Closed vendor | Assrt |
|---|---|---|
| Time to draft the exercise | 2-5 hours hand-writing specs | 10 seconds via assrt_plan, 15 minutes including pruning |
| Consistency across roles | Each interviewer writes their own version | Same PLAN_SYSTEM_PROMPT produces the canonical plan |
| Exercise artifact format | Proprietary YAML in a vendor dashboard | Plain Markdown #Case blocks in scenario.md, in the repo |
| Candidate setup | Create a vendor account, wait for invite email | npx assrt-mcp, one command, zero signup |
| Grading input | Subjective diff review by one human | Green/red report plus 30-second video at 5x speed |
| Detecting assertion softening | Only caught if the grader reads the diff carefully | git diff scenario.md is the first thing you run |
| Re-running the same exercise | Re-seed the vendor project per candidate | Check out their branch, run one command |
| Cost per hire pipeline | ~$7,500/mo seat-based | $0 + cents of Haiku tokens |
Anchors you can grep for in the source
Ship-readiness checklist for your exercise
- Every Case in scenario.md is self-contained (no inter-Case dependencies)
- Every Case is 3-5 actions maximum
- Every assertion checks observable state (visible text, URL, element presence)
- Exactly one seeded app bug, one flawed test, one environment issue
- scenario.md is committed to the interview repo (not in a vendor dashboard)
- README gives the dev server boot command and the rubric
- Grading run uses your own clean scenario.md, not the candidate's edited copy
- Candidate diff of scenario.md is the first thing you inspect
- Video recording is archived with the submission for future appeals
Want this built into your hiring loop?
A 20-minute call walks through your existing take-home, auto-drafts a plan against your app, and shows you the grading flow end to end.
Book a call →Frequently asked questions
What exactly is a 'dev hiring test suite exercise'?
It is a take-home where the interviewer hands the candidate a small web application and a test suite that includes a handful of intentionally broken Cases. The candidate has to decide, for each red test, whether the app is broken, the test is broken, or the environment is misconfigured, fix the right thing, and hand back a green run. The exercise is popular because it mirrors a real day at the office: you inherit a repo, something is on fire, and the first skill is not coding, it is triage. The mistake most hiring managers make is hand-crafting the broken test suite from scratch, which takes half a day per role and produces an exercise nobody else on the team can reproduce or grade consistently. This page is about flipping that workflow: auto-draft the canonical plan from your running app, break a few Cases on purpose, hand every candidate the exact same artifact.
How does assrt_plan generate the canonical test plan, step by step?
The handler lives at /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 768-862. It launches a local Playwright MCP browser, navigates to the URL you provide, then takes three sequential screenshots at scroll offsets 0, 800, and 1600 pixels. Between each scroll it pulls an accessibility-tree snapshot of the visible viewport. All three snapshot texts are concatenated and sliced to the first 8000 characters, then sent along with the three JPEG screenshots to claude-haiku-4-5-20251001 with max_tokens set to 4096. The system prompt at lines 219-236 forces the output into the #Case format with six numbered rules, including 'Each case must be SELF-CONTAINED' and 'Keep cases SHORT — 3-5 actions max per case'. The response is 5 to 8 cases you can review, prune, and ship as your canonical plan.
Why would I generate the plan instead of writing it by hand?
Three reasons. Speed: one tool call against a running URL takes about ten seconds and produces a draft plan that covers the flows a candidate would actually exercise. Consistency: the generator uses the same PLAN_SYSTEM_PROMPT for every role, so two interviewers reviewing candidates for the same app start from the same canonical plan instead of their own idiosyncratic tests. Coverage honesty: it surfaces gaps. If the plan that pops out does not cover a flow you care about, the signal is that your UI does not expose that flow at all or exposes it badly, which is useful feedback independent of the hiring exercise. None of this is magic; the generator is just a forcing function that makes you actually look at what the app affords before grading someone on it.
How do I deliberately break a Case so a candidate has to fix it?
Three seeded-failure patterns work best because they map cleanly to the three triage buckets. For 'app bug', leave the Case intact and introduce a one-line regression in the app (e.g., drop a column from a SELECT, forget to await a Promise, flip a comparison operator). For 'flawed test', keep the app correct but mutate the Case to assert stale text, a deleted selector, or a timing that stopped holding after a recent refactor. For 'environment issue', keep both app and test correct but misconfigure something out-of-band: a missing env var, a seed script that no longer runs, a port the dev server does not bind to. Ship a README that tells the candidate 'three of the Cases in /tmp/assrt/scenario.md are red; fix the right thing in each bucket, submit a writeup'.
What makes the exercise reproducible across candidates?
The entire thing is three files on disk: scenario.md (plan text), scenario.json (metadata like name and URL), and optionally a snapshot of /tmp/assrt/results/latest.json from a baseline green run. This layout is fixed at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 16-20. Because scenario.md is plain Markdown with #Case blocks, you can check it into the interview repo next to the app source. Every candidate clones the same repo, runs the same dev server, and executes `npx assrt-mcp` against the same URL with the same plan. There is no dashboard to provision, no vendor account, no seat to allocate. If a candidate submits a fix, you copy their branch, run the same command, and the Pass/Fail outcome is a video recording plus an events log written under /tmp/assrt/<runId>/ rather than a subjective diff review.
How do I grade a submission without reading the diff?
Once the candidate's branch is cloned, you run a single assrt_test call against the same URL with the same scenario.md. The runner writes a video (recording.webm) plus per-step screenshots plus an events.json and an execution.log into /tmp/assrt/<runId>/. A passing submission produces a green report with three Cases listed as 'passed: true' and evidence strings showing the exact accessibility refs and text content that satisfied each assertion. The grading signal is: (a) did the plan go green end-to-end, (b) did the candidate weaken any assertion to fake a green (you diff scenario.md before vs after), and (c) did they add a regression Case after each app-bug fix. You can read the video at 5x speed via the auto-opened player and make a decision in under three minutes per candidate.
Does this require AI? Can I still grade it if the candidate does not use an LLM?
Yes on both counts. The candidate only needs Playwright MCP and Node.js to execute the plan; the LLM is only involved when they invoke assrt_diagnose or assrt_plan themselves. They can read the #Case blocks by eye, fix the code, and the runner will still execute the plan deterministically against real Chromium. Your side of the grading pipeline (assrt_test with a fixed scenario.md) does not call any LLM either; it is a scripted browser walkthrough with assertion bookkeeping. That matters if the hiring rules at your company forbid candidates from using AI assistants in take-homes. The exercise format survives that rule; only the candidate's drafting tools change.
How much does this cost to run compared to a closed vendor QA platform?
Assrt is open source and self-hosted. There is no per-seat pricing, no per-run fee, no dashboard subscription. Your only out-of-pocket cost is LLM tokens if you use assrt_plan or assrt_diagnose, and plan generation with Haiku 4.5 at max_tokens 4096 typically comes in under a cent per call. Comparable closed vendor platforms in this space start around $7.5K per month for seat-based access to hosted test runners and proprietary test formats. For a hiring pipeline that evaluates ten candidates a month on a take-home, the realistic cost difference is roughly $7,500 versus a few dollars of Anthropic billing.
What happens if a candidate edits scenario.md while the test runs?
The watcher at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 90-111 picks up edits, debounces for one second, and syncs the new plan text back to Firestore under the same scenario UUID. For a hiring exercise you want to disable this: set ASSRT_NO_SAVE=1 in the candidate's environment so local edits stay local and do not mutate any shared record. You can also gate on the fact that scenario IDs starting with 'local-' skip the watcher entirely. Either way, when you re-grade the candidate's branch you run against your own clean copy of scenario.md, not theirs, so a candidate cannot slip an altered plan into the grading run without you noticing the diff.
What is the fastest way to ship my first version of this exercise?
Boot the app you want candidates to work against. Run `npx assrt-mcp` and invoke assrt_plan with the dev server URL. Copy the 5 to 8 Case blocks from the JSON response into a file called scenario.md under the repo root. Pick two Cases and deliberately regress the underlying app code. Pick one Case and mutate its assertion to something stale. Commit, push, and write a one-paragraph README pointing the candidate at scenario.md and giving them the dev server boot command. End to end, this takes about fifteen minutes the first time and under five on repeat. Compared to hand-writing a broken-test exercise, that is a roughly tenfold reduction in prep time.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.