The broken test suite interview is a triage test, not a debugging test
You open the take-home. The README says "fix the failing tests, we'll schedule a follow-up to discuss your approach." Four specs are red. The clock is running. If you go straight into the source and start patching, you are already losing. The grade is not the diff. The grade is how fast you decide, for each red test, whether it is an app bug, a flawed test, or an environment issue, and whether the fix you ship respects that distinction. Assrt happens to ship a tool that produces exactly that four-part verdict. This guide walks you through it.
What the top search results for this keyword miss
Search the phrase and you get two kinds of articles. The first is the essay genre: "technical interviewing is broken, here's why." Thoughtful, true, useless when you have ninety minutes to fix a repo. The second is the generic take-home guide: read the README, match the rubric, keep it clean. Also true. Also useless. Neither talks about the specific task the interviewer set you, which is to classify each failing test and produce a fix that respects the classification. That is the uncopyable part of the take-home. That is what this page is about.
Changing toBe("Ada Lovelace") to toBeTruthy() is the tell. It takes one read of the diff to spot. Do not do it.
After fixing the app, add a second #Case that locks in the regression. Shows you internalized what the original red test was actually protecting.
The triage is encoded in a system prompt you can read
Most tools that claim to help with broken tests are vibes in a trench coat. Assrt's diagnose tool is a named, inspectable system prompt that lives in the repo. Here it is, verbatim. Read it. This is the contract that produces the four-section output every time.
The top of the prompt forces a choice: app bug, flawed test, or environment issue. The bottom enforces a Markdown skeleton the model has to fill in. No free-form prose. No "it depends." The handler then invokes it with a specific, boring model call:
“Every failing test falls into exactly one: app bug, flawed test, or environment issue. Say which one, cite the evidence, then fix.”
DIAGNOSE_SYSTEM_PROMPT, server.ts:240-268
How a failing spec becomes a corrected #Case
You feed the tool a URL, the failing scenario, and the stderr. The hub is Haiku running with the triage prompt. The right side is the four-section output, dominated by the corrected #Case block you can paste back into scenario.md and re-run.
Three inputs, one triage verdict
The three buckets, plus three landmines
The interviewer's rubric almost certainly has these categories, even if they do not use these words. Get the classification right and half the writeup writes itself. Miss it and the best diff in the world reads as lucky.
App bug
The test is right; the code under test is wrong. The diagnose output lands in Recommended Fix with a file path and a one-line change. Example: a route forgets to project a column, a reducer returns the old state, a form submits with the wrong HTTP verb. You keep the test; you fix the source. This is the bucket that makes interviewers smile, because the test suite was telling the truth and you listened.
Flawed test
The app is right; the test is wrong. Brittle selectors, hard-coded timings, assertions against old copy. Recommended Fix is a corrected #Case block you paste into scenario.md. Don't weaken the assertion, rewrite it.
Environment issue
Both sides are right; the surrounding state is wrong. Missing env var, stale seed data, clock skew, port collision, fixture not loaded. Recommended Fix points at the config or setup step, not the code or the test.
The thing you must not do
Weaken the assertion to make the red test green. It is the single most common mistake and the single most common reject signal. If `expect(name).toBe('Ada Lovelace')` is red, do not change it to `toBeTruthy()`. Diagnose first, fix the real thing, leave the strict assertion alone.
The tests you should not touch
Some seeded failures are red on purpose and belong in your writeup rather than your diff. If a test asserts a behavior the spec does not require, explain it in the writeup. Interviewers read restraint as judgment.
The one-line test you should add
After every fix in the app bug bucket, add a second #Case that covers the specific regression. The original red test proves the bug existed; the new one proves it cannot come back. This move is almost always a green flag.
The diff the interviewer actually reads
Left: what a panicking candidate commits in the first twenty minutes. A timeout, a softened assertion, a fake green. Right: the corrected #Case the diagnose tool produces, with the original assertion preserved and the actual app fix called out as a one-line change.
Softened assertions are the single most common reject signal
// What most candidates do in the first 20 minutes.
// Bang on the red test. Assume the app is fine. Commit false greens.
test("profile page shows user name", async ({ page }) => {
await page.goto("/profile");
// Test is red. Candidate's diff:
await page.waitForTimeout(3000); // hope it's just slow
const name = await page.textContent("h1");
// Assertion was: expect(name).toBe("Ada Lovelace")
// Candidate softens it to make the test green:
expect(name).toBeTruthy(); // now "passes" against anything
});
// Interviewer reads the diff. Candidate weakened the test to pass.
// Instant reject. The app bug was never found.One invocation. The writeup is already half-drafted.
Here is the whole flow against a realistic red test. You paste the failing scenario and the error into assrt_diagnose and the four sections come back in under ten seconds. Read the Root Cause. Sanity check it against the code. Apply the fix.
Now you paste the Corrected Test Scenario into scenario.md and verify it against the fixed app in real Chromium. The green plus the video live under /tmp/assrt/<runId>/, which you can link directly in your submission.
A 30-minute plan for a 4-red-test take-home
This is the order of operations that wins the grade when tests are red and the clock is running. Four phases, each with a hard-edged deliverable. Do not skip the first phase; reading source before you have counted the failures is the classic trap.
Clone → diagnose → apply → memo
- 1
Clone and boot (5 min)
Clone the interview repo, install, run the dev server, run the test suite once. Count the red tests and copy their names into a notepad. Do not open any source file yet.
- 2
Diagnose one failure (5 min each)
For each red test, paste its scenario and error into assrt_diagnose. Read the Root Cause. Sanity-check the bucket against what you observe in the code. If the bucket is wrong, override; the tool drafts, you judge.
- 3
Apply and re-run (10 min total)
Apply each Recommended Fix. For flawed-test buckets, paste the Corrected Test Scenario into scenario.md and re-run assrt_test. Every green is now backed by a video recording you can link in the writeup.
- 4
Write the four-section memo (5 min)
Mirror the diagnose output in your submission: per-test Root Cause, one-sentence Analysis, file:line Fix, corrected #Case. Interviewers read this first. The candidate with the cleanest memo usually wins, even against a marginally better diff.
Submission checklist
Run through this before you click submit. Each item is calibrated to the real reasons interviewers reject broken-test take-homes. A passing grade rarely comes from a better diff; it comes from not tripping any of these wires.
Eight wires not to trip on a broken-test-suite take-home
- You ran the suite once before opening any source file
- Every red test is labeled as app-bug, flawed-test, or environment
- You did not weaken a single assertion to make something green
- Each app-bug fix has a new regression #Case added alongside
- Each flawed-test fix is a full rewrite, not a softened assertion
- Tests you chose not to fix are explained in the writeup
- Your submission has a four-section memo at the top
- Every green test in the submission has a video in /tmp/assrt/<runId>/
Diagnose vs a bare stack trace, line by line
A stack trace tells you where the test blew up. The diagnose tool tells you which of the three things is actually broken and drafts the fix. Both are real artifacts; only one is a writeup scaffold. On an interview clock, that difference is the whole game.
| Feature | Stack trace + guessing | Assrt diagnose |
|---|---|---|
| Verdict format | Stack trace plus your guess | Root Cause, Analysis, Recommended Fix, Corrected #Case |
| Output of a 'flawed test' | Prose suggestion or a diff fragment | Drop-in #Case block you paste into scenario.md |
| Bucket clarity | Implicit; you infer from the trace | Explicit app-bug / flawed-test / environment-issue fork |
| Re-run the fix | Manual edit, rerun, hope | assrt_test against the corrected plan, video proof in seconds |
| Writeup scaffold | None; you build from scratch | Four sections already drafted, edit for tone |
| Vendor setup inside a timed take-home | Account, seat, project, SSO | npx assrt-mcp, one command, zero signup |
| Tests you submit | Proprietary YAML or vendor dashboard | Plain Markdown #Case, lives in the PR |
| Cost for a 2-hour interview | Usually free tier, otherwise per-seat | $0 + a few cents of Haiku tokens |
Bring a take-home, leave with a four-section writeup
Thirty minutes. You share a broken-test repo (real or anonymized). We run assrt_diagnose against the red specs live, draft the memo, and show you the corrected #Case that ships.
Book a call →FAQ on the broken test suite dev interview
What is a 'broken test suite' take-home interview actually testing?
It looks like a debugging task on the surface. It is really a communication and triage task. The interviewer seeds the repo with two or three failing tests and watches how you decide whether the app has a real bug, the test is written against stale behavior, or the environment (ports, env vars, fixtures) is misconfigured. If you spend 90 minutes fixing tests that were always wrong, you lose. The people who get hired tend to finish in under an hour with a short writeup that names the category of each failure in one sentence, links the commit that fixes it, and (crucially) leaves the untouched red tests explained rather than silently deleted. Assrt's diagnose tool formalizes that triage into four sections the interviewer can skim in 30 seconds.
How does `assrt_diagnose` decide between an app bug and a flawed test?
The decision is driven by the system prompt at /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 240-268. It is not heuristic, it is a structural forcing function. The prompt tells the model its first job is to pick one of three buckets: bug in the application, flawed test, or environment issue. Then it demands four output sections (Root Cause, Analysis, Recommended Fix, Corrected Test Scenario) in that order. The Corrected Test Scenario section has to use the exact #Case format, so you get a drop-in replacement test rather than a prose suggestion. Any evidence you paste in (the failing assertion, the page HTML, the stderr line) gets cited back to you in the Analysis section. That grounds the verdict in what actually happened.
Can I use this during a live interview?
If the take-home rules allow AI assistants (most do as of 2026), yes. `npx assrt-mcp` runs on your laptop. You point it at the interview repo's dev server, paste the failing scenario and its stderr into `assrt_diagnose`, and read the four-section response. Your job is still to read the verdict critically: does the Root Cause match what you see in the codebase? Does the Corrected Test Scenario exercise the actual behavior the feature should have? The tool is not a replacement for judgment; it is a forcing function that makes your triage writeup look exactly like a senior engineer's would. If the interview explicitly forbids AI, use the same four-section structure by hand; the tool is really teaching you a template.
Why does the diagnose output include a full #Case block instead of a patch?
Because the #Case format is executable. You can paste it into scenario.md under /tmp/assrt, rerun `assrt_test`, and watch the corrected test go green against the same URL, in the same real Chromium process, in under a minute. A patch tells the interviewer what you think the fix is. An executable #Case block tells them you ran the fix. That is the difference between a candidate who says 'this probably works' and a candidate who ships. The #Case format also survives any vendor switch — it is just plain Markdown with numbered steps, which means the tests you write during the interview are portable if the company later decides to migrate off Playwright.
What file does Assrt save the corrected test plan to, and how do I run it?
The plan text lives at /tmp/assrt/scenario.md, the metadata (UUID, URL, name) at /tmp/assrt/scenario.json, and the latest run at /tmp/assrt/results/latest.json. This layout is defined at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 16-20. To run, you call `assrt_test` with the URL and either the plan text inline or a scenarioId from a previous run. The runner spawns @playwright/mcp, drives a real Chromium process, and writes a video plus screenshots plus events.json into /tmp/assrt/<runId>/. For an interview, the flow is: diagnose the failure, write the corrected #Case into scenario.md, run assrt_test, screenshot the green result, push the branch, submit.
What model does the diagnose tool run on, and how big is the context window?
It calls claude-haiku-4-5-20251001 with max_tokens set to 4096, per the handler at server.ts lines 896-901. Haiku is chosen because the diagnose task is bounded: you hand it a single failing scenario, a single error message, and a single URL. Most interview-style failures fit comfortably inside the 200k input window, and the 4096-token output budget is calibrated to produce the four required sections without rambling. If you need a longer or more exploratory analysis (for example, untangling a cascade of ten failing tests), run `assrt_diagnose` once per test rather than trying to stuff the whole suite into one call. The tool is designed for one-verdict-at-a-time.
What distinguishes Assrt from a closed vendor QA tool for this interview use case?
Three things that matter when a clock is running. First, there is no dashboard to sign into, no seat to provision, no approval from a security team. You `npx assrt-mcp` and go. Second, the tests are plain Markdown in your local /tmp/assrt (or checked into the interview repo), so when you submit you can include them in the PR without any export-and-clean step. Third, there is no vendor lock-in: if the company uses Playwright already, the runner is already Playwright, so the tests you draft are portable. Assrt is also open source and self-hosted at $0 beyond LLM tokens, compared to closed competitors in the $7.5K/month range. For a two-hour take-home, the cost-per-use is dominated by the Anthropic token bill, which is usually under a dollar.
If the test is red because of a race condition I cannot reproduce locally, what does diagnose do?
It usually puts that into the 'environment issue' bucket in Root Cause and writes an Analysis section that names the specific timing assumption in the test. The Recommended Fix tends to be either a snapshot-first re-check (which is how Assrt agents are supposed to handle dynamic state per the system prompt in agent.ts) or a suggestion to wait for a concrete page signal rather than a sleep. The Corrected Test Scenario block then replaces the flaky wait with a `wait_for` on a specific ref from the accessibility tree. Flaky tests are the single most common category of 'broken' in take-home repos because they are easy for the interviewer to seed and they filter for candidates who recognize timing bugs rather than re-running until green.
Do I have to share my code with a third party to use this?
No. Assrt MCP runs entirely on your machine. The only outbound network calls are to the LLM provider (Anthropic for diagnose, optional Gemini for video analysis) and the target URL you are testing. No code upload, no project creation, no vendor SaaS. This matters for interview repos that ship with proprietary or contractual restrictions; you can point the runner at localhost:3000 and the interview source stays on your laptop. If you want to go even further, you can swap the LLM for a local model via the ANTHROPIC_BASE_URL env var and keep the whole loop on-device.
What should I submit at the end of a broken-test-suite take-home?
The fix itself, obviously. Alongside it, submit a short writeup with four sections that mirror the diagnose output: what each failing test was testing, which bucket each failure fell into (app bug, flawed test, environment), the one-line fix, and the one corrected test you rewrote rather than patched. If there are tests you chose not to fix because you believed they were testing the wrong invariant, say so explicitly. Interviewers read this writeup first and the diff second. A candidate who hands over a green test run plus a writeup like that almost always advances, regardless of whether they used a tool to draft the triage. The writeup is what the interviewer is actually grading.
How did this page land for you?
React to reveal totals
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.