Reddit-tuned interview guide

The broken test suite interview is a triage test, not a debugging test

You open the take-home. The README says "fix the failing tests, we'll schedule a follow-up to discuss your approach." Four specs are red. The clock is running. If you go straight into the source and start patching, you are already losing. The grade is not the diff. The grade is how fast you decide, for each red test, whether it is an app bug, a flawed test, or an environment issue, and whether the fix you ship respects that distinction. Assrt happens to ship a tool that produces exactly that four-part verdict. This guide walks you through it.

Matthew Diakonov, Written with AI

Published April 20, 202611 min read

Triage, don't debug

The move that gets broken-test take-homes hired

Three buckets: app bug, flawed test, environment issue

One tool call outputs Root Cause, Analysis, Fix, #Case

Corrected test is drop-in Markdown, not a vendor format

Re-run in real Chromium, video proof in seconds

Your writeup is already half-drafted

0:00 / 0:05

4.9from Assrt MCP users

Diagnose prompt forks on app-bug / flawed-test / env-issue (server.ts:240-268)

Four-section output is machine-enforced, not a convention

Corrected test arrives as a drop-in #Case in Markdown

Free, open-source, self-hosted; $0 beyond LLM tokens

What the top search results for this keyword miss

Search the phrase and you get two kinds of articles. The first is the essay genre: "technical interviewing is broken, here's why." Thoughtful, true, useless when you have ninety minutes to fix a repo. The second is the generic take-home guide: read the README, match the rubric, keep it clean. Also true. Also useless. Neither talks about the specific task the interviewer set you, which is to classify each failing test and produce a fix that respects the classification. That is the uncopyable part of the take-home. That is what this page is about.

Reject signal

Weakening an assertion

Changing toBe("Ada Lovelace") to toBeTruthy() is the tell. It takes one read of the diff to spot. Do not do it.

Hire signal

Adding a regression test

After fixing the app, add a second #Case that locks in the regression. Shows you internalized what the original red test was actually protecting.

The triage is encoded in a system prompt you can read

Most tools that claim to help with broken tests are vibes in a trench coat. Assrt's diagnose tool is a named, inspectable system prompt that lives in the repo. Here it is, verbatim. Read it. This is the contract that produces the four-section output every time.

assrt-mcp/src/mcp/server.ts

The top of the prompt forces a choice: app bug, flawed test, or environment issue. The bottom enforces a Markdown skeleton the model has to fill in. No free-form prose. No "it depends." The handler then invokes it with a specific, boring model call:

assrt-mcp/src/mcp/server.ts

3 buckets

“Every failing test falls into exactly one: app bug, flawed test, or environment issue. Say which one, cite the evidence, then fix.”

DIAGNOSE_SYSTEM_PROMPT, server.ts:240-268

How a failing spec becomes a corrected #Case

You feed the tool a URL, the failing scenario, and the stderr. The hub is Haiku running with the triage prompt. The right side is the four-section output, dominated by the corrected #Case block you can paste back into scenario.md and re-run.

Three inputs, one triage verdict

The three buckets, plus three landmines

The interviewer's rubric almost certainly has these categories, even if they do not use these words. Get the classification right and half the writeup writes itself. Miss it and the best diff in the world reads as lucky.

App bug

The test is right; the code under test is wrong. The diagnose output lands in Recommended Fix with a file path and a one-line change. Example: a route forgets to project a column, a reducer returns the old state, a form submits with the wrong HTTP verb. You keep the test; you fix the source. This is the bucket that makes interviewers smile, because the test suite was telling the truth and you listened.

Flawed test

The app is right; the test is wrong. Brittle selectors, hard-coded timings, assertions against old copy. Recommended Fix is a corrected #Case block you paste into scenario.md. Don't weaken the assertion, rewrite it.

Environment issue

Both sides are right; the surrounding state is wrong. Missing env var, stale seed data, clock skew, port collision, fixture not loaded. Recommended Fix points at the config or setup step, not the code or the test.

The thing you must not do

Weaken the assertion to make the red test green. It is the single most common mistake and the single most common reject signal. If `expect(name).toBe('Ada Lovelace')` is red, do not change it to `toBeTruthy()`. Diagnose first, fix the real thing, leave the strict assertion alone.

The tests you should not touch

Some seeded failures are red on purpose and belong in your writeup rather than your diff. If a test asserts a behavior the spec does not require, explain it in the writeup. Interviewers read restraint as judgment.

The one-line test you should add

After every fix in the app bug bucket, add a second #Case that covers the specific regression. The original red test proves the bug existed; the new one proves it cannot come back. This move is almost always a green flag.

The diff the interviewer actually reads

Left: what a panicking candidate commits in the first twenty minutes. A timeout, a softened assertion, a fake green. Right: the corrected #Case the diagnose tool produces, with the original assertion preserved and the actual app fix called out as a one-line change.

Softened assertions are the single most common reject signal

// What most candidates do in the first 20 minutes.
// Bang on the red test. Assume the app is fine. Commit false greens.

test("profile page shows user name", async ({ page }) => {
  await page.goto("/profile");
  // Test is red. Candidate's diff:
  await page.waitForTimeout(3000);                 // hope it's just slow
  const name = await page.textContent("h1");
  // Assertion was: expect(name).toBe("Ada Lovelace")
  // Candidate softens it to make the test green:
  expect(name).toBeTruthy();                       // now "passes" against anything
});
// Interviewer reads the diff. Candidate weakened the test to pass.
// Instant reject. The app bug was never found.

-7% fewer lines

Root CauseAnalysisRecommended FixCorrected Test Scenario#Caseapp bugflawed testenvironment issuescenario.mdserver.ts:240-268claude-haiku-4-5max_tokens: 4096

One invocation. The writeup is already half-drafted.

Here is the whole flow against a realistic red test. You paste the failing scenario and the error into assrt_diagnose and the four sections come back in under ten seconds. Read the Root Cause. Sanity check it against the code. Apply the fix.

claude mcp: assrt_diagnose

Now you paste the Corrected Test Scenario into scenario.md and verify it against the fixed app in real Chromium. The green plus the video live under /tmp/assrt/<runId>/, which you can link directly in your submission.

npx assrt-mcp — verify the corrected case

Root-cause buckets the prompt forks on

Required output sections (Root Cause, Analysis, Fix, #Case)

max_tokens per diagnose call (server.ts:898)

Signup cost; open-source MCP, pay only tokens

A 30-minute plan for a 4-red-test take-home

This is the order of operations that wins the grade when tests are red and the clock is running. Four phases, each with a hard-edged deliverable. Do not skip the first phase; reading source before you have counted the failures is the classic trap.

Clone → diagnose → apply → memo

1
Clone and boot (5 min)
Clone the interview repo, install, run the dev server, run the test suite once. Count the red tests and copy their names into a notepad. Do not open any source file yet.
2
Diagnose one failure (5 min each)
For each red test, paste its scenario and error into assrt_diagnose. Read the Root Cause. Sanity-check the bucket against what you observe in the code. If the bucket is wrong, override; the tool drafts, you judge.
3
Apply and re-run (10 min total)
Apply each Recommended Fix. For flawed-test buckets, paste the Corrected Test Scenario into scenario.md and re-run assrt_test. Every green is now backed by a video recording you can link in the writeup.
4
Write the four-section memo (5 min)
Mirror the diagnose output in your submission: per-test Root Cause, one-sentence Analysis, file:line Fix, corrected #Case. Interviewers read this first. The candidate with the cleanest memo usually wins, even against a marginally better diff.

Submission checklist

Run through this before you click submit. Each item is calibrated to the real reasons interviewers reject broken-test take-homes. A passing grade rarely comes from a better diff; it comes from not tripping any of these wires.

Eight wires not to trip on a broken-test-suite take-home

You ran the suite once before opening any source file
Every red test is labeled as app-bug, flawed-test, or environment
You did not weaken a single assertion to make something green
Each app-bug fix has a new regression #Case added alongside
Each flawed-test fix is a full rewrite, not a softened assertion
Tests you chose not to fix are explained in the writeup
Your submission has a four-section memo at the top
Every green test in the submission has a video in /tmp/assrt/<runId>/

Diagnose vs a bare stack trace, line by line

A stack trace tells you where the test blew up. The diagnose tool tells you which of the three things is actually broken and drafts the fix. Both are real artifacts; only one is a writeup scaffold. On an interview clock, that difference is the whole game.

Feature	Stack trace + guessing	Assrt diagnose
Verdict format	Stack trace plus your guess	Root Cause, Analysis, Recommended Fix, Corrected #Case
Output of a 'flawed test'	Prose suggestion or a diff fragment	Drop-in #Case block you paste into scenario.md
Bucket clarity	Implicit; you infer from the trace	Explicit app-bug / flawed-test / environment-issue fork
Re-run the fix	Manual edit, rerun, hope	assrt_test against the corrected plan, video proof in seconds
Writeup scaffold	None; you build from scratch	Four sections already drafted, edit for tone
Vendor setup inside a timed take-home	Account, seat, project, SSO	npx assrt-mcp, one command, zero signup
Tests you submit	Proprietary YAML or vendor dashboard	Plain Markdown #Case, lives in the PR
Cost for a 2-hour interview	Usually free tier, otherwise per-seat	$0 + a few cents of Haiku tokens

Bring a take-home, leave with a four-section writeup

Thirty minutes. You share a broken-test repo (real or anonymized). We run assrt_diagnose against the red specs live, draft the memo, and show you the corrected #Case that ships.

FAQ on the broken test suite dev interview

What is a 'broken test suite' take-home interview actually testing?

It looks like a debugging task on the surface. It is really a communication and triage task. The interviewer seeds the repo with two or three failing tests and watches how you decide whether the app has a real bug, the test is written against stale behavior, or the environment (ports, env vars, fixtures) is misconfigured. If you spend 90 minutes fixing tests that were always wrong, you lose. The people who get hired tend to finish in under an hour with a short writeup that names the category of each failure in one sentence, links the commit that fixes it, and (crucially) leaves the untouched red tests explained rather than silently deleted. Assrt's diagnose tool formalizes that triage into four sections the interviewer can skim in 30 seconds.

How does `assrt_diagnose` decide between an app bug and a flawed test?

The decision is driven by the system prompt at /Users/matthewdi/assrt-mcp/src/mcp/server.ts lines 240-268. It is not heuristic, it is a structural forcing function. The prompt tells the model its first job is to pick one of three buckets: bug in the application, flawed test, or environment issue. Then it demands four output sections (Root Cause, Analysis, Recommended Fix, Corrected Test Scenario) in that order. The Corrected Test Scenario section has to use the exact #Case format, so you get a drop-in replacement test rather than a prose suggestion. Any evidence you paste in (the failing assertion, the page HTML, the stderr line) gets cited back to you in the Analysis section. That grounds the verdict in what actually happened.

Can I use this during a live interview?

If the take-home rules allow AI assistants (most do as of 2026), yes. `npx assrt-mcp` runs on your laptop. You point it at the interview repo's dev server, paste the failing scenario and its stderr into `assrt_diagnose`, and read the four-section response. Your job is still to read the verdict critically: does the Root Cause match what you see in the codebase? Does the Corrected Test Scenario exercise the actual behavior the feature should have? The tool is not a replacement for judgment; it is a forcing function that makes your triage writeup look exactly like a senior engineer's would. If the interview explicitly forbids AI, use the same four-section structure by hand; the tool is really teaching you a template.

Why does the diagnose output include a full #Case block instead of a patch?

Because the #Case format is executable. You can paste it into scenario.md under /tmp/assrt, rerun `assrt_test`, and watch the corrected test go green against the same URL, in the same real Chromium process, in under a minute. A patch tells the interviewer what you think the fix is. An executable #Case block tells them you ran the fix. That is the difference between a candidate who says 'this probably works' and a candidate who ships. The #Case format also survives any vendor switch — it is just plain Markdown with numbered steps, which means the tests you write during the interview are portable if the company later decides to migrate off Playwright.

What file does Assrt save the corrected test plan to, and how do I run it?

The plan text lives at /tmp/assrt/scenario.md, the metadata (UUID, URL, name) at /tmp/assrt/scenario.json, and the latest run at /tmp/assrt/results/latest.json. This layout is defined at /Users/matthewdi/assrt-mcp/src/core/scenario-files.ts lines 16-20. To run, you call `assrt_test` with the URL and either the plan text inline or a scenarioId from a previous run. The runner spawns @playwright/mcp, drives a real Chromium process, and writes a video plus screenshots plus events.json into /tmp/assrt/<runId>/. For an interview, the flow is: diagnose the failure, write the corrected #Case into scenario.md, run assrt_test, screenshot the green result, push the branch, submit.

What model does the diagnose tool run on, and how big is the context window?

It calls claude-haiku-4-5-20251001 with max_tokens set to 4096, per the handler at server.ts lines 896-901. Haiku is chosen because the diagnose task is bounded: you hand it a single failing scenario, a single error message, and a single URL. Most interview-style failures fit comfortably inside the 200k input window, and the 4096-token output budget is calibrated to produce the four required sections without rambling. If you need a longer or more exploratory analysis (for example, untangling a cascade of ten failing tests), run `assrt_diagnose` once per test rather than trying to stuff the whole suite into one call. The tool is designed for one-verdict-at-a-time.

What distinguishes Assrt from a closed vendor QA tool for this interview use case?

Three things that matter when a clock is running. First, there is no dashboard to sign into, no seat to provision, no approval from a security team. You `npx assrt-mcp` and go. Second, the tests are plain Markdown in your local /tmp/assrt (or checked into the interview repo), so when you submit you can include them in the PR without any export-and-clean step. Third, there is no vendor lock-in: if the company uses Playwright already, the runner is already Playwright, so the tests you draft are portable. Assrt is also open source and self-hosted at $0 beyond LLM tokens, compared to closed competitors in the $7.5K/month range. For a two-hour take-home, the cost-per-use is dominated by the Anthropic token bill, which is usually under a dollar.

If the test is red because of a race condition I cannot reproduce locally, what does diagnose do?

It usually puts that into the 'environment issue' bucket in Root Cause and writes an Analysis section that names the specific timing assumption in the test. The Recommended Fix tends to be either a snapshot-first re-check (which is how Assrt agents are supposed to handle dynamic state per the system prompt in agent.ts) or a suggestion to wait for a concrete page signal rather than a sleep. The Corrected Test Scenario block then replaces the flaky wait with a `wait_for` on a specific ref from the accessibility tree. Flaky tests are the single most common category of 'broken' in take-home repos because they are easy for the interviewer to seed and they filter for candidates who recognize timing bugs rather than re-running until green.

Do I have to share my code with a third party to use this?

No. Assrt MCP runs entirely on your machine. The only outbound network calls are to the LLM provider (Anthropic for diagnose, optional Gemini for video analysis) and the target URL you are testing. No code upload, no project creation, no vendor SaaS. This matters for interview repos that ship with proprietary or contractual restrictions; you can point the runner at localhost:3000 and the interview source stays on your laptop. If you want to go even further, you can swap the LLM for a local model via the ANTHROPIC_BASE_URL env var and keep the whole loop on-device.

What should I submit at the end of a broken-test-suite take-home?

The fix itself, obviously. Alongside it, submit a short writeup with four sections that mirror the diagnose output: what each failing test was testing, which bucket each failure fell into (app bug, flawed test, environment), the one-line fix, and the one corrected test you rewrote rather than patched. If there are tests you chose not to fix because you believed they were testing the wrong invariant, say so explicitly. Interviewers read this writeup first and the diff second. A candidate who hands over a green test run plus a writeup like that almost always advances, regardless of whether they used a tool to draft the triage. The writeup is what the interviewer is actually grading.