Catching Spec-Kit Task Phantom Completions with /speckit.verify-tasks

AI agents sometimes mark tasks [X] without doing the work. These phantom completions are rare (~0.36% in my data), but each one is a false claim you’ll either accept or spend precious mental energy figuring out the truth. verify-tasks is a spec-kit community extension that runs a multilayer verification cascade against every [X] completion in your tasks.md and passes a verdict on whether the work was actually done.

In The [X] Problem, I documented phantom completions: tasks that AI agents mark [X] complete without doing the work. Across ~830 structured tasks spanning Claude Code /plan and spec-kit workflows, I found three phantom completions, about 0.36%. The preceding post introduced /verify-plan to catch this in Claude Code’s /plan workflow, but nothing equivalent existed for spec-kit’s task-based workflows, where hundreds of [X] marks in tasks.md go unchecked after /speckit.implement finishes.

/speckit.verify-tasks closes that gap.

Why phantom completions matter more than the rate suggests

A 0.36% rate sounds negligible, but a phantom completion is an AI agent making a false claim about work it didn’t do. The [X] mark carries the same confidence as every other completed task, with no hedge or partial-completion flag. When you see a tasks.md with 50 checked boxes, you focus your review energy on code quality and architecture. You don’t re-derive whether each task was performed, because the agent said it was.

This is cognitive dissonance at work: you believe you’re doing a thorough review, but you’re accepting the agent’s self-reported completions without verification. The easiest way to resolve that tension is to think about other things, which is what tends to happen.

How phantom completions happen

Marking a task complete and completing it are not distinct operations in a single forward pass of the implementation. When the model has marked T001 through T024 as [X], the highest-probability continuation for T025 is another [X], regardless of what happened in the filesystem. The theory on how this happens is explored more thoroughly in The [X] Problem. But the agent can verify its own claims mechanically, which is what verify-tasks does.

What does “verify a task” mean?

A task marked [X] is a claim. The question is what evidence would support or refute it. Tasks reference different things (files, functions, classes, method signatures, module wiring), so a single check isn’t enough.

A task can be phantom in several ways:

The referenced file doesn’t exist at all
The file exists but the class or function body is empty or stubbed
The symbol is declared but never imported or referenced elsewhere
The file exists but contains entirely different functions than what the task specified
A class exists but is missing a required method, so it can’t fulfill its described role

Each failure sits at a different level of evidence. File existence catches the first but misses the rest. Symbol matching catches the third and fourth but not the fifth. And there’s no point checking whether a function is wired into the codebase if the file was never created.

The result is a layered verification cascade where each layer checks a deeper property. Mechanical checks run first, semantic assessment (an LLM reading the code) runs last.

The five-layer verification cascade

When you run /speckit.verify-tasks, each [X] task passes through five layers in sequence:

The five-layer verification cascade

Layer	What it checks	How
1. File existence	Does the referenced file exist on disk?	`test -f`, `find`
2. Git diff presence	Was the file modified in the relevant commit range?	`git diff`, `git log`
3. Content pattern matching	Does the file contain the declared symbols (functions, classes, methods)?	`grep -n` for expected patterns
4. Dead-code detection	Is the symbol referenced by any other file beyond its definition site?	`grep -rn` across the project
5. Semantic assessment	Does the code contain genuine behavior or is it a stub/placeholder?	Agent reads the code directly

Layers 1 through 4 are mechanical: they use shell tools (grep, find, git) and produce ternary results (positive, negative, or not applicable). If any mechanical layer returns negative, the task is flagged. Layer 5 has two roles. When the mechanical layers are all positive, it catches hollow implementations: a function containing only return None will pass Layers 1 through 4 but fail Layer 5. When the mechanical layers all return not-applicable (because the task is qualitative or process-oriented), Layer 5 makes the sole determination and assigns SKIPPED when the task can’t be mechanically assessed.

Each task receives one of five verdicts:

VERIFIED: all applicable layers returned positive evidence
PARTIAL: at least one layer positive, at least one negative; needs human review
WEAK: marginal evidence found; the cascade could not confirm the implementation
NOT_FOUND: the referenced file does not exist
SKIPPED: no file paths or code references; cannot be mechanically verified

The cascade uses an asymmetric error model: a missed phantom is worse than a false alarm, because the walkthrough lets you inspect and dismiss false flags in seconds. Ambiguous evidence always produces PARTIAL or WEAK, never VERIFIED.

False positives and false negatives in practice

NOT_FOUND and VERIFIED are high-confidence verdicts. PARTIAL and WEAK require human judgment, but in my testing across four fixture suites, they’ve consistently pointed to a legitimate concern: a true phantom, dead code, or incomplete wiring.

False negatives (a phantom passing as VERIFIED) are the more dangerous failure mode. A task can only reach VERIFIED if every applicable layer returns positive and Layer 5 confirms genuine behavior. The likely path to a false negative is a task where the file exists, the symbol is present and imported, and the function has a real body, but the implementation doesn’t match what the task specified. Layer 5 catches the clear cases, but logic mismatches are what code review and testing are for.

This tool does not replace code review or testing. It checks whether the work was done, not whether it was done correctly.

The examples here use Python, but the cascade is language-agnostic. It operates on file paths, symbol names, and import graphs using grep and find, which work the same way across Python, TypeScript, Go, Java, Rust, or C. The README documents how accuracy varies by artifact type.

What the cascade catches

Missing file (Layer 1): “Create notifier.py with EmailNotifier class” but notifier.py was never created. The simplest case, but easily missed when the task sits in a list of hundreds of checked boxes.

Missing symbols (Layer 3): cache.py exists and CacheManager is declared, but the get() and set() methods are absent.

Dead code and stubs (Layers 4, 5): routes.py exists and register_routes is declared with a stub body, but no other file imports it. The most common phantom pattern: the agent created scaffolding, wired nothing, and moved on.

Wrong content (Layer 3): utils.py exists but contains format_date and slugify instead of the specified parse_request_body. The agent wrote something useful in the right file but didn’t implement the task.

Incomplete class (Layers 3, 5): middleware.py has a LoggingMiddleware with __init__ but no __call__ method. The structural artifact exists but can’t perform its specified role.

Pure prompt architecture

The extension command is a single markdown prompt with no dependencies beyond grep, find, and git.

The trade-off is precision: grep lacks the resolution of AST parsing, so a grep for register_routes will match a comment mentioning the function, not just its definition. AST parsing isn’t applicable to all domains anyway. False positives at the grep layer are caught by the semantic layer, which reads the code and recognizes that a comment isn’t an implementation.

Testing the verifier with planted phantoms

You need known ground truth to test a verification tool: tasks with predetermined correct verdicts. Real project data doesn’t work because you don’t know in advance which tasks are phantom. The repo ships four synthetic test fixtures with planted phantom completions.

The phantom fixture has 10 tasks: five implemented, five planted phantoms (missing file, empty stub, never-imported function, wrong functions, missing method). All phantoms should produce NOT_FOUND or PARTIAL; all genuine tasks should produce VERIFIED.

The genuine fixture has 10 tasks, all implemented. Every task should produce VERIFIED with zero false negatives.

The edge-case fixture tests boundaries: behavioral tasks with no file references (should produce SKIPPED), malformed syntax, glob paths, and nested subtasks.

The scalability fixture has 50 tasks. All source files exist with real implementations, but 8 tasks have genuine quality issues (dead-code methods, stub functions). Expected: 42 VERIFIED and 8 PARTIAL. This validates that the cascade scales without losing accuracy and distinguishes phantoms from real quality issues.

Reproducible test runs

Each fixture has setup and teardown scripts for reproducible, isolated test runs. setup-fixture.sh phantom-tasks creates a dedicated git branch, copies fixture files into a feature directory, and commits them. You then open a fresh agent session and run /speckit.verify-tasks. After comparing the report against expected verdicts, teardown-fixture.sh switches back to main and deletes the test branch. expected-verdicts.md documents the expected evidence for every task in every fixture.

The interactive walkthrough

After writing the verification report, the command walks through each flagged item in severity order: NOT_FOUND first, then PARTIAL, then WEAK. For each item, it shows the evidence gap and offers three options:

T007 register_routes in routes.py — PARTIAL Layer 1 ✅ file exists | Layer 3 ✅ symbol declared | Layer 4 ❌ never imported | Layer 5 ❌ body is stub

[I]nvestigate / [F]ix / [S]kip

Choosing I runs additional searches and outputs a detailed analysis. F proposes a minimal fix without applying it until you confirm. S moves to the next item. The walkthrough log is appended to the report; the original verdict table is never modified, preserving the audit record.

Complementary verification tools

verify-tasks fills a specific gap. Several other tools address adjacent concerns:

/verify-plan (Dave Sharpe / dataStone) catches phantom completions in Claude Code’s /plan workflow by cross-referencing plan items against the actual diff. Claude Code-specific.

/plan-exit-review (Garry Tan) reviews speckit plan quality before implementation begins. Pre-implementation; cannot detect implementation omissions.

spec-kit-verify (Ismael Jimenez) checks whether the speckit implementation satisfies the spec across requirement coverage, test coverage, and constitution compliance. Broad quality gate operating at the spec level.

Code Review for Claude Code (Anthropic) dispatches a team of agents on every pull request to find bugs and rank findings by severity. Checks whether code that was written is correct; does not check whether code that was supposed to be written exists.

The distinction is the unit of analysis. spec-kit-verify and Code Review operate on the implementation as a whole. verify-tasks operates on individual [X] marks. A phantom will likely pass both spec-kit-verify (the code that exists is fine) and Code Review (no bugs in what was written) but will be caught by verify-tasks (the task was never implemented).

	spec-kit-verify	verify-tasks
Unit of analysis	Spec requirements, scenarios, constitution	Individual `[X]` tasks in `tasks.md`
Verification method	Agent semantic assessment across 7 categories	Mechanical cascade (`grep`, `find`, `git diff`) plus semantic stub detection
Error model	Balanced severity reporting	Asymmetric: missed phantoms are catastrophic, false flags are acceptable
What it catches	Spec-implementation misalignment	Tasks marked done that were never implemented
Fresh-session requirement	No	Yes, by design

verify checks if the implementation is correct. verify-tasks checks if it’s complete.

Where it fits in the workflow

The recommended sequence after /speckit.implement finishes:

/speckit.implement (in the working session) → /speckit.verify-tasks (in a fresh session) → spec-kit-verify (optional, spec-level quality gate) → code review → merge PR.

verify-tasks in the spec-kit workflow

The fresh-session recommendation is structural. The implementing agent carries context that biases it toward confirming its own work. A fresh session reads the task list and diff without that history.

An after_implement hook is included: after /speckit.implement finishes, the agent prompts you to run /speckit.verify-tasks in a new session. The hook is optional and can be disabled in extensions.yml.

The command is designed for iterative use. After the walkthrough, if fixes were applied, re-run /speckit.verify-tasks for a clean re-evaluation. The goal is a report where every [X] task is either VERIFIED or SKIPPED.

Install

1

specify extension add verify-tasks

MIT licensed. No dependencies beyond grep, find, and git.

Repository: github.com/datastone-inc/spec-kit-verify-tasks

Star the repo if you find it useful. File issues if the cascade produces unexpected verdicts. The expected-verdicts doc and fixture scripts are there to validate the tool before trusting it with your tasks.

Previous in series: Claude Code Said ‘Done.’ It Wasn’t. So I Built a Skill to Catch Phantom Completions

Previous in series: The [X] Problem: Phantom Completions in AI-Assisted Development

Why phantom completions matter more than the rate suggests

How phantom completions happen

What does “verify a task” mean?

The five-layer verification cascade

False positives and false negatives in practice

What the cascade catches

Pure prompt architecture

Testing the verifier with planted phantoms

Reproducible test runs

The interactive walkthrough

Complementary verification tools

Where it fits in the workflow

Install

Comments