The [X] Problem: Phantom Completions in AI-Assisted Development
AI coding agents sometimes mark tasks as complete when the work was never done. The code compiles, the tests pass, and the agent moves on. But the specified file was never created, or the required modification was never applied. I call this failure mode a phantom completion: a false positive in the agent’s own task-tracking output, where the checkbox was marked [X] complete, but the code is missing or “wrong” (syntactically correct but not to spec).
Across multiple agentic coding workflows, spanning Claude Code’s /plan and GitHub’s spec-kit task-based implementation, I have observed three phantom completions in approximately 830 structured tasks. That is a rate of about 0.36%, or roughly one silent failure per 277 tasks.
None of these were caught through proactive task list review. In each case, I was testing the implemented feature, something did not work as expected, and I had to trace the failure backward through the code. That process was slow and painstaking: figuring out why a feature behaved incorrectly, then eventually realizing the problem traced back to a task that had been marked complete but was never actually implemented. I had to mentally puzzle through the fact that I had specified the requirement in the plan and task list, the agent had checked it off, and yet the corresponding code was wrong or missing. Each such task was marked [X] complete by the implementing agent. Each such [X] mark was wrong.
What a phantom completion is not
A phantom completion is distinct from the failure modes that developers already watch for. It is not a bug, which is wrong implementation of something that was attempted. It is not a hallucination in the traditional LLM sense, which is fabricated factual content. It is not scope creep, which is implementing something the spec did not ask for. A phantom completion is specifically a gap between what the task list says was done and what the codebase actually contains. The agent’s self-reported status is the false claim; the code is missing or syntactically correct but not to spec.
A 0.36% rate that you will never catch by hand
Three phantom completions out of roughly 830 tasks means the agent produces a false completion report approximately once every 277 tasks. That is far too rare to develop an intuition for. If you are reviewing a task list where 827 items are genuinely complete, your brain pattern-matches a wall of [X] marks and moves on. The three phantoms are invisible.
There is a deeper reason they stay invisible. When you see a tasks.md with 50 checked boxes, you focus your review energy on code quality and architecture. You do not re-derive whether each task was performed, because the agent said it was. But you also believe you are doing a thorough review. Those two facts are in cognitive dissonance tension. You hold two contradictory beliefs: “I’m reviewing this carefully” and “I’m accepting the agent’s self-reported completions without verification.” The easiest way to resolve the dissonance is to not think about it. The phantoms survive because your review process has a blind spot that feels like diligence.
This is not a problem you solve through vigilance. Vigilance fails at this frequency because there is nothing to trigger suspicion. The code compiles, the structure looks right, and the agent expressed no uncertainty. You need tooling that cross-references the specified work against the actual codebase, independent of the agent’s claim that the work was done.
The concrete failures
The phantom completions occurred across two different agentic workflows during development of Chiron (pronounced “KY-ron”), a TypeScript-based AI collaboration coaching tool currently in closed beta that analyzes VS Code Copilot sessions and provides actionable feedback.
Claude Code /plan workflow
The first phantom I discovered was in a Claude Code session where I had iterated a detailed /plan with six major change groups and over sixty implementation items. The plan called for a new field on a core data structure, populated during construction from an upstream source and consumed by a downstream filter. What actually happened: the field was declared on the type (so it existed in the codebase) but never populated where the structure was built. The downstream filter always read undefined. The code was syntactically correct, structurally present, and referenced in exactly the right places. It was dead code masquerading as a working implementation. I wrote about this experience and built /verify-plan to catch it.
Spec-kit task workflow
The remaining two phantoms emerged during the spec-kit phase of the same project. I used spec-kit to manage nine specifications with 768 total tasks:
| Spec | Tasks |
|---|---|
| 001-working-extension | 60 |
| 002-outcome-correlation | 61 |
| 003-architecture-refactor | 110 |
| 004-remove-session-thresholds | 50 |
| 005-historical-analysis | 73 |
| 006-snippet-grounding | 80 |
| 007-ai-taxonomy-classification | 154 |
| 008-comprehensive-testing-suite | 111 |
| 010-insights-coaching-improvements | 69 |
The two phantoms occurred in different specs, which is significant because it means the phenomenon is not tied to a particular specification’s complexity or the agent’s accumulated state within one implementation session.
Task T025 (spec 002-outcome-correlation) was a pure phantom. The task specified: “Create outcome classification types in src/core/sessionOutcome/outcomeTypes.ts.” The file outcomeTypes.ts was never created. The commit that closed the task only added outcomeAssessor.ts. The types that T025 specified were folded into a different module entirely, but the dedicated file with its runtime constants and validation helpers was silently dropped.
Task T013b (spec 010-insights-coaching-improvements) was a scope-limited phantom and the more dangerous of the two. The task specified: “Modify pattern analysis in patternDetector.ts to filter out turns with operationalFailures.” The file patternDetector.ts was never modified and contains zero references to operationalFailures. The operational failure filtering that the parent task (T013) built was applied to effectiveness scoring but never extended to pattern detection. This means operationally-failed turns, including authentication errors, rate limits, and tool failures, can still contribute to pattern frequency counts and generate coaching suggestions based on noise. That is the exact bug T013b was designed to prevent.
These two subtypes, pure phantoms (no implementation at all) and scope-limited phantoms (partial application reported as full completion), appear to arise from different mechanisms but produce the same visible symptom: a checked box next to work that was not done.
This is not an isolated experience
The spec-kit ecosystem has over 68,000 stars on GitHub and a growing body of users reporting related problems, though the reports describe overlapping but distinct failure modes.
In issue #847, a developer reports that end-of-task guardrails are “routinely skipped” as context fills up, and that no amount of adjustment to the constitution, spec, plan, or task files produces reliable completion behavior. This is a related but different problem. Context-window degradation causes the agent to gradually lose track of instructions as the session grows long, and the developer typically notices because tests fail or the agent takes unauthorized actions like committing code. Phantom completion is a point failure in task status reporting where nothing visibly goes wrong. The agent does not “forget” a guardrail; it confidently and specifically marks a task done that it did not do. There is no downstream signal that anything is amiss until the feature misbehaves in testing.
In issue #442, a user requested a post-implementation debugging workflow, noting that /implement sometimes produces broken or incomplete results. That issue addresses visible implementation failures: code that crashes, tests that fail, features that do not work. A /diagnose command helps when you know something is wrong. Phantom completion is the case where you do not know something is wrong because the agent reported success. A task cross-check helps specifically when there is no signal that debugging is needed.
Outside the spec-kit ecosystem, a practitioner building a Claude-based automation system independently documented a related phenomenon in a DEV.to post on “AI execution hallucination,” observing that the model confidently reports completing actions it never actually took. Their case was more wholesale: the model fabricated entire file operations. Phantom completion is a narrower and subtler variant specific to structured development workflows, where the false completion is embedded in a list of hundreds of genuine completions. The individual phantom is a needle in a haystack, which is what makes it harder to detect.
These reports describe related but distinct problems in the broader space of agent reliability. Phantom completion is a specific failure mode within that space: the agent marks a task [X] complete in a structured implementation list, and the corresponding code is missing or not to spec. It is not the only way agents fail at implementation, but it is one of the hardest to detect because it produces no error signal at all.
Why LLMs produce phantom completions
Understanding why this happens requires looking at how autoregressive language models actually work when they process a task list.
The completion token is the path of least resistance
An LLM generates text one token at a time, where each token is selected based on the probability distribution conditioned on all preceding tokens. When an agent has been working through a task list and has marked tasks T001 through T024 as [X], the context window contains a repeating pattern: task description followed by completion marker. When the model reaches T025, the highest-probability continuation after the task description is another [X].
The model has no internal state register that tracks whether outcomeTypes.ts was created on disk. It operates exclusively on the tokens in its context window. If the surrounding tokens look like a completed task list, the completion marker is the statistically favored continuation regardless of what happened in the filesystem. The token prediction mechanism is the source of the failure. The model is doing what it was trained to do: predict the most probable next token given the context. The context says “completed tasks.” The most probable continuation says “completed.”
This problem is compounded by exposure bias, a well-studied phenomenon in autoregressive models where the discrepancy between training-time inputs (ground truth) and inference-time inputs (the model’s own prior outputs) causes errors to accumulate. Once the model has emitted a few [X] markers in sequence, the distribution shifts further toward emitting more. Each completion marker makes the next one more likely, independent of whether the underlying work was performed.
Planning and execution collapse into the same token stream
Bachmann and Nagarajan’s “The Pitfalls of Next-Token Prediction” (ICML 2024) provides a theoretical framework for this conflation. Their work demonstrates that autoregressive models trained with teacher-forcing can learn to exploit local shortcuts in planning tasks rather than developing genuine global reasoning. They identify what they call the “Clever Hans cheat,” where the model predicts the next correct token by copying from locally available context rather than computing the answer from first principles. In their experiments, both the Transformer and Mamba architectures failed at a minimal planning task in this way, despite the task being straightforward to learn through other means.
The relevance to phantom completions is direct. When the model processes Task T013b (“filter out turns with operationalFailures in pattern detection”), it generates tokens that reason about the concept. This reasoning produces activations structurally similar to tokens that would describe having done the work. The model “thought about” applying the filter to patternDetector.ts. In an autoregressive architecture, thinking about an action and reporting that the action was completed are not distinct operations. They occur in the same forward pass, producing the same kinds of tokens. The actual file modification, which would have required the model to hold a multi-file requirement in working memory and apply the same logic in a new context, was the harder task. The model satisfied the local prediction and moved on.
There is an additional structural factor that compounds the problem. A task list is always a simplified model of the codebase at the time the tasks were written. As implementation progresses and the codebase evolves, the gap between what a task specifies and what the code actually needs can widen. Early tasks change the landscape that later tasks were written to address. The agent naturally adapts to the code as it finds it rather than to the task as it was specified. Sometimes this adaptation is correct: the task was written before the code evolved, and the agent correctly recognizes that a different approach is needed. But sometimes the agent silently decides a task is no longer needed or is already satisfied by prior work, marks it complete, and moves on. From the outside, these two cases are indistinguishable without cross-referencing the task against the code.
Bachmann and Nagarajan’s snowball failure mode is also relevant. They demonstrate that a model can have learned a good plan while simultaneously failing to execute it during inference, because a small deviation early in the sequence pushes the model onto a trajectory where subsequent tokens are consistent with the deviated path rather than the original plan. A phantom completion may represent exactly this: the model deviated from the plan, and the widening gap between plan and code reality made subsequent completion markers consistent with the new trajectory rather than the original task specification.
The model cannot verify its own output
Kambhampati et al. argued in their ICML 2024 position paper “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks” that autoregressive LLMs cannot perform self-verification because verification is itself a form of reasoning that requires the same capabilities the model lacks for planning. Their investigation into self-critiquing found that it actually diminishes plan generation performance compared to systems with external verifiers, and that LLM-based verifiers produce a notable number of false positives. A phantom completion is, by definition, a false positive in the model’s assessment of its own work.
Kambhampati frames LLMs as “approximate knowledge sources” that act as a pseudo System 1 (fast, intuitive) in Kahneman’s dual-process model. Self-verification would require System 2 (slow, deliberate) reasoning: checking whether a specific file was modified, tracing data flow across modules, confirming that a requirement was satisfied in all specified locations. The autoregressive architecture produces the System 1 answer (“looks done”) rather than the System 2 answer (“let me check”).
There is also a subtler force at play. Models trained through RLHF have a statistical bias toward outputs that human evaluators rate positively. Completing all tasks is a positive outcome. Reporting partial completion or flagging a task as blocked creates friction. This is not about the model trying to please you, but about the distribution of highly-rated completions in training data skewing toward success reporting. One practical observation: explicitly asking the model for critical self-assessment (“identify tasks that may have been missed or incompletely implemented”) partially counteracts this bias by shifting the probability distribution away from the default success-reporting mode. It helps, but it is not a fix.
You cannot prompt your way out of this
If the root cause were sloppy instructions, better prompting would solve it. But the root cause is structural. The autoregressive token prediction mechanism has no grounding in execution state. The model predicts what a completed task list looks like. It does not verify what the filesystem contains.
Better prompts can reduce the rate. Explicit instructions to verify each task against the codebase before marking it complete will catch some phantom completions. But as Kambhampati et al. demonstrated, LLM self-verification produces false positives at a rate that undermines reliability. The model that wrote the code carries the full context of every reasoning step that led to the phantom completion, which subtly biases it toward confirming its own work.
This is why I recommend running verification in a separate, clean session. A fresh session reads the task list and the diff without carrying the implementation context. It has no memory of the decisions that led to the phantom. The findings are more reliable. But even a fresh session is still an LLM performing self-assessment, which means the structural limitation remains.
The approach I have found effective is to design an AI verification workflow that uses the LLM’s own biases in reverse. If the model has a statistical tendency to report success, you can build a tool whose entire prompt context is oriented toward finding gaps. Instead of asking the implementing agent to confirm its own work, you spin up a separate agent whose job is to look for problems, whose context is saturated with patterns of incompleteness rather than patterns of completion. You are not eliminating the token-prediction bias; you are deliberately pointing it in the other direction. This is the design principle behind /verify-plan and the forthcoming /speckit.verify-tasks.
The verification gap in spec-driven development
Spec-driven development is the right direction. The structured progression from specification to plan to tasks to implementation produces better outcomes than unstructured prompting. But the current workflow has a structural gap between “implement” and “done.”
Notably, spec-kit invests in verification for the planning artifacts. The /speckit.analyze command checks cross-artifact consistency between the spec, plan, tasks, and constitution. The /speckit.checklist command generates domain-specific quality checklists that act as “unit tests for English” to validate spec completeness before implementation. But there is no corresponding step that cross-references the completed implementation against the task list. The workflow verifies the coherence of the plan before coding begins and then trusts the agent at exactly the stage where trust is least warranted.
Anthropic’s recently announced Code Review for Claude Code represents significant investment in post-implementation verification, dispatching a team of agents on every pull request to find bugs, filter false positives, and rank findings by severity. But Code Review operates on a different axis: it checks whether the code that was written is correct. It does not check whether the code that was supposed to be written actually exists. A phantom completion would pass Code Review because the code that is present is fine. The problem is the code that is absent.
These are complementary verification layers. Code Review asks “did you write this correctly?” Task cross-referencing asks “did you write what you said you would?” Both are necessary. The current ecosystem provides the first and lacks the second.
I built /verify-plan (repo | blog post) to fill this gap for Claude Code’s /plan workflow. It performs a deterministic, grep-level cross-reference of each plan item against the actual diff, flagging declared-but-never-wired symbols, missing file modifications, and dead code. It catches the class of failures that phantom completions produce without relying on an LLM to assess its own work. The next step is building the equivalent for task-based workflows such as spec-kit: a /speckit.verify-tasks command that reads a tasks.md file, parses the structured task items, and cross-checks each one against the codebase. That tool is the subject of the next post in this series.
The principle
Every agentic workflow that marks tasks complete needs a verification step that is independent of the implementing agent. Build verification into your workflow the same way you build testing into your workflow. The agent that wrote the code cannot reliably confirm that it followed the spec.
A 0.36% phantom completion rate sounds low until you consider that it means roughly one silent failure per 277 tasks. For a complex project with hundreds of tasks, that is a near-certainty. And unlike a compilation error or a failing test, a phantom completion leaves no trace except a checked box next to work that was never done. The only way to find it is to look.
Previous in series: Claude Code Said ‘Done.’ It Wasn’t. So I Built a Skill to Catch Plan Implementation Drift
Next in series: Building /speckit.verify-tasks for spec-kit task cross-checking (coming soon)
Tags: AI-assisted development, phantom completion, spec-driven development, spec-kit, agentic coding, Claude Code, LLM verification, plan implementation drift
Comments