Skip to main content
Back to Blog

Stop Being Held Hostage by Spec-Kit - In the Strong Model Era, Lightweight Documentation Flow Is the Right Way for AI Programming

Introduction

A few days ago I read a rant about spec-kit. The author said they wasted 34M tokens back and forth, and in the end deleted both the code and specs. After deleting, the more they thought about it, the angrier they got, so they wrote an article documenting the entire experience.

I nodded through almost the entire thing.

If you haven’t seen my previous article “GLM-5 Announced: Agentic Code Era Arrives, Why SDD Is Actually More Important?”, I suggest taking a look first. That article explains why SDD is more important in the Agentic Code era; today’s article covers the other side: The problem isn’t SDD, but certain tool implementations that remain bloated, inefficient, and lacking break mechanisms in the strong model era.

This isn’t an isolated case - I’ve fallen into similar traps myself. Before, I used spec-kit to work on Node.js full-stack projects and Java projects, with almost identical results: requirements didn’t get clearer, architecture didn’t get more stable, but a bunch of increasingly long Markdown documents were generated first. Before the project even landed, the docs directory spiraled out of control.

So my conclusion is now clear: The SDD philosophy isn’t wrong, but heavy SDD tools like spec-kit are becoming increasingly like baggage as model capabilities improve.

What truly matters is never “how much documentation was written,” but “whether documentation can stably constrain the model and continuously guide implementation.”


Effect Demo: I Now Only Keep 4 Core Documents

My current approach in Codex 5.4 and Claude Code has converged into a very lightweight documentation flow:

AGENTS.md
docs/
  specify.md
  plan.md
  task.md

Each does only one thing:

  • specify.md: Requirements, goals, acceptance criteria
  • plan.md: Architecture design, boundaries, technical constraints
  • task.md: Task breakdown, execution order, verification methods
  • AGENTS.md: Project tech stack, development principles, safety red lines, and overall command rules

More critically, these 3 documents in docs can’t just float independently - they must be explicitly mounted in the root AGENTS.md, with their specific paths, purposes, and when the Agent must re-read which one clearly written out.

The benefits of this flow are direct: Less documentation, but the model is more obedient; fewer constraints, but higher accuracy.

The reason is simple: the model sees high-signal context, not a face full of ritualized paperwork.


Why spec-kit Failed

If I had to summarize in one sentence:

spec-kit’s problem isn’t being too specification-focused, but simultaneously committing both “over-engineering” and “design deficiency” errors.

1. Over-Engineering: It Writes Too Much, Making the Model Dumber

spec-kit’s standard chain is constitution, specify, plan, tasks, implement, each step with massive guiding words, templates, and MUST constraints.

This might have made sense in the weak model era because models then needed many “handrails.”

But today’s problem is that models like Codex 5.4 have already internalized quite a bit of software engineering common sense. When you pour several large paragraphs of paperwork, dozens of MUSTs, and countless edge-case natural language constraints into its context, you’re essentially competing for its attention.

The result is:

  1. Requirements get diluted
  2. Key constraints get buried
  3. Details frequently lost during implementation
  4. Model starts twitching between “writing code” and “looking back to align with paperwork”

The most typical phenomenon:明明要求它用 ORM,它偏要自己拼 SQL;明明说了不要乱关警告,它照样自信地把警告 suppress 掉;明明已经澄清过某条路径无效,它最后还是把第一轮幻觉出来的路径也实现了,再额外套一层兼容层。

This isn’t the model being dumb - it’s that there’s too much ineffective noise in the context.

2. Design Deficiency: It Has No Real Break Mechanism

This is the more fatal problem.

spec-kit’s approach is more “append” than “replace.” When you modify requirements, clarify paths, or delete assumptions, it often doesn’t truly break the old solution, but tends to:

  • Keep old requirements
  • Stack new requirements on top
  • Make compatibility for both logic sets
  • Patch the system thicker and thicker

So you see an absurd state:

  • Frontend still lives in version 1 hallucinated requirements
  • Backend has already switched to version 2 clarified path
  • Middle layer has “compatibility with old logic” transition code

For solo projects, this approach is nearly disastrous.

Because what solo projects should do most is: Once the route changes, break completely. Delete old code, no mercy.

No need to be compatible with the past, no need to keep historical baggage, no need to build monuments for hallucinated products.

3. It Treats LLM Like a Compiler, Not a Probabilistic Model

Many of spec-kit’s guiding words feel like: if you write enough rules, the model will definitely follow them.

But the reality is, LLM isn’t a compiler - it doesn’t execute every rule with equal weight. Once context gets long, it naturally focuses more on the beginning and end, more easily ignoring the middle.

So the question becomes:

When context exceeds tens of K, how many of spec-kit’s many MUSTs are actually still in effect?

If a method must rely on massive rule text to work, it’s already very fragile.

4. Negative Constraints Distort in Long Context

This is something I’m increasingly certain about: LLM’s handling of negative instructions is naturally less stable than positive constraints.

“Don’t do A” easily becomes A again when context is long and state is messy.

So you’ll find spec-kit-generated documents often have many sentences like:

  • Don’t do this
  • Don’t do that
  • Don’t be compatible with old logic
  • Don’t suppress warnings

The problem is, once these negatives fail in long context, they directly backfire into positive behavior.

This is why often the most effective writing isn’t “don’t A,” but “only allow B, C, D.” If you must negate, it’s better written as “don’t A, use B instead.”

5. Tool Scripts Themselves Aren’t Stable Enough

Another very real problem is that spec-kit isn’t just heavy on philosophy - even the tool layer may not be stable.

For example, managing specs and branches by number was meant to be process-oriented, but if scripts allow generating duplicate prefixes, the entire subsequent flow might error out. More ironically, even when scripts are broken, the model will stubbornly keep running forward, eventually making the whole state messier.

These problems illustrate one thing:

The biggest risk of heavy process tools isn’t that some step has no value, but that if any node breaks, all subsequent phases get polluted.


SDD Philosophy Isn’t Wrong, Wrong Is Heavy SDD Tools Misplaced

One thing must be distinguished here:

I’m not criticizing SDD, but spec-kit’s implementation approach of “over-tooling and over-paperworking SDD.”

Spec-Driven Development’s core philosophy is actually quite simple:

  • Requirements first
  • Then design
  • Then planning
  • Finally coding

I now believe this more than before.

Because without requirement docs, design boundaries, task plans, directly having Agents start writing is essentially asking a skilled cook to work without ingredients. No matter how strong the model, it can only guess in the absence of constraints.

So what should really be kept is:

  • Requirements first
  • Design first
  • Task breakdown first
  • Then let Agent execute

What should really be deleted is:

  • Overly long templates
  • Low information density phase paperwork
  • Append-only mechanisms when modifying requirements
  • Piles of seemingly professional but actually context-polluting process noise

How I Now Work in Codex and Claude Code

I’ve gradually converged my workflow into a lighter practice.

Global AGENTS.md / CLAUDE.md is essentially equivalent to global constitution.

Project-internal docs/specify.md, docs/plan.md, docs/task.md separately handle requirements, design, and task breakdown responsibilities.

Finally, use a very restrained AGENTS.md/CLAUDE.md in root directory as overall command. This command file doesn’t just write principles and division of labor, but also explicitly mounts docs/specify.md, docs/plan.md, docs/task.md, telling the Agent:

  • What are the specific paths to these documents
  • What each document is responsible for
  • Which one must be read first when entering requirements, design, and execution phases

specify.md: Only Write Requirements Body, Not World Complement

Most important here is positive definition:

  • What to do
  • For whom
  • What are success criteria
  • Clearly which major items NOT to do

I now very restrainedly write “don’t do what,” trying to rewrite constraints as “only allow what.”

plan.md: Only Keep Design That Affects Implementation

Don’t write this layer as a thesis.

If AGENTS.md or CLAUDE.md already clearly wrote global tech stack constraints, plan.md shouldn’t repeat them. It’s better suited for only writing design decisions and special case constraints that will actually be used in this implementation.

Keeping just these is enough:

  • Module boundaries
  • Data flow
  • Implementation choices specific to this feature
  • Reference implementation sources
  • Places that must be explicitly broken

Especially “places that must be broken” should be written out separately. Once new path is confirmed, delete old path, don’t append compatibility layer.

task.md: Tasks Must Be Broken Down to Agent-Executable Level

Tasks can’t just be “complete backend development” kind of big words.

More effective is this granularity:

  • Which files to modify
  • Which interfaces to add
  • Which verification commands to run
  • What are completion criteria

The clearer tasks are broken down, the less likely the model will drift while writing.

AGENTS.md: Shorter Is Better, More Accurate Is Better

AGENTS.md absolutely cannot be written as a long background introduction.

It should only contain high-signal content:

  • Project tech stack and basic constraints
  • Safety red lines
  • Development principles, like single responsibility, open-closed principle kind of thinking constraints
  • Specific paths and purpose explanations for docs/specify.md, docs/plan.md, docs/task.md
  • Who is responsible for what
  • Which directories who can modify
  • When parallel execution is allowed
  • Which situations must stop and report
  • Which old logic must be directly deleted

AGENTS.md isn’t a second requirements document, but a dispatch hub connecting global principles and 3 core documents.


Best Engineering Practices for Superpowers Combined with Skills

I now mostly passively trigger superpowers skills in Codex or Claude Code, rather than actively stacking a long sacred process.

My suggestion is to follow 5 principles when combining skills.

1. Process Skill Before Implementation Skill

Recommended order:

brainstormingwriting-planssubagent-driven-development / dispatching-parallel-agentsrequesting-code-review

After main process is stable, then connect specific domain skills, like frontend connecting frontend-design, Java connecting spring-boot-engineer.

Don’t let implementation skill jump the gun from the start.

2. One Stage, One Main Document

Requirements stage only modify specify.md, design stage only modify plan.md, task stage only modify task.md.

Don’t let each skill produce its own new document, otherwise multiple sources of truth will quickly appear.

3. Strong Models for Judgment, Regular Models for Execution

Strong models are better for:

  • Requirements clarification
  • Architecture convergence
  • Code Review

Regular models are better for:

  • Mechanical implementation
  • Writing tests
  • Running verification

This is more stable than having one big model handle everything end-to-end.

4. Must Have Review Gates

Real engineering guardrails aren’t document thickness, but checkpoints.

At minimum, two gates:

  • Does this implementation match current spec
  • Is this code quality up to standard

Pass spec gate first, then quality gate, otherwise it’s easy to produce “worked hard but wrong direction” code.

5. Passive Triggering Better Than Manual Stacking

Models are now strong enough that many process skills don’t need you to manually call every one each time.

As long as your document structure is clear, context is clean, boundaries are clear, it will naturally trigger when it should.

The mark of a mature workflow isn’t many steps, but few steps where each step is effective.


Conclusion: Heavy spec-kit Will Become Obsolete, But Lightweight SDD Won’t

I increasingly believe one thing:

As AI models continue to iterate, heavy SDD tools like spec-kit will only seem increasingly clumsy, but SDD itself won’t become obsolete.

Because software development always needs:

  • Clearly state requirements
  • Set boundaries
  • Break out plans
  • Hand execution to the model

What changes isn’t this sequence.

What changes is that tools carrying this sequence must become increasingly light, increasingly short, increasingly high-signal.

So my position is now clear:

Want SDD, not document worship. Want constraints, not rituals. Want break, not compatibility with hallucinations.

This might be the more reliable AI engineering approach for the strong model era.


This article partially references the following materials, combined with my own practical experience:


Welcome to follow the WeChat Official Account FishTech Notes for more discussions!