AI systemsMay 11, 202614 min read

AI Coding’s New Bottleneck Is Control, Not Code

Fast code generation is useful. Controlled software delivery needs specs, task graphs, behavior frameworks, orchestration, verification, and governance.

AI systems Spec-driven development AI orchestration Developer workflow Software delivery Governance

On this page

The problem behind spec-driven AI development
The core idea: spec-driven AI is a control model
How the method works in practice
What the analysis found
Why this matters
Limitations and honest skepticism
My personal take
The architecture I would start with
Final thought

Back to blogHussam Ahmed

There is a strange failure mode that shows up when teams start using AI seriously for software development.

At first, everything looks faster.

The agent writes the first version of the feature. It creates files, edits tests, explains itself with confidence, and sometimes even opens a pull request. The demo works. The team feels the speed immediately.

Then the uncomfortable questions start.

Was this actually the feature we agreed to build? Where is the approved requirement? Which acceptance criteria does this code satisfy? Did anyone review the architecture before the agent touched the auth flow? Are the tests proving the risky part, or just proving the happy path? If this breaks production, who approved the change and where is the rollback plan?

This is where the real problem appears.

AI can generate code faster than most teams can govern it.

That does not mean AI coding is useless. It means the bottleneck has moved. The hard part is no longer only writing the code. The hard part is keeping control of intent, scope, risk, verification, and execution while the code is being produced at machine speed.

That is the useful lens for looking at spec-driven AI development tools.

Not as a leaderboard. Not as a search for the best tool. Not as a fight between Spec Kit, OpenSpec, Kiro, Taskmaster, LangGraph, Claude Flow, or any other name in the current ecosystem.

The better question is simpler and harder:

Where does your team lose control?

The problem behind spec-driven AI development

Most software teams already have some kind of delivery process.

They write PRDs, tickets, design docs, task lists, test plans, and release notes. Some of that work is formal. Some of it lives in Slack threads, meeting notes, or the memory of one senior engineer who somehow knows why a decision was made six months ago.

AI coding agents make this mess more visible.

A human developer may ask clarification questions before touching a risky part of the system. A coding agent may jump straight into implementation if the prompt sounds clear enough. A human may remember that the auth service has a strange legacy constraint. An agent may miss that context unless it is written down and placed in the right workflow.

The result is not always bad code. Sometimes the code is fine.

The deeper risk is uncontrolled delivery.

A vague requirement becomes a confident implementation. A PRD becomes a pile of unrelated tasks. A task list becomes code without architecture review. A test suite passes without covering the actual business risk. An agent retries, edits, and commits without durable state or approval records.

This is why spec-driven AI development is becoming attractive. It tries to force structure before implementation. It asks the team to define the feature, design the change, break down the work, verify the outcome, and keep those artifacts visible.

But there is a trap.

Spec-driven AI development is often discussed as if it is one product category. It is not.

Some tools help write requirements. Some turn PRDs into task graphs. Some shape how agents behave. Some orchestrate agents, tools, memory, retries, state, approvals, and long-running workflows.

Putting all of them in one ranking creates bad recommendations. A spec tool does not replace an orchestrator. An orchestrator does not create good acceptance criteria. A behavior framework does not give you durable workflow state. A task graph does not replace architecture review.

That category mistake is the core issue.

The core idea: spec-driven AI is a control model

The best idea in this analysis is not that one tool wins.

The best idea is that spec-driven AI development should be treated as a layered control model.

Spec-driven AI development control modelClick to inspect full size

Each layer controls a different part of the delivery system.

The spec layer controls intent. It captures what should be built, why it matters, and what must be true before the work is accepted.

The task layer controls decomposition. It turns a requirement or PRD into smaller pieces of executable work, ideally with dependencies, risk levels, and verification steps.

The behavior layer controls how agents operate. It tells them to clarify, plan, write tests, review their own work, debug carefully, and avoid jumping into code too early.

The orchestration layer controls execution. It manages state, tools, memory, retries, approvals, parallel work, and long-running workflows.

The verification layer controls evidence. It proves whether the work satisfies the acceptance criteria through tests, scans, reviews, and release gates.

The governance layer controls risk. It decides when humans must approve a design, review a security-sensitive change, or stop an autonomous loop before it damages something important.

Once you see the space this way, the tool landscape becomes less confusing.

GitHub Spec Kit, OpenSpec, Kiro, and Spec Workflow MCP are mainly spec/control-plane tools.

Taskmaster AI and Ralph-style loops are closer to task execution and PRD-to-work systems.

BMAD Method, Superpowers, GSD, and SPARC-style approaches shape agent behavior and team method.

Microsoft Agent Framework, Google ADK, LangGraph, CrewAI, LlamaIndex Workflows, Ruflo, and Claude Flow belong closer to orchestration.

That split matters because teams often buy or adopt the wrong thing for the wrong problem.

If your requirements are vague, an orchestrator will not save you. If your task breakdown is weak, a better coding agent will still wander. If your tests are shallow, autonomous loops will create false confidence. If your team lacks approval gates, MCP integration alone does not make the system safe.

The right tool depends on the control failure.

How the method works in practice

A useful way to understand this is to imagine a real feature: adding TOTP-based multi-factor authentication to an existing SaaS app.

This is not a cosmetic feature. It touches login, secrets, backup codes, audit logs, admin visibility, recovery flows, and rate limiting. A coding agent can absolutely help build it. But letting an agent "just implement MFA" would be careless.

The first layer is the spec.

A good spec would say that users without MFA should keep the same login flow. Users who enable MFA must enter a valid TOTP code after password login. Backup codes must be single-use. Admins can see MFA status but cannot see secrets. Security auditors need events for enrollment, verification, failure, disablement, and recovery-code use.

That already reduces ambiguity.

The clever part is that the spec also defines acceptance criteria. TOTP secrets must be encrypted at rest. Backup codes must be hashed. Invalid attempts must be rate-limited. Audit events must be written. Unit, integration, and end-to-end tests must cover the sensitive flows.

Now the work can move into the task layer.

The feature becomes a task graph: design the data model, implement encrypted secret storage, add the enrollment endpoint, add the login challenge flow, add backup codes, add audit events. The risky tasks are marked as risky. Dependencies are explicit. Verification is attached to each task.

This is where tools like Taskmaster AI fit well. They do not make bad requirements good, but they can turn a decent PRD into a more concrete execution queue.

Then comes the behavior layer.

This is where a framework like Superpowers or a method like BMAD can influence how the agent behaves. The agent should not immediately write production code. It should clarify intent, propose a plan, write failing tests, implement narrowly, run checks, and review its own diff against the acceptance criteria.

That sounds basic. It is not basic when agents are running fast and context is changing.

Then comes orchestration.

A runtime like LangGraph, Microsoft Agent Framework, Google ADK, CrewAI, or LlamaIndex Workflows can model the full process as a workflow: load the spec, validate acceptance criteria, require design approval, pick the next task, run implementation, run tests, request review, update state, and open a pull request.

This is the layer that many teams underestimate.

A CLI command can create files. A coding agent can edit code. A task tool can say what comes next. But a serious workflow needs durable state.

It needs to know what was approved, what failed, what was retried, which tool was called, what output was trusted, and when a human must step in.

That is why orchestration is different from prompting. Prompting tells the agent what to do now. Orchestration controls the system across many steps.

What the analysis found

The strongest finding is that the tools become much easier to reason about once they are separated by control layer.

GitHub Spec Kit looks like a strong default reference for formal spec-driven development because it creates repo-based artifacts across principles, specs, plans, tasks, and implementation. Its strength is not that it magically writes better code. Its strength is that it moves intent and planning out of chat history and into files that can be reviewed.

OpenSpec looks stronger for brownfield change control. Its change-folder model is simple, reviewable, and easy to wrap with an orchestrator. That matters in existing systems, where the problem is often not "create a perfect new architecture," but "make this change without losing the reason, scope, and verification trail."

Kiro is interesting because it brings the spec workflow inside the IDE. That can be useful for teams that do not want to assemble several open-source pieces. The trade-off is that external orchestration becomes more context-dependent because Kiro already has its own workflow model.

Spec Workflow MCP has a clean integration story because it is MCP-native and includes approvals, dashboard visibility, task progress, and implementation logs. That makes it attractive when agents need to call into spec state directly. The caution is licensing and operational surface. MCP-native does not automatically mean production-safe.

Taskmaster AI stands out in the PRD-to-task category. Its value is turning a PRD into a task structure that agents and workflows can act on. But it should not be mistaken for architecture review. If the PRD is weak, Taskmaster can produce a tidy task list for the wrong work.

Ralph is one of the more interesting patterns because it represents an autonomous loop: pick one failing story, implement it, run checks, commit if checks pass, update progress, repeat. That is powerful in the right environment. It is also dangerous in the wrong one. For auth, security, migrations, or weak test suites, "tests pass" is not enough.

BMAD, Superpowers, GSD, and SPARC sit closer to method and agent behavior. They matter because many AI coding failures are behavioral. The agent starts too early, skips tests, forgets context, or explains around weak evidence. These tools and methods can improve discipline, but they do not replace durable workflow state or approval gates.

The orchestration runtimes are a different class. Microsoft Agent Framework, Google ADK, LangGraph, CrewAI, LlamaIndex Workflows, Ruflo, and Claude Flow are about execution control. They are not just there to "use agents." They manage how agents, tools, memory, events, approvals, retries, and state interact.

That is why comparing Ruflo or Claude Flow directly with Spec Kit is misleading. One is closer to an orchestration system. The other is closer to a spec-control tool. They may both appear in an AI delivery stack, but they are not solving the same problem.

Why this matters

This matters because teams are starting to build AI-assisted delivery systems before they have a clear architecture for control.

A developer can use these tools solo and still get value. OpenSpec, GSD, or Spec Kit can bring discipline to local development. They force the work to be written down before the agent starts editing files.

A startup team may benefit from Spec Kit plus Taskmaster because it creates a lightweight bridge from feature intent to executable tasks. That is useful when the team is moving fast and needs enough structure without building a full internal platform.

A brownfield engineering team may get more value from OpenSpec plus LangGraph because existing systems need bounded changes, reviewable proposals, and workflow state around implementation and testing.

An enterprise team may need Spec Workflow MCP or Spec Kit wrapped by Microsoft Agent Framework, Google ADK, or another orchestration runtime. In that environment, auditability, approval, security boundaries, and integration with existing engineering systems matter as much as code generation.

The practical takeaway is this:

Do not start by asking which tool is best.

Start by asking where the work becomes unsafe or unclear.

If the problem is vague requirements, improve the spec layer. If the problem is messy execution, improve the task layer. If the problem is agents behaving badly, improve the behavior layer. If the problem is long-running workflows, retries, and approvals, improve orchestration. If the problem is false confidence, improve verification. If the problem is risk ownership, improve governance.

That framing saves time. It also prevents teams from blaming the wrong tool.

Limitations and honest skepticism

The evidence here has a clear boundary.

Most of the analysis comes from public documentation, GitHub repositories, workflow descriptions, commands, file structures, MCP support, approval concepts, and integration surfaces.

That is enough to judge workflow shape. It is not enough to prove production outcomes.

It does not prove that one tool reduces defects. It does not prove enterprise adoption. It does not prove generated-code quality. It does not prove security completeness. It does not prove ROI.

That is why decimal scores would be misleading. A score like 4.6 out of 5 sounds scientific, but this kind of analysis does not support that level of precision. Coarse judgments like High, Medium, Low, or Context-dependent are more honest.

There is also a danger of tool stacking.

It is easy to imagine a beautiful architecture with a spec tool, a task tool, an orchestration runtime, MCP servers, CI gates, dashboards, and approval workflows. It can look mature on paper and still fail in practice.

If nobody reviews the specs, the spec layer becomes paperwork. If tests are weak, autonomous loops become risky. If MCP servers are trusted blindly, integration becomes a security hole. If approval gates are too heavy, developers route around them. If the orchestrator is poorly designed, the system becomes harder to debug than the original manual process.

Spec-driven AI development does not remove engineering judgment. It makes the missing judgment more visible.

My personal take

What I find most useful in this analysis is the shift from tool comparison to control diagnosis.

That sounds small, but it changes the whole conversation.

When people talk about AI coding tools, the discussion often gets pulled toward model quality or impressive demos. Can it build the app? Can it fix the bug? Can it generate tests? Can it refactor the module?

Those questions still matter.

But in real delivery, the question I care about more is: can I trust the path from intent to shipped change?

I do not want an agent that only writes code quickly. I want a system where the requirement is clear, the plan is visible, the risky parts are marked, the tests map to acceptance criteria, the approvals are recorded, and the final pull request can be reviewed against the original intent.

That is why I like the layered model.

It does not pretend there is one magic tool. It also avoids the opposite mistake: dismissing AI coding because agents sometimes drift. The answer is not blind trust or total rejection. The answer is architecture.

For me, the most interesting pattern is the combination of a simple spec/control layer with a real orchestrator.

Something like OpenSpec plus LangGraph is attractive for brownfield work because the change folder becomes the state object. The orchestrator can read it, execute against it, update it, and stop when the evidence is missing.

Spec Kit plus Taskmaster plus a stronger orchestration runtime also makes sense for teams that want a more formal flow from requirement to task to implementation.

Superpowers-style behavior instructions are useful, but I would treat them as policy, not state. They can make agents behave better, but they do not give me the durable trail I need when something goes wrong.

Ralph-style loops are fascinating, but I would use them carefully. They make sense when stories are small, tests are strong, and the repo is safe to automate. I would not start there for security-sensitive work.

The bigger lesson is that AI software delivery needs fewer magic demos and more boring control surfaces.

Approved specs. Task state. Test evidence. Allowlisted tools. Logged actions. Rollback rules. Human gates where the risk justifies them.

That may sound less exciting than an autonomous swarm building an app overnight. But it is the part that makes the overnight build usable the next morning.

The architecture I would start with

For a serious team, I would start with a simple layered stack:

Practical spec-driven AI delivery workflowClick to inspect full size

Spec layer: GitHub Spec Kit, OpenSpec, Kiro, or Spec Workflow MCP.

Task layer: Taskmaster AI, a Ralph-style task JSON model, or the team's existing issue tracker.

Orchestration layer: Microsoft Agent Framework, Google ADK, LangGraph, CrewAI, or LlamaIndex Workflows.

Tool layer: trusted MCP servers, GitHub, Jira or Azure DevOps, CI, test runners, docs, and databases.

Verification layer: unit tests, integration tests, end-to-end tests, security scans, migration tests, and acceptance-criteria mapping.

Governance layer: human approval, PR review, audit logs, rollback rules, and restrictions around destructive commands.

That stack does not need to be heavy on day one. A solo developer does not need enterprise governance. A startup does not need a giant approval board. But every team needs to know which layer is responsible for which kind of control.

The worst architecture is the invisible one: a long prompt, a powerful agent, a repo full of edits, and no durable answer to "why did it do that?"

Final thought

AI coding is getting fast enough that speed is no longer the main story.

The next serious question is control.

Can the team preserve intent while the agent works? Can it prove the change satisfies the requirement? Can it stop the workflow when risk appears? Can it explain, days later, why the code changed?

Spec-driven AI development is useful when it answers those questions.

If it becomes another tool leaderboard, it misses the point.

Hussam Ahmed

Building large-scale systems by day, exploring the universe by night.

Keep reading

AI systemsJun 7, 2026

Claude Code Dynamic Workflows: A Practical Guide to the New Orchestration Feature

A practical guide to Claude Code dynamic workflows: what the new feature does, when to use it, how to trigger it, and how to design workflows that split, verify, loop, and synthesize real engineering work.

Read article

AI systemsMay 14, 2026

ProjectGenesis: A Repo-Native Control Layer for AI-Built Software

ProjectGenesis is a Markdown-first scaffold for turning rough AI-built product ideas into traceable requirements, specs, backlog items, reviews, tests, and handoffs.

Read article

Featured project

See the Map Knowledge Graph reason about a live driving scene.

An interactive simulator with scenario switching, graph traversal, and step-by-step decision playback.

Open simulator

Follow new posts

I share build logs on AI systems, execution, and astrophotography as they ship — no schedule, only substance.

Follow on LinkedIn Browse all articles