Spec-Driven Development: Controlling AI Code Generation

Observations

4000 lines in a single MR. Three hours reviewing, 12 comments, fixes - another 800 lines. On the fourth attempt I closed the tab and realized: the problem isn’t the code, it’s that nobody knew what exactly needed to be written.

If you work with large codebases, this situation is familiar. Large MRs are a symptom. When it’s unclear what exactly needs to be done, developers write more code than necessary. They add things just in case. Cover scenarios nobody asked for. The MR grows not because the task is big, but because the boundaries are fuzzy.

Another cause is the illusion that it’s easier to do everything in one task than to decompose. It seems like splitting creates extra work. In practice, a monolithic 4000-line MR can’t be properly reviewed, and bugs slip through to production.

I use AI daily. Claude Code is my main tool - it can be configured to use different models: Anthropic, DeepSeek, GPT family, local ones via Ollama. At some point I noticed a pattern: the more precisely I formulate the task, the better the result. Prompts became increasingly structured - simple instructions, then templates, then something resembling technical specifications.

Where’s the problem? At first I thought I was too lazy to give AI detailed instructions. Then I decided the AI Agent wasn’t gathering enough context - needed RAG or something similar. Eventually I realized the problem is in both places: creating full instructions feels like overkill, and how to help the agent gather the right context - unclear.

But the main thing - there’s nothing to verify. No artifact you can point to and say: this is what was specified, this is what was built.

Hypothesis

The idea isn’t new - the industry has been talking about spec-first approach for a while. I wanted to try it for a long time and finally decided to test it.

If requirements are formalized as a specification before coding begins, then:

AI Agent will generate more predictable code
Results can be validated against the specification
Architecture will remain controllable

When AI Agent generates hundreds of lines of code per minute, the only way to control the result is to have a formal description of what should be produced, and a tool that verifies the implementation matches the specification. More on the latter in a separate article.

Method

I decided to test the hypothesis on a real project - a tool for building architectural graphs from Go code. The rule was simple: not a single line of code without a specification.

First specification - create an empty project with standard Go layout. Second - graph model. Third - Go code analyzer. Over several days, 10 completed specifications accumulated.

Storage Structure

Kanban-like organization via file system:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
specs/
├── todo/           # task queue
│   ├── 0010-feature-x.md
│   └── 0020-feature-y.md
├── inprogress/     # in progress (maximum one - WIP limit)
│   └── 0005-current.md
└── done/           # completed
    ├── 0001-init-project.md
    ├── 0003-data-model.md
    └── 0004-go-analyzer.md

Prioritization via numeric prefix: lower number - higher priority. State transitions - moving files between directories.

graph LR TODO["todo/"] --> INPROGRESS["inprogress/"] INPROGRESS --> DONE["done/"]

Specification Sizes

Classification by time to write the specification (T-shirt sizing):

S (Small) - up to 10 minutes
M (Medium) - 10-20 minutes
L (Large) - more than 20 minutes

Size determines depth of elaboration. S-task: Problem Statement and 5 Acceptance Criteria. L-task: full UML/C4 diagrams, detailed Requirements, 15+ acceptance criteria. Correlation between specification size and result predictability is direct.

If you’re starting from scratch - try an S-specification first. Minimal investment for the experiment.

S-Specification Example: Project Initialization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Spec 0001: Initialize Standard Golang Project Layout

**Metadata:**
- Priority: 0001
- Status: Done
- Effort: S

## Overview
### Problem Statement
Need to create basic Go project structure for the archlint tool
according to standard development practices.

### Solution Summary
Initialize Go module and create minimal project structure.

## Requirements
### R1: Go Module Initialization
- Initialize Go module with name github.com/mshogin/archlint

### R2: Minimal Directory Structure
- Create cmd/archlint/ for entry point
- Create internal/ for private code
- Create pkg/ for public libraries

## Acceptance Criteria
- [x] AC1: go.mod created with module path github.com/mshogin/archlint
- [x] AC2: cmd/archlint/main.go exists and compiles
- [x] AC3: Directories internal/ and pkg/ exist

Minimum details, maximum specificity. AI gets clear task boundaries.

M-Specification Example: Collect Command

Medium tasks require diagrams. I experimented with Sequence diagrams - sending them to the agent along with requirements. Noticed that with them the AI Agent produces roughly what’s expected.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Spec 0006: Implement Collect Command

**Metadata:**
- Priority: 0006
- Status: Done
- Effort: M

## Overview
### Problem Statement
Need to implement the collect command for gathering architecture
from source code and saving the graph to a YAML file.

### Solution Summary
Create a collect subcommand that uses GoAnalyzer to analyze code
and saves the result in YAML format.

## Architecture
### Sequence Flow (PlantUML)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
title Sequence: Collect Command

actor User
participant "collectCmd" as CC
participant "GoAnalyzer" as GA
participant "saveGraph" as SG

User -> CC: archlint collect . -o arch.yaml
CC -> GA: Analyze(dir)
GA --> CC: *Graph
CC -> SG: saveGraph(graph)
SG --> CC: nil
CC --> User: "Graph saved to arch.yaml"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## Requirements
### R1: Command Definition
var collectCmd = &cobra.Command{
    Use:   "collect [directory]",
    Short: "Collect architecture from source code",
    Args:  cobra.ExactArgs(1),
    RunE:  runCollect,
}

### R2: Flags
-o, --output: output YAML file (default: architecture.yaml)
-l, --language: programming language (default: go)

## Acceptance Criteria
- [x] AC1: Command accepts directory as argument
- [x] AC2: Flags -o and -l work
- [x] AC3: Result saved to YAML
- [x] AC4: Statistics printed for components

Sequence diagram defines call order. AI follows it literally.

L-Specification Example: Go Code Analyzer

Large tasks require multiple diagrams - Data Model and Sequence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Spec 0004: Implement Go Code Analyzer

**Metadata:**
- Priority: 0004
- Status: Done
- Effort: L

## Overview
### Problem Statement
Need to implement a Go code analyzer that parses source code
using AST and builds a dependency graph between components.

### Solution Summary
Create GoAnalyzer in internal/analyzer package that uses
go/ast and go/parser to analyze Go files and build the graph.

## Architecture
### Data Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class GoAnalyzer {
  -packages: map[string]*PackageInfo
  -types: map[string]*TypeInfo
  -functions: map[string]*FunctionInfo
  -nodes: []model.Node
  -edges: []model.Edge
  --
  +NewGoAnalyzer() *GoAnalyzer
  +Analyze(dir string) (*model.Graph, error)
  -parseFile(filename string) error
  -buildGraph()
}

class PackageInfo {
  +Name: string
  +Path: string
  +Imports: []string
}

class TypeInfo {
  +Name: string
  +Package: string
  +Kind: string
  +Fields: []FieldInfo
}

GoAnalyzer "1" *-- "*" PackageInfo
GoAnalyzer "1" *-- "*" TypeInfo

1
### Sequence Diagram

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
title Sequence: Code Analysis

actor User
participant "GoAnalyzer" as GA
participant "go/parser" as GP
participant "buildGraph" as BG

User -> GA: Analyze(dir)
loop For each .go file
  GA -> GP: ParseFile(filename)
  GP --> GA: *ast.File
  GA -> GA: Extract packages, types, functions
end
GA -> BG: buildGraph()
BG --> GA: *Graph
GA --> User: *Graph, nil

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## Requirements
### R1: AST Parsing
- Parse all .go files in directory
- Extract packages, types, functions, methods

### R2: Graph Building
- Create Node for each component
- Create Edge for each relationship (import, calls, uses)

### R3: External Dependencies
- Identify external dependencies
- Mark them as entity: external

## Acceptance Criteria
- [x] AC1: Analyzer correctly parses Go code
- [x] AC2: All component types extracted
- [x] AC3: All relationship types identified
- [x] AC4: External dependencies identified
- [x] AC5: Graph serializes to YAML

For L-tasks multiple diagrams are a necessity. Data Model, Sequence, Component - together they define application architecture and manage dependencies between components.

The Role of AI

AI accelerates the work, but I don’t delegate architecture: decisions are fixed in the specification, go through review, and are verified by validators. However, decisions rarely come from thin air: I bring initial options and constraints, the agent suggests alternatives and highlights blind spots. This affects my thinking, and I’m aware of it. The final yes/no and responsibility are mine.

The agent has two areas of responsibility.

Specification Editor I dictate a raw stream of thoughts by voice (I dictate faster than I type). The agent formats it according to my specification template: organizes into sections, clarifies the unsaid, formulates requirements and acceptance criteria so they can be verified. After that I review and fix the specification as the source contract.
Implementation Executor When the specification is agreed upon, I hand it to the agent for implementation. The agent writes code according to the specification, and I verify the result: review, validation, iterations until the architecture becomes clean and predictable.

graph LR A["Ideas/options (me)"] --> B["AI supplements and formalizes"] B --> C["Specification"] C --> D["Review and decisions (me)"] D --> E["Implementation (AI)"]

Results

Over several days - 10 completed specifications and a working project. Code matches the architecture from diagrams.

To test whether specifications are self-sufficient, I ran an experiment: gave Claude Code an empty directory and 10 specifications from archlint - no access to source code. The task: recreate the project from scratch.

Result in 20 minutes:

85.5% reproduction success rate
100% structural identity (directories, files, types)
23 mutations in implementation details

Project structure reproduced completely. All acceptance criteria from specifications met. Project compiles and passes tests.

Mutations occurred where specifications described what to do, but not how. Critical example: the sequence diagram building algorithm was implemented differently - functionally equivalent, but with different call stack traversal logic. Another category of mutations - stylistic: comment language, function order in files, variable naming.

Takeaway for improving specifications: critical algorithms need pseudocode or concrete input/output examples. A specification with what + how yields more precise reproduction than what alone.

Full report with mutation catalog: github.com/mshogin/archlint-reproduction

Specification Idempotency

For specifications to remain reproducible, all changes must go through them. No tweaks in copilot mode, no chats with “fix this thing”. Every change - update the specification, then implement.

This is the main challenge. You want to quickly fix a bug in dialogue rather than go back to the spec. But every such fix - loss of reproducibility.

The trade-off is obvious:

Need results here and now - copilot mode is faster
Need reproducibility - only through specifications

The choice depends on context. Prototype or experiment - copilot. Production code with a long lifecycle - specifications.

Limitations

The approach doesn’t solve problems automatically. It doesn’t replace domain understanding, doesn’t invent requirements, and doesn’t guarantee good design. It makes errors visible earlier and forces architecture to stay within bounds.

The cost: time for writing specifications, reviewing them, and iterations after validation. If you treat specs formally, everything slides back into chaos. Honestly ask yourself: are you willing to spend 10-30 minutes on a specification so the agent can implement it in 5-20 minutes?

Implementation details vary. The reproducibility experiment showed 23 mutations - algorithms are interpreted differently, code style differs. Critical sections need pseudocode, not just descriptions.

I think the approach works well where there’s an established, formed, and working process. Processes focused on discipline, clear areas of responsibility, review, definition of done. You can think of this process as a conveyor delivering software 24/7.

Conclusions

The hypothesis was confirmed:

AI generates more predictable code - yes, with diagrams present
Results can be validated - yes, 85.5% reproducibility
Architecture remains controllable - yes, 100% structural identity

The essence is simple: without specification there’s nothing to verify, with specification - there’s an artifact for validation. No need for perfect AI or perfect prompt.

The experiment continues.

Templates and examples: github.com/mshogin/archlint

If you’re trying the spec-driven approach or already using it - share in the comments what works and what doesn’t. I write about AI code generation and architecture practices on Telegram: @MikeShogin