In this article
- Observations: the problem with large MRs and AI code generation
- Hypothesis: specifications as a contract
- Method: storage structure and specification sizes
- The Role of AI: editor and executor
- Results: reproducibility experiment
- Limitations and Conclusions
Observations
4000 lines in a single MR. Three hours reviewing, 12 comments, fixes - another 800 lines. On the fourth attempt I closed the tab and realized: the problem isn’t the code, it’s that nobody knew what exactly needed to be written.
If you work with large codebases, this situation is familiar. Large MRs are a symptom. When it’s unclear what exactly needs to be done, developers write more code than necessary. They add things just in case. Cover scenarios nobody asked for. The MR grows not because the task is big, but because the boundaries are fuzzy.
Another cause is the illusion that it’s easier to do everything in one task than to decompose. It seems like splitting creates extra work. In practice, a monolithic 4000-line MR can’t be properly reviewed, and bugs slip through to production.
I use AI daily. Claude Code is my main tool - it can be configured to use different models: Anthropic, DeepSeek, GPT family, local ones via Ollama. At some point I noticed a pattern: the more precisely I formulate the task, the better the result. Prompts became increasingly structured - simple instructions, then templates, then something resembling technical specifications.
Where’s the problem? At first I thought I was too lazy to give AI detailed instructions. Then I decided the AI Agent wasn’t gathering enough context - needed RAG or something similar. Eventually I realized the problem is in both places: creating full instructions feels like overkill, and how to help the agent gather the right context - unclear.
But the main thing - there’s nothing to verify. No artifact you can point to and say: this is what was specified, this is what was built.
Hypothesis
The idea isn’t new - the industry has been talking about spec-first approach for a while. I wanted to try it for a long time and finally decided to test it.
If requirements are formalized as a specification before coding begins, then:
- AI Agent will generate more predictable code
- Results can be validated against the specification
- Architecture will remain controllable
When AI Agent generates hundreds of lines of code per minute, the only way to control the result is to have a formal description of what should be produced, and a tool that verifies the implementation matches the specification. More on the latter in a separate article.
Method
I decided to test the hypothesis on a real project - a tool for building architectural graphs from Go code. The rule was simple: not a single line of code without a specification.
First specification - create an empty project with standard Go layout. Second - graph model. Third - Go code analyzer. Over several days, 10 completed specifications accumulated.
Storage Structure
Kanban-like organization via file system:
| |
Prioritization via numeric prefix: lower number - higher priority. State transitions - moving files between directories.
Specification Sizes
Classification by time to write the specification (T-shirt sizing):
- S (Small) - up to 10 minutes
- M (Medium) - 10-20 minutes
- L (Large) - more than 20 minutes
Size determines depth of elaboration. S-task: Problem Statement and 5 Acceptance Criteria. L-task: full UML/C4 diagrams, detailed Requirements, 15+ acceptance criteria. Correlation between specification size and result predictability is direct.
If you’re starting from scratch - try an S-specification first. Minimal investment for the experiment.
S-Specification Example: Project Initialization
| |
Minimum details, maximum specificity. AI gets clear task boundaries.
M-Specification Example: Collect Command
Medium tasks require diagrams. I experimented with Sequence diagrams - sending them to the agent along with requirements. Noticed that with them the AI Agent produces roughly what’s expected.
| |
| |
| |
Sequence diagram defines call order. AI follows it literally.
L-Specification Example: Go Code Analyzer
Large tasks require multiple diagrams - Data Model and Sequence:
| |
| |
| |
| |
| |
For L-tasks multiple diagrams are a necessity. Data Model, Sequence, Component - together they define application architecture and manage dependencies between components.
The Role of AI
AI accelerates the work, but I don’t delegate architecture: decisions are fixed in the specification, go through review, and are verified by validators. However, decisions rarely come from thin air: I bring initial options and constraints, the agent suggests alternatives and highlights blind spots. This affects my thinking, and I’m aware of it. The final yes/no and responsibility are mine.
The agent has two areas of responsibility.
Specification Editor I dictate a raw stream of thoughts by voice (I dictate faster than I type). The agent formats it according to my specification template: organizes into sections, clarifies the unsaid, formulates requirements and acceptance criteria so they can be verified. After that I review and fix the specification as the source contract.
Implementation Executor When the specification is agreed upon, I hand it to the agent for implementation. The agent writes code according to the specification, and I verify the result: review, validation, iterations until the architecture becomes clean and predictable.
Results
Over several days - 10 completed specifications and a working project. Code matches the architecture from diagrams.
To test whether specifications are self-sufficient, I ran an experiment: gave Claude Code an empty directory and 10 specifications from archlint - no access to source code. The task: recreate the project from scratch.
Result in 20 minutes:
- 85.5% reproduction success rate
- 100% structural identity (directories, files, types)
- 23 mutations in implementation details
Project structure reproduced completely. All acceptance criteria from specifications met. Project compiles and passes tests.
Mutations occurred where specifications described what to do, but not how. Critical example: the sequence diagram building algorithm was implemented differently - functionally equivalent, but with different call stack traversal logic. Another category of mutations - stylistic: comment language, function order in files, variable naming.
Takeaway for improving specifications: critical algorithms need pseudocode or concrete input/output examples. A specification with what + how yields more precise reproduction than what alone.
Full report with mutation catalog: github.com/mshogin/archlint-reproduction
Specification Idempotency
For specifications to remain reproducible, all changes must go through them. No tweaks in copilot mode, no chats with “fix this thing”. Every change - update the specification, then implement.
This is the main challenge. You want to quickly fix a bug in dialogue rather than go back to the spec. But every such fix - loss of reproducibility.
The trade-off is obvious:
- Need results here and now - copilot mode is faster
- Need reproducibility - only through specifications
The choice depends on context. Prototype or experiment - copilot. Production code with a long lifecycle - specifications.
Limitations
The approach doesn’t solve problems automatically. It doesn’t replace domain understanding, doesn’t invent requirements, and doesn’t guarantee good design. It makes errors visible earlier and forces architecture to stay within bounds.
The cost: time for writing specifications, reviewing them, and iterations after validation. If you treat specs formally, everything slides back into chaos. Honestly ask yourself: are you willing to spend 10-30 minutes on a specification so the agent can implement it in 5-20 minutes?
Implementation details vary. The reproducibility experiment showed 23 mutations - algorithms are interpreted differently, code style differs. Critical sections need pseudocode, not just descriptions.
I think the approach works well where there’s an established, formed, and working process. Processes focused on discipline, clear areas of responsibility, review, definition of done. You can think of this process as a conveyor delivering software 24/7.
Conclusions
The hypothesis was confirmed:
- AI generates more predictable code - yes, with diagrams present
- Results can be validated - yes, 85.5% reproducibility
- Architecture remains controllable - yes, 100% structural identity
The essence is simple: without specification there’s nothing to verify, with specification - there’s an artifact for validation. No need for perfect AI or perfect prompt.
The experiment continues.
Templates and examples: github.com/mshogin/archlint
If you’re trying the spec-driven approach or already using it - share in the comments what works and what doesn’t. I write about AI code generation and architecture practices on Telegram: @MikeShogin