In our last post, we made the case for design-first development. But translating from theory to practice can be a challenge. To help others clear this hurdle, we’re going to demonstrate Design-First development by first creating a CI scheduler, and then fixing a bug in it.
When you move to a design-first workflow, you hit the blank page problem. You sit down to write your first design doc, and the friction shows up immediately: Is this too detailed? Not detailed enough? If I’m already thinking about the database schema, should I put that here, or wait?
This is not a failure. Most of us learned to solve problems by coding our way through them. In a design-first world, that instinct works against you. The blank page feels uncomfortable because you are used to your code giving you traction. In this workflow, traction comes from committing to decisions before you start building.
This shift is not automatic. You have to apply the methodology each time with the same care as the first. Consistency and rigor makes it work, not cleverness. The hard part is repeating the process when you are tired, when the deadline is close, and when you feel tempted to skip the writing and go straight to code.
The design doc, the spec, and the plan do different jobs. When you blur them, the AI gets confused, your reviewers get confused, and the methodology stops working.
The design doc comes first, before requirements and technical planning. It lays out what we are building, what we are not building, and which promises the system makes that nothing is allowed to break.
It scopes to a system, not a feature. You keep one document per system, version it in your repository alongside the code, and update the relevant sections in the same PR when a new feature touches that system.
The design doc stays prescriptive: you delete sections that do not apply rather than leaving them empty or marking them N/A. A single-component change does not need a high-level architecture diagram. A pure library with no external callers does not need a backwards compatibility section. The document should contain only what matters for this system.
Goals and Non-Goals
Goals must be verifiable. If you cannot write a test or metric for a goal, rewrite it or delete it. If you cannot verify a goal, you do not have a goal; you have an aspiration.
Non-goals are where design docs can easily fail. A weak non-goal documents an omission. A strong non-goal documents a decision.
”Out of scope: parallel suite execution.”
“Parallel suite execution is explicitly deferred. Current suite count makes sequential scheduling sufficient, and adding concurrency control now would complicate failure attribution without proportional benefit. Revisit when suite count exceeds 20.”
Six months from now, someone reads the strong version and knows: we thought about this, we made a call, here is why. Without it, you relitigate the decision in the next planning meeting.
System Invariants
This section tells you whether the design is done.
System invariants are the top-level promises the system makes to every caller, regardless of which component is involved. You number them because failure modes reference them by ID. They set the hardest constraints, and no component can violate them.
If you cannot name the invariants yet, you need more design work before you move on.
Examples:
Test your design against these. If a failure mode violates an invariant, your architecture has a gap that you need to address.
Alternatives Considered
The proposed design looks obvious in retrospect. The rejected alternatives are the proof of judgment calls. If you do not write them down, that decision context evaporates.
Every rejected alternative needs three things: what it was, why it seemed attractive, and why you chose differently.
Component Contracts
For each component in the system, the design doc defines its contract: what it exposes, what it consumes, what it promises to callers, how it fails. At the bottom of each contract, include a checklist of spec directives: explicit statements of what the downstream spec must define for this component:
The spec directives let the design doc produce the spec. When you sit down to write the spec, you work through the checklist from each component contract instead of starting cold. The design doc has already told you exactly what you need to specify.
Failure Modes
Describe how the system fails end-to-end. Cover upstream dependency outages and partial failure behavior. These scenarios span multiple components; they are not single-component error handling.
Each failure mode references the invariants it might affect. If you cannot map a failure to your invariants, you still have thinking to do.
Cross-Cutting Concerns
Security, performance, observability. If a section has no impact, say so explicitly with one sentence of justification. Do not leave it blank.
The Living Document Rule
A design doc does not capture intent once and then freeze. When implementation diverges from the plan (and it will), update the design doc in the same PR as the code. Date-stamp it, and explain what changed and why.
### 2026-03-14: Decision update
Proposed Design amended. Original design stored schedule configs in-memory; implementation revealed this causes config loss on service restart. Moved to persistent store. Design updated to reflect.
If you merge code that contradicts the design doc without updating it, you create debt in your most important artifact.
Eval Suite CI System: Design Document
Goals
Non-Goals
System Invariants
Alternatives Considered
Failure Modes
Cross-Cutting Concerns
Component Manifest
| Component | Module | Size | Depends on |
|---|---|---|---|
| Schedule Store | db/migrations/0012_scheduler.sql | S | — |
| Scheduler | services/scheduler/scheduler.ts | M | Schedule Store |
| Trigger Handler | services/runner/handler.ts | S | Scheduler |
Size key: XS = < 2h · S = half day · M = 1–2 days · L = 3–5 days · XL = needs breakdown
Component Contracts
Scheduler — services/scheduler/scheduler.tsDepends on: Schedule Store Size: M
| Error class | When | Caller must |
|---|---|---|
| Invalid config | Trigger time malformed or suite ID missing | Reject at creation; do not store |
| Queue unavailable | Run queue unreachable at dispatch time | Log missed trigger; alert on consecutive misses |
| Org isolation violation | Schedule references a resource outside the requesting org | Reject; log the attempt |
The spec follows from the design doc. Each component’s spec directives act as a checklist for the spec writer: they define exactly what you need to specify for each component. When you finish the design, writing the spec becomes mechanical: you work through each directive and make it testable.
The spec defines success, not approach. No tech stack names, no implementation choices. Use this test: can a stranger write tests against this spec without seeing any code and without asking you a single question? If not, it is not done. If you cannot evaluate a requirement unambiguously pass/fail, it does not belong here.
User Stories in Given/When/Then
This format forces observable behavior, not implementation intent. The “Given” sets the precondition. The “When” names the action. The “Then” defines the testable outcome.
Order stories by priority. P1 is the smallest viable increment: the thing that must work before anything else matters.
Functional Requirements (FR-001, FR-002…)
Number them. A reviewer can say “FR-007 is not satisfied” instead of writing three sentences. Use MUST and MUST NOT only. “Should” creates wiggle room, and wiggle room creates ambiguity.
Bad: “The system should respond quickly.”
Good: “FR-003: The workflow MUST complete within 10 minutes of trigger under normal load.”
Success Criteria (SC-001…)
Functional Requirements describe what the system must do. Success criteria describe what good looks like to the person using it: user-facing and measurable, not system internals.
Bad: “The polling loop runs on a 30-second interval.”
Good: “SC-002: Results are available to the team within 15 minutes of workflow trigger.”
Entity Definitions
If the feature involves data, define the domain objects precisely: what they are, what fields they carry, what constraints apply. You are not writing a schema. You are writing a conceptual definition that two engineers would model the same way.
Edge Cases
Do not write “handle errors gracefully.” List cases: missing configuration, concurrent triggers, upstream dependency unavailable, partial completion. Ask concrete questions: what happens when this field is empty? What happens when the trigger fires twice? What happens when the downstream service times out halfway through?
The Living Document Rule
When requirements change or gaps surface, amend the existing spec (date-stamped) with a ## Clarifications section and a ### Session YYYY-MM-DD subheading. Explain what changed, why, and what sections you updated. This is how a spec stays authoritative instead of becoming an archaeological artifact no one trusts.
Scheduled Eval Suite CI: Specification
User Story 1 (P1)
Given a schedule configured for 12:00 PM UTC,
When the scheduler clock reaches 12:00 PM UTC,
Then eval-suite-alpha MUST be initiated within 60 seconds of the scheduled time.
User Story 2 (P1)
Given a suite is already running when a scheduled trigger fires,
When the new trigger is received,
Then the trigger MUST be queued and executed after the current run completes, not dropped.
FR-001: The system MUST log the start time, org_id, suite_id, and trigger_source for every execution.
FR-002: If a suite is currently running, a concurrent trigger MUST be queued, not silently dropped. Queued triggers do not expire and MUST all execute in order.
FR-003: Schedules MUST be independently configurable per org. A change to one org’s schedule MUST NOT affect any other org.
FR-004: If a scheduled trigger fires and no suite is configured for that window, the system MUST log the event and take no further action.
SC-001: Results are available to the team within 15 minutes of the scheduled trigger time under normal load.
SC-002: No cross-org suite contamination occurs under any scheduling configuration.
Entities
FR-001 and FR-002 map directly to test cases. FR-003 maps directly to SC-002. The spec directives in the design doc produced these requirements; the spec made each one testable.
The design doc is architectural. The spec is tech-agnostic. Neither tells an engineer exactly what to build where. The plan does. It produces two artifacts.
The plan itself maps the spec to the real codebase: the technical choices the spec deliberately withheld, what already exists, what needs to be created, and exactly which files will change. It includes the data model with real schemas, affected file paths, and any code generation steps that must run before implementation can start. If any technical questions are genuinely unresolved when you sit down to plan, you resolve them first: research before you commit.
The task list follows from the plan. It is dependency-ordered and phased: setup and code generation first, then foundational shared types, then one phase per user story in priority order, then a final quality gate phase. Every task maps to a specific file. Parallel work is tagged. Each phase ends with a checkpoint that must pass before the next phase starts. The implementation agent executes against the task list. It does not need to interpret the plan.
Scheduled Eval Suite CI: Implementation Plan
Technical Context
| Item | Decision | Notes |
|---|---|---|
| Storage | SQL (persistent) | Preserves schedule state across restarts. No in-memory loss on crash. |
| Queue | Internal job queue | Existing infrastructure, no new dependency |
| Migration | Additive schema migration | New tables only. No destructive changes. |
Affected Locations
| Location | What changes | New or modified? |
|---|---|---|
db/migrations/0012_scheduler.sql | Schedule + ScheduledRun tables | New |
services/scheduler/scheduler.ts | Core scheduler logic | New |
services/scheduler/scheduler.test.ts | Co-located unit tests | New |
services/runner/handler.ts | Trigger ingestion point | Modified |
Code Generation Steps
Scheduled Eval Suite CI: Tasks
Phase 1 — Setup
db/migrations/0012_scheduler.sqlCheckpoint: Migration runs cleanly. New tables exist.
Phase 2 — Foundational
services/scheduler/types.tsservices/scheduler/errors.tsCheckpoint: Core types compile and are importable.
Phase 3 — User Story 1: Scheduled Trigger (P1)
services/scheduler/scheduler.tsservices/scheduler/scheduler.test.tsservices/runner/handler.tsCheckpoint: User Story 1 acceptance criteria pass independently.
Final Phase — Quality Gates
The plan answers “what are we building and where.” The task list answers “in what order, one file at a time.” The implementation agent gets the task list and the spec. Its job is to satisfy requirements, not to interpret the plan.
Step 1: Design Doc First
Design the system from a high level first. Define the goals and non-goals. Work through the alternatives; this is where you discover that upstream CI triggers create a coupling problem, and that external schedulers introduce audit gaps. The Alternatives Considered section only makes sense after you try to map the problem to the available infrastructure.
Then name the invariants. For the scheduler, INV-1 (no trigger is silently dropped) and INV-2 (no org affects another) carry the load. If the design cannot hold them under failure, the architecture has a gap.
Then write the component contracts and spec directives. When you can fill in the directives for each component (these are the exact things the spec must define), you finish the design doc.
Step 2: Spec from the Design Doc
With the design doc complete, writing the spec becomes mechanical. For each component, work through its spec directives: write the user stories and the FRs that satisfy them, then the success criteria the tests can measure. The directives tell you what needs to exist. The spec makes each one pass/fail testable.
For the CI scheduler, the scheduler component’s directives produced FR-001 (logging), FR-002 (queuing), FR-003 (org isolation), and FR-004 (empty-window behavior). The design doc established that these requirements needed to exist; the spec defines exactly what “correct” looks like for each.
Step 3: Clarify, One Question at a Time
Before planning, surface the remaining ambiguities. The discipline matters: ask one question at a time, and integrate each answer into the spec before you ask the next. Each answer shapes the next question.
For the CI scheduler spec, one clarification came back:
“FR-002 says concurrent triggers must be queued. What’s the maximum queue depth? Do queued triggers expire?”
Answer: unbounded queue; triggers do not expire.
Spec updated immediately, with a session entry:
### Session 2026-03-12: clarified FR-002
Queue is unbounded; triggers do not expire. Added “in order” execution constraint.
FR-002: If a suite is currently running, a concurrent trigger MUST be queued. Queued triggers do not expire and MUST all execute in order.
Step 4: Plan and Tasks
With a stable spec, produce the plan: map the work to the real codebase, make the technical choices the spec deliberately deferred, define the data model, and identify any code generation steps that must run before implementation starts. If anything is genuinely unknown at this point, resolve it through research before the plan is finalized. No unresolved technical questions should survive into the task list.
Then generate the task list: phased and dependency-ordered. Every task maps to a specific file. Parallel work is tagged. Each phase ends with a checkpoint. The final phase is always a quality gate: lint, build, test.
Step 5: Pre-Flight
Two checks run against the full artifact set before implementation begins.
First, run a consistency check: does every FR map to at least one task? Do any tasks lack a corresponding requirement? Do the artifacts use terms consistently? Catch these on paper.
Second, write a verification plan: for every FR, every acceptance scenario, every success criterion, every edge case, define the concrete test step that proves it, and name the file that contains that test. If a requirement has no plausible verification method, you have a gap. Fix it before you write code.
Step 6: Implementation
The implementation agent gets the spec, the plan, and the task list. The spec defines what to build. The plan says where and how. The task list says what to do first: “Implement this task. Do not go beyond it. The spec is the definition of success.”
Once the agent completes a task or phase, you run an adversarial review before moving on. A fresh context (a second agent or a new session) receives the spec, the completed work, and one job: find what the spec requires that the implementation missed. It is not a collaborator. It is an inspector. It asks: does every requirement have test coverage? Are the edge cases handled? Does anything in the code exist that the spec did not ask for?
The output is a gap list: specific, traceable items tied to requirements. “FR-004 has no test.” “The empty-config edge case is not covered.” “Implementation added retry logic that no FR specifies.” The implementing agent addresses each item and re-runs. This is the feedback loop.
Two or three iterations is normal. More than that usually means the spec has a gap: an ambiguous requirement, a missing edge case, a constraint that was implied but never written down. When the loop is not converging, stop the implementation and go back to the spec. Fix the artifact, then re-run. Grinding through iteration after iteration without fixing the underlying problem is waste, and it signals that the real issue was never the code.
When the gap list is empty and the implementation satisfies the spec, it goes to review.
Step 7: The Review
The PR arrives as a package: design doc update + spec + plan + implementation + tests. One coherent unit.
The reviewer reads the spec first, then checks whether the plan and tests satisfy the spec. Verify requirements. Reference the design doc as necessary for context.
The gap list stays specific and traceable: “FR-004 has no test coverage.” “SC-001 timing claim is unverified under load.” “The concurrent trigger edge case from FR-002 is handled but not tested.”
Once the spec is satisfied, the implementation is finished by definition.
The PR description points to spec sections satisfied, not implementation choices made: “Satisfies FR-001 through FR-004. SC-001 verified by load test. No design doc amendments required; implementation matched the design.”
Now, what do we do differently when iterating on code that is already fully embracing the design/spec methodology?
You find a bug. The instinct is immediate: open the file, change the thing, open a PR. This is how patches introduce regressions, fix symptoms not causes, and arrive with no proof they actually work. The methodology applies here too, not a lighter version of it. The same version, at a narrower scope.
First: Diagnose the Divergence
Before touching any artifact, answer one question: what kind of gap are we dealing with?
Type 1: The spec was correct, the implementation was wrong. The code fails to satisfy an existing FR. Do not touch the spec. Fix the code. Add a regression test. Done.
Type 2: The spec was incomplete. The bug exists because an edge case was not covered. Amend the existing spec: add the missing edge case, add the FR, date-stamp the clarification with justification. Then implement against the updated spec.
Type 3: The design was wrong. The architectural decision caused the bug. Update the existing design doc (date-stamped), explain what was wrong and why direction shifted, check whether you need to amend invariants, and propagate changes into the spec if needed. Then fix the code.
For our walkthrough, this CI scheduler bug is a Type 2.
After deploying the scheduler, an incident revealed that if the suite configuration object existed but contained zero configured suites, the scheduler entered an infinite retry loop. No FR covered it. The spec had a gap.
The fix started with the spec, not the code:
### Session 2026-03-22: Added FR-009
A production incident revealed that a suite config present with zero suites caused an infinite retry loop. This edge case was not specified.
Sections updated: Functional Requirements (added FR-009), Edge Cases (added entry for empty-but-valid config object).
FR-009: If a scheduled trigger fires and the suite config is present but contains zero configured suites, the system MUST log a warning and terminate the trigger without retry.
Then the implementation agent ran against the updated spec. The PR included a regression test:
def test_empty_suite_config_does_not_retry():
# Trigger scheduler with a config containing zero suites
# Assert: single log entry emitted, no retry, clean exit
“Works now” is not proof. “There is now a test that would have caught this” is proof.
Why Not a Separate Patch Spec?
Single source of truth. Any human or agent reading the spec six months later sees the ### Session 2026-03-22 block, understands the full history, and understands the current state. A parallel patch document makes that reconstruction harder and will drift from or contradict the original.
It also forces the right diagnostic question: was this wrong in the code, the spec, or the design? That question is most of the work.
The Legacy Code Edge Case
Sometimes the bug lives in code that predates the methodology. No design doc, no spec.
Write the spec that should have existed. Not a “patch spec”: the spec. Treat it as a first-class artifact even if it is abbreviated. Then add the session entry. Then fix the code.
This is how the methodology spreads into legacy code without a big-bang rewrite: you use it when you need it.
The design doc, the spec, and the plan are not documentation overhead layered on top of engineering work. They are the engineering work.
The methodology does not include a “too small for process” exception. If you apply it to large features but skip it for patches, the patches will introduce regressions that a spec would have caught. A bug fix without the diagnostic question (was this wrong in the code, the spec, or the design?) is not a fix. It is a guess.
The only thing that changes with the size of the work is the surface area of the artifacts. The rigor remains.