Design-First Development: What It Actually Looks Like

In our last post, we made the case for design-first development. But translating from theory to practice can be a challenge. To help others clear this hurdle, we’re going to demonstrate Design-First development by first creating a CI scheduler, and then fixing a bug in it.

The Blank Page Problem

When you move to a design-first workflow, you hit the blank page problem. You sit down to write your first design doc, and the friction shows up immediately: Is this too detailed? Not detailed enough? If I’m already thinking about the database schema, should I put that here, or wait?

This is not a failure. Most of us learned to solve problems by coding our way through them. In a design-first world, that instinct works against you. The blank page feels uncomfortable because you are used to your code giving you traction. In this workflow, traction comes from committing to decisions before you start building.

This shift is not automatic. You have to apply the methodology each time with the same care as the first. Consistency and rigor makes it work, not cleverness. The hard part is repeating the process when you are tired, when the deadline is close, and when you feel tempted to skip the writing and go straight to code.

The Artifacts

The design doc, the spec, and the plan do different jobs. When you blur them, the AI gets confused, your reviewers get confused, and the methodology stops working.

Design-first development workflow showing the progression from Design Doc (what and why) to Spec (definition of success) to Plan (where and how) to Task (what is left to do) to Code — The segmented workflow of design-first development that ensures each stage
has clear inputs, outputs, and verification methods.

The Design Doc: The Thinking Document

The design doc comes first, before requirements and technical planning. It lays out what we are building, what we are not building, and which promises the system makes that nothing is allowed to break.

It scopes to a system, not a feature. You keep one document per system, version it in your repository alongside the code, and update the relevant sections in the same PR when a new feature touches that system.

The design doc stays prescriptive: you delete sections that do not apply rather than leaving them empty or marking them N/A. A single-component change does not need a high-level architecture diagram. A pure library with no external callers does not need a backwards compatibility section. The document should contain only what matters for this system.

Goals and Non-Goals

Goals must be verifiable. If you cannot write a test or metric for a goal, rewrite it or delete it. If you cannot verify a goal, you do not have a goal; you have an aspiration.

Non-goals are where design docs can easily fail. A weak non-goal documents an omission. A strong non-goal documents a decision.

Weak

”Out of scope: parallel suite execution.”

Strong

“Parallel suite execution is explicitly deferred. Current suite count makes sequential scheduling sufficient, and adding concurrency control now would complicate failure attribution without proportional benefit. Revisit when suite count exceeds 20.”

Six months from now, someone reads the strong version and knows: we thought about this, we made a call, here is why. Without it, you relitigate the decision in the next planning meeting.

System Invariants

This section tells you whether the design is done.

System invariants are the top-level promises the system makes to every caller, regardless of which component is involved. You number them because failure modes reference them by ID. They set the hardest constraints, and no component can violate them.

If you cannot name the invariants yet, you need more design work before you move on.

Examples:

INV-1: Running the same trigger twice produces the same result.
INV-2: A trigger is never silently dropped. It either executes or is logged as missed.
INV-3: No org can affect another org’s schedule or execution.

Test your design against these. If a failure mode violates an invariant, your architecture has a gap that you need to address.

Alternatives Considered

The proposed design looks obvious in retrospect. The rejected alternatives are the proof of judgment calls. If you do not write them down, that decision context evaporates.

Every rejected alternative needs three things: what it was, why it seemed attractive, and why you chose differently.

Component Contracts

For each component in the system, the design doc defines its contract: what it exposes, what it consumes, what it promises to callers, how it fails. At the bottom of each contract, include a checklist of spec directives: explicit statements of what the downstream spec must define for this component:

Exact interface signatures
Validation rules for all inputs
Concrete error types and payloads
Acceptance criteria as Given/When/Then test cases
Integration points with dependencies
Performance constraints

The spec directives let the design doc produce the spec. When you sit down to write the spec, you work through the checklist from each component contract instead of starting cold. The design doc has already told you exactly what you need to specify.

Component contract diagram showing exposes, consumes, data in/out, behavioral guarantees, state ownership, error semantics, and spec directives — The component contract checklist that ensures each component is defined before writing the spec.

Failure Modes

Describe how the system fails end-to-end. Cover upstream dependency outages and partial failure behavior. These scenarios span multiple components; they are not single-component error handling.

Each failure mode references the invariants it might affect. If you cannot map a failure to your invariants, you still have thinking to do.

Cross-Cutting Concerns

Security, performance, observability. If a section has no impact, say so explicitly with one sentence of justification. Do not leave it blank.

The Living Document Rule

A design doc does not capture intent once and then freeze. When implementation diverges from the plan (and it will), update the design doc in the same PR as the code. Date-stamp it, and explain what changed and why.

### 2026-03-14: Decision update

Proposed Design amended. Original design stored schedule configs in-memory; implementation revealed this causes config loss on service restart. Moved to persistent store. Design updated to reflect.

If you merge code that contradicts the design doc without updating it, you create debt in your most important artifact.

Example: Design Doc Excerpt (not complete)

Eval Suite CI System: Design Document

Goals

Trigger automated evaluation suites on a fixed cron schedule, independent of deployment events. Verified by: integration test confirming trigger fires within 60s of scheduled time.

Non-Goals

Parallel suite execution. Explicitly deferred. Current suite count (≤8) makes sequential scheduling sufficient. Adding concurrency control now would complicate failure attribution. Revisit when suite count exceeds 20.
On-demand triggering via UI. Valuable but separable; it can be added later without redesigning the scheduler.

System Invariants

INV-1: A trigger is never silently dropped. It either executes, is queued, or is logged as missed with a reason.
INV-2: No org’s schedule or execution can affect any other org.
INV-3: The system is idempotent with respect to schedule configuration: applying the same config twice has no additional effect.

Alternatives Considered

Upstream CI triggers. Attractive because infrastructure already exists. Rejected: couples our internal eval cadence to the deployment pipeline. If deployments pause, eval coverage silently stops, violating INV-1.
External scheduler service. Rejected: introduces an external orchestration dependency we do not control and creates gaps in our internal audit logging.

Failure Modes

Scheduler crash: Persistent schedule storage means no triggers are lost. On restart, scheduler reloads state and resumes. Runs mid-execution at crash time are marked interrupted. Invariants held: INV-1 (trigger logged as interrupted, not dropped).
Run-queue unavailable: Scheduler logs the failure, does not retry automatically. Alerting fires on consecutive misses. Invariants held: INV-1 (logged as missed), INV-2 (scoped to affected org only).

Cross-Cutting Concerns

Security: All schedule operations are scoped to the requesting org. No org can read, write, or trigger another org’s schedules. No new auth layer required; org context flows from the existing auth middleware.
Performance: Triggers must fire within 60 seconds of scheduled time under normal load. No other performance constraints; suite execution is out of scope for this system.
Observability: Every trigger event — start, queued, missed, interrupted — is logged with org_id, suite_id, trigger_source, and timestamp. Alerting fires on two consecutive missed triggers for the same schedule.

Component Manifest

Component	Module	Size	Depends on
Schedule Store	`db/migrations/0012_scheduler.sql`	S	—
Scheduler	`services/scheduler/scheduler.ts`	M	Schedule Store
Trigger Handler	`services/runner/handler.ts`	S	Scheduler

Size key: XS = < 2h · S = half day · M = 1–2 days · L = 3–5 days · XL = needs breakdown

Component Contracts

`Scheduler` — `services/scheduler/scheduler.ts`

Depends on: Schedule Store Size: M

Contract shape

Exposes: Schedule management (create, update, delete per org); trigger dispatch (queue a suite run at scheduled time)
Consumes: Persistent schedule store; run queue
Data in: Schedule config — org ID, trigger time (UTC), list of suite IDs
Data out: Trigger event — org ID, suite ID, scheduled time, trigger source

Behavioral guarantees

Idempotent: applying the same schedule config twice produces no additional effect
Org-isolated: one org’s schedules cannot affect another’s dispatch or execution
Trigger durability: a trigger is never silently dropped — it executes, is queued, or is logged as missed

State ownership

Owns: Schedule configurations and trigger log records
Reads from: Schedule store on startup and on each config change
Side effects: Writes trigger events to the run queue; emits missed-trigger log entries on queue failure

Error semantics

Error class	When	Caller must
Invalid config	Trigger time malformed or suite ID missing	Reject at creation; do not store
Queue unavailable	Run queue unreachable at dispatch time	Log missed trigger; alert on consecutive misses
Org isolation violation	Schedule references a resource outside the requesting org	Reject; log the attempt

Spec directives

Exact interface signatures (method names, parameter types, return types)
Validation rules for all inputs
Concrete error types and error payloads
Acceptance criteria as given/when/then test cases
Integration points with dependencies (exact calls, expected responses)
Performance constraints (trigger must fire within 60s of scheduled time)

The Spec: The Definition of Success

The spec follows from the design doc. Each component’s spec directives act as a checklist for the spec writer: they define exactly what you need to specify for each component. When you finish the design, writing the spec becomes mechanical: you work through each directive and make it testable.

The spec defines success, not approach. No tech stack names, no implementation choices. Use this test: can a stranger write tests against this spec without seeing any code and without asking you a single question? If not, it is not done. If you cannot evaluate a requirement unambiguously pass/fail, it does not belong here.

User Stories in Given/When/Then

This format forces observable behavior, not implementation intent. The “Given” sets the precondition. The “When” names the action. The “Then” defines the testable outcome.

Order stories by priority. P1 is the smallest viable increment: the thing that must work before anything else matters.

Functional Requirements (FR-001, FR-002…)

Number them. A reviewer can say “FR-007 is not satisfied” instead of writing three sentences. Use MUST and MUST NOT only. “Should” creates wiggle room, and wiggle room creates ambiguity.

Bad: “The system should respond quickly.”

Good: “FR-003: The workflow MUST complete within 10 minutes of trigger under normal load.”

Success Criteria (SC-001…)

Functional Requirements describe what the system must do. Success criteria describe what good looks like to the person using it: user-facing and measurable, not system internals.

Bad: “The polling loop runs on a 30-second interval.”

Good: “SC-002: Results are available to the team within 15 minutes of workflow trigger.”

Entity Definitions

If the feature involves data, define the domain objects precisely: what they are, what fields they carry, what constraints apply. You are not writing a schema. You are writing a conceptual definition that two engineers would model the same way.

Edge Cases

Do not write “handle errors gracefully.” List cases: missing configuration, concurrent triggers, upstream dependency unavailable, partial completion. Ask concrete questions: what happens when this field is empty? What happens when the trigger fires twice? What happens when the downstream service times out halfway through?

The Living Document Rule

When requirements change or gaps surface, amend the existing spec (date-stamped) with a ## Clarifications section and a ### Session YYYY-MM-DD subheading. Explain what changed, why, and what sections you updated. This is how a spec stays authoritative instead of becoming an archaeological artifact no one trusts.

Example: Spec Excerpt (not complete)

Scheduled Eval Suite CI: Specification

User Story 1 (P1)

Given a schedule configured for 12:00 PM UTC,

When the scheduler clock reaches 12:00 PM UTC,

Then eval-suite-alpha MUST be initiated within 60 seconds of the scheduled time.

User Story 2 (P1)

Given a suite is already running when a scheduled trigger fires,

When the new trigger is received,

Then the trigger MUST be queued and executed after the current run completes, not dropped.

FR-001: The system MUST log the start time, org_id, suite_id, and trigger_source for every execution.

FR-002: If a suite is currently running, a concurrent trigger MUST be queued, not silently dropped. Queued triggers do not expire and MUST all execute in order.

FR-003: Schedules MUST be independently configurable per org. A change to one org’s schedule MUST NOT affect any other org.

FR-004: If a scheduled trigger fires and no suite is configured for that window, the system MUST log the event and take no further action.

SC-001: Results are available to the team within 15 minutes of the scheduled trigger time under normal load.

SC-002: No cross-org suite contamination occurs under any scheduling configuration.

Entities

Schedule: belongs to an org, has a trigger time (UTC), references zero or more suite IDs. A schedule with zero suites is valid but produces no runs.
ScheduledRun: record of a single triggered execution, linked to a schedule, with status and timestamps.

FR-001 and FR-002 map directly to test cases. FR-003 maps directly to SC-002. The spec directives in the design doc produced these requirements; the spec made each one testable.

The Plan: The Technical Bridge

The design doc is architectural. The spec is tech-agnostic. Neither tells an engineer exactly what to build where. The plan does. It produces two artifacts.

The plan itself maps the spec to the real codebase: the technical choices the spec deliberately withheld, what already exists, what needs to be created, and exactly which files will change. It includes the data model with real schemas, affected file paths, and any code generation steps that must run before implementation can start. If any technical questions are genuinely unresolved when you sit down to plan, you resolve them first: research before you commit.

The task list follows from the plan. It is dependency-ordered and phased: setup and code generation first, then foundational shared types, then one phase per user story in priority order, then a final quality gate phase. Every task maps to a specific file. Parallel work is tagged. Each phase ends with a checkpoint that must pass before the next phase starts. The implementation agent executes against the task list. It does not need to interpret the plan.

Example: Plan Excerpt (not complete)

Scheduled Eval Suite CI: Implementation Plan

Technical Context

Item	Decision	Notes
Storage	SQL (persistent)	Preserves schedule state across restarts. No in-memory loss on crash.
Queue	Internal job queue	Existing infrastructure, no new dependency
Migration	Additive schema migration	New tables only. No destructive changes.

Affected Locations

Location	What changes	New or modified?
`db/migrations/0012_scheduler.sql`	Schedule + ScheduledRun tables	New
`services/scheduler/scheduler.ts`	Core scheduler logic	New
`services/scheduler/scheduler.test.ts`	Co-located unit tests	New
`services/runner/handler.ts`	Trigger ingestion point	Modified

Code Generation Steps

Run schema migration before any implementation tasks that depend on the new tables.

Example: Task List Excerpt (not complete)

Scheduled Eval Suite CI: Tasks

Phase 1 — Setup

T001 Create schema migration — db/migrations/0012_scheduler.sql

Checkpoint: Migration runs cleanly. New tables exist.

Phase 2 — Foundational

T002 [P] Define Schedule and ScheduledRun types — services/scheduler/types.ts
T003 [P] Define scheduler error types — services/scheduler/errors.ts

Checkpoint: Core types compile and are importable.

Phase 3 — User Story 1: Scheduled Trigger (P1)

T004 [US1] Implement scheduler core logic — services/scheduler/scheduler.ts
T005 [US1] Add co-located tests — services/scheduler/scheduler.test.ts
T006 [US1] Wire trigger ingestion — services/runner/handler.ts

Checkpoint: User Story 1 acceptance criteria pass independently.

Final Phase — Quality Gates

T007 Run lint
T008 Run build
T009 Run tests

The plan answers “what are we building and where.” The task list answers “in what order, one file at a time.” The implementation agent gets the task list and the spec. Its job is to satisfy requirements, not to interpret the plan.

New Feature Walkthrough

Design-First Development Flow Chart — New feature walkthrough from start to finish.

Step 1: Design Doc First

Design the system from a high level first. Define the goals and non-goals. Work through the alternatives; this is where you discover that upstream CI triggers create a coupling problem, and that external schedulers introduce audit gaps. The Alternatives Considered section only makes sense after you try to map the problem to the available infrastructure.

Then name the invariants. For the scheduler, INV-1 (no trigger is silently dropped) and INV-2 (no org affects another) carry the load. If the design cannot hold them under failure, the architecture has a gap.

Then write the component contracts and spec directives. When you can fill in the directives for each component (these are the exact things the spec must define), you finish the design doc.

Step 2: Spec from the Design Doc

With the design doc complete, writing the spec becomes mechanical. For each component, work through its spec directives: write the user stories and the FRs that satisfy them, then the success criteria the tests can measure. The directives tell you what needs to exist. The spec makes each one pass/fail testable.

For the CI scheduler, the scheduler component’s directives produced FR-001 (logging), FR-002 (queuing), FR-003 (org isolation), and FR-004 (empty-window behavior). The design doc established that these requirements needed to exist; the spec defines exactly what “correct” looks like for each.

Step 3: Clarify, One Question at a Time

Before planning, surface the remaining ambiguities. The discipline matters: ask one question at a time, and integrate each answer into the spec before you ask the next. Each answer shapes the next question.

For the CI scheduler spec, one clarification came back:

“FR-002 says concurrent triggers must be queued. What’s the maximum queue depth? Do queued triggers expire?”

Answer: unbounded queue; triggers do not expire.

Spec updated immediately, with a session entry:

### Session 2026-03-12: clarified FR-002

Queue is unbounded; triggers do not expire. Added “in order” execution constraint.

FR-002: If a suite is currently running, a concurrent trigger MUST be queued. Queued triggers do not expire and MUST all execute in order.

Step 4: Plan and Tasks

With a stable spec, produce the plan: map the work to the real codebase, make the technical choices the spec deliberately deferred, define the data model, and identify any code generation steps that must run before implementation starts. If anything is genuinely unknown at this point, resolve it through research before the plan is finalized. No unresolved technical questions should survive into the task list.

Then generate the task list: phased and dependency-ordered. Every task maps to a specific file. Parallel work is tagged. Each phase ends with a checkpoint. The final phase is always a quality gate: lint, build, test.

Step 5: Pre-Flight

Two checks run against the full artifact set before implementation begins.

First, run a consistency check: does every FR map to at least one task? Do any tasks lack a corresponding requirement? Do the artifacts use terms consistently? Catch these on paper.

Second, write a verification plan: for every FR, every acceptance scenario, every success criterion, every edge case, define the concrete test step that proves it, and name the file that contains that test. If a requirement has no plausible verification method, you have a gap. Fix it before you write code.

Step 6: Implementation

The implementation agent gets the spec, the plan, and the task list. The spec defines what to build. The plan says where and how. The task list says what to do first: “Implement this task. Do not go beyond it. The spec is the definition of success.”

Once the agent completes a task or phase, you run an adversarial review before moving on. A fresh context (a second agent or a new session) receives the spec, the completed work, and one job: find what the spec requires that the implementation missed. It is not a collaborator. It is an inspector. It asks: does every requirement have test coverage? Are the edge cases handled? Does anything in the code exist that the spec did not ask for?

The output is a gap list: specific, traceable items tied to requirements. “FR-004 has no test.” “The empty-config edge case is not covered.” “Implementation added retry logic that no FR specifies.” The implementing agent addresses each item and re-runs. This is the feedback loop.

Two or three iterations is normal. More than that usually means the spec has a gap: an ambiguous requirement, a missing edge case, a constraint that was implied but never written down. When the loop is not converging, stop the implementation and go back to the spec. Fix the artifact, then re-run. Grinding through iteration after iteration without fixing the underlying problem is waste, and it signals that the real issue was never the code.

When the gap list is empty and the implementation satisfies the spec, it goes to review.

Step 7: The Review

The PR arrives as a package: design doc update + spec + plan + implementation + tests. One coherent unit.

The reviewer reads the spec first, then checks whether the plan and tests satisfy the spec. Verify requirements. Reference the design doc as necessary for context.

The gap list stays specific and traceable: “FR-004 has no test coverage.” “SC-001 timing claim is unverified under load.” “The concurrent trigger edge case from FR-002 is handled but not tested.”

Once the spec is satisfied, the implementation is finished by definition.

The PR description points to spec sections satisfied, not implementation choices made: “Satisfies FR-001 through FR-004. SC-001 verified by load test. No design doc amendments required; implementation matched the design.”

The Patch

Now, what do we do differently when iterating on code that is already fully embracing the design/spec methodology?

You find a bug. The instinct is immediate: open the file, change the thing, open a PR. This is how patches introduce regressions, fix symptoms not causes, and arrive with no proof they actually work. The methodology applies here too, not a lighter version of it. The same version, at a narrower scope.

First: Diagnose the Divergence

Before touching any artifact, answer one question: what kind of gap are we dealing with?

Three types of bug gaps in design-first development: Type 1 - spec correct but implementation wrong, Type 2 - spec incomplete, Type 3 - design wrong — The three types of gaps that can cause a bug in the implementation and how to fix them.

Type 1: The spec was correct, the implementation was wrong. The code fails to satisfy an existing FR. Do not touch the spec. Fix the code. Add a regression test. Done.

Type 2: The spec was incomplete. The bug exists because an edge case was not covered. Amend the existing spec: add the missing edge case, add the FR, date-stamp the clarification with justification. Then implement against the updated spec.

Type 3: The design was wrong. The architectural decision caused the bug. Update the existing design doc (date-stamped), explain what was wrong and why direction shifted, check whether you need to amend invariants, and propagate changes into the spec if needed. Then fix the code.

For our walkthrough, this CI scheduler bug is a Type 2.

After deploying the scheduler, an incident revealed that if the suite configuration object existed but contained zero configured suites, the scheduler entered an infinite retry loop. No FR covered it. The spec had a gap.

The fix started with the spec, not the code:

### Session 2026-03-22: Added FR-009

A production incident revealed that a suite config present with zero suites caused an infinite retry loop. This edge case was not specified.

Sections updated: Functional Requirements (added FR-009), Edge Cases (added entry for empty-but-valid config object).

FR-009: If a scheduled trigger fires and the suite config is present but contains zero configured suites, the system MUST log a warning and terminate the trigger without retry.

Then the implementation agent ran against the updated spec. The PR included a regression test:

def test_empty_suite_config_does_not_retry():
    # Trigger scheduler with a config containing zero suites
    # Assert: single log entry emitted, no retry, clean exit

“Works now” is not proof. “There is now a test that would have caught this” is proof.

Why Not a Separate Patch Spec?

Single source of truth. Any human or agent reading the spec six months later sees the ### Session 2026-03-22 block, understands the full history, and understands the current state. A parallel patch document makes that reconstruction harder and will drift from or contradict the original.

It also forces the right diagnostic question: was this wrong in the code, the spec, or the design? That question is most of the work.

The Legacy Code Edge Case

Sometimes the bug lives in code that predates the methodology. No design doc, no spec.

Write the spec that should have existed. Not a “patch spec”: the spec. Treat it as a first-class artifact even if it is abbreviated. Then add the session entry. Then fix the code.

This is how the methodology spreads into legacy code without a big-bang rewrite: you use it when you need it.

The Artifacts are the Work

The design doc, the spec, and the plan are not documentation overhead layered on top of engineering work. They are the engineering work.

The methodology does not include a “too small for process” exception. If you apply it to large features but skip it for patches, the patches will introduce regressions that a spec would have caught. A bug fix without the diagnostic question (was this wrong in the code, the spec, or the design?) is not a fix. It is a guess.

The only thing that changes with the size of the work is the surface area of the artifacts. The rigor remains.