Back to Learning Hub

Autonomous Software Development Agent

System design for an AI agent that receives user stories and autonomously writes code, runs tests, deploys, and iterates — with multi-model LLM support.

Version: 3.1 Date: April 2, 2026 Author: Suman

1. Executive Overview

This document describes the architecture of an Autonomous Software Development Agent — a system that accepts user stories as its sole input and autonomously performs the full software development lifecycle: planning, code generation, testing, deployment, monitoring, and self-improvement.

The agent operates in a continuous loop. It decomposes user stories into implementation tasks, generates code, validates it against automated tests it also writes, deploys to a target environment, observes the result, and iterates until acceptance criteria are met. The system is designed to be LLM-agnostic, supporting pluggable model backends including Claude, GPT, open-source models, or any combination.

This goes beyond developer-assist copilots and aligns with the "AI software engineer" category demonstrated by systems like Devin (Cognition) and open platforms like OpenHands, which describe agents interacting with a codebase via editing, command execution, and web browsing inside sandboxed environments. A durable architecture treats the user story as a requirements artefact that must become machine-checkable before the agent is allowed to ship — Behaviour-Driven Development (BDD) provides a well-known bridge from narrative requirements to automated acceptance tests via Given–When–Then scenarios, making success measurable rather than subjective.

Designing for autonomy also implies designing for bounded autonomy. Recent research on agent autonomy highlights that "oversight" is not a UI add-on but an architectural concern spanning permissioning, escalation, and post-deployment monitoring. The system's core loop cannot be tightly coupled to one provider's API quirks — a practical route is to build a model gateway with a common tool-calling abstraction, and optionally adopt interoperability standards such as MCP (Model Context Protocol), which standardises how agentic applications connect to tools and data sources.

Key Innovation: Unlike code-generation copilots that assist developers, this agent is the developer. It owns the full loop from story → running software, with human oversight at configurable checkpoints.

What Makes This Different from a Copilot

CapabilityCopilot / AssistantThis Autonomous Agent
Scope of actionSingle file / function at a timeFull project — multi-file, multi-module changes
Error handlingHuman reads errors and re-promptsAgent classifies errors, applies targeted fix strategies, retries autonomously
TestingHuman writes or requests testsAgent generates tests from acceptance criteria and verifies coverage
DeploymentNot involvedDeploys to staging, runs smoke tests, monitors health, rolls back on failure
LearningNone (stateless)Remembers patterns, error fixes, project conventions across stories
Context awarenessCurrent file + neighborsFull codebase index with semantic search, dependency graph, convention profile

2. Goals & Design Principles

2.1 Primary Goals

  • Story-to-Software: Accept a user story in natural language and produce working, tested, deployed software.
  • Autonomous Iteration: Detect failures (build errors, test failures, runtime issues) and fix them without human intervention.
  • Self-Improvement: Learn from past successes and failures to improve code quality and reduce iteration cycles over time.
  • Multi-Model Flexibility: Route different tasks (planning, coding, reviewing, testing) to different LLMs based on capability and cost.

2.2 Design Principles

PrincipleDescription
Loop-FirstEvery action feeds back into an evaluation loop. The agent never "finishes" — it converges.
Fail-Safe by DefaultThe agent operates in sandboxed environments. Destructive actions require explicit approval gates.
ObservableEvery decision, code change, and deployment is logged with full provenance for audit and debugging.
Model-AgnosticThe LLM layer is abstracted behind a unified interface. Models can be swapped, mixed, or A/B tested.
Incremental DeliveryThe agent works in small increments — committing, testing, and deploying after each coherent change.
Human OverrideConfigurable approval gates allow humans to review and intervene at any stage of the pipeline.
Bounded AutonomyOversight is an architectural concern, not a UI add-on. The system enforces permissioning, escalation policies, and post-deployment monitoring at every layer.
Machine-Checkable SpecsAcceptance criteria are compiled into executable tests (BDD Given–When–Then) before any code is written, making "done" objectively measurable.

3. High-Level Architecture

INPUT LAYER User Stories API / CLI / UI Existing Codebase Config / Constraints AGENT CORE Orchestrator (State Machine) Story Analyzer Planner Code Generator Test Generator Code Reviewer Build Runner Test Runner Deploy Engine Monitor / Verifier Feedback Loop LLM Router → Claude | GPT | Open-Source Models | Specialized Models INFRASTRUCTURE & PERSISTENCE Sandbox Env Git Repository Artifact Store State Database Memory Store
Figure 1 — High-level system architecture showing input layer, agent core with orchestrator and sub-components, LLM router, and infrastructure.

4. Component Deep-Dive

4.1 Orchestrator (State Machine)

The orchestrator is the brain of the system. It manages the lifecycle of each user story through a finite state machine, deciding which component to invoke next based on the current state, outputs from previous steps, and defined policies. It maintains execution context, handles retries, and enforces maximum iteration limits.

ResponsibilityDetails
State ManagementTracks each story through states: RECEIVED → ANALYZING → PLANNING → CODING → REVIEWING → TESTING → DEPLOYING → VERIFYING → DONE / FAILED
Retry PolicyConfigurable max retries per stage (default: 5). Exponential backoff between attempts. Circuit breaker after repeated failures.
ParallelismCan process independent stories concurrently. Within a story, some stages can be parallelized (e.g., generating code for independent modules).
Approval GatesConfigurable pause points where human approval is required before proceeding (e.g., before deployment).
Context WindowManages context passed to LLMs — including relevant code, past errors, and project conventions.

4.2 Story Analyzer

Parses and enriches incoming user stories. Extracts acceptance criteria, identifies implied constraints, detects dependencies on existing code, and classifies the story type (new feature, bug fix, refactor, infrastructure). Produces a structured StorySpec object consumed by the Planner.

Outputs:

  • Acceptance Criteria — machine-verifiable conditions extracted from the story
  • Dependency Map — files, modules, APIs the story will likely touch
  • Complexity Score — estimated effort used for LLM routing and timeout config
  • Ambiguity Flags — areas where the story is unclear (may trigger a human clarification request)

4.3 Planner

Takes the StorySpec and produces an execution plan: an ordered list of implementation tasks, each with a description, target files, and estimated token budget. The planner also decides the testing strategy (unit, integration, e2e) and whether any infrastructure changes are needed.

{
  "story_id": "STORY-142",
  "plan": {
    "tasks": [
      {
        "id": "task-1",
        "action": "create_file",
        "target": "src/services/payment-processor.ts",
        "description": "Implement PaymentProcessor service with Stripe integration",
        "depends_on": [],
        "estimated_tokens": 4200
      },
      {
        "id": "task-2",
        "action": "modify_file",
        "target": "src/routes/checkout.ts",
        "description": "Add POST /checkout endpoint wiring to PaymentProcessor",
        "depends_on": ["task-1"],
        "estimated_tokens": 1800
      },
      {
        "id": "task-3",
        "action": "create_file",
        "target": "tests/services/payment-processor.test.ts",
        "description": "Unit tests for PaymentProcessor with mocked Stripe client",
        "depends_on": ["task-1"],
        "estimated_tokens": 3000
      }
    ],
    "testing_strategy": "unit + integration",
    "deploy_strategy": "staging_first"
  }
}

4.4 Code Generator

Receives individual tasks from the plan and generates code. It has access to the full project context via the codebase indexer, which provides semantic search over the existing code. The generator respects project conventions (linting rules, naming patterns, architectural boundaries) by loading a project profile at initialization.

Context Strategy: The generator uses a hierarchical context approach — project-level conventions are always included, file-level context is loaded on demand, and cross-file references are resolved via the codebase index. This keeps token usage efficient while maintaining coherence.

4.5 Test Generator

Writes tests that exercise the generated code against the acceptance criteria. Produces unit tests, integration tests, and where applicable, end-to-end test scripts. The test generator works in two modes: coverage mode (aims for high code coverage) and criteria mode (directly tests acceptance criteria from the story).

4.6 Code Reviewer

Performs automated review of generated code before it enters the test/deploy pipeline. Checks for security vulnerabilities, performance anti-patterns, adherence to project conventions, and logical correctness. Can optionally use a different LLM than the code generator to provide an independent perspective.

4.7 Build & Test Runner

Executes build commands and runs the test suite inside a sandboxed environment. Captures stdout/stderr, exit codes, test reports, and coverage data. Parses failures into structured error objects that the orchestrator can feed back to the Code Generator for repair.

4.8 Deploy Engine

Handles deployment to the target environment. Supports multiple strategies: direct push to staging, container builds, serverless deploys, and PR-based workflows. The deploy engine is pluggable via adapters — each target environment has a corresponding adapter that implements a standard interface.

4.9 Monitor / Verifier

Post-deployment, the verifier runs smoke tests, checks health endpoints, monitors error rates, and validates that the acceptance criteria are met in the live environment. If issues are detected, it feeds structured failure reports back to the orchestrator to trigger a new iteration.

4.10 Feedback & Memory Loop

After a story completes (or fails), the feedback loop captures lessons: which patterns worked, common error types encountered, effective fix strategies, and performance metrics. This data is stored in the Memory Store and used to improve future planning and code generation through retrieval-augmented generation (RAG).

5. Codebase Intelligence Engine

A truly autonomous agent cannot generate coherent code without deeply understanding the existing codebase. The Codebase Intelligence Engine is the agent's "memory of the project" — it indexes, analyzes, and provides semantic access to the full repository so that every generated line of code is contextually aware.

5.1 Multi-Layer Indexing

The engine builds and maintains several complementary indexes that are refreshed incrementally as the agent makes changes:

Index LayerWhat It CapturesUse Case
AST IndexAbstract syntax trees for every source file — functions, classes, interfaces, type signatures, exportsUnderstanding function signatures, class hierarchies, available APIs
Dependency GraphImport/require relationships, module boundaries, circular dependency detectionImpact analysis — "if I change this file, what else breaks?"
Semantic EmbeddingsVector embeddings of code chunks (functions, classes, blocks) using a code-specific embedding modelNatural language search — "find the function that handles user authentication"
Symbol TableGlobal registry of all exported symbols, their types, file locations, and usage frequencyAuto-import resolution, avoiding name collisions
Convention ProfileDetected patterns: naming conventions, file structure, error handling patterns, test patterns, comment styleEnsuring generated code matches the project's style and idioms
API Surface MapAll HTTP endpoints, GraphQL resolvers, CLI commands, event handlers exposed by the projectUnderstanding the project's external interface before adding to it

5.2 Incremental Re-indexing

After each code generation step, the engine performs a targeted re-index of only the changed files and their direct dependents. This keeps the index fresh without the cost of a full rebuild. The incremental strategy uses file hashes to detect changes and a dependency-aware invalidation algorithm to propagate updates through the graph.

5.3 Convention Detection & Enforcement

On first encountering a project, the engine runs a one-time "convention scan" that analyzes existing code to build a ProjectProfile:

interface ProjectProfile {
  language: string;                     // "typescript", "python", etc.
  framework: string;                    // "nextjs", "django", "express"
  package_manager: string;              // "npm", "yarn", "pnpm", "pip"
  test_framework: string;              // "jest", "pytest", "vitest"
  naming_conventions: {
    files: "kebab-case" | "camelCase" | "PascalCase" | "snake_case";
    functions: "camelCase" | "snake_case";
    classes: "PascalCase";
    constants: "UPPER_SNAKE_CASE";
  };
  directory_structure: DirectoryPattern;  // e.g., "src/[domain]/[type].ts"
  error_handling_pattern: string;        // "try-catch" | "Result type" | "error callbacks"
  import_style: "named" | "default" | "barrel";
  lint_config?: LintConfig;             // Parsed .eslintrc / pyproject.toml
  formatting_config?: FormatConfig;     // Parsed .prettierrc / black config
}

Every code generation prompt includes a condensed version of this profile, ensuring the agent writes code that looks like a human teammate wrote it — not like AI-generated boilerplate.

5.4 Semantic Code Search

When the Code Generator or Planner needs to find relevant code, it queries the Codebase Intelligence Engine using natural language. The engine combines vector similarity search (for semantic matches) with AST-based structural search (for precise type/signature matches) and returns ranked results with surrounding context. This is far more effective than naive string matching for understanding how existing code works.

6. Tool & Environment Interaction Layer

The agent doesn't just generate code — it must interact with a full development environment: running shell commands, managing files, using git, installing packages, and operating build tools. The Tool Layer provides a secure, structured interface for all these operations.

A useful architectural reference is SWE-agent's Agent-Computer Interface (ACI): rather than exposing an unconstrained Linux shell, it provides a small, curated action set for viewing, searching, and editing — and gives concise feedback at every step with guardrails to avoid common mistakes. This demonstrates that the interface to the environment materially impacts both success rate and safety. Tool integration at scale becomes a bottleneck when every model/app pair needs bespoke glue. MCP (Model Context Protocol) addresses this as an open protocol that standardises how LLM applications connect to external tools and data sources, reducing fragmentation. In this system, MCP is especially valuable at two boundaries: the context boundary (retrieving repository state, tickets, logs, deployment manifests) and the action boundary (invoking CI/CD, infra changes, secret rotation, feature-flag toggles) — all with uniform discovery and invocation semantics.

6.1 Tool Catalog

ToolOperationsRisk LevelApproval Required
File Systemread, write, create, delete, move, list, searchLow-MediumOnly for delete outside sandbox
Shell Executorrun_command with stdout/stderr capture, timeout, resource limitsMediumAllowlist-based; unknown commands flagged
Git Clientclone, branch, commit, push, diff, log, merge, rebaseMediumForce-push and rebase require approval
Package Managerinstall, update, remove, audit (npm, pip, cargo, etc.)MediumMajor version upgrades require approval
Build Toolsbuild, lint, format, type_checkLowNo
Test Runnerrun_all, run_file, run_single, coverageLowNo
Database Clientmigrate, seed, query (test DB only)HighMigrations always require approval
HTTP Clientrequest to whitelisted URLs (APIs, package registries)MediumNew domains require approval
Container Runtimebuild_image, run_container, compose_upMediumPort exposure requires approval

6.2 Command Execution Model

Every tool invocation goes through a structured pipeline: the agent emits a tool call as a structured JSON object, the Tool Layer validates it against the allowlist and risk policy, executes it in the sandbox, and returns a structured result including exit code, stdout, stderr, execution time, and any side effects (files changed, packages added).

// Tool call from agent
{
  "tool": "shell",
  "command": "npm run test -- --coverage",
  "working_dir": "/workspace/project",
  "timeout_ms": 60000,
  "env": { "NODE_ENV": "test" }
}

// Structured result
{
  "exit_code": 1,
  "stdout": "Test Suites: 3 passed, 1 failed, 4 total\n...",
  "stderr": "",
  "duration_ms": 12340,
  "side_effects": {
    "files_written": ["coverage/lcov-report/index.html"],
    "files_modified": []
  },
  "parsed": {
    "type": "test_report",
    "passed": 3, "failed": 1, "total": 4,
    "coverage": 87.3,
    "failures": [
      { "suite": "PaymentProcessor", "test": "handles declined cards", "error": "..." }
    ]
  }
}

6.3 Output Parsing & Structured Extraction

Raw stdout/stderr from tools is parsed by specialized extractors that understand common output formats: Jest/Vitest test reports, ESLint/Pylint warnings, TypeScript compiler errors, Docker build logs, and more. This transforms noisy terminal output into structured data the LLM can reason about efficiently — reducing token waste and improving fix accuracy.

6.4 Resource Limits & Sandboxing

Every tool execution runs inside the sandbox with hard limits: maximum CPU time (configurable, default 5 minutes per command), memory cap (2GB), disk quota (10GB), and network restrictions. Long-running processes are monitored and killed if they exceed limits. The agent receives a structured timeout error and can decide to retry with different parameters or escalate.

7. Core Agent Loop & State Machine

RECEIVED ANALYZING PLANNING CODING REVIEWING TESTING DEPLOYING VERIFYING DONE FAILED pass test fail issues verify fail max retries Forward progress Retry / rollback on failure Success path
Figure 2 — State machine for story processing. Failures trigger retries back to the CODING state. Max retries exhausted → FAILED.

7.1 Iteration Behavior

The core loop follows a generate → validate → fix cycle, consistent with the ReAct (Reasoning + Acting) pattern which interleaves reasoning and actions with observations. ReAct motivates this interleaving specifically because observations from tools help reduce hallucination and update plans dynamically. When a test fails or the reviewer flags issues, the orchestrator packages the error context (stack traces, reviewer comments, failing test output) and sends it back to the Code Generator along with the original task. This retry includes progressively more context with each attempt — on the first retry, just the error; by the third retry, all previous attempts and their failure modes.

7.2 Convergence Strategy

To prevent infinite loops, the agent uses several convergence mechanisms: a hard maximum iteration count per story (configurable, default 10), a diminishing-returns detector that flags when fixes are oscillating, and an escalation path that can request human intervention or decompose the story into smaller sub-stories.

7.3 Tiered Autonomy: Agentless vs. Fully Agentic

Research such as the Agentless approach argues for a simpler three-phase pipeline (localise → repair → validate) rather than complex autonomous tool-using loops, demonstrating that reducing degrees of freedom can improve reliability. This system combines both insights using tiered autonomy: deterministic cores handle safety-critical steps (dependency graphing, static checks, build execution, test running, deployment procedures, policy enforcement — no LLM discretion), while LLM discretion is reserved for where creativity and synthesis matter (requirement interpretation, design alternatives, patch proposals, test drafting, log triage).

8. Context Management Strategy

The single biggest technical challenge for an autonomous coding agent is context management. Real projects have thousands of files, but LLM context windows are finite. The agent must intelligently select what context to include in each prompt to maximize code quality while staying within token budgets.

8.1 Hierarchical Context Architecture

Context is organized in layers, from always-present global context to dynamically-loaded file-level detail:

LayerContentsToken BudgetLifetime
System LayerAgent instructions, tool definitions, output format specs~2,000Always present
Project LayerProjectProfile, directory tree summary, key conventions, tech stack~1,500Per-project, refreshed on changes
Story LayerCurrent user story, acceptance criteria, execution plan, previous iteration outcomes~2,000Per-story
Task LayerCurrent task description, target file(s), relevant imports/dependencies~1,000Per-task
Code LayerFull content of target files, referenced files, type definitions~8,000-20,000Dynamic per-task
Error LayerPrevious attempt code, error messages, stack traces (on retries only)~3,000Per-retry
Memory LayerRAG-retrieved lessons from past stories (patterns, anti-patterns, fixes)~1,500Dynamic, relevance-scored

8.2 Smart File Selection

When generating or modifying code, the agent must decide which files to include in context. The selection algorithm uses multiple signals:

  • Direct dependencies: Files imported by the target file (always included)
  • Type definitions: Interfaces, types, and schemas referenced by the task (always included)
  • Sibling patterns: Similar files in the same directory that demonstrate the project's patterns (included for new file creation)
  • Semantic relevance: Files returned by vector search for the task description (top-k, with relevance threshold)
  • Recency: Files modified in the current story get a boost (they contain the agent's own recent changes)
  • Usage frequency: Frequently-imported utility files and shared types are prioritized

8.3 Context Compression Techniques

  • Skeleton extraction: For large files, include only function/class signatures with docstrings, omitting implementation bodies unless directly relevant
  • Diff-only context: On retries, include only the diff of previous changes plus error output, not the full file again
  • Type-only imports: For dependency files, include only the type signatures the target file actually uses
  • Summary substitution: For very large modules, replace the full source with a generated summary of its API surface

8.4 Context Window Overflow Handling

When the selected context exceeds the model's token limit, the agent applies a priority-based eviction strategy: Memory Layer items are evicted first (lowest relevance score), then semantic search results, then sibling patterns. Direct dependencies and the target file are never evicted. If the task still doesn't fit, the Planner is asked to decompose it into smaller sub-tasks that each operate on fewer files.

9. Error Taxonomy & Recovery Strategies

An autonomous agent encounters many kinds of errors. Treating them all the same ("retry and hope") leads to wasted iterations and poor convergence. The Error Taxonomy classifies errors and maps each type to a specific recovery strategy.

9.1 Error Classification

CategoryExamplesRecovery StrategyTypical Fix Attempts
Syntax Errors Parse failures, missing brackets, invalid syntax Direct fix — usually a single-line correction. Include the exact error location and message. 1
Type Errors Type mismatches, missing properties, wrong argument types Include the full type definitions involved. The agent often needs to see both the caller and callee types. 1-2
Import / Resolution Errors Module not found, missing exports, circular dependencies Consult the Symbol Table and Dependency Graph. May require creating missing modules or fixing export paths. 1-2
Logic Errors Tests fail with wrong output, unexpected behavior Hardest to fix. Include test expectations, actual output, and the algorithm description. May require plan revision. 2-4
Integration Errors API contract mismatches, database schema issues, env config missing Load broader context: API specs, schema definitions, environment configuration. May need multiple file changes. 2-3
Environment Errors Missing packages, wrong runtime version, disk full, permission denied Execute environment repair commands (install packages, update configs). Do not retry code generation — the code is fine. 1
Flaky / Non-Deterministic Race conditions, timing-dependent tests, random seed issues Re-run the same code up to 3 times before considering a code fix. Flag for human review if consistently flaky. 0 (re-run)
Architectural Errors Fundamental design flaw, wrong abstraction, circular coupling Escalate to Planner for a revised execution plan. Do not patch — redesign. N/A (replan)

9.2 Error Fingerprinting

The agent maintains a fingerprint database of errors it has seen before. Each error is hashed by its type, message pattern (with dynamic values stripped), and file location. When a previously-seen fingerprint recurs, the agent can immediately apply the known fix pattern from its Memory Store — skipping the expensive "reason about the error from scratch" step. This dramatically improves fix speed for common issues like missing null checks, wrong import paths, or uncaught promise rejections.

9.3 Oscillation Detection

A critical failure mode is when the agent's fix for Error A introduces Error B, and its fix for Error B reintroduces Error A. The orchestrator detects this by tracking a rolling window of error fingerprints. If the same fingerprint appears twice within 4 iterations, the agent is flagged as "oscillating" and the recovery strategy escalates: the agent is prompted with all previous attempts and their errors simultaneously, forcing it to find a solution that addresses all constraints at once. If oscillation persists, the story is decomposed or escalated to a human.

9.4 Progressive Context Enrichment

Each retry attempt includes more context than the previous one, following a progressive enrichment schedule:

  • Attempt 1: Standard context + error message
  • Attempt 2: + Full stack trace + the failing test source code
  • Attempt 3: + All previous code attempts + a diff showing what changed each time
  • Attempt 4: + Broader project context (related modules, test fixtures) + Memory Store patterns
  • Attempt 5: Switch to a more capable (top-tier) model with maximum context

10. Data Model

10.1 Core Entities

// ── Story ──
interface Story {
  id: string;                      // e.g., "STORY-142"
  raw_text: string;                // Original user story text
  spec: StorySpec;                 // Parsed structured specification
  plan: ExecutionPlan;             // Generated implementation plan
  state: StoryState;               // Current state in the FSM
  iterations: Iteration[];         // History of all attempts
  metadata: {
    created_at: DateTime;
    completed_at?: DateTime;
    total_tokens_used: number;
    total_cost_usd: number;
    model_usage: Record<string, number>;
  };
}

// ── StorySpec ──
interface StorySpec {
  title: string;
  description: string;
  acceptance_criteria: AcceptanceCriterion[];
  story_type: "feature" | "bugfix" | "refactor" | "infra";
  complexity_score: number;        // 1-10
  dependencies: FileDependency[];
  ambiguities: string[];           // Items needing clarification
}

// ── ExecutionPlan ──
interface ExecutionPlan {
  tasks: Task[];
  testing_strategy: TestingStrategy;
  deploy_strategy: DeployStrategy;
  estimated_total_tokens: number;
}

// ── Task ──
interface Task {
  id: string;
  action: "create_file" | "modify_file" | "delete_file" | "run_command";
  target: string;                  // File path or command
  description: string;
  depends_on: string[];            // Task IDs
  status: "pending" | "in_progress" | "done" | "failed";
  result?: TaskResult;
}

// ── Iteration ──
interface Iteration {
  number: number;
  started_at: DateTime;
  ended_at: DateTime;
  stage: StoryState;
  changes: FileChange[];           // Diffs applied
  test_results?: TestReport;
  review_results?: ReviewReport;
  deploy_results?: DeployReport;
  error_context?: ErrorContext;    // If this iteration was a retry
  model_calls: ModelCall[];
}

// ── ModelCall ──
interface ModelCall {
  id: string;
  model: string;                   // e.g., "claude-sonnet-4-6"
  provider: string;                // e.g., "anthropic"
  purpose: string;                 // e.g., "code_generation", "review"
  input_tokens: number;
  output_tokens: number;
  latency_ms: number;
  cost_usd: number;
}

10.2 Memory Store Schema

// ── MemoryEntry ── (for RAG-based self-improvement)
interface MemoryEntry {
  id: string;
  type: "pattern" | "error_fix" | "convention" | "anti_pattern";
  context: string;                 // When does this apply?
  content: string;                 // The actual lesson/pattern
  source_story_id: string;
  effectiveness_score: number;     // How often this helped (0-1)
  embedding: float[];              // Vector embedding for retrieval
  created_at: DateTime;
  last_used_at: DateTime;
  use_count: number;
}

11. API Specifications

11.1 Story Submission

POST /api/v1/stories

Request Body:
{
  "story": "As a user, I want to reset my password via email...",
  "project_id": "proj-abc",
  "priority": "high",
  "constraints": {
    "max_iterations": 8,
    "approval_gates": ["pre_deploy"],
    "target_branch": "feature/password-reset",
    "deploy_target": "staging"
  }
}

Response (202 Accepted):
{
  "story_id": "STORY-142",
  "status": "RECEIVED",
  "estimated_completion": "2026-03-26T15:30:00Z",
  "tracking_url": "/api/v1/stories/STORY-142"
}

11.2 Story Status

GET /api/v1/stories/{story_id}

Response:
{
  "story_id": "STORY-142",
  "state": "TESTING",
  "current_iteration": 2,
  "plan_summary": "3 tasks: create payment service, add route, write tests",
  "progress": {
    "tasks_completed": 2,
    "tasks_total": 3,
    "tests_passing": 12,
    "tests_failing": 1,
    "coverage_percent": 87.3
  },
  "timeline": [
    { "state": "RECEIVED",  "at": "2026-03-26T14:00:00Z" },
    { "state": "ANALYZING", "at": "2026-03-26T14:00:05Z" },
    { "state": "PLANNING",  "at": "2026-03-26T14:00:18Z" },
    { "state": "CODING",    "at": "2026-03-26T14:00:35Z" },
    { "state": "REVIEWING", "at": "2026-03-26T14:02:10Z" },
    { "state": "TESTING",   "at": "2026-03-26T14:02:45Z" }
  ],
  "cost_so_far_usd": 0.42
}

11.3 Approval Gate

POST /api/v1/stories/{story_id}/approve

Request Body:
{
  "gate": "pre_deploy",
  "decision": "approved",
  "comment": "Looks good. Deploy to staging."
}

Response (200 OK):
{
  "story_id": "STORY-142",
  "state": "DEPLOYING",
  "message": "Approval accepted. Deployment initiated."
}

11.4 Batch Submission

POST /api/v1/stories/batch

Request Body:
{
  "project_id": "proj-abc",
  "stories": [
    { "story": "As a user, I want to ...", "priority": "high" },
    { "story": "As an admin, I want to ...", "priority": "medium" }
  ],
  "execution_mode": "parallel"
}

12. LLM Integration Layer

12.1 Architecture

The LLM layer is abstracted behind a Model Router that selects the optimal model for each task based on configurable routing rules. This allows the system to use different models for different purposes — e.g., a fast, cheap model for simple code edits and a powerful model for complex architectural decisions.

Common Tool-Calling Contract: LLM-agnostic execution is easiest when the system standardises on a tool-calling contract: (1) provide the model a list of tools with schemas and policies, (2) accept a structured call from the model, (3) execute it in the application/runtime, (4) return tool outputs as observations, (5) repeat until the model returns a final result. Agent actions are expressed as a stable internal Intermediate Representation (IR), and each model provider adapter maps provider-specific calls to/from that IR — keeping the orchestration layer constant even when switching between hosted frontier models and self-hosted open-source models.
Model Router
Routes tasks to optimal models based on task type, complexity, cost budget, and latency requirements
🟣
Anthropic Adapter
Claude Opus, Sonnet, Haiku
🟢
OpenAI Adapter
GPT-4o, o1, o3
🔵
Open-Source Adapter
Llama, Mistral, DeepSeek, CodeLlama

12.2 Routing Rules

Task TypeRecommended TierRationale
Story AnalysisMid-tier (Sonnet / GPT-4o)Requires good reasoning but not peak coding ability
Architectural PlanningTop-tier (Opus / o3)Complex multi-step reasoning about system structure
Code GenerationTop-tier for complex, Mid for simpleComplexity score from analyzer determines routing
Test GenerationMid-tierTests follow predictable patterns; mid-tier is cost-effective
Code ReviewTop-tier (different provider than generator)Independent perspective catches more issues
Error Analysis & FixTop-tierDebugging requires strong reasoning and context retention
Commit Messages / DocsLow-tier (Haiku / small model)Straightforward text generation

12.3 Provider Interface

interface LLMProvider {
  name: string;

  complete(request: CompletionRequest): Promise<CompletionResponse>;
  stream(request: CompletionRequest): AsyncIterator<StreamChunk>;
  healthCheck(): Promise<HealthStatus>;
  getUsage(): Promise<UsageMetrics>;
}

interface CompletionRequest {
  model: string;
  messages: Message[];
  tools?: ToolDefinition[];
  max_tokens: number;
  temperature: number;
  system_prompt?: string;
  metadata: {
    story_id: string;
    task_id: string;
    purpose: string;
    attempt: number;
  };
}

interface ModelRouter {
  route(task: RoutingContext): SelectedModel;
  fallback(failed_model: string, task: RoutingContext): SelectedModel;
  experiment(task: RoutingContext, models: [string, string]): ExperimentResult;
}

12.4 Cost Management

Each story has a configurable cost budget (default: $5.00). The orchestrator tracks cumulative token usage and cost across all model calls. When 80% of the budget is consumed, the router switches to cheaper models. At 100%, the story is paused and escalated for human decision (continue with more budget, or mark as failed).

13. Sequence Diagrams

13.1 Happy Path — Story to Deployed Feature

User Orchestrator Analyzer LLM Router Sandbox Deploy Submit Story Analyze(story) Route → mid-tier StorySpec spec + plan Generate code → top-tier code + tests Review code → top-tier (alt) APPROVED Build + Run Tests All tests pass Approval gate Approved Deploy to staging Healthy — Smoke tests pass STORY COMPLETE
Figure 3 — Sequence diagram showing the happy path from story submission to successful deployment.

13.2 Retry Path — Test Failure

When tests fail, the orchestrator packages the test output (failing test names, stack traces, expected vs. actual values) and sends it back to the Code Generator as additional context. The generator receives the original task description, the code it previously generated, and the structured error — allowing it to produce a targeted fix rather than regenerating from scratch.

// Error context sent to LLM on retry
{
  "retry_context": {
    "attempt": 2,
    "previous_code": "... (the code that failed) ...",
    "error": {
      "type": "test_failure",
      "failing_tests": [
        {
          "name": "PaymentProcessor.processPayment should handle declined cards",
          "expected": "throws DeclinedCardError",
          "actual": "throws generic Error('Payment failed')",
          "stack_trace": "..."
        }
      ]
    },
    "instruction": "Fix the code to throw DeclinedCardError instead of generic Error for declined cards."
  }
}

14. Git & Source Control Workflow

An autonomous agent that writes code must also be a disciplined git user. The Git Workflow defines how the agent manages branches, commits, pull requests, and code review — ensuring its changes are traceable, reversible, and integrate cleanly with team workflows.

14.1 Branching Strategy

// Branch naming convention
feature/STORY-{id}/{short-description}
// Examples:
feature/STORY-142/password-reset-email
feature/STORY-203/add-payment-webhook
fix/STORY-187/null-check-user-profile

The agent creates a fresh branch from the target base branch for each story. All changes happen on this feature branch. The agent never commits directly to main or develop.

14.2 Commit Discipline

The agent follows an atomic commit strategy — each commit represents one logical change. Commits are made after each successful sub-task completion (not after each file change). The commit message follows the project's convention (detected by the Convention Profile) or defaults to Conventional Commits format:

feat(payments): add PaymentProcessor service with Stripe integration

- Implement charge, refund, and webhook handling
- Add error types for declined cards and insufficient funds
- Wire up to /api/checkout endpoint

Story: STORY-142 | Task: task-1 | Agent: auto-dev-v2

14.3 Pull Request Automation

When all tasks for a story are complete and tests pass, the agent creates a pull request with:

  • Title: Derived from the story title
  • Description: Summary of changes, files modified, acceptance criteria addressed
  • Test report: Coverage metrics, test results summary
  • Iteration report: Number of attempts, errors encountered and fixed, total cost
  • Labels: Auto-applied based on story type (feature, bugfix, refactor)
  • Reviewers: Assigned based on CODEOWNERS or project configuration

14.4 Merge Conflict Resolution

If the target branch has advanced since the agent started, it rebases its feature branch before creating the PR. For simple conflicts (non-overlapping changes), the agent resolves them automatically. For semantic conflicts (overlapping logic), the agent attempts resolution using both versions as context, runs the full test suite to verify, and flags any uncertain merges in the PR description for human review.

15. Multi-Agent Collaboration

Complex projects benefit from multiple specialized agents working together. The Multi-Agent layer defines how agents coordinate, share context, avoid conflicts, and review each other's work.

15.1 Agent Roles

RoleSpecializationWhen Used
Architect AgentSystem design, API contracts, schema design, dependency managementNew features that span multiple services or introduce new abstractions
Implementer AgentCode generation, test writing, bug fixing for a specific moduleMost stories — the primary workhorse
Reviewer AgentCode review, security audit, performance analysisAfter implementation, before merge — uses a different LLM provider for independence
DevOps AgentCI/CD configuration, Docker/K8s manifests, infrastructure-as-codeStories that require infra changes or deployment pipeline updates
Documentation AgentAPI docs, README updates, changelog entries, architecture decision recordsRuns in parallel after implementation is stable

15.2 Coordination Protocol

When multiple agents work on the same codebase, a Coordination Manager prevents conflicts:

  • File locking: Agents acquire advisory locks on files they intend to modify. If two agents need the same file, the higher-priority story goes first; the other waits or works on non-conflicting tasks.
  • Shared context bus: Agents publish their changes (file diffs, new exports, API changes) to a shared event bus. Other agents subscribe to changes in their dependency scope and update their local context accordingly.
  • Contract-first interfaces: When the Architect Agent defines an API contract (interface, schema), Implementer Agents code against the contract independently without needing to see each other's implementations.
  • Sequential merge queue: PRs are merged one at a time through a queue that runs the full test suite after each merge. If a merge causes test failures, it's reverted and the agent is notified to fix the conflict.

15.3 Agent-to-Agent Review

The Reviewer Agent operates on a different LLM provider than the Implementer Agent to ensure independent judgment. It receives the full PR diff, the story specification, and the project's coding standards. It produces structured feedback categorized as: blocking (must fix before merge), suggestion (recommended improvement), or nit (style preference). The Implementer Agent addresses blocking issues automatically and applies suggestions if they don't regress tests.

15.4 Human-Agent Handoff Protocol

When the agent cannot resolve an issue autonomously (max retries exceeded, ambiguous requirements, security-sensitive changes), it creates a structured handoff artifact:

{
  "handoff": {
    "story_id": "STORY-142",
    "reason": "oscillating_error",
    "summary": "Cannot resolve conflict between payment validation and...",
    "what_was_tried": [
      { "attempt": 1, "approach": "...", "result": "..." },
      { "attempt": 2, "approach": "...", "result": "..." }
    ],
    "current_state": {
      "branch": "feature/STORY-142/password-reset",
      "passing_tests": 47,
      "failing_tests": 2,
      "files_changed": ["src/services/payment.ts", "src/routes/checkout.ts"]
    },
    "suggested_next_steps": [
      "Review the interaction between PaymentValidator and StripeClient",
      "Consider whether the retry logic should be at the service or route level"
    ],
    "estimated_human_effort": "30-60 minutes"
  }
}

16. Safety, Guardrails & Human-in-the-Loop

16.1 Sandboxing

All code generation and execution happens inside isolated sandbox environments (containers, VMs, or cloud workspaces). The sandbox has no access to production data, secrets, or external networks beyond what is explicitly whitelisted per project. Sandboxes are ephemeral — they are created fresh for each story and destroyed after completion.

16.2 Approval Gates

GateDefaultTrigger
pre_codeDisabledRequires human approval of the execution plan before coding begins
pre_deployEnabledRequires human approval before deploying to any environment
pre_productionEnabled (locked)Always requires approval before production deployment. Cannot be disabled.
budget_exceededEnabledPauses when cost budget is exceeded and asks for authorization to continue
ambiguity_detectedEnabledPauses when the story has significant ambiguities that could lead to wrong implementation

16.3 Safety Boundaries

  • No secret access: The agent never has access to production secrets, API keys, or credentials. Test environments use mock/test credentials only.
  • No destructive operations: Database migrations, data deletions, and infrastructure teardowns require explicit human approval even in staging.
  • Rate limiting: Maximum LLM calls per story, maximum file operations per iteration, and maximum deploy attempts are all configurable and enforced.
  • Code scanning: Generated code passes through static analysis (SAST) before being committed. Known vulnerability patterns are blocked.
  • Rollback capability: Every deployment is associated with a rollback plan. If post-deploy verification fails, automatic rollback is triggered.

16.4 Policy Engine as a First-Class Subsystem

Because this agent can change production systems, it implements a policy engine that mediates every side-effecting action (file modification, network access, secret retrieval, deployment, rollback). The model proposes actions, but the application executes them — therefore the application must validate arguments, permissions, and context. The policy engine supports configurable checkpoints: a design checkpoint (approve plan and architecture changes before code is touched), a security checkpoint (approve changes impacting authn/authz, cryptography, payments, PII), and a release checkpoint (staging deployment is autonomous; production promotion requires approval unless risk is low and guardrails show high confidence).

16.5 AI Governance & NIST AI RMF

Beyond software supply chain security, the agent is itself an AI system operating across the lifecycle. The NIST AI Risk Management Framework (AI RMF 1.0) provides governance guidance emphasising mapping context, measuring risks, and managing them across the AI lifecycle. In practical terms, this translates into artefacts and controls that must be logged and audited: prompts/config used for each run, tool invocations and outputs, diffs and test results, deployment events, and runtime telemetry tied back to the change that produced it. This is consistent with autonomy research emphasising post-deployment measurement via behavioural telemetry, not only pre-deployment benchmarks.

Important: The agent should never be given credentials to production systems. All production deployments should go through an existing CI/CD pipeline that the agent triggers via API — not through direct access.

17. Quality Gates & Acceptance Criteria Verification

Before any story is marked as DONE, it must pass a series of automated quality gates. These gates go beyond "tests pass" to ensure the generated code meets production-quality standards. The agent treats testing as part of requirements satisfaction — BDD's origin explicitly frames behaviour specifications as a way to connect agile analysis to automated acceptance testing via Given–When–Then scenarios.

The test strategy follows the test pyramid principle: many low-level unit tests (fast, cheap) and fewer high-level "broad stack" tests (slower, more brittle). The agent invests effort accordingly — generating unit tests around new logic, adding integration tests for service boundaries, and reserving end-to-end tests for critical user journeys tied directly to acceptance criteria. When the agent fixes a defect, it writes a regression lock (a failing test first, then the fix), mirroring how SWE-bench tasks validate code changes against test suites derived from real repositories.

17.1 Gate Pipeline

GateThresholdToolBlocking?
Unit Tests Pass100% of related tests passTest RunnerYes
Code Coverage≥ 80% line coverage on new code (configurable)Coverage ReporterYes
Type CheckZero type errors in changed filestsc / mypy / equivalentYes
Lint & FormatZero lint errors; code matches project formatterESLint, Prettier, Black, etc.Yes
Security Scan (SAST)No high/critical vulnerabilities in new codeSemgrep / Bandit / CodeQLYes
Dependency AuditNo known CVEs in newly-added packagesnpm audit / pip auditYes
Bundle SizeNo more than 5% increase (for frontend projects)Webpack analyzer / esbuildWarning
Performance BenchmarkNo significant regression on existing benchmarksCustom benchmark runnerWarning
Acceptance CriteriaAll criteria from story spec are verified by testsCriteria MapperYes
DocumentationPublic APIs have JSDoc/docstrings; README updated if neededDoc CheckerWarning

17.2 Acceptance Criteria Verification

The Criteria Mapper is a critical component that connects each acceptance criterion from the story to one or more test cases. It works in two passes:

  • Forward pass (generation time): When the Test Generator creates tests, it tags each test with the acceptance criterion it validates using a @criterion annotation.
  • Backward pass (verification time): The Criteria Mapper checks that every acceptance criterion has at least one associated passing test. Unmapped criteria are flagged and additional tests are generated.

17.3 Regression Detection

The agent runs the full project test suite (not just new tests) to detect regressions. If existing tests fail after the agent's changes, these failures are treated as high-priority bugs and fixed before the story can proceed. The agent's changes are never allowed to break existing functionality — this is a hard constraint.

18. Deployment Architecture

18.1 Agent Runtime

The agent itself runs as a long-lived service with a task queue. Stories are submitted to the queue and processed by worker instances. Each worker runs one story at a time to maintain focused context. Horizontal scaling is achieved by adding more workers.

API Gateway
REST/GraphQL endpoint for story submission, status queries, and approval gates
Task Queue
Durable message queue (e.g., Redis, RabbitMQ, SQS) for story processing
Worker Pool
Stateless workers that pull stories from the queue and execute the agent loop
Sandbox Manager
Provisions and manages isolated execution environments for each story
State Store
Persistent database for story state, iteration history, and audit logs
Memory / Vector DB
Stores embeddings of past experiences for RAG-based self-improvement

18.2 Sandbox Environment

Each story gets a fresh sandbox containing: the project's git repository (cloned at the target branch), language runtimes and build tools, test frameworks, and a network policy that allows only whitelisted outbound connections (e.g., package registries). The sandbox is destroyed after the story completes.

Because the agent writes and executes code it produced, the platform must assume untrusted execution. Two proven isolation approaches are supported: Firecracker microVMs (lightweight VMs with improved isolation relative to containers while keeping performance closer to container speed) and gVisor (an application kernel interposed between the workload and the host for running untrusted code). A robust execution plane includes: (a) ephemeral, per-attempt workspaces; (b) network egress controls; (c) strict resource quotas; and (d) a controlled tool surface rather than raw shell-by-default — a "defence in depth" approach aligned with production isolation best practices.

19. Observability & Monitoring

19.1 Metrics

MetricDescriptionAlert Threshold
story_completion_rate% of stories reaching DONE state< 60% over 24h
avg_iterations_to_completeMean iteration count for successful stories> 5 (increasing trend)
avg_time_to_completeWall-clock time from RECEIVED to DONE> 30 min (p95)
cost_per_storyAverage LLM + compute cost per story> $3.00 (increasing trend)
test_pass_rate_first_attempt% of stories whose tests pass on first try< 40%
deploy_rollback_rate% of deployments that trigger automatic rollback> 20%
llm_error_rate% of LLM calls that fail (timeout, rate limit, error)> 5%

19.2 Logging & Tracing

Every story execution produces a trace consisting of all orchestrator decisions, LLM calls (with sanitized prompts/responses), test results, and deployment outcomes. Traces are stored for auditing and can be replayed for debugging. Each LLM call is tagged with the story ID, task ID, attempt number, and purpose for cost attribution.

19.3 OpenTelemetry Integration

To iterate autonomously after deployment, the system must observe production-like behaviour. OpenTelemetry provides standardised telemetry signals (logs, metrics, traces) with correlation context, enabling the agent to connect a regression symptom (e.g., a spike in errors) back to specific requests and spans. All agent-generated services are instrumented with OpenTelemetry SDKs, and the agent can query trace data to diagnose issues it detects during post-deployment verification.

19.4 SLOs, SLIs & Error Budgets

Acceptance criteria that reference runtime behaviour (e.g., "latency p95 under X ms") become monitoring assertions via Service Level Objectives (SLOs). An SLO is a target value for a Service Level Indicator (SLI), and error budgets quantify allowable unreliability. The agent's Spec Compiler translates performance-related acceptance criteria into SLO definitions that are monitored post-deployment — the story is not marked DONE until the SLO is met over the observation window.

19.5 DORA Metrics

For engineering feedback at the system level, the agent tracks DORA metrics: deployment frequency, lead time for changes, change failure rate, and time to restore service. These metrics are especially useful for evaluating whether the agent's autonomy is truly improving delivery or merely increasing churn. DORA metrics feed into the self-improvement loop, informing decisions about model routing, prompt optimization, and convergence strategies.

20. Risks & Mitigations

RiskImpactLikelihoodMitigation
Infinite retry loops High (cost, time) Medium Hard iteration cap, diminishing-returns detector, cost budget ceiling
Generated code introduces security vulnerabilities High Medium Mandatory SAST scanning, code review by independent model, sandboxing
LLM hallucinations (inventing non-existent APIs) Medium High Codebase index provides real API surface; build step catches invalid imports; tests verify behavior
Cost overruns from complex stories Medium Medium Per-story budget caps, progressive model downgrading, complexity-based estimation
Ambiguous stories leading to wrong implementation High High Ambiguity detection triggers human clarification gate before coding begins
LLM provider outage Medium Low Multi-provider fallback chain; automatic failover in the Model Router
Sandbox escape or unintended side effects High Low Ephemeral containers, no production access, network whitelisting, resource limits
Context window overflow losing critical information Medium High Hierarchical context architecture, priority-based eviction, skeleton extraction, task decomposition
Multi-agent file conflicts and race conditions Medium Medium Advisory file locking, sequential merge queue, shared context bus, contract-first interfaces
Stale codebase index leading to wrong code references Medium Medium Incremental re-indexing after every code change, hash-based invalidation, dependency-aware propagation
Agent introduces subtle logic bugs that pass tests High Medium Cross-provider code review, acceptance criteria mapping, regression testing, human review gate on PRs
Dependency supply chain attacks via added packages High Low Dependency audit gate, package allowlisting, lockfile verification, major version approval requirement

21. Implementation Roadmap

Phase 1 — Foundation (Weeks 1-4)

  • Build the Orchestrator state machine with basic story lifecycle
  • Implement the LLM Router with support for at least two providers
  • Create sandbox provisioning with Tool Layer (shell, file system, package manager)
  • Build Story Analyzer and Planner components
  • Set up state database and basic API endpoints
  • Implement structured output parsing for common build/test tools

Phase 2 — Codebase Intelligence (Weeks 5-8)

  • Build AST indexer and Symbol Table for primary language (TypeScript or Python)
  • Implement Dependency Graph construction and impact analysis
  • Create Convention Profile detector (naming, structure, patterns)
  • Build semantic code search with vector embeddings
  • Implement hierarchical context management with smart file selection
Milestone 1: Agent can analyze an existing codebase, understand its conventions, and answer questions about its structure and APIs.

Phase 3 — Core Loop (Weeks 9-12)

  • Implement Code Generator with full codebase context awareness
  • Build Test Generator in both coverage and criteria modes
  • Implement Build/Test Runner with structured error parsing and classification
  • Wire up the retry loop with error taxonomy and progressive context enrichment
  • Add Code Reviewer component (cross-provider for independence)
  • Implement oscillation detection and convergence strategies
Milestone 2: Agent can take a simple user story ("Add a health check endpoint") and produce working, tested code in a sandbox. First end-to-end demo.

Phase 4 — Git & Quality (Weeks 13-16)

  • Implement Git workflow: branching, atomic commits, PR creation
  • Build the Quality Gates pipeline (coverage, SAST, lint, type check, dependency audit)
  • Implement Acceptance Criteria Mapper (forward + backward verification)
  • Add merge conflict detection and auto-resolution
  • Build the Human-Agent handoff protocol for escalation

Phase 5 — Deployment & Safety (Weeks 17-20)

  • Build Deploy Engine with at least two adapter types (container + serverless)
  • Implement Monitor / Verifier with smoke tests and health checks
  • Add all approval gates with notification system
  • Implement automatic rollback on post-deploy failures
  • Build the dashboard/UI for story tracking and approval

Phase 6 — Intelligence & Memory (Weeks 21-24)

  • Implement Memory Store with vector embeddings and error fingerprint database
  • Build the Feedback Loop for learning from past iterations
  • Add RAG-augmented prompting for code generation
  • Implement A/B testing framework for model comparison
  • Build observability dashboard with all key metrics

Phase 7 — Multi-Agent & Scale (Weeks 25-30)

  • Implement Multi-Agent coordination (Architect, Implementer, Reviewer, DevOps, Docs)
  • Build file locking, shared context bus, and sequential merge queue
  • Add multi-language sandbox support (Python, TypeScript, Go, Rust)
  • Implement batch story processing and parallel execution
  • Build cost optimization engine (dynamic model routing based on historical data)
  • Harden security: penetration testing of sandbox, audit logging review
  • Documentation, runbooks, and operational readiness
Milestone 3: Full autonomous loop — agent handles a backlog of user stories, coordinates multi-agent work, deploys to staging, monitors results, and self-improves over time.

22. Advanced Prompt Engineering & Reasoning Chains

The quality of an autonomous agent's output depends critically on how it constructs prompts. Unlike simple code-generation tools that use a single prompt, this agent uses a multi-stage reasoning pipeline with structured prompt templates, chain-of-thought decomposition, self-reflection loops, and tool-augmented reasoning.

22.1 Prompt Architecture

Every LLM call uses a layered prompt architecture. Prompts are not strings — they are structured objects assembled from reusable, versioned template components. This allows A/B testing of prompt strategies and ensures consistency across the system.

interface PromptTemplate {
  id: string;                          // e.g., "code-gen-v4.2"
  version: string;
  sections: PromptSection[];
  variables: Record<string, PromptVariable>;
  output_schema?: JSONSchema;          // Structured output specification
  reasoning_mode: "direct" | "chain_of_thought" | "self_reflect" | "tree_of_thought";
  max_output_tokens: number;
}

interface PromptSection {
  role: "system" | "user" | "assistant";
  template: string;                    // Handlebars-style template with {{variables}}
  priority: number;                    // For eviction under token pressure
  required: boolean;                   // Cannot be evicted if true
  cache_control?: "ephemeral";         // Enable prompt caching for stable prefixes
}

// Template registry with versioning and rollback
interface PromptRegistry {
  get(id: string, version?: string): PromptTemplate;
  promote(id: string, fromVersion: string, toVersion: string): void;
  rollback(id: string, toVersion: string): void;
  experiment(id: string, variants: [string, string], trafficSplit: number): void;
}

22.2 Chain-of-Thought Strategies

Different tasks require different reasoning strategies. The agent selects the optimal reasoning mode based on task complexity:

Reasoning ModeWhen UsedToken OverheadImplementation
Direct Generation Simple, well-defined tasks (rename variable, add import, write docstring) ~0% Single prompt → code output. No intermediate reasoning.
Chain-of-Thought Standard code generation and bug fixes ~30-50% Model reasons step-by-step before generating code. Reasoning is parsed and logged but not included in final output.
Self-Reflection Complex logic, security-sensitive code, retry attempts ~100% (two passes) First pass generates code. Second pass reviews the code against requirements and produces corrections. Final output merges both passes.
Tree-of-Thought Architectural decisions, ambiguous requirements, multiple valid approaches ~200-400% Model explores 2-3 candidate approaches in parallel, evaluates each against acceptance criteria, and selects the best. All branches are logged.
Debate / Multi-Perspective Code review, security audit, design decisions ~150% Two different model calls (possibly different providers) independently assess the same code. Disagreements are resolved by a third "judge" call or flagged for human review.

22.3 Self-Reflection Protocol

After generating code, the agent runs a structured self-reflection pass. This is not a simple "review your own code" — it follows a systematic checklist:

// Self-reflection prompt structure
{
  "reflection_axes": [
    {
      "axis": "correctness",
      "prompt": "Does this code correctly implement all acceptance criteria? Walk through each criterion and verify."
    },
    {
      "axis": "edge_cases",
      "prompt": "What edge cases are not handled? Consider: null/undefined inputs, empty collections, concurrent access, boundary values, error states."
    },
    {
      "axis": "security",
      "prompt": "Are there any injection vulnerabilities, unsanitized inputs, hardcoded secrets, or insecure patterns?"
    },
    {
      "axis": "consistency",
      "prompt": "Does this code follow the project's established patterns? Compare naming, error handling, and structure against the convention profile."
    },
    {
      "axis": "testability",
      "prompt": "Can the generated tests actually catch real bugs? Are there missing test cases for the edge cases identified above?"
    },
    {
      "axis": "performance",
      "prompt": "Are there O(n²) loops, unbounded queries, missing indexes, or memory leaks in this code?"
    }
  ],
  "output_format": {
    "issues_found": [{ "axis": "string", "severity": "critical|warning|info", "description": "string", "fix": "string" }],
    "confidence_score": "number (0-1)"
  }
}

If the confidence score falls below 0.7, the agent applies its own suggested fixes and re-runs the reflection. This typically catches 30-40% of issues before the formal code review stage.

22.4 Structured Output Enforcement

LLM outputs are unpredictable by nature. The agent uses multiple strategies to ensure outputs conform to expected formats:

  • JSON Schema constraints: Where the LLM provider supports it (e.g., Anthropic tool use, OpenAI function calling), outputs are constrained to a JSON schema. The LLM cannot produce invalid structures.
  • Post-processing parsers: For code outputs, custom parsers extract code blocks from markdown, validate syntax, and apply AST-level transformations (auto-import, format correction).
  • Retry on parse failure: If the output doesn't parse, the agent retries with a simplified prompt and explicit format examples. Parse failures are tracked — templates that cause frequent parse failures are flagged for revision.
  • Streaming validation: For long outputs, the agent validates partial output during streaming (e.g., checking bracket balance, import validity) and can abort early if the output is clearly going off track.

22.5 Few-Shot Example Management

The agent maintains a curated library of high-quality examples for each task type, drawn from the project's own codebase. When generating a new service, the prompt includes an abbreviated example of an existing well-structured service from the same project. These examples are dynamically selected based on similarity to the current task, not hardcoded — ensuring they stay relevant as the project evolves.

interface FewShotLibrary {
  // Retrieve the most relevant examples for a given task
  getExamples(task: Task, maxTokens: number): FewShotExample[];

  // After a story completes successfully, extract high-quality examples
  extractFromSuccess(story: Story): FewShotExample[];

  // Prune examples that are outdated or no longer match conventions
  prune(projectProfile: ProjectProfile): PruneResult;
}

interface FewShotExample {
  task_type: string;           // "create_service" | "add_endpoint" | "write_test" | etc.
  description: string;         // Natural language description of what this example shows
  input_context: string;       // Abbreviated task context
  output_code: string;         // The exemplary code
  quality_score: number;       // 0-1, based on test pass rate and review results
  source_story_id: string;
  created_at: DateTime;
  embedding: float[];          // For similarity search
}

23. Advanced Memory Architecture

A truly autonomous agent must learn from experience. The Memory Architecture goes beyond simple RAG — it implements a biologically-inspired multi-tier memory system with episodic, semantic, and procedural memory stores, automatic consolidation, and intelligent forgetting.

Reflexion Pattern: Self-improvement is designed as data-driven learning from runs, not uncontrolled self-modification. The Reflexion approach provides a concrete pattern: the agent reflects on feedback signals and stores "lessons" in an episodic memory buffer to improve future decisions without updating model weights. Each attempt is stored as an episode (story/spec version, plan, diffs, tool traces, failures, final outcome), structured learnings are extracted (e.g., "When tests fail with X stacktrace in module Y, run command Z and check config file W"), and these learnings are fed into RAG retrieval for future similar tasks and into heuristics for planning.

23.1 Memory Tiers

Memory TierAnalogyContentsRetentionAccess Pattern
Working Memory Short-term / scratchpad Current story context, in-progress task state, recent tool outputs Duration of current story only Always in context window
Episodic Memory Personal experiences Full execution traces of past stories: what was tried, what failed, what worked, in temporal order Last 500 stories, then summarized Retrieved by similarity to current task
Semantic Memory Factual knowledge Distilled patterns, rules, and facts: "In this project, database queries always use the repository pattern", "React components use named exports" Indefinite (with decay scoring) Matched against current context profile
Procedural Memory Skills / how-to Proven fix strategies: "When Jest mock fails with X error, the fix is Y". Reusable code patterns indexed by task type. Indefinite (reinforced by success) Triggered by error fingerprints or task type
Project Memory Institutional knowledge Per-project conventions, architectural decisions, known tech debt, team preferences, past PR review feedback Lifetime of the project Always loaded for matching project

23.2 Memory Consolidation Pipeline

After each story completes, a background consolidation process extracts durable knowledge from the ephemeral working memory and episodic traces:

interface ConsolidationPipeline {
  // Stage 1: Extract — identify significant events from the story trace
  extractSignificantEvents(story: CompletedStory): SignificantEvent[];

  // Stage 2: Generalize — convert specific events into reusable patterns
  generalize(events: SignificantEvent[]): MemoryCandidate[];

  // Stage 3: Deduplicate — merge with existing memories if similar enough
  deduplicate(candidates: MemoryCandidate[], existing: Memory[]): Memory[];

  // Stage 4: Score — assign initial effectiveness score based on story outcome
  score(memories: Memory[], storyOutcome: StoryOutcome): ScoredMemory[];

  // Stage 5: Store — persist to vector DB with embeddings
  store(memories: ScoredMemory[]): void;
}

// Example consolidation output
{
  "type": "procedural",
  "trigger": "error_fingerprint:TS2339_property_does_not_exist",
  "pattern": "When TypeScript reports 'Property X does not exist on type Y', check if: (1) the interface definition is imported from the correct file, (2) the property was recently renamed, (3) the type needs to be extended or cast",
  "success_rate": 0.87,
  "source_stories": ["STORY-142", "STORY-156", "STORY-189"],
  "generalized_from": 3
}

23.3 Memory Decay & Forgetting

Not all memories are equally useful. The system implements an intelligent forgetting mechanism based on a modified Ebbinghaus forgetting curve:

  • Effectiveness decay: Memories that haven't been used (retrieved and found helpful) within their expected window lose score. A procedural memory unused for 100 stories decays to 50% effectiveness.
  • Contradiction invalidation: When a new memory contradicts an existing one (e.g., "this project uses callbacks" vs. "this project uses async/await"), the newer memory wins and the older one is archived.
  • Compaction: Multiple similar memories are periodically merged into a single, higher-quality memory with a combined effectiveness score.
  • Hard limit: Each memory tier has a maximum capacity. When exceeded, lowest-scoring memories are evicted. This prevents unbounded memory growth and ensures retrieval stays fast.

23.4 Cross-Project Transfer Learning

Some memories are project-specific, but others are generalizable (e.g., "TypeScript strict null checks require guarding against undefined" applies everywhere). The memory system tags memories with a transferability score based on how project-specific the context is. High-transferability memories are stored in a global pool and made available across all projects, while project-specific memories stay scoped.

interface TransferAnalysis {
  // Analyze whether a memory is project-specific or generalizable
  analyzeTransferability(memory: Memory): {
    score: number;                     // 0 = very project-specific, 1 = universally applicable
    dependencies: string[];            // What project-specific context it relies on
    abstractable: boolean;             // Can the memory be abstracted to remove project specifics?
    abstracted_version?: Memory;       // The generalized version if abstractable
  };
}

24. Evaluation & Benchmarking Framework

An autonomous agent must be continuously measured. The Evaluation Framework defines how the system's performance is benchmarked, how model upgrades are tested, and how prompt changes are validated before rollout.

The architecture draws on established evaluation approaches: SWE-bench is a repository-level benchmark based on real GitHub issues where the agent must modify a repository and pass tests; SWE-bench Verified adds human validation to improve scoring reliability; and SWE-bench-Live proposes continuously updated benchmarks with dedicated Docker images for reproducibility and reduced contamination risk. The system includes a "shadow eval" lane: every major model/router/prompt change is tested against a fixed internal suite plus public benchmarks, and results are tracked over time as part of the agent's own release engineering.

24.1 Benchmark Suite

The agent maintains a curated set of benchmark stories — real user stories with known-good implementations. These benchmarks serve as regression tests for the agent itself: when any system component changes (prompt template, model version, routing rule), the benchmark suite runs to verify the change doesn't degrade performance.

Benchmark CategoryStoriesWhat It MeasuresKey Metric
Simple CRUD10 storiesBaseline competence: add endpoint, create model, write testFirst-attempt pass rate
Bug Fixes15 storiesError diagnosis, targeted fixes, regression avoidanceCorrect fix rate, iteration count
Refactoring8 storiesUnderstanding existing code, preserving behavior while improving structureTest preservation rate, code quality delta
Cross-Module Features10 storiesMulti-file changes, dependency management, integration testingIntegration test pass rate
Security-Sensitive8 storiesAuth flows, input validation, secret handlingSAST clean rate, vulnerability introduction rate
Performance-Critical5 storiesAlgorithmic efficiency, database query optimizationBenchmark regression rate
Adversarial10 storiesAmbiguous requirements, conflicting constraints, impossible tasksCorrect escalation rate (should request clarification)
Recovery8 storiesPre-seeded with common error states to test error recoveryRecovery success rate, iterations to fix

24.2 Automated Evaluation Metrics

interface EvalResult {
  benchmark_id: string;
  run_id: string;
  model_config: ModelConfig;
  prompt_versions: Record<string, string>;

  // Core metrics
  stories_attempted: number;
  stories_completed: number;
  completion_rate: number;              // stories_completed / stories_attempted
  first_attempt_pass_rate: number;      // % passing on first code generation
  avg_iterations: number;
  median_iterations: number;
  p95_iterations: number;

  // Quality metrics
  avg_test_coverage: number;            // % line coverage on generated code
  sast_clean_rate: number;              // % with zero SAST findings
  lint_clean_rate: number;              // % passing lint on first attempt
  regression_introduction_rate: number; // % that broke existing tests

  // Efficiency metrics
  avg_tokens_per_story: number;
  avg_cost_per_story: number;
  avg_wall_time_seconds: number;
  avg_llm_calls_per_story: number;

  // Comparison (if comparing two configs)
  comparison?: {
    baseline_run_id: string;
    delta: Record<string, number>;      // Metric name → change (positive = improvement)
    significant_changes: string[];       // Metrics with p < 0.05 difference
    recommendation: "promote" | "reject" | "needs_review";
  };
}

24.3 A/B Testing Protocol

Before any change to the agent's LLM configuration, prompt templates, or routing rules goes to production, it runs through a structured A/B test:

  • Shadow mode: The new configuration runs in parallel with the existing one on real stories (not benchmarks). Both configurations generate code independently, but only the existing config's output is used. Results are compared offline.
  • Canary rollout: If shadow mode shows improvement, the new config handles 10% of incoming stories for 48 hours. Metrics are monitored for regressions.
  • Full rollout: If canary metrics are positive, the new config becomes the default. The old config is retained for instant rollback.

24.4 Code Quality Scoring

Beyond pass/fail, the agent scores the quality of its generated code on multiple dimensions. This composite score is used to compare model performance and track quality trends over time:

interface CodeQualityScore {
  correctness: number;         // 0-1: Do all tests pass? Are acceptance criteria met?
  maintainability: number;     // 0-1: Cyclomatic complexity, function length, nesting depth
  consistency: number;         // 0-1: How well does the code match project conventions?
  security: number;            // 0-1: SAST score, known vulnerability patterns
  test_quality: number;        // 0-1: Coverage, mutation testing survival rate, assertion density
  documentation: number;       // 0-1: JSDoc/docstring presence, README updates
  composite: number;           // Weighted average of all dimensions

  // Weights are configurable per project
  static defaultWeights = {
    correctness: 0.30,
    maintainability: 0.20,
    consistency: 0.15,
    security: 0.15,
    test_quality: 0.15,
    documentation: 0.05
  };
}

25. Security Threat Model & Defense-in-Depth

An autonomous code-generating agent introduces unique security risks. This section provides a formal threat model using STRIDE methodology and defines the defense-in-depth architecture that mitigates each threat category.

25.1 STRIDE Threat Analysis

Threat (STRIDE)Attack VectorImpactDefense
Spoofing Attacker submits stories impersonating authorized user; hijacks agent session Unauthorized code changes mTLS for API access, JWT with short TTL, session binding to authenticated identity
Tampering Man-in-the-middle modifies LLM responses; attacker tampers with sandbox filesystem Injected malicious code TLS for all LLM API calls, immutable sandbox base images, filesystem integrity checks (hash verification post-build)
Repudiation Agent action cannot be traced to a specific story or user; audit trail is incomplete Accountability loss Immutable append-only audit log, cryptographically signed commits, full provenance chain from story → code → deploy
Information Disclosure Agent leaks source code to LLM provider; secrets embedded in generated code; codebase exfiltration via prompt injection IP theft, credential exposure Secret scanning pre-commit (trufflehog/detect-secrets), PII/secret stripping from LLM prompts, data retention agreements with providers, self-hosted model option
Denial of Service Adversarial story causes infinite loop or resource exhaustion; LLM rate limit abuse System unavailability Per-story resource budgets, circuit breakers, per-tenant rate limits, sandbox resource capping (CPU, memory, disk, network)
Elevation of Privilege Agent sandbox escape; LLM prompt injection causes agent to execute unauthorized commands; generated code contains backdoors System compromise Minimal sandbox privileges (non-root), command allowlisting, gVisor/Firecracker isolation, SAST + behavioral analysis of generated code, independent cross-provider review

25.2 Prompt Injection Defense

Because the agent feeds user-controlled content (user stories) and codebase content into LLM prompts, it is vulnerable to prompt injection attacks. The defense strategy operates at multiple layers:

  • Input sanitization: User stories are preprocessed to strip known injection patterns (e.g., "ignore all previous instructions", encoded commands). A classifier model scores injection probability before the story enters the pipeline.
  • Prompt isolation: User content is always placed in clearly delimited sections (XML tags with randomized nonces) that the system prompt instructs the model to treat as data, never as instructions.
  • Output validation: All LLM outputs are validated against expected schemas. Unexpected tool calls, file operations outside the project directory, or network requests to non-whitelisted URLs are blocked regardless of what the model outputs.
  • Behavioral monitoring: An anomaly detector tracks the agent's action patterns. Sudden deviations (e.g., the agent starts reading files outside the project, makes unusually many network requests, or generates code with obfuscation patterns) trigger an automatic pause and alert.

25.3 Supply Chain Security

The agent adds dependencies to projects. This creates a supply chain attack surface:

interface DependencyPolicy {
  // Allowlisted registries (default: npm, pypi, crates.io)
  allowed_registries: string[];

  // Package approval tiers
  auto_approve: {
    conditions: [
      "weekly_downloads > 100000",
      "published_more_than_90_days_ago",
      "no_known_cves",
      "has_type_definitions",
      "maintainer_count > 1"
    ]
  };
  manual_review: {
    conditions: [
      "weekly_downloads < 10000",
      "published_less_than_30_days_ago",
      "new_maintainer_in_last_90_days"
    ]
  };
  always_block: {
    conditions: [
      "known_malicious",
      "typosquat_detected",
      "install_scripts_present_and_obfuscated"
    ]
  };

  // Lockfile enforcement
  lockfile_policy: "require_exact_versions" | "allow_semver_range";

  // Maximum new dependencies per story
  max_new_deps_per_story: number;       // Default: 3
}

Build Artefact Integrity

When the agent builds artefacts (containers, packages), supply chain security controls are part of the "Definition of Done". The system implements:

  • SLSA Framework: An incrementally adoptable framework/checklist to prevent tampering and improve integrity across the software supply chain. The agent tracks provenance from source to deployed artefact.
  • SBOM Generation: Every build produces a Software Bill of Materials using CycloneDX, a full-stack BOM standard designed for cyber risk reduction, enabling downstream consumers to audit transitive dependencies.
  • Artefact Signing & Verification: Built images and packages are signed using cosign (part of the Sigstore project). Kubernetes admission controls via Sigstore policy-controller enforce image signature policies, ensuring only verifiable artefacts run in the cluster — even if internal processes fail.

25.4 Secret Management

The agent never has access to production secrets. Test environments use synthetic credentials managed by a dedicated secret provider:

  • Test credential rotation: Sandbox API keys and database passwords rotate every 24 hours and are injected via environment variables at sandbox creation time — never stored in code or config files.
  • Pre-commit secret scanning: Every generated file passes through detect-secrets and trufflehog before git commit. Matches trigger an immediate block and alert.
  • LLM prompt scrubbing: Before sending any code context to an LLM, a secret-detection pass removes potential credentials, replacing them with placeholder tokens. The original values are restored in the output via a post-processing step that never touches the LLM.

26. Performance Optimization & Caching

LLM calls are the bottleneck — both in latency and cost. The performance optimization layer reduces redundant computation through intelligent caching, speculative execution, and parallel processing strategies.

26.1 Multi-Level Cache Architecture

Cache LayerScopeTTLHit Rate (Typical)What It Caches
Prompt Prefix Cache Per-provider 5 minutes 60-80% Reuses cached system prompt + project context across multiple LLM calls for the same story. Anthropic and some providers support native prompt caching.
Embedding Cache Per-project Until file hash changes 90%+ Code embeddings keyed by file content hash. Avoids re-embedding unchanged files.
AST Cache Per-project Until file hash changes 95%+ Parsed ASTs keyed by file content hash. Parsing is CPU-intensive for large files.
Convention Cache Per-project Until config files change 99% Project conventions re-scanned only when lint/format config files or directory structure changes.
Tool Result Cache Per-sandbox Until relevant files change 40-60% Build outputs, test results, lint results. Invalidated by file changes in the relevant scope.
Semantic Search Cache Per-story Story duration 30-50% Vector search results for repeated queries within a story. Common when retrying tasks.

26.2 Speculative Execution

The orchestrator can predict likely next steps and start them before the current step completes:

  • Speculative test generation: While the Code Generator is producing implementation code, a parallel LLM call generates test skeletons based on the task description and acceptance criteria. The test code is finalized once the implementation is ready.
  • Pre-warm sandbox: When a story enters PLANNING state, the sandbox manager begins provisioning the environment (cloning repo, installing dependencies). By the time CODING begins, the sandbox is ready.
  • Predictive file loading: Based on the task description and dependency graph, likely-needed files are pre-loaded into memory and their embeddings pre-fetched before the Code Generator requests them.
  • Optimistic build: After code generation, the build starts immediately (optimistic path) while the code review runs in parallel. If the review finds blocking issues, the build result is discarded. If the review passes, the build result is ready immediately.

26.3 Parallel Processing Pipeline

// Tasks within a story that can run in parallel
interface ParallelizationPolicy {
  // Independent tasks (no shared files) can generate code in parallel
  code_generation: "parallel_if_independent" | "sequential";

  // Code review and test generation can overlap with build
  review_and_build: "parallel" | "review_first";

  // Multiple independent test suites can run concurrently
  test_execution: "parallel_suites" | "sequential";

  // Documentation generation runs alongside final verification
  documentation: "parallel_with_verification" | "after_verification";
}

// Example: for a story with 3 independent tasks
// Sequential: task1 → task2 → task3 → build → test → review → deploy
// Parallel:   [task1, task2, task3] → [build + review] → test → deploy
// Savings:    ~40% wall-clock time reduction for multi-task stories

26.4 Token Budget Optimization

Every token counts — both for cost and for context window utilization. The agent employs several strategies to minimize token usage without sacrificing output quality:

  • Diff-based context on retries: Instead of sending the full file again, send only the diff from the previous attempt plus the error. Saves 40-60% tokens on retry iterations.
  • Skeleton-first file loading: For reference files (not the target file), load only function signatures and type definitions. Full implementations are loaded only when the Code Generator explicitly requests them via a tool call.
  • Prompt template compression: System prompts are periodically reviewed and compressed — removing redundant instructions, consolidating examples, and shortening boilerplate. A 10% reduction in system prompt length compounds across thousands of calls.
  • Dynamic temperature: Simple tasks (rename, add import) use temperature 0 for deterministic output. Complex tasks use 0.3-0.5 for creativity. Retries after failure increase temperature slightly to explore different solution approaches.

27. Formal Verification & Property-Based Testing

Unit tests verify expected behavior for specific inputs. For critical code paths, the agent goes further — using property-based testing, mutation testing, and where applicable, formal verification to provide stronger correctness guarantees.

27.1 Property-Based Testing

Instead of testing specific input/output pairs, property-based tests define invariants that must hold for all possible inputs. The agent generates these properties from acceptance criteria:

// Traditional unit test (specific input)
test("add(2, 3) returns 5", () => {
  expect(add(2, 3)).toBe(5);
});

// Property-based test (invariant for ALL inputs)
test.property("add is commutative", (a: number, b: number) => {
  expect(add(a, b)).toBe(add(b, a));
});

test.property("add has identity element 0", (a: number) => {
  expect(add(a, 0)).toBe(a);
});

// Agent generates these from acceptance criteria:
// "The payment amount should always equal item price × quantity minus discount"
test.property("payment calculation invariant",
  (price: number, qty: number, discount: number) => {
    fc.pre(price > 0 && qty > 0 && discount >= 0 && discount <= price * qty);
    const payment = calculatePayment(price, qty, discount);
    expect(payment).toBe(price * qty - discount);
    expect(payment).toBeGreaterThanOrEqual(0);
  }
);

27.2 Mutation Testing

Mutation testing measures test suite quality by introducing small changes (mutations) to the generated code and checking whether tests catch them. If a mutation survives (tests still pass despite the code change), it indicates a gap in test coverage.

Mutation OperatorExampleWhat It Tests
Boundary mutation>>=Off-by-one errors
NegationisValid!isValidConditional logic
Return valuereturn resultreturn nullReturn value handling
Arithmetica + ba - bArithmetic correctness
Remove statementDelete a line of codeStatement necessity
Constant swap01, """x"Magic value handling

The agent targets a mutation score of ≥ 85% (85% of mutations are killed by tests). When the score is below threshold, the agent generates additional tests specifically targeting the surviving mutations.

27.3 Contract Testing for APIs

When the agent generates or modifies API endpoints, it creates contract tests that verify:

  • Request schema validation: The endpoint rejects malformed requests with appropriate error codes
  • Response schema compliance: The response matches the documented schema for all status codes
  • Idempotency: For applicable HTTP methods, repeated identical requests produce the same result
  • Error contracts: All error responses follow the project's standard error format
  • Backward compatibility: If modifying an existing endpoint, existing clients are not broken (new fields are additive, no removed fields, status codes preserved)

27.4 Static Analysis Beyond Linting

In addition to standard linters, the agent runs deeper static analysis on generated code:

  • Taint analysis: Tracks data flow from user inputs to security-sensitive sinks (SQL queries, shell commands, file paths) to detect injection vulnerabilities
  • Resource leak detection: Identifies unclosed database connections, file handles, HTTP clients, and event listeners
  • Null safety analysis: Beyond TypeScript's strict mode — detects potential null dereferences across function boundaries and async boundaries
  • Concurrency analysis: For multi-threaded or async code, detects potential race conditions, deadlocks, and shared state mutations without synchronization
  • Complexity gates: Functions with cyclomatic complexity > 15 or nesting depth > 4 are flagged for decomposition before the code is accepted

28. Advanced Deployment Strategies

The basic deploy engine supports push-to-staging. For production-grade autonomous deployment, the agent implements sophisticated release strategies that minimize blast radius and enable safe automated rollouts. Blue/green switches traffic between two environment versions for instant rollback, while canary gradually rolls out to a subset of users/traffic while monitoring key signals. For Kubernetes environments, controllers like Argo Rollouts recommend starting with blue/green as simpler, then moving to canaries as metrics maturity improves — this maps neatly to an autonomy roadmap: begin with conservative rollout patterns until the agent proves stable in your environment.

28.1 Progressive Delivery Pipeline

Build & Test All gates pass Staging Full smoke tests Canary (5%) 15 min observe Linear Ramp 5% → 25% → 50% Full Rollout 100% traffic Automatic rollback at any stage if error rate > threshold or latency p99 regresses > 20%
Figure 4 — Progressive delivery pipeline with automatic rollback at every stage.

28.2 Deployment Strategy Selection

StrategyWhen UsedRollback TimeRisk Level
Blue-Green Stateless services with fast startup. Full environment swap. < 30 seconds Low (instant rollback)
Canary High-traffic services where gradual validation is critical. < 1 minute Very Low (limited blast radius)
Feature Flags New features that need runtime toggles. Code deployed dark, activated separately. Instant (toggle off) Minimal
Rolling Update Kubernetes deployments with health checks. Pods replaced one at a time. < 5 minutes Low-Medium
Shadow / Dark Launch High-risk changes. New code receives production traffic copy but responses are discarded. N/A (not serving) None (no user impact)

28.3 Automated Rollback Triggers

interface RollbackPolicy {
  // Metric-based triggers (any one triggers rollback)
  error_rate_threshold: number;         // Default: 1% increase over baseline
  latency_p99_regression: number;       // Default: 20% increase
  latency_p50_regression: number;       // Default: 10% increase
  cpu_utilization_spike: number;        // Default: 30% increase sustained 5 min
  memory_leak_detection: boolean;       // Monotonically increasing memory over 10 min
  crash_loop_threshold: number;         // Default: 3 restarts in 5 minutes

  // Observation windows
  canary_observation_minutes: number;   // Default: 15
  ramp_step_observation_minutes: number;// Default: 10

  // Rollback behavior
  rollback_type: "instant" | "gradual"; // Instant: flip all traffic. Gradual: reverse the ramp.
  preserve_logs: boolean;               // Keep failed deployment logs for post-mortem
  notify_channels: string[];            // Slack, PagerDuty, email for rollback notifications
  auto_create_incident: boolean;        // Create incident ticket on rollback
}

28.4 Feature Flag Integration

For new features, the agent can generate code wrapped in feature flags. This decouples deployment from release — code ships dark and is activated separately, enabling instant kill-switch capability:

// Agent generates feature-flagged code automatically for high-risk features
async function processPayment(order: Order): Promise<PaymentResult> {
  if (await featureFlags.isEnabled("new-payment-processor", { userId: order.userId })) {
    return newPaymentProcessor.process(order);  // New implementation
  }
  return legacyPaymentProcessor.process(order); // Existing implementation
}

// The agent also generates:
// 1. Flag definition in the feature flag config
// 2. Tests for both code paths
// 3. Cleanup task to remove the flag after full rollout

29. Distributed Agent Orchestration

At scale, a single orchestrator becomes a bottleneck. The distributed orchestration layer enables multiple agent instances to process a backlog of stories concurrently while maintaining consistency, avoiding conflicts, and optimizing resource utilization across a cluster.

29.1 Cluster Architecture

Coordination Service (Raft Consensus)
Leader election, distributed locks, global state management, work assignment
Worker Node A
Orchestrator + Sandbox pool. Processing STORY-142, STORY-145
Worker Node B
Orchestrator + Sandbox pool. Processing STORY-143, STORY-147
Worker Node C
Orchestrator + Sandbox pool. Processing STORY-144 (complex, needs full resources)

29.2 Work Assignment & Scheduling

interface WorkScheduler {
  // Priority-weighted assignment considering:
  // - Story priority (user-defined)
  // - Estimated complexity (from analyzer)
  // - Resource requirements (model tier, sandbox size)
  // - Affinity (prefer workers that recently processed same project for warm caches)
  assign(story: Story, workers: WorkerStatus[]): WorkerAssignment;

  // Preemption: high-priority stories can pause low-priority ones
  preempt(highPriority: Story, workers: WorkerStatus[]): PreemptionPlan;

  // Load balancing: redistribute work when a worker becomes idle or overloaded
  rebalance(workers: WorkerStatus[]): RebalancePlan;
}

interface WorkerStatus {
  id: string;
  active_stories: number;
  max_concurrent: number;               // Based on available resources
  cpu_utilization: number;
  memory_utilization: number;
  active_sandbox_count: number;
  warm_project_caches: string[];         // Projects with warm AST/embedding caches
  current_llm_calls_in_flight: number;
  estimated_completion_times: Record<string, DateTime>;
}

29.3 Distributed Locking & Conflict Prevention

When multiple agents work on stories that touch the same codebase, conflicts must be prevented at the file level, not just the git level:

  • File-level advisory locks: Before modifying a file, an agent acquires a distributed lock (via the coordination service) on that file path. Locks are scoped to the project and branch. If the lock is held, the requesting agent either waits (for dependent work) or proceeds with non-conflicting tasks.
  • Semantic conflict detection: Beyond file-level locks, the system detects semantic conflicts: two stories that modify different files but affect the same API contract, database schema, or shared state. These are identified by the dependency graph and flagged for sequential processing.
  • Optimistic concurrency for reads: Reading files and running queries against the codebase index does not require locks. Writes use an optimistic concurrency model — if the file changed since the agent last read it, the write fails and the agent re-reads and re-generates.
  • Merge queue with CI verification: All completed stories enter a sequential merge queue. Each merge triggers a full CI run. If CI fails, the merge is rejected, the agent is notified, and it re-enters the fix loop with the merge conflict as context.

29.4 Failure Recovery & State Persistence

Worker nodes can fail at any time. The system ensures no work is lost:

  • Checkpointing: After each state transition (ANALYZING → PLANNING → CODING → etc.), the story's complete state is persisted to the distributed state store. This includes all generated code, test results, and iteration history.
  • Heartbeat monitoring: Workers send heartbeats every 10 seconds. If a worker misses 3 consecutive heartbeats, the coordination service marks it as failed and reassigns its stories to other workers.
  • Resumable execution: When a story is reassigned, the new worker loads the checkpoint and resumes from the last persisted state. LLM conversation history is reconstructed from the audit log. The sandbox is re-provisioned from the git branch state.
  • Exactly-once semantics: Tool executions (git commits, deployments, API calls) use idempotency keys to prevent duplicate actions if a story is replayed after a worker failure.

30. Cost Optimization Engine

LLM costs can escalate quickly at scale. The Cost Optimization Engine uses historical data, predictive modeling, and dynamic routing to minimize cost while maintaining quality targets.

30.1 Cost Model

interface CostModel {
  // Real-time cost tracking per story
  track(story_id: string, call: ModelCall): void;

  // Predict total cost for a new story based on similar past stories
  predict(spec: StorySpec): CostEstimate;

  // Recommend the cheapest model configuration that meets quality targets
  optimize(task: Task, qualityTarget: number): OptimalConfig;
}

interface CostEstimate {
  estimated_total_usd: number;
  confidence_interval: [number, number];  // 80% CI
  breakdown: {
    analysis_usd: number;
    planning_usd: number;
    code_generation_usd: number;
    test_generation_usd: number;
    review_usd: number;
    error_fix_usd: number;                // Based on expected retry count
    documentation_usd: number;
  };
  comparable_stories: string[];            // IDs of similar past stories used for estimation
  estimated_iterations: number;
}

30.2 Dynamic Model Routing

Instead of static routing rules, the cost optimizer learns which model to use for which task types based on historical performance data:

Optimization StrategyHow It WorksTypical Savings
Complexity-Based Routing Simple tasks (complexity ≤ 3) route to cheaper models. Only complex tasks use top-tier models. Threshold is learned from historical first-attempt pass rates. 30-50%
Cascade Strategy Start with the cheapest model. If the output fails validation or tests, escalate to the next tier. Most tasks succeed on the first tier. 20-40%
Speculative Smaller Model For retry attempts, first try a smaller/cheaper model with the enriched error context. Error fixes are often simpler than initial generation and don't need top-tier models. 15-25%
Batch Embedding Queue embedding requests and process them in batches rather than one-at-a-time. Embedding APIs often have lower per-token costs at batch scale. 10-20%
Off-Peak Scheduling For non-urgent stories, schedule LLM calls during off-peak hours when provider pricing may be lower or rate limits less constrained. 5-15%

30.3 Budget Allocation & Guardrails

interface BudgetPolicy {
  // Per-story limits
  max_cost_per_story: number;             // Default: $5.00
  warning_threshold: number;              // Default: 0.8 (80% of max)
  escalation_action: "pause" | "downgrade_model" | "notify_and_continue";

  // Per-project limits
  daily_budget: number;                   // Maximum daily spend per project
  monthly_budget: number;                 // Maximum monthly spend per project

  // Per-organization limits
  org_monthly_budget: number;

  // Allocation strategy
  priority_allocation: {
    high_priority_reserve: number;        // % of budget reserved for high-priority stories
    low_priority_max_model_tier: string;  // e.g., "mid" — low-priority stories can't use top-tier
  };

  // Cost alerts
  alerts: {
    per_story_exceeded: boolean;
    daily_80_percent: boolean;
    monthly_80_percent: boolean;
    cost_per_story_trend_increasing: boolean;
  };
}

30.4 ROI Tracking

The system tracks return on investment by comparing the agent's cost against estimated human developer cost for the same stories:

  • Time saved: Estimated hours a developer would spend on the story (based on complexity score and historical data from similar manual stories) × average developer hourly rate
  • Agent cost: Actual LLM + compute cost for the story
  • Quality delta: Comparison of agent-generated code quality scores against historical human-written code quality scores for similar stories
  • Iteration efficiency: Trend of iterations-to-completion over time, showing the agent's learning curve
Target ROI: The system targets a minimum 3x cost efficiency vs. manual development (i.e., agent cost should be less than 33% of equivalent human developer cost). Stories where the agent consistently fails to achieve this ratio are flagged for analysis — they may indicate task types better suited to human developers.