Harness Engineering

All courses

futrtechlife

FUTRTECHLIFE COURSE

ONLINE COURSE · 12 LESSONS · LEARN AT YOUR PACE

Designing reliable long-running agents — from first principles to production.

Enroll — from $799 See pricing

Format: Online · self-paced
Lessons: 12 lessons · learn at your pace
Video: ~75 min per lesson
Capstone: 1 project
Access: Lifetime
Level: Intermediate

[ WHY THIS COURSE ]

The model isn’t the bottleneck anymore. Models like Opus 4.6 and Sonnet 4.6 can do remarkable work in a single turn. They still fail at long-running tasks for a separate reason: the harness around them is wrong.

A harness is the closed-loop system that surrounds the model — the environment it works in, the artifacts that persist between sessions, the tools it can call, the verification it must pass before declaring victory, and the rules that constrain what it can do. Good models with bad harnesses produce demos. Good models with good harnesses ship products.

This course teaches the engineering discipline of building those systems. Not “prompt tricks.” Not “agent frameworks.” The actual primitives — context, state, tools, evaluation, orchestration — that determine whether your agent works at hour one and still works at hour twenty.

Who this is for

Working developers (any language; we standardize on Python + TypeScript) who have used Claude / ChatGPT but haven’t built agents.
Engineers who built a “quick agent demo” that worked once and then fell apart and want to know why.
Tech leads scoping agentic features who need a sharp mental model of what’s achievable, what’s risky, and what the work actually looks like.

Who this is not for

ML researchers wanting model internals — we treat the model as a black box.
Total beginners — you should be comfortable with git, the command line, async code, and reading API docs.
Builders who just want the fastest “wire-up-a-framework” tutorial. We deliberately build from primitives so the abstractions stop feeling like magic.

[ WHAT YOU WALK OUT WITH ]

A rebuilt /build-harness command — your capstone — that takes a one-line prompt and produces a working app via multi-agent orchestration.

A reusable toolkit: AGENTS.md, feature_list.json, claude-progress.md, init.sh, evaluator prompts, and a permission ruleset library.

Evaluation harnesses that catch regressions before they ship.

A vocabulary — initializer, generator, evaluator, sprint contract, context reset, compact boundary, tool registry — that lets you reason about agent systems precisely.

[ CURRICULUM ]

Twelve modules, twelve named failure modes, twelve primitives that fix them.

Each module is self-contained: a recorded lecture, a reading pack, a hands-on lab, and a deliverable you ship when you’re ready.

M01

Why capable agents still fail

We take a frontier model, give it a “build a clone of claude.ai” prompt, run it in a naive loop, and watch it fail in slow motion. Then we name the failure modes every later module addresses: one-shotting, context loss between sessions, premature victory, undocumented progress, environment drift.

Lab · Run a baseline agent against a non-trivial task. Record every place it gets stuck. Write a one-page failure mode taxonomy — your private rubric for the rest of the course.

Deliverable · failure-modes.md

M02

What a harness actually is

The harness is the environment plus the rules plus the verification plus the state. We dissect Claude Code’s architecture as a worked example — boot sequence, system prompt assembly, query loop, tool registry, permission layers, context compaction — and reduce it to a minimal mental model you can re-derive on a whiteboard.

Lab · Sketch the architecture of your last “agent” project. Identify what was the model, what was the harness, and where the harness was implicit (and therefore broken). Re-draw it with explicit components.

Deliverable · Architecture diagram + 1-page diagnosis

M03

The repository as system of record

State doesn’t live in the model’s context — it lives in files. Every fact the next session needs must be written down, in the repo, in a format the next agent can read. Why JSON for state the model shouldn’t rewrite, Markdown for state it should append to, and why a clean git history is a harness primitive.

Lab · Take the failure modes from Module 1 and design the file layout that would have prevented each one. Build the templates as a reusable starter pack.

Deliverable · harness-starter/ template repo

M04

Context engineering

The system prompt is composed, not written. We pull apart how a real prompt is assembled at runtime: role, tool schemas, project context, memory files, git state, OS info, dynamic injections. The principle: split context by lifecycle — constant, per-project, per-session, per-turn.

Lab · Build a build_system_prompt() function that assembles a prompt from five composable sources. Measure the token cost of each. Trim ruthlessly.

Deliverable · prompt_builder + token budget worksheet

M05

Tool design and the tool registry

Tools are the verbs of the harness. Bad tools turn agents into guessers. Schema design that makes misuse hard, descriptions that double as in-context training, parallel-safe execution, primitive vs. high-level operations, and the difference between a tool and an MCP server. We dissect a 43-tool registry.

Lab · Design a 5-tool registry for a specific domain (refactoring, research summarization, or support triage). Write the descriptions as if they were prompts — because they are.

Deliverable · tools.json + tool-design rationale

M06

The initializer pattern

The first session is not like the others. The two-prompt harness: an initializer agent that sets up the environment once, and a coding agent that makes incremental progress every subsequent session. The initializer expands the terse prompt into a feature list, writes init.sh, seeds claude-progress.md, commits initial state.

Lab · Build an initializer that takes a 1-sentence prompt (“build a markdown-based bug tracker”) and produces a fully-scaffolded project ready for a coding agent to take over.

Deliverable · initializer/ — runnable initializer agent

M07

Feature lists as harness primitives

feature_list.json is more than a TODO. It’s the contract that prevents premature victory. Feature granularity, the passes: false starting state, why the model may only flip test status (never edit tests), category taxonomies (functional / visual / behavioral), and end-to-end testable features.

Lab · Decompose a real product spec into a 50+ feature list. Have your initializer reproduce the decomposition. Compare.

Deliverable · feature_list.json (50+) + decomposition rubric

M08

Verification: the generator / evaluator split

Agents grading their own work skew positive. The fix: split the agent doing the work from the agent judging it. The GAN-inspired pattern — a generator produces work; an evaluator with its own tools, rubric, and skeptical disposition grades it and writes critique. We calibrate with few-shot examples and read evaluator drift.

Lab · Build an evaluator that grades a frontend across four criteria — design, originality, craft, functionality — and writes structured critique back to the generator. Run the loop 5+ times and watch output evolve.

Deliverable · evaluator/ + calibration set of 10 graded examples

M09

Long-running sessions: handoffs, compaction, resets

Sessions are short. Projects are long. The three mechanisms that bridge sessions: handoff artifacts, compaction, and context resets. When to use which, what “context anxiety” looks like, and the “get my bearings” boot sequence — pwd → progress → features → git log → init.sh test — before any new work.

Lab · Take a half-finished, deliberately-broken project and write the bearings sequence + handoff artifacts that let a fresh agent recover state, identify the issues, and resume.

Deliverable · recovery-runbook.md + recovered project

M10

Observability inside the harness

If you can’t see what the agent did, you can’t fix what it broke. Tool-call logging, post-sampling hooks (compact, memory extract), token accounting, cost dashboards, replay tools, and structured traces. Observability lives inside the harness and shapes how you debug, tune, and trust the system.

Lab · Instrument your evaluator-generator loop. Build a dashboard showing tool calls, token spend, evaluator scores per iteration, and time per phase. Identify three optimization opportunities from the data.

Deliverable · Trace logs + dashboard + optimization memo

M11

Multi-agent orchestration

When does one agent stop being enough? The planner / generator / evaluator architecture, sprint contracts (agreements on what “done” looks like before code is written), and file-based agent communication. Also when not to do this: every component encodes an assumption about what the model can’t do alone, and assumptions go stale.

Lab · Add a planner in front of your generator/evaluator from Module 8. Run end-to-end on a one-line prompt. Then remove the planner and compare — is it carrying its weight?

Deliverable · Three-agent harness + ablation memo

M12

Permissions, safety, and leaving a clean state

Zooming back out: the three-layer permission model (registry filter → per-call check → interactive prompt), Bash AST safety analysis, allow/deny rule files, hooks, plan vs. auto mode, and the discipline of leaving every session “mergeable to main.” We close with a tour of extension points — MCP, custom agents, skills, hooks, plugins.

Lab · Write a permission ruleset for your capstone. Add three hooks (pre-tool, post-tool, post-sampling). Stress-test against five adversarial prompts.

Deliverable · permissions.json + hooks + adversarial test log

[ CAPSTONE · FINAL PROJECT ]

Rebuild /build-harness

Rebuild /build-harness — the multi-agent command that takes a short prompt and produces a working application, end-to-end, autonomously — from scratch, using only the primitives you built in the preceding twelve modules. You receive the reference implementation as a black box and reproduce its observable behavior (input → planner spec → generator output → evaluator critique → handoff artifacts) with your own code. Graded on feature parity, the load-bearing-ness of each component, and at least one deliberate, measured improvement vs. the reference.

Source repo with a passing end-to-end demo on a fresh prompt.
A 5-minute demo video.
An ablation memo — what each component does, what breaks if it’s removed.
A “next iteration” doc — what you’d change against the next model release.

Prerequisites

Comfort with Python OR TypeScript (we provide both tracks).
Working command line, git, and a modern editor.
An Anthropic API key (budget ~$100 for capstone runs).
A few hours a week — go as fast or as slow as you like.

How it works

Watch the recorded lesson (~75 min) whenever it suits you.
Work through the reading pack at your own pace.
Do the hands-on lab — no deadlines, no scheduled sessions.
Ship the deliverable when you’re ready and move to the next lesson.
Get unstuck in the async community whenever you need a second pair of eyes.

[ PRICING ]

Buy the course once, or add the community. Both include the full curriculum, code, labs, and capstone.

Course

$799

One-time purchase. Lifetime access — learn entirely at your own pace.

Full curriculum, code, labs, and capstone materials
All 12 recorded lessons + reading packs
Lifetime access, including future updates
Learn on your own clock — no schedule

Choose Course

Course + Community

Stop blaming the model. Your harness is the bug.

By module 6 you’ve shipped an initializer. By the capstone you’ve rebuilt /build-harness from scratch.

Enroll — from $799