The simple version

Prompt engineering helped us talk to AI. Harness engineering is how we make AI work.

The model is the brain. The harness is the body around it: tools, memory, permissions, verification, workflows, audit trails and feedback loops. But if that body is going to operate inside a real business, it needs discipline. That is where I think every serious AI agent needs something like a Self-Improvement Protocol, or SIP.

Why prompts were only the first layer

For the last two years, most business conversations about AI have been about prompts.

What should we ask the model? How should we phrase the instruction? Which prompt gives the cleanest answer?

That was a useful phase. It taught people that language itself could become an interface. It made non-technical leaders feel, for the first time, that they could direct a machine through intent rather than code.

But it was also the surface layer.

The deeper question is not only how to ask better questions. It is how to build AI systems that can act, verify, learn from mistakes and stay inside human judgment. That is where harness engineering begins. And in my own operating vocabulary, that is where a Self-Improvement Protocol, or SIP, becomes necessary.

A prompt improves one interaction. A harness improves the operating environment around the model. A Self-Improvement Protocol improves the system after the interaction is over.

OpenAI captured the shift neatly in its own writing on harness engineering: “Humans steer. Agents execute.”

That line matters because it moves the discussion away from chatbots and toward operations. A company does not need an AI that sounds impressive in a demo. It needs AI that can act safely inside real work, with the right context, the right tools, the right supervision and the right discipline.

A chatbot can impress in a meeting. An operator has to survive Monday morning.

A real operator example

In my own AI operating setup, one lesson came from something small but important: file delivery.

A document could appear prepared on the assistant side while the actual WhatsApp group did not receive a visible attachment. To a chatbot, the task looked done. To an operator, it was not done, because the file had not reached the working channel.

The fix was not a better sentence. It was a harness rule: any WhatsApp file must be sent through the native WhatsApp send path, with a message ID recorded as proof. If that path fails, the system retries through a fallback route. SIP then promoted the lesson into a standing rule, so future file tasks cannot be marked complete based only on an internal attachment hint.

This is exactly the business point. In an ERP follow-up, a CRM update, a supplier quote or an approval chain, "I prepared it" is not completion. Completion means the external system changed, the right person received the output, and the evidence is recorded.

What harness engineering really means

Think of RoboCop.

The human brain is still the source of judgment. But the capability does not come from the brain alone. It comes from the system around it: armor, sensors, targeting discipline, communications, diagnostics, memory, rules of engagement and command structure.

The brain matters. But the harness turns the brain into an operator.

LLMs are moving in the same direction.

A model by itself can answer a question, summarize a document or draft an email. A model inside a well-engineered harness can work through a process. It can use tools. It can retrieve the right context. It can remember what matters. It can check its own output. It can escalate risky actions. It can recover when something fails. It can leave evidence behind.

That is not simply better prompting. It is a different category of capability.

Prompt engineering changes the words the model reads.

Context engineering changes the information the model sees.

Harness engineering changes the environment in which the model acts.

Context engineering was an important bridge. People like Andrej Karpathy and Tobi Lütke have helped popularize the idea that giving the model the right context may matter more than crafting clever prompts. Anthropic has also treated context management, memory and long-running agent behavior as production concerns rather than small implementation details.

But harness engineering is wider still. It includes context, but also tools, permissions, task state, observability, failure attribution, verification and human intervention.

In other words: context helps the model think. The harness helps the model operate.

From chatbots to operators

This is where business leaders need to shift their question.

The question is no longer only, “Which model should we use?”

That still matters. But it is incomplete.

The better questions are:

  • What business systems can the model safely access?
  • What information does it need before acting?
  • What must it verify before claiming success?
  • Where does a human need to approve the next step?
  • What evidence is recorded after the task is complete?
  • How does the system learn when it fails?

This is much closer to how real companies work.

In a real business, work does not happen inside a clean prompt window. It happens across ERP systems, CRM records, WhatsApp messages, email threads, supplier quotations, dashboards, approvals, invoices, customer exceptions, payment follow-ups and human accountability.

No serious company wants an AI that confidently says something is done when nothing has actually happened.

A sales follow-up is not complete because the AI drafted a message. It is complete when the right customer was identified, the right context was checked, the CRM was updated, the next action was recorded and any exception was escalated.

A finance task is not complete because the AI summarized a ledger. It is complete when the source data was verified, the calculation was checked, the assumptions were visible and the human accountable for the decision had the right evidence.

That is harness engineering.

Guides and sensors: the business version

Martin Fowler and Birgitta Böckeler describe the system around coding agents in terms of guides and sensors. Guides are feedforward controls: instructions, conventions, examples and constraints that shape the work before the agent acts. Sensors are feedback controls: tests, reviews, logs and checks that reveal whether the work is actually good.

This vocabulary is useful beyond software.

Every business process needs guides and sensors.

The guide might be a pricing rule, a customer policy, a brand tone, a credit limit, a margin threshold or an escalation rule.

The sensor might be a reconciliation, an approval, a dashboard, a human review, a system log, a test, or a simple check that asks: did the thing actually happen?

Without guides, the AI improvises.

Without sensors, the AI performs.

With both, it starts to operate.

A good harness usually contains seven layers. These are not the same as the seven SIP steps. The harness layers describe the operating environment; SIP describes how that environment learns safely.

First, tools: what systems the agent can use.

Second, permissions: what it is allowed to touch, and what requires approval.

Third, context: what information is brought into the task, and what noise is deliberately excluded.

Fourth, memory: what can be remembered across work, and what must be verified because memory can become stale.

Fifth, workflow: how the agent plans, acts, checks, escalates and recovers.

Sixth, observability: logs, traces and evidence of what happened.

Seventh, feedback loops: the mechanism by which repeated failures become permanent improvements.

The seventh layer is where the discussion becomes more than automation. It becomes discipline.

A harness should not only help an AI act. It should help the AI improve safely from experience.

What a Self-Improvement Protocol looks like

In my own work with AI agents, I use the phrase Self-Improvement Protocol, or SIP, for the operating discipline around improvement.

I do not mean it as a public standard or a claim that every company needs to use the same vocabulary. Most companies will never call it SIP. But every serious company using agents will need something like it.

A self-improvement protocol answers one practical question:

When an AI system fails, how does it improve without becoming dangerous?

That question matters because the wrong kind of improvement is not improvement. It is drift.

An agent can accumulate stale memory. It can promote a bad rule. It can hide uncertainty. It can overfit to one incident. It can gain more freedom than its judgment deserves. It can become more confident without becoming more reliable.

A practical SIP prevents that by turning experience into disciplined improvement.

Capture: every important action leaves evidence. What was the task? What source was used? What decision was made? What changed? What failed? Without capture, every mistake disappears into chat history.

Classify: not every failure means the model was wrong. Was it a prompt issue, a context issue, a tool issue, a permission issue, a memory issue, a workflow issue or a verification issue? Good operators diagnose the system, not just the answer.

Verify: before the AI says something is done, there must be evidence. A test passed. A source was checked. A record was updated. A message was actually sent. A human approved the step. The system must prove completion, not perform confidence.

Promote: repeated lessons become durable controls. A checklist. A better tool. A stricter permission. A clearer workflow. A test that runs automatically. A rule that prevents the same mistake from returning.

Govern: the agent does not get unlimited freedom. External messages, money, privacy, credentials, destructive actions and sensitive decisions need approval gates. Improvement should increase reliability, not remove accountability.

Retrospect: hard tasks should be reviewed. This is where Retrospective Harness Optimization, or RHO, fits. RHO looks back at difficult trajectories, compares attempts, identifies failure patterns and proposes updates to the harness. It is one retrospective method inside SIP.

Clean entropy: the harness itself must be maintained. Memories become stale. Rules conflict. Tools change. Processes drift. If the harness is never cleaned, yesterday's intelligence becomes tomorrow's confusion.

This is why SIP matters for business. It connects learning with governance. It lets the system improve, but only through evidence, verification and human-defined boundaries.

A harness gives the model a body. SIP gives the body discipline.

Why RHO matters

This is why the research around RHO is important. The paper on Retrospective Harness Optimization explores how agents can improve their own harness from past trajectories, without relying on externally labelled validation data. It selects hard and diverse tasks, compares different attempts, identifies patterns of failure and proposes updates to the harness.

The reported result is striking: a single RHO round improved SWE-Bench Pro pass rate from 59 percent to 78 percent.

Benchmarks should always be treated with care. But the direction is hard to ignore.

The model did not become smarter in isolation. The system around the model became better at helping intelligence turn into work.

RHO is retrospective. SIP is operational.

RHO looks backward at what happened. SIP defines what should be captured, verified, promoted, governed and cleaned so the system can improve safely after every serious task.

The management question leaders should ask

A simpler way for leaders to think about it is this: the model provides capability, the harness provides operating discipline, and the business process defines where that capability should be applied.

If the surrounding system is badly designed, even a powerful model will fail in real work. If there is no verification, no audit trail and no approval discipline, speed becomes risk.

The next generation of AI leadership will not be about who can write the cleverest prompt. That skill will still matter, but it will not be the moat.

The moat will be in the operating system around intelligence.

What can the agent see? What can it do? What must it prove? What must it never touch without approval? What does it remember? What does it forget? How does it learn from failure? How do humans stay in the loop where judgment matters?

That is the work.

Prompt engineering helped us talk to AI.

Context engineering helped us give AI the right information.

Harness engineering is how we make AI work.

But work is not enough. If AI is going to operate inside real companies, it also needs a disciplined way to improve from experience.

That is the role of a Self-Improvement Protocol.

The winners will not be the companies that treat AI as a smarter chatbot. They will be the companies that build the best systems around intelligence, and the best disciplines for making those systems better after every serious task.

Frequently asked questions

What is harness engineering?

Harness engineering is the discipline of designing the operating system around an AI model: the tools it can use, the context it can see, the memory it can trust, the permissions it must respect, the checks it must pass, and the feedback loops that help it improve.

How is harness engineering different from prompt engineering?

Prompt engineering improves a single interaction with a model. Harness engineering improves the operating environment that lets the model work across real processes, systems and decisions.

How is context engineering different from harness engineering?

Context engineering focuses on what information the model sees. Harness engineering is broader. It includes context, but also tools, permissions, verification, observability, feedback loops and governance.

Why does harness engineering matter for business?

Businesses do not need AI that only gives clever answers. They need AI that can operate safely across ERP systems, CRM records, email, dashboards, approvals, follow-ups and accountability.

What is a Self-Improvement Protocol?

A Self-Improvement Protocol, or SIP, is a practical operating discipline for AI agents. It captures mistakes, classifies failures, verifies outcomes, promotes repeated lessons into durable controls, governs risk, reviews hard tasks and cleans entropy over time.

What is RHO?

RHO stands for Retrospective Harness Optimization. It is a method for improving an AI agent's harness by reviewing past trajectories, comparing attempts, identifying failure patterns and updating the system around the model. In this article, RHO is treated as one retrospective method inside the broader SIP idea.

Source notes

  1. OpenAI, “Harness engineering: leveraging Codex in an agent-first world”, 2026. Key reference: “Humans steer. Agents execute.” URL: https://openai.com/index/harness-engineering/
  2. Martin Fowler / Birgitta Böckeler, “Harness engineering for coding agent users”, 2026. Useful framing: guides as feedforward controls, sensors as feedback controls. URL: https://martinfowler.com/articles/harness-engineering.html
  3. Wenbo Pan et al., “Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts”, arXiv:2606.05922. Key claim used carefully: reported SWE-Bench Pro improvement from 59 percent to 78 percent in one round. URL: https://arxiv.org/abs/2606.05922
  4. Hailin Zhong and Shengxin Zhu, “AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents”, arXiv:2605.13357. URL: https://arxiv.org/abs/2605.13357
  5. Adjacent context engineering references: Andrej Karpathy, Tobi Lütke and Anthropic are used as support for the move beyond prompt engineering, not as direct endorsements of the phrase “harness engineering”.