What does 'the agent stack is the new OS' mean?

Andrew Stroup argues that an agent stack is not a product feature with a measurable 90-day ROI. It is the intelligence layer an organization runs on, analogous to an operating system. An OS creates the conditions for ROI; it does not generate it directly. Asking 'what is the return on our agent stack?' is structurally the same question as asking 'what is the ROI of Windows?' — the answer depends entirely on what you install and how you use it.

What separates the operators who get real value from those who don't?

The variance is almost never in the technology. Operators who get value feed the system continuously — filing decisions as they make them, writing down standing rules, closing the feedback loop when the agent misses something. They treat the stack like an extension of their working memory. Operators who don't get value hand work to the agent from a cold start, get one mediocre output, and disengage. Writer's 2026 enterprise AI survey found AI super-users deliver 5x productivity gains while only 29% of organizations report significant ROI overall. That gap is operator behavior at scale.

How long does it actually take for an agent stack to compound?

The deployment curve does not look like a SaaS feature rollout. Month one is overhead. Month two to three is parity — the stack can do what was done before at roughly the same quality. Most ROI evaluations happen at month two and declare the deployment inconclusive. By month six, if the input loop has been closed and the deployment landed on real work, the baseline shifts from 'how fast do we do X?' to 'we could not do X at all before.' Measuring an OS migration at month two with a feature-launch rubric produces a false negative that kills deployments that would have compounded.

The Agent Stack Is the New OS

Q: Why is the ROI question the wrong frame for evaluating agentic AI?

Stroup traces the ROI framing to an analogy mismatch. Deloitte found only 10% of organizations report significant ROI from agentic AI today, with most projecting one to five years before it materializes. McKinsey found 80% of organizations use generative AI regularly but only 6% qualify as high performers, a 74-point gap between adoption and impact. The gap is not a technology gap. It is organizational: integration with existing systems, data quality, and change management are the top blockers. The technology is capable. The organization has not caught up to it yet.

Q: Where could the 'agent stack as OS' argument be wrong?

Two scenarios weaken it. First, if model capability scales faster than operator skill, the input quality bottleneck eventually compresses — a model with a strong enough memory layer might extract real value from poor operator input by inference. Stroup does not think this is the base case in the next two to three years, because the bottleneck is operator-side context the model cannot observe. Second, if agentic platforms consolidate into a dominant standard with strong defaults, the peripheral deployment risk shrinks, and even low-engagement operators get usable output without building input discipline themselves.

Every conversation I have about agentic AI with an operator or a board eventually lands in the same place. They know it matters. They cannot tell you how, or by how much, or what the timeline looks like. And the conversation almost always resolves the same way: someone quotes a headcount reduction, because headcount is the one thing in the room everyone knows how to measure.

I think the measurement problem is real. I also think the cause is not a data problem. It is an analogy problem, and the wrong analogy produces wrong answers and wrong decisions. I wrote earlier this year about why replacement is the wrong default and what the defensible layer actually is. This post is about the evaluation frame that sits underneath both of those questions.

The personal computer did not have a calculable ROI when it first arrived in an office. Neither did the internet. Neither did interchangeable parts, which are the reason you can replace a brake caliper at a shop instead of commissioning a craftsman to build you one. Each of those was a generational platform shift: a technology where asking “what is the return?” is almost the wrong question, because the value is not in the technology itself. It is in what organizations can do on top of it that they structurally could not do before, and that answer takes years to become legible. Agentic AI is in that category, and the frame that actually fits the evaluation is an operating system. An agent stack is not a feature. It is the intelligence layer an organization runs on.

The ROI question kills the conversation

Asking “what is the return on our agent stack?” is structurally the same question as “what is the ROI of Windows?” The honest answer is that it depends entirely on what you install and how you use it. An OS creates the conditions for ROI. It does not generate ROI directly, and a company that measured the value of its PC investment by counting replaced spreadsheets in year one missed the thing that made the PC valuable.

The data on this is pretty striking. Deloitte found that only 10% of organizations report significant ROI from agentic AI today, with most projecting one to five years before it materializes. McKinsey found that 80% of organizations now use generative AI regularly, but only 6% qualify as high performers — meaning more than 5% of EBIT actually attributable to AI. That is a 74-point gap between adoption and impact, and when you dig into why, the explanation is almost never model quality. Anthropic surveyed 500-plus technical leaders and found the top blockers were integration with existing systems, data quality, and change management, in that order. The technology is capable. The organization has not caught up to it yet.

The narrative that fills the ROI vacuum is headcount reduction, because it closes the CFO conversation. “We replaced five SDRs with one agent” is legible, fits in a board deck, and is almost never the interesting part of the story. It focuses attention on the execute layer — the middle of what researchers studying AI’s effect on software engineering call the decide-execute-deliver sandwich — rather than on the two ends that actually determine whether the deployment compounds. The same pattern holds outside software: AI compresses execution, but the decision-making and accountability layers resist automation in ways that do not resolve with capability improvements alone. Headcount reduction measures the cheapest layer. The interesting question is what the organization can do at the other two.

The operator is the whole variable

The part that breaks most evaluations is this: the same agent stack deployed to two different operators produces wildly different outcomes. Same model, same tooling, same configuration. One operator gets real leverage inside a quarter. The other gets marginal lift and eventually stops using it. I have sat in both rooms. The variance is almost never in the technology. It is in the operator.

The operators who get value share a pattern that has nothing to do with how technical they are. They feed the system continuously. They file decisions as they make them, write down standing rules instead of keeping context in their heads, close the feedback loop when the agent misses something. They treat the stack like an extension of their working memory, which means the system stays warm because they keep putting things into it. In the sandwich framing, these operators are doing the decide and deliver work — the ends that resist automation — and delegating the execute layer to the agent. The ones who disengage are trying to hand off all three layers at once, and the agent cannot hold the ends.

The operators who do not get value do the opposite. They expect the agent to reconstruct context from a cold start, get one mediocre output, decide the model is not ready for their use case, and disengage. The system stays cold. The conclusion becomes “this does not work for us” when the real conclusion is “we never gave it what it needed to work.”

Writer’s 2026 enterprise AI survey found that AI super-users deliver 5x productivity gains while only 29% of organizations report significant ROI overall. The gap between those two numbers is not a technology gap. It is the operator behavior gap, at scale, across every company running the same deployment pattern and wondering why the results are flat.

Where new technology goes to die

There is a pattern in how organizations adopt foundational new technology, and it is almost always the same. Because the capability is unfamiliar and the failure modes are not understood yet, the instinct is to protect mission-critical workflows. You do not introduce the new system where a failure is expensive. You start with lower-priority work, internal tasks, the back-office processes that feel safe to experiment on.

Lower-priority work gets lower-priority attention from a team with less urgency and a feedback loop that runs slower, so the wins that do come never compound into anything visible enough to build organizational momentum. The experiment fails quietly, and the conclusion becomes “agents are not ready for our workflows” when the honest conclusion is “we gave the new OS the least important applications and then measured how unimportant they were.”

Gartner projects that over 40% of agentic AI projects will fail by 2027, and while they cite legacy system integration as the primary cause, I think the organizational pattern I am describing is underneath most of those failures. The projects that stall are the ones that started on the periphery, never built the input discipline, and got measured at month two with a feature-launch rubric. The organizations getting real returns are doing something different: they pick a high-leverage workflow, one where the stakes are high enough that the team is actually motivated to close the input loop, and they deploy there first. The risk of a failed experiment on a real workflow is real. The upside is that you get accurate signal fast and build organizational muscle in an environment that actually demands it.

The measurement conversation that actually works

I spoke recently with a growth-stage founder who has been running an agentic deployment across his company for several months. The things he kept coming back to as the hardest parts were not technical. They were getting the team to adopt a new way of working, framing the investment to a board that wanted a number, and structuring the rollout so early wins built confidence rather than early stumbles building skepticism.

His framing for the board was capability expansion, full stop. The question he put in front of his directors was “what can this team do in twelve months that they cannot do today, and what is the value of that?” That single reframe changed the entire conversation, because it put the evaluation on the right time horizon and the right unit. A CFO who cannot approve a twelve-month payback without concrete metrics can approve an investment in capability expansion, because the comparison case is not last quarter’s baseline. It is what it would cost to hire for the capability you are building instead.

His experience matches what the research is showing. MIT Sloan found that organizations are deploying agentic AI faster than they are building strategy for it, with 35% adoption already reached and another 44% planning to deploy soon. An analysis of more than 30 enterprise AI reports found that the primary constraint has shifted from model capability to organizational readiness — defined as the gap between what the technology can do and what the institution is designed to absorb. The technology is further along than most operators realize. The bottleneck is the organization’s ability to absorb it.

What the curve actually looks like

Month one is overhead. Configuration, context loading, the system learning the shape of the operator’s world. It genuinely feels slower than doing the work yourself, and it often is. Month two to three is parity, where the stack can do what was being done before at roughly the same quality. Most ROI evaluations happen at month two and declare the deployment inconclusive, which is the equivalent of measuring an OS migration by the apps that shipped on day one.

By month six, if the input loop has been closed and the deployment landed on real work, the comparison baseline has shifted. It is no longer “how fast do we do X?” It is “we could not do X at all before.” By month twelve the organization has capabilities that did not exist in its pre-agent structure, and the right question is no longer what the stack returned. It is what it would cost to go back.

Measuring that curve at month two produces a false negative that kills deployments that would have compounded, and it reinforces the instinct to keep the next experiment on the periphery where the stakes are lower and the cycle repeats itself.

Where I could be wrong

Two scenarios weaken this. If model capability scales faster than operator skill, the input quality bottleneck eventually compresses. A model with a strong enough memory layer might extract real value from poor operator input by inference, narrowing the variance I described. I do not think this is the base case in the next two to three years, because the bottleneck is operator-side context the model cannot observe. The pace of model improvement has surprised me before, though.

And if agentic platforms consolidate into a dominant standard with strong defaults, the peripheral deployment risk shrinks. Good-enough default workflows for common business functions would mean even low-engagement operators get usable output without building the input discipline themselves. The current trajectory looks more like the early PC era than the iPhone era, but that could shift faster than I expect.

The diagnostic

If you are measuring an agent deployment and the results are flat, the first thing to check is not the model and not the tooling. How much context has been loaded? How many decisions are being filed as they get made? How tight is the feedback loop? Is the team treating this like an OS that needs to be built out, or like a chatbot with extra steps that should already know everything?

And if you are trying to explain the value to a board that needs a number, the framing that works is the one the growth-stage founder used: what can this team do in twelve months that it cannot do today. The return on an operating system is not the OS. It is everything the organization builds on top of it, most of which did not exist before and therefore has no prior baseline to measure against.

The agent stack is the new OS. Whether a deployment compounds or stalls is determined almost entirely by organizational choices that have nothing to do with which model is running underneath. That is the variable worth spending time on.

The ROI question kills the conversation

The operator is the whole variable

Where new technology goes to die

The measurement conversation that actually works

What the curve actually looks like

Where I could be wrong

The diagnostic

Get the next essay.

Keep reading

US-China: One Collision, Many Surfaces

Everyone's Inventing AI Ketchup