Scaling Agentic Ambition

Progressing from Pilots to Permission

Apr 01, 2026

Is Healthcare Really “Behind”?

Healthcare is often described as being behind on AI: slow to adopt, stuck in pilots, overly cautious compared to other industries.

That framing assumes adoption in healthcare should look like adoption elsewhere.

Healthcare does not adopt tools in the same way, for good reason. Healthcare has environments where decisions carry risk to human life as well as legal liability. What is often underappreciated is how systems are built around highly sensitive, regulated personal data that can become an attractive target for identity theft.

Recently, while at the U.S. ambassador’s residence in the Netherlands, we had a conversation on the cybersecurity asepcts of medical devices, as was recently in the news. Healthcare is part of critical infrastructure that can be a target for operational disruption and at risk during broader geopolitical activity. That unpredictability could be anything from a sudden immigration raid in a hospital to a multi-national war and cyber attacks.

Given that, then moving from a pilot to broader use is not just a question of whether something works. Decision makers in healthcare ask questions on if new tech can be used safely, repeatedly, across population types, and under conditions that are not always fully controlled.

Pilot Stage Is Not the Problem

Pilots without scaling is often treated as evidence that an industry is “slow” to innovate. To understand what it takes to progress past pilots, one must consider what pilots can or cannot prove compared to what decision-makers are responsible to safeguard. By design, pilots demonstrate performance under defined conditions, for “proof of concept”, and for a limited slate of metrics.

The limitation of pilots is not unique to healthcare AI. Measuring “time saved” per interaction, for instance, does not capture total workflow impact. There might be trade-offs made, new risks created, or new ineffciencies introduced.

Existence of pilots shows innovation-engaged institutions with curiosity. That is different from commitment. Pilots show possibility. They do not establish reliability.

Recently, with a colleague, David Simcik, we gave a talk at Stanford titled “Scaling Agentic Ambition” with the premise that to get past pilot, in order to scale, you need to understand how healthcare decision-makers and purchasers think. Beyond the pilot, you must be able to be credible on de-risking your tech solution. As a former Chief Medical Officer, I was one of those decision-makers. David has worked in Big Tech on enterprise sales, hearing what keeps a CIO or CSO (Chief Security Officer) up at night. This month’s issue is inspired by our combined presentation.

When Agentic AI Acts Between Systems

Agentic AI refers to systems that do not simply generate static outputs, but take actions within an environment and between systems: initiating steps, accelerating workflows, reducing administrative burden, and, most importantly, interacting with other systems.

Multi-agent AI that orchestrate actions or collaborate within a single system can increase safety, improve quality, and reduce risk.

Agentic systems, if acting autonomously, can reduce or even remove the pause needed for effective “human-in-the-loop” oversight. The promised efficiency of AI may allow AI agents to route information, trigger actions, and respond in real-time. The faster that happens that means that by the time something is reviewed, the action may already have occurred, the data may already have been shared, or the access given to an outside agent.

An output that is slightly off can be corrected.
An action that is slightly off can propagate.

Actions can include:
Patient information is sent.
A payment is made.
A patient is directed down a path.

(Keep in mind what we covered last issue, on the reliabilty of large language models or LLMs when it comes to a seemingly simple task of triage (“Should I see a doctor immediately or wait a couple of days or weeks?”). For life-threatening conditions a health-specific LLM under-triaged >50% of the time.)

Agentic systems also do not operate within clean boundaries but within a “society of agents” via a range of connections. They rely on tools, APIs, and, in some cases, other agents outside the original system. That introduces a different category of risk. These external agents are prone to model error, may use incomplete or flawed data, and/or may have insufficient security controls. This can increase exposure to things like manipulated prompts, malicious instructions, injected false data, or attempts to extract or redirect sensitive information.

When Control Systems Fail

We have already seen what happens when systems that are meant to manage and control infrastructure become the point of failure. The recent cyberattack on Stryker did not just involve unauthorized access or data exposure. It disrupted operations—manufacturing, ordering, delivery—by leveraging internal device management systems.

The impact did not stay within IT. Surgeries were delayed. Patient care was affected. Revenue was lost. Lawsuits may be filed. Trust was affected.

While this could have happened to any company, it shows how a single failure is not contained within a single system or function. It moves through connected processes and shows up in operations and across hospital systems and domains of healthcare. As hospitals are fairly low margin operations (with those margins falling further in 2025), a few days or weeks of canceled surgeries — an important source of revenue for hospitals — can have an outsized impact on income statements and fiscal health.

As systems become more connected and more capable of acting, this becomes even more consequential.

Action Changes the Accountability Structure

The relevant questions, when doing a root cause analysis or review of any error is whether the system had the authority to act, under what conditions that authority was granted, and what safeguards governed that delegation.

These systems are exercising delegated authority inside regulated workflows and fluctuating risks.

Who approved this class of action?
Was the “human-in-the loop” approver correct for the task?
For instance, simply having “a doctor” as the authorized approver may not be appropriate. There is a difference between a fresh medical school graduate in residency training and a board-certified attending physician.

Additional questions;
What inputs was the system allowed to rely on?
What evidence exists of the context in which the action was taken?
What review was required before or after the fact?

If harm occurs, the system does not hold responsibility.Responsibility remains with the organization and, ultimately, with individuals—clinically, operationally, legally, and financially. In the case of malpractice risk, even if the error is by the tech, the liability rests with the attending physician.

This Is Not a New Kind of Problem

When systems become complex, interconnected, and high-consequence, the limitation is no longer individual performance. It is how the system behaves under real conditions. These high stakes situations are not unique to healthcare and exist in the military, aviation, and a number of other industries. Just as healthcare has learned from other industries before, for agentic AI, healthcare can learn from what has been tried in industries faster to adopt this technology.

In aviation, checklists addressed the limits of memory, reduced cognitive load for pilots to ensure consistent pre-flight safety checks. That translated into pre-surgery checklists.

In manufacturing, lean systems reduced variability, made processes observable, reduced waste, and empowered front line workers to prevent error. That too has been brought into healthcare.

Scaling systems in high-risk environments has required this transition, from relying on individual judgment to building systems where behavior is visible, constrained, and accountable by design. The more you can cite successful, reliable examples from other highly regulated industries, the more you can sound credible to responsible parties in healthcare who seek to mitigate risk while adopting new technologies.

From Pilot to Permission

Pilot projects demonstrate that something can work. After they are completed, their results are used as evidence for scaling. Where future-facing, optimistic innovators are blindsided is when they fail to understand that the pilot is just one form of data, but not the complete set of criteria to convince a decision-maker in healthcare.

That permission depends on whether the system can operate under conditions that are variable, incomplete, and interconnected—and whether its behavior can be contained, understood, and attributed when it fails.

Below are five lenses my colleague, David, offered in our talk at Stanford, based on his conversations with C-level executives who are considering introducing agentic AI.

First, visibility: knowing which agents are in use, where they sit, what functions they serve, and what they connect to externally.

Second, evaluation: understanding how they behave across use cases and contexts, not just an average across all functions.

This requires reference benchmarks and use of the right training data. This could mean use of curated “golden datasets.” These validated datasets can demonstrate performance in a way that is legible to regulators and other oversight bodies. That golden dataset is not, by itself, sufficient however; systems trained on historical data may perform differently across specific population types. Examples include women’s health, pediatrics, elder care, rural health, or rare diseases. (You would not use elder nursing home data to predict disease in a newborn nursery, even if the elder population data were a “golden dataset.”)

Third, control and security: limiting what they can do, what they can access, and stopping actions before they propagate. In practice, that means defining boundaries—what systems an agent can access, what actions require escalation, and where execution must pause.

Fourth, trust and safety: managing variability, bias, and the effects of incomplete or changing information.

Fifth, auditability: reconstructing what happened, why it happened, and where responsibility sits.

If an innovator or tech vendor has a plan to address these domains, that innovator has a greater chance of getting permission to progress past the pilot stage to scaling across a health system. Even if you do not have all the answers, being able to engage in a converation on these topics will help undestand what trade offs to make when building your tech solution in order to be a good candidate for scaling past pilot.

Open Questions

What level of variation is acceptable when systems are acting within workflows?

Acceptable compared to what…current practice, existing failure rates, or an idealized baseline?

What level of autonomy is acceptable before authority is transferred, not just assistance provided?

And once systems are allowed to act and interact across boundaries, who is responsible for what follows: clinically, operationally, legally, and financially?

Human Systems

Hey — I came across your writing and really liked how you think.

I’m exploring something similar from a different angle — writing about human behavior through a system design lens (like debugging internal patterns).

Just started publishing on Substack. If you ever get a moment to read, I’d genuinely value your perspective.

Also happy to support your work — feels like there’s an interesting overlap here.

1 reply by Umbereen Nehal, MD, MPH, MBA

1 more comment...

Discussion about this post

Ready for more?