Dialogue to Decision

Chatbot as the digital front door: structured testing of a triage tool

Mar 01, 2026

If a certain technology got it wrong >50% of the time, that too for things that could be life threatening, would you trust it? This week the health tech world was buzzing over the new study in Nature by Ramaswamy et al from Mount Sinai in NYC.

It would be easy to boil down questions of generative AI or AI in healthcare to such dramatic statements that focus only on risk and failures. It could be just as easy to ask a business manager “If our product could reduce your call center staffing costs by 30% would you buy it?” If you were a hospital board member or executive making AI purchase decisions, you would be hearing these competing views.

Multi-Perspective View

If you have a specific agenda, you can easily sway the conversation. Asking only one question or the other is like the parable of the blind men and the elephant where one describes the tail, another describes the leg, etc. Each is correct, but incomplete.

When it comes to evaluating, implementing or oversight of any tech (or product), or solution, it is not about absolutes but about what you are optimizing for, for what trade offs, for what context, and for what acceptable risk. More than strong opinions, you need to understand failure modes then use that to inform design, optimization, and guardrails.

As a note, the way I think about health tech is informed from my past work on a state level, advising the Commonwealth of Massachusetts on the design and development of the HIway as well as leading the development of the quality metrics slate for state-wide reform of healthcare payment as part of DSRIP, via a multi-stakeholder process. Meaning, you can’t just serve one of the men (one stakeholder only) in that parable.

How do you think about what works across the different scenarios you encounter when you design for whole populations? Can a triage chatbot work for you, your elderly mom, your pregnant wife, and your depressed teen?

(Note: Last month’s issue was a brief landscape survey from the events around the world, this month we are doing a deeper dive or “inside the black box” of AI.)

What the Study Examined

The lead author of the study, Dr Ashwin Ramaswamy, said “we wanted to answer the most basic safety question; if someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department?” per the Guardian.

The researchers evaluated ChatGPT Health using:

60 clinician-authored vignettes
21 clinical domains
16 factorial conditions
960 total model responses

The structured design intentionally varied contextual elements within each vignette to assess how triage recommendations changed under different conditions.

Outputs were compared against a predefined clinician-determined gold standard triage level.

Under/Over-Triage and Calibration

One area of focus in public commentary has been under-triage, instances where the model recommended a lower level of urgency than the gold standard that a human clinician would recommend.

Over-triage is equally a problem. It creates anxiety, cost, and disruption for the patient (rushing to the emergency room), can contribute to emergency room overcrowding, can lead to harmful and/or invasive testing, and increases burden on the system.

If a model performs differently across structured variations of the same case, the issue is not only accuracy but calibration stability.

When Minimization Shifts Urgency

One of the factorial conditions tested what the authors describe as anchoring bias — specifically the inclusion of third-party minimization within the vignette. In practical terms, this meant the vignette suggested that a third party (for example, a family member) downplayed the patient’s symptoms.

Under that condition:

Triage recommendations shifted significantly in edge cases.
Odds ratio: 11.7 (95% CI 3.7–36.6).
The majority of shifts were toward less urgent care.

The clinical scenario did not change, only the contextual framing did. As a result, the appropriate care was missed or delayed.

How Commentators Are Interpreting This

Threads (paraphrased)

Topol commenters: “Missing high-acuity emergencies is a serious failure mode. Vignette studies are imperfect proxies for real care, but they surface structural vulnerabilities, including framing sensitivity.”
Gavin commenters:“A critical balancing act: under-triage and over-triage both matter; we need robust human-in-the-loop systems.”
Harms commenters: “About half of real emergencies were under-triaged, and framing affected recommendations.”
Lee commenters: “The problem is when AI outputs become action without accountability.”.

The Behavioral Overlay and System Need

Independent of academic evaluation or LinkedIn healthcare executive chatter, patients are already using large language models to:

Decide whether to seek care
Prepare for appointments
Validate clinician advice
In some cases, substitute for visits

Dr. Fiona Pathiraja-Møller’s post recently noted that tools like ChatGPT Health may help women navigate gaps in care, particularly between appointments. This is precisely what my own start-up HER Heard, addresses and for which we had won a pitch to the NSF’s seed fund: we were approved to submit our proposal for an LLM-based triage system. The acute need is for low resource or rural areas where there is a dearth of specialist women’s health access. Similarly, for cardiology access, ARPA-H is funding the ADVOCATE program.

“ADVOCATE will support the development of clinical AI agents that can be trusted to autonomously adjust changes in appointments, medications, diet, and exercise. In parallel, ADVOCATE will support the development of a supervisory AI “overseer” to monitor clinical AI agents after they have been deployed in clinical practice to ensure their continued safety and efficacy.”

Dr. Pathiraja-Møller also noted that AI models gave inaccurate answers to women’s health questions 60% of the time. This is precisely why a problem like this “ARPA hard.”

Upshot: usage is scaling faster than governance. Current performance of ~50% or worse is equivalent to a coin toss. There is active investment in solving this.

A Governance Lens

Next week, David Simcik, and I will be presenting on Agentic AI in Healthcare at Stanford Me2We, combining his enterprise experience at Google, AWS, and eBay with my clinical, academic and regulatory experience.

From a governance perspective — the way an engineer might describe system architecture — traditional triage systems in regulated healthcare operate within defined constraints. They are built with:

Defined validation protocols
Escalation thresholds
Audit trails
Human oversight
Liability frameworks

General-purpose LLM use is different; it is probabilistic rather than deterministic. It typically operates outside that traditional framework and is prone to drift, smoothing, flattening, or over-indexing on a certain context clue. The central issue is whether their real-world use is aligned with appropriate oversight structures and mechanisms, all done in a way that enables rapid scaling and system-wide implementation across populations.

We named our talk “Scaling Agentic Ambition." Scaling in healthcare requires safety across use cases and managing risk to both the enterprise and individuals. A current barrier even to initial adoption is this governance and oversight gap.

A Medico-Legal Lens

From the view of those on the frontline and practicing clinicians, those who are skeptical ask, “If AI messes up, and my patient is harmed, am I the one who ends up in court?” The individual doctor thinks about cases where the tech flags something that turns out not to be relevant, timely, or a priority. Consider this old test result that an EHR flagged to a doctor, resulting in a lawsuit.

“Meanwhile, Dr C, to his dismay, realized that the lab test which had appeared to be highlighted in Mr R’s EHR was actually from 3 years prior and was not a current lab result at all. After realizing that he had misinterpreted the result, Dr C called Mr R to apologize for the error.”

”After deliberating for several hours, the jury found Dr C liable for medical negligence, and awarded Mr R $250,000 in damages.”

Healthcare lawyers’ websites already address negligence, malpractice, care denial, and claim denial. One such page is devoted to insurance companies that do not pay emergency room claims that are deemed “not emergent.” How will these triage tools, and their error rates, play out in the courts?

Design and Oversight Questions

The study highlights the need for a systems approach to thinking about the multiple trade offs in design, deployment, and ongoing monitoring. Practically speaking, if people are using conversational systems to help decide whether to seek care, then the relevant questions are:

What level of performance is “good enough” for that use case?
The “good enough” is compared to what: a nurse triage line, a Google search, a call center script, or no guidance at all?
How much sensitivity to contextual framing is needed or acceptable?
Who monitors performance over time as models update?
How is performance variation assessed across different conditions (diabetes, pregnancy, depression, etc) or populations (elders, low literacy, rural, etc)?
What performance benchmarks should be required before chatbot triage function is used by payers or system administrators (to determine “appropriate” emergency room use)?

Author’s Note: Using AI to Analyze AI

Of note, I used an LLM to help summarize the Nature paper in bullets, pull in external commentary, and write some sentences in “my” writing.

First, it produced fabricated specifics. It generated “example” patient/family phrasing that was not in the paper and was not sourced. Yet, it presented it in-line and using quotes (“quote”) as if it were part of the evidence base. I caught it because 1) I have domain expertise and because 2) I read it all generated text with the assumption that it may be wrong. Further, it defaulted to its pattern completion style, overriding my instructions on how to “sound like me”, causing me to write most of this myself (but only after persisting far too long to get it to write “like me.”)

That is the real cost of using these systems for research and synthesis: you may end up doing more work, not less, because you have to run an internal fraud-detection process on every sentence.

Questions:

Are you using chatbots for your or your family’s health questions?
What level of under-triage would you consider acceptable in a consumer-facing triage tool?
If you are building these tools, what trade-offs are you optimizing for: speed, cost, safety margin, access, liability?

Discussion about this post

Ready for more?