• Evaluation Framework™

    Evaluation Framework™

  • A note before you begin

    The AI Care Standard Evaluation Framework™ supports the responsible use of patient-facing AI in real-world healthcare.

    This evaluation is not about passing or failing. Its purpose is to assess how prepared an AI system is to communicate with patients safely, accurately, and with appropriate care—given the clinical, emotional, and operational realities of healthcare today.

    The results may confirm strengths, surface gaps, or indicate where additional safeguards, governance, or changes in approach are warranted.

    Meeting the Standard reflects a commitment to understand how an AI system performs in practice and to act thoughtfully on what the evaluation reveals.

    This Framework is most valuable for organizations willing to engage honestly with the findings, use them to inform decisions, and prioritize patient trust alongside innovation.

  • Who the Evaluation Framework™ is for

    The AI Care Standard™ Evaluation Framework was designed to support organizations responsible for selecting, governing, or deploying patient-facing AI in healthcare, including:

    Health system AI committees
    Teams charged with evaluating AI tools across clinical, operational, ethical, and patient experience considerations. 

    Hospital Chief AI Officers and equivalent leaders
    Executives and senior leaders accountable for AI strategy, governance, and responsible deployment within healthcare organizations.

    Vendors offering patient-facing AI solutions
    Organizations building or delivering AI tools that communicate directly with patients and want to evaluate that tool’s readiness for responsible use in real-world care.

  • Before proceeding

    Please take a moment to review the Terms and Conditions for the AI Care Standard Evaluation Framework™.

  • Built with cross-industry expertise

    Built with cross-industry expertise

  • The AI Care Standard™ was developed through a collaborative cohort of health system leaders, clinicians, patient advocates, and industry experts with direct experience in patient-facing AI.

    This group was intentionally assembled to reflect the real-world environments in which patient-facing AI is designed, governed, and deployed. Contributors were selected based on their leadership roles, hands-on experience with AI in care delivery, and shared commitment to patient safety, trust, and responsible innovation.

    Cohort members and contributors participated in structured discussions, working sessions, and iterative reviews to help shape the scope, principles, and evaluation criteria reflected in this Framework. Their involvement does not imply endorsement of any specific product or outcome, but rather a shared contribution to advancing responsible standards for patient-facing AI.

  • PatientAI Collaborative™ Cohort Members & Contributors

    CO-CHAIRS

    • Duffy, Bridget, MD — Former Chief Experience Officer, Cleveland Clinic; Former Chief Medical Officer, Vocera; Board Member, Vital; Healthcare experience evangelist
    • Ratwani, Raj, PhD — Vice President of Scientific Affairs, MedStar Health Research Institute

    Health Systems & Provider Organizations

    • Dash, Dev, MD — Assistant Professor of Emergency Medicine, Stanford University
    • Erskine, Alistair, MD, MBA — Chief Information Digital Officer, Highmark Health
    • Gold, Jeff, MD — Associate Chief Health Information Officer for Advanced Clinic Training, Learning Health Systems & Data Stewardship, OHSU
    • Nye, Kelly — Vice President, Marketing & Digital Strategy, HCA
    • Rogers, Jeremy — Vice President of Patient and Consumer Experience, Indiana University Health
    • Sammer, Marla, MD, MHA, FAAP — Vice Chair, AI and Innovation; Professor, Baylor College of Medicine
    • Liu, Tania, CPA, FACHE — Executive Director, Enterprise Data Office, Children’s Hospital Los Angeles
    • Woods, Adrienne — Vice President, Digital Engagement, Hackensack Meridian Health

    Industry & Healthcare Innovation Experts

    • Adirim, Terry, MD, MPH, MBA — Physician Executive; Former Assistant Secretary of Defense for Health Affairs; Former VA Program Executive Director of EHR Modernization
    • Collens, Steven, MBA — Chief Executive Officer, MATTER
    • Drenkard, Karen, PhD, NEA-BC, RN — President, Drenkard Healthcare Consulting
    • Patzer, Aaron, MSEE — Founder & Chief Executive Officer, Vital
    • Sterling, Nick, MD, PhD — Chief Medical Information Officer, Vital; Emergency Medicine Physician

    Patient & Caregiver Advocates

    • Daley-Ullom, Beth, MBA — Co-Founder, Patients for Patient Safety
    • Hemmelgarn, Carole — Co-Founder, Patients for Patient Safety
    • Sharp, Janae — Founder, Sharp Index

    Consumer & Advocacy Organizations

    • Hassanyeh, Sami — Senior Vice President, Digital Strategy, AARP

    Executive Leads

    • Dow, Christine, MBA — Founder & Managing Partner, Wheel+Dow
    • English, Kathy, BSN, RN — Chief Marketing Officer, Vital
    • Wheelan, Julie, MBA — Founder & Managing Partner, Wheel+Dow
  • Important note

    Participation in the PatientAI Collaborative™ reflects individual contributions to the development of the AI Care Standard™ Evaluation Framework. Participation does not imply organizational endorsement of the Standard nor of any AI system evaluated using it.

  • STEP 1/8: Training Data & Model Foundations

    STEP 1/8: Training Data & Model Foundations

    How was this AI model trained and can its outputs be trusted in patient care?
  • Patient-facing AI systems operate in environments where information can directly influence understanding, behavior, and health decisions. Errors, omissions, or overconfident responses can create real clinical, emotional, and reputational harm.

    This section evaluates whether the AI system is designed and governed to communicate with patients safely and accurately, using clinically grounded information and appropriate safeguards. It assesses the foundational requirements for responsible patient-facing use before examining more advanced capabilities.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Patient Safety & Clinical Accuracy (todo: aicarestandard.com web link)
  • CALIBRATION — Examples to help you calibrate your responses

    The examples below are provided to help calibrate how different training data approaches can affect the safety, accuracy, and reliability of patient-facing AI.

    These are not scores or prescriptions. They are reference points to support consistent, informed evaluation.

  •  Clinically grounded training approaches

     

    EXAMPLE 1: OpenEvidence

    Why this approach is strong

    • Draws from licensed, peer-reviewed medical literature (e.g., NEJM, JAMA, NCCN)
    • Incorporates trusted public health and evidence databases (PubMed/MEDLINE, Cochrane Library, WHO, AHRQ)
    • Prioritizes source transparency and clinical rigor
       

    What this enables

    • Higher confidence in medical accuracy
    • Reduced risk of misinformation
    • Better alignment with clinical standards of care

  •  Risky approaches to AI model or system training

     

    EXAMPLE 1: Internet-trained large language models (LLMs)

     

    Why this approach is limited

    • Relies heavily on broad internet text (e.g., forums, social media, non-reviewed articles)
    • Health-related content may be unverified, outdated, or misleading
    • Lacks clinical curation or medical accountability on its own
       

    Why this matters in patient care

    • General language fluency does not equal clinical reliability
    • Increases risk of confident but medically incorrect responses

     

    EXAMPLE 2: Single health system or geographically-limited clinical datasets

    Why this approach is limited

    • Training data reflects a single health system or region
    • May encode local population characteristics, practice patterns, or outcomes (e.g. Colorado has the lowest obesity rate)
    • Lacks demographic and epidemiologic diversity
       

    Why this matters in patient care

    • Limits generalizability
    • Can introduce bias or inaccurate assumptions for broader patient populations
  • STEP 1/8: Training Data & Model Foundations

    STEP 1/8: Training Data & Model Foundations

    How was this AI model trained and can its outputs be trusted in patient care?
  • STEP 1/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

    Data Diversity

    AI systems that communicate with patients must be informed by data that reflect the real-world diversity of healthcare.

    The questions below assess how representative the system’s training and reference data are across patient populations, conditions, and care settings.

  • STEP 2/8: Model Predictions & Input Integrity

    STEP 2/8: Model Predictions & Input Integrity

    How does this AI system handle incomplete, inconsistent, or unclear information?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Patient-facing AI systems must be able to generate responses that are safe and appropriate even when the information they receive is incomplete, inconsistent, or unclear.

    This section evaluates how the AI system handles the quality of it's inputs, including whether safeguards are in place to prevent unsafe or misleading outputs when input data are fragmented, contradictory, or poorly standardized.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Safe and Clinically Accurate Communication
  • CALIBRATION — Examples to help you calibrate your responses

    The examples below illustrate how challenges in input quality can directly affect the safety and reliability of patient-facing AI outputs.


  •  Risky approaches to model "prediction"

     

    EXAMPLE 1: Non-standardized or loosely structured inputs

    Why this approach is limited

    • Clinical units vary by health system, lab, or documentation practice
    • Units and reference ranges may differ by age, sex, or population
    • Free-text inputs may lack clinical structure or consistency 

    Why this matters in patient care

    • Increases the risk of incorrect or oversimplified interpretations
    • Can lead patients to misunderstand the significance of results
    • May provide reassurance or concern that is not clinically appropriate

     


     

    EXAMPLE 2: Contradictory or internally inconsistent clinical information

    This is a section from an actual CT Scan in 2024:

    Why this approach is limited

    • Clinical notes may contain conflicting statements due to cut-and-paste practices or human error
    • Systems may lack logic to detect or flag inconsistencies

    Why this matters in patient care

    • Can result in confident but incorrect patient-facing communication
    • Safety concerns when Al fails to acknowledge underlying contradiction
  • STEP 2/8: Model Predictions & Input Integrity

    STEP 2/8: Model Predictions & Input Integrity

    How does this AI system handle incomplete, inconsistent, or unclear information?
  • STEP 2/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

    Data Normalization

    Data normalization helps ensure that patient-facing AI interprets clinical information consistently, even when data come from different sources or systems.

  • Garbage Detection

    “Garbage in, garbage out” applies strongly to patient-facing AI. Patient-facing AI must be able to recognize when information is missing, conflicting, incomplete or contradictory prior to making a prediction.

  • STEP 3/8: Model Validation & Ongoing Oversight

    STEP 3/8: Model Validation & Ongoing Oversight

    How are this AI system’s outputs validated, and how does it detect error over time?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Even well-designed AI systems can produce errors, degrade over time, or behave unpredictably as clinical data, populations, and workflows evolve. Without ongoing validation, unsafe outputs may go undetected until harm occurs.

    This section evaluates whether the AI system is validated before deployment, monitored during real-world use, and periodically reassessed to ensure patient-facing communication remains accurate, safe, and aligned with current clinical standards.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Safe & Clinically Accurate
    • Acknowledgement of Limits & Confidence Levels
    • Continuous Oversight & Improvement
  • CALIBRATION — Examples to help you calibrate your responses

    Different output validation and oversight practices can affect the safety and trustworthiness of patient-facing AI.

  •  Clinically grounded validation approaches

    These approaches are designed to ensure patient-facing AI outputs are accurate, clinically appropriate, and continuously monitored over time.

    EXAMPLE 1: External clinical review and peer validation

    In August 2023, Vital released their doctor-to-patient translator for radiology and had 7 outside doctors review the patient-facing AI output.

    Why this approach is strong:

    • Reviewed by an outside panel of physicians
    • Used overlapping validation sets for 2+ eyes on every case
    • Resulted in a peer-reviewed paper

    What this enables

    • Greater confidence in clinical safety and accuracy
    • Avoids the conflict of interest inherent in grading your own solution
    • Stronger trust among clinicians, health systems, and patients

  •  Validation approaches that are less than ideal

    Each of these represents a failure to validate properly, either at model creation, or more subtly, later in the model's lifetime. 

    EXAMPLE 1: No safeguards for model drift

    Why this matters:

    • Patient populations, clinical practices, and coding standards change over time
    • Disease prevalence (e.g Covid-19, flu-season) changes over time
    • Models validated on "older data" may degrade in performance if not retrained and revalidated

    Why this matters in patient care

    • Outputs may become less accurate without obvious warning
    • Patients may receive guidance that no longer reflects current practice
    • Risk accumulates silently over time

     


     

    EXAMPLE 2: Limited real-world validation

    A University of Michigan study found the AI model accidentally trained on "treated for sepsis" billing codes. Then, model drift occurred when billing shifted from ICD-9 to ICD-10.

     

    Why this approach is limited

    • Models may perform well in controlled or retrospective settings
    • Real-world deployment can surface unanticipated edge cases or biases
    • Validation based solely on proxy labels or billing codes can introduce error

    Why this matters in patient care

    • Increases the likelihood of error over time (aka "model rot")
    • Highlights the need for ongoing monitoring beyond initial launch
  • STEP 3/8: Model Validation & Ongoing Oversight

    STEP 3/8: Model Validation & Ongoing Oversight

    How are this AI system’s outputs validated, and how does it detect error over time?
  • STEP 3/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

    Human Validation

    Human review is an important safeguard for patient-facing AI, especially in complex or high-stakes situations. This section evaluates whether qualified clinicians or trained reviewers regularly sample and assess AI outputs in real-world settings to identify errors, edge cases, or unsafe patterns.

  • Continuous Automatic (machine) Validation

    Automated monitoring uses metrics, alerts, and anomaly detection (e.g., unusual output patterns or error rates) to detect problems quickly. This is critical for large-scale deployments where manual oversight alone cannot catch all issues in real time.

  • Retraining

    Even well-validated systems can drift over time due to model updates, data changes, or new clinical evidence. Periodic review by clinicians ensures that patient communications remain accurate, safe, and consistent with evolving guidelines and institutional policies.

  • Real-World Deployment

    Shadow mode allows the system to generate messages without sending them to patients, so teams can validate outputs prior to "go-live". Validation at each facility, region or EHR instance further accounts for local workflows, note structures, and data pipelines that can materially impact safety.

  • STEP 4/8: Protecting Patients in Vulnerable Situations

    STEP 4/8: Protecting Patients in Vulnerable Situations

    How does this AI system recognize and respond appropriately when patients may be vulnerable or at risk?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Patient-facing AI systems may encounter situations involving serious diagnoses, emotional distress, safety concerns, or medical urgency. In these moments, inappropriate automation can cause harm if systems fail to escalate, defer, or adjust communication appropriately.

    This section evaluates whether the AI system can recognize sensitive or high-risk situations, and whether it responds with safeguards such as escalation to human care teams, real-world resources or delayed disclosure when needed.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Situationally-Appropriate Responses
    • Clear, Closed-Loop Communication
    • Patient-Specific Accommodation
  • CALIBRATION — Examples to help you calibrate your responses


  •  Insufficient protection for vulnerable patients

    These approaches deliver sensitive information without sufficient context, escalation pathways, or human oversight.

    EXAMPLE 1: Automated disclosure of serious diagnoses

    Patient portals release now lab results and doctors notes immediately post 21st Century Cures Act (2016). While transparency is a good thing for normal results, learning about a new cancer diagnosis "through an app" is less than ideal.

     

     

    Why this approach is limited

    • Sensitive results or diagnoses may be delivered without prior clinician context
    • Patients may receive high-impact information asynchronously and without the support of friends & family

    Why this matters in patient care

    • Can cause unnecessary distress or confusion
    • May undermine trust in both the AI system and the care team
    • Fails to account for the emotional and clinical complexity of serious diagnoses

     


     

    EXAMPLE 2: Failure to recognize or escalate signals of self-harm or crisis

    A number of recent suicides & murders have been linked to chatbot use. While a professional therapist has an obiligation to report imminent harm to the patient or others, chatbots do not. 

    Why this approach is limited

    • Systems may not reliably detect expressions of suicidal or homicidal ideation
    • Responses may lack appropriate escalation, support resources, or handoff mechanisms
    • AI systems do not carry the same ethical or legal responsibilities as licensed clinicians 

    Why this matters in patient care

    • Increased risk of harm in high-stakes situations
    • Missed opportunities to intervene or provide support
    • Highlights the need for clear escalation protocols
  • STEP 4/8: Protecting Patients in Vulnerable Situations

    STEP 4/8: Protecting Patients in Vulnerable Situations

    How does this AI system recognize and respond appropriately when patients may be vulnerable or at risk?
  • STEP 4/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

    Sensitive Topics

    Agents and/or models should accurately flag sensitive topics. The system should recognize when content pertains to highly sensitive or emotionally charged diagnoses or events. Flagging is the first step to enabling special handling or escalation to reduce harm.

  • Harm or Harmful Intent

    The system should not treat all content as routine; it must recognize when messages indicate immediate risk to health, safety, or legal obligations. Flagging hazardous content is a key prerequisite for any escalation or emergency protocol.

    Detection is not enough; the chatbot must also guide patients to appropriate hotlines, urgent care, ED, or in-person contacts. This ensures that high-risk situations are handled via human-led care instead of continued automated conversation.

  • Symptom Escalation

    Identifying complaints and symptoms compatible with high-risk or urgent conditions allows the system to steer patients away from asynchronous messaging and toward emergent care.

  • Prevent System Hijacking

    Systems that accept patient free-text are more vulnerable to manipulation. In conversational AI, this is often referred to as prompt hijacking. While large language models typically include guardrails around unsafe or prohibited content, these safeguards can sometimes be bypassed through carefully crafted inputs.

    Similar risks exist in systems that allow user-generated tags or metadata, which can be manipulated in ways that affect downstream responses or other users’ experiences if not properly controlled. This risk is especially important to address when users seek information related to self-harm or harm to others, where even subtle failures in safeguards can have serious consequences.

  • STEP 5/8: Personalized to the Individual Patient

    STEP 5/8: Personalized to the Individual Patient

    How does this AI system tailor communication to an individual patient’s health history, preferences, and needs?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Patient-facing AI should adapt to the individual patient’s health context, language needs, and communication preferences. Generic or one-size-fits-all responses can undermine trust, increase confusion, and push patients toward unsafe alternatives.

    This section evaluates whether the AI system personalizes communication using appropriate patient-specific information and supports accessibility across language, literacy, and ability needs.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Patient-Specific Communication & Accommodation
  • CALIBRATION — Examples to help you calibrate your responses

  •  Clinically meaningful personalization approaches

    These approaches enable patient-facing AI to communicate in ways that reflect the individual patient’s language, health context, and preferences.

    EXAMPLE 1: Personalization grounded in patient context

    This web-app supports many languages, in text, notifications, patient video education, and lab results.

    Why this approach is strong

    • Adapts communication based on the patient’s preferred language and cultural context
    • Accounts for relevant patient-specific information (e.g., conditions, medications, allergies, recent results)
    • Adjusts explanations and guidance to reflect what is known about the individual patient

    What this enables

    • More relevant and understandable patient communication
    • Reduced cognitive and emotional burden for patients
    • Greater trust that the system is responding to them, not a generic user

  •   Approaches with too little personalization

    EXAMPLE 1: AI has no awareness of patient’s health record

    This is an app from a major health system, launched with a big press push. It does well on certain patient protections, but appears to know little about the patient.

    Why this approach is limited

    • The system cannot reference the patient’s actual results, medications, or conditions
    • Responses are general guidance rather than individualized insight

    Why this matters in patient care

    • Patients may feel the system does not “know them”
    • Can push patients toward external tools that lack clinical safeguards
    • Undermines trust in patient-facing AI throughout that health system
  • STEP 5/8: Personalized to the Individual Patient

    STEP 5/8: Personalized to the Individual Patient

    How does this AI system tailor communication to an individual patient’s health history, preferences, and needs?
  • STEP 5/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

    Knowledgeable, Personalized Responses

    The AI should know basics about the patient already from the clinical records: age, medications, allergies, and any test results. It should not give generic responses that are the same for all patients. If that happens, the system is more likely a rules-based decision "tree" and not a true AI.

  • Accessibility

    Patient-facing communication must be understandable and usable for people with different languages, health literacy levels, and accessibility needs. This section evaluates whether the system supports clear language, multilingual communication, and basic accessibility accommodations.

  • STEP 6/8: Empowering Patient Agency

    STEP 6/8: Empowering Patient Agency

    Does this AI system give patients the information and context they need to make informed decisions about their health?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Patient-facing AI should help patients understand their health information - not replace their own decision-making. Transparency about sources, evidence, and limitations is essential for informed decision-making and patient trust.

    This section evaluates whether the AI system provides clear sourcing, evidence traceability, and mechanisms to verify patient understanding. This supports patients as active participants in their care.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Patient Autonomy & Empowerment
    • Transparency, Data Lineage & Evidence
  • CALIBRATION — Examples to help you calibrate your responses

  •  Approaches that support patient autonomy

    These approaches help patients understand why information is being presented and give them the ability to explore, verify, and act on health information.

    EXAMPLE 1: Transparent responses with clear source attribution

    While not patient-facing, UpToDate® illustrates clear attribution when looking up medical information. The same "cite your sources" should apply to patients.

    Why this approach is strong

    • Clearly identifies the sources behind medical explanations or recommendations
    • Allows patients to review original evidence or clinical guidance
    • Distinguishes between facts, evidence-based guidance, and interpretation
       

    What this enables

    • Greater patient trust and confidence
    • Reduced reliance on “blind trust” in AI-generated responses
  • STEP 6/8: Empowering Patient Agency

    STEP 6/8: Empowering Patient Agency

    Does this AI system give patients the information and context they need to make informed decisions about their health?
  • STEP 6/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

    Document Audit Trail

    Maintaining source lineage allows clinicians and auditors to see exactly which notes and metadata underpinned a given AI message. This is important for traceability, error investigation, and regulatory or medical-legal review.

    E.g. "What's the latest on my diabetes?" might cite both the latest lab result for blood glucose and the most recent provider note discussing diabetes or pre-diabetes.

  • Quality Sources

    Using high-quality, curated sources is essential to ensure that patient-facing advice reflects best available evidence. This reduces the risk of AI surfacing unvetted or fringe recommendations that could undermine informed decision-making.

  • Knowledge Verification

    Clear communication is not complete unless understanding is confirmed. Patient-facing AI should include mechanisms to check whether information was understood. A step further would be detection of patient confusion, so that clarification or human follow-up can occur.

  • STEP 7/8: Disclosure

    STEP 7/8: Disclosure

    Do patients clearly understand when information is generated by AI and what role human clinicians play?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Patients should never be uncertain about whether information was generated by AI. Clear disclosure helps set appropriate expectations and reinforces trust.

    This section evaluates whether AI-generated content is clearly identified, appropriately framed, and positioned as supportive (not authoritative) relative to clinician judgment.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Disclose Identity & Accommodate Ethical Standards
  • CALIBRATION — Examples to help you calibrate your responses

  •  Clear Disclosure with Appropriate Deferral

    This example demonstrates transparent, patient-appropriate disclosure of AI involvement while reinforcing appropriate clinical boundaries.

    EXAMPLE 1: Transparent responses with clear source attribution

    Vital's doctor-to-patient translator uses this UX element before showing patient's an AI generated result.

    Why this approach is strong

    • Clearly identifies that the content was generated or summarized by AI
    • Acknowledges the possibility of error or omission without overstating risk
    • Links directly to the original clinical source, preserving evidence traceability
    • Explicitly defers final authority to the clinician and primary records

    What this enables

    • Sets accurate expectations about the role of AI in patient communication
    • Preserves patient trust by avoiding implied clinical authority
    • Supports informed decision-making without replacing provider judgment
  • STEP 7/8: Disclosure

    STEP 7/8: Disclosure

    Do patients clearly understand when information is generated by AI and what role human clinicians play?
  • STEP 7/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

  • STEP 8/8: Improving Care Team Efficiency

    STEP 8/8: Improving Care Team Efficiency

    Does the AI system support care teams without adding unnecessary friction or alert fatigue?
  • ORIENTATION — What this section of the Evaluation Framework assesses

    Patient-facing AI systems do not operate in isolation. Their outputs can directly affect clinician workload, inbox volume, alert burden, and downstream workflows. Systems that generate excessive alerts, low-value escalations, or poorly timed interruptions can undermine clinical judgment and contribute to burnout.

    This section evaluates whether the AI system meaningfully supports care team efficiency, or whether it introduces unnecessary noise, duplicate work or operational burden that outweighs its benefits.

    This section focuses on the following AI Care Standard Core Pillars™:

    • Optimization of Care Team Workflow
  • CALIBRATION — Examples to help you calibrate your responses


  •   High Alert Volume with Low Clinical Yield

    This example illustrates how poorly calibrated AI systems can increase staff burden without improving outcomes. A University of Michigan study found 109 alerts were generated for every one true sepsis case not already detected by clinical judgement.

     

    Why this approach is limited

    • Generates a high volume of alerts with low true-positive rates
    • Requires clinicians to respond to many false or low-value signals
    • Diverts attention from clinical judgment rather than supporting it
    • Misses a significant portion of true cases despite alert volume 

    Why this matters in patient care

    • Contributes to alert fatigue and clinician burnout
    • Reduces trust in AI-assisted systems
    • Increases operational burden without improving patient safety
  • STEP 8/8: Improving Care Team Efficiency

    STEP 8/8: Improving Care Team Efficiency

    Does the AI system support care teams without adding unnecessary friction or alert fatigue?
  • STEP 8/8: KEY EVALUATION QUESTIONS

    Answer the following questions to the best of your ability using information currently available to you.

  • Send your evaluation results

    Send your evaluation results

    On submission, we will send the results of the evaluation to emails listed below. This might be your boss, a client, or a committee. This is optional. Your email is only use to send you the results of this evaluation tool.
  • AI Care Standard™ Evaluation Summary

    AI Care Standard™ Evaluation Summary

    Overall Readiness Status
  • 🔴 Not ready for patient-facing deployment

     

  • 🟡 Refinements needed for patient-facing deployment

     

  • 🟢 Appears ready for patient-facing deployment

     

  • What "Not Ready" Means

    Based on your responses, the AI system under review currently has technical and/or safety gaps that could put patients at risk if deployed for direct patient communication.

    This does not imply poor intent or irreparable design. It indicates areas where additional safeguards, validation, or design changes are needed before safe use in real-world care.

  • What "Refinements Needed" Means

    Based on your responses, the AI system under review likely has technical or safety gaps that may put patients at risk if deployed for direct patient communication.

    This does not imply poor intent or irreparable design. It indicates areas where additional safeguards, validation, or design changes are needed before safe use in real-world care.

  • What "Appears Ready" Means

    Based on your responses, the AI system under review has sufficient risk mitigation for direct patient communication.

    This assessment is based on an aggregate score only. Please read through individual answers to determine if there are any deal-breakers for your organization. There may be additional safegaurds, validation or design changes needed before safe use in real-world care.

  • Approaches to Reduce Risk

    Below are a list of suggestions generated from your scores on individual sections of the AI Care Standard™ Evaluation Framework. These approaches are based on our work deploying real-world systems. Due to the permutations of responses, you may already be addressing some of the concerns highlighted.

  • Model Training:

    Consider utilizing data from multiple health systems across multiple regions. This may require licensing PHI-redacted data or using public data sets like MIMIC-IV. Use diverse data across age, ethnicity, disease state, and treatment setting (e.g. ER data is very different from Primary Care).


  • Model Prediction:

    Every good data scientist knows 95% of the effort is data wrangling and data cleansing. Do more. Put as much effort here as your fancy neural network!

    Health data is rarely "clean". Each hospital might use slightly different coding standards or reference ranges. Freetext was generated by tired humans who make cut-and-paste errors or contradict themselves.

    Normalize & sanitize your inputs. Don't make predictions when only 1% of inputs are available. If you take freetext, don't accept 4 word inputs when your system expects a 4 page note.


  • Model Validation:

    Appropriately-trained and licensed providers should validate your AI system at least annually. Where possible, use 3rd party validation on N > 1000 samples. Deploy in "shadow" mode for a few weeks before launching to patients. Retrain and revalidate your system at least annually

    The AI system should check its own work. You might be thinking, “if the AI can check its work, it wouldn't make mistakes.” That's only partly true. Consider that in practice, using a second AI "judge" can eliminate 80%+ of hallucinations.

    Alternatively, consider more stringent confidence thresholds. While a confidence of 51% might map to "yes", the model isn't as confident as a 98th percentile score.


  • Protecting Patients:

    Detecting suicidal or homicidal ideation is crucial for any system that allows patient input. Good systems go beyond those basics.

    Avoid the psychological harm and isolation from learning of a new cancer or loss of pregnancy without human empathy. Check for abuse. Look for patterns (words, phrases, response time) in patient input that may indicate degradation of agency or ability. Check for patient manipulation of the AI system itself.


  • Personalization:

    Health systems tend towards conservative approaches, often swayed by their legal teams. We've seen patient-facing AI blindly respond to minor health issues with "consult your physician or go to the ER immediately".

    This one-size-fits all approach does not keep patients safe. Instead, it drives them towards less-safe alternatives.

    Personalize your responses. Make sure the model knows more than the patient's age and name. Patients expect that the AI is fully connected to all or most of your EHR data.

    Support more than English, and ideally go beyond Spanish support as well. For accessibility (blind, deaf, color-blind, etc.) use the Web Content Accessibility Guidelines (WCAG) Level AA standards or better.


  • Patient Agency:

    Citing high-quality sources is a crucial scientific principle. Sources may include the patient's results, findings, or notes, or publications from reputable sources. Without citing sources, you're asking patients to accept information on "faith".


  • Disclosure: Max score = 2

    Care team efficiency: Max score = 2

  • A Note on Interpretation

    This evaluation reflects how the system performs based on the information provided at this point in time. Results should be used to inform design, governance, and deployment decisions. This tool is not a substitute for clinical, legal, or regulatory review.

  • Should be Empty: