How should we think about monitoring the effect of AI-based tools on clinical outcomes?

Last Updated: February 28, 2024

Disclosure: All authors - employment and equity ownership in Verily Life Sciences. Amy P. Abernethy - stock ownership in EQRx, Georgiamune, One Health and Iterative Health; consulting fees from Sixth Street and ClavystBio.
Pub Date: Wednesday, Feb 28, 2024
Author: Caroline Marra PhD; Joseph B. Franklin JD PhD; Amy P. Abernethy MD, PhD
Affiliation: Verily Life Sciences; South San Francisco, CA

Artificial intelligence (AI)-based tools have the potential to transform cardiovascular care in numerous areas - from diagnosis and disease classification to treatment selection and clinical decision support. Many different examples and uses of AI tools in cardiovascular care are united by a common objective: to make use of available data in better ways to inform care delivery and, ultimately, patient outcomes. The American Heart Association (AHA)'s Scientific Statement, "Use of Artificial Intelligence in Improving Outcomes in Heart Disease," highlights best practices and challenges with the use of AI-based tools in cardiovascular medicine. The authors describe many promising uses of AI tools but emphasize that in general, these tools "have yet to improve patient outcomes at scale." Across a variety of applications, including imaging, electrocardiography, in-hospital and at-home monitoring, there is a common call to action: to develop and implement appropriate means of evaluating and monitoring AI tools that achieves the scale and scientific rigor to ensure responsible deployment.

This need is widely recognized and featured prominently in recent policymaking related to AI. For example, a newly issued executive order, "Safe Secure and Trustworthy Use of AI," reflects a growing understanding in recent years that the application of AI-based tools for healthcare will need to be accompanied by systematic monitoring to ensure accuracy and guard against unintended consequences.(1) Many of these themes are also emphasized in multi-sector consensus statements, such as the Coalition for Health AI's "Blueprint for Trustworthy AI Implementation Guidance and Assurance for Healthcare."(2)

As highlighted in AHA's Scientific Statement, "a unique hazard to AI/ML-based systems is that algorithm performance may degrade over time." Further concerns include amplifying bias and inequity in healthcare. The introduction of Peter Embi's "algorithmovigilance" concept has brought clarity to the scientific activities required for the evaluation, monitoring, understanding, and prevention of adverse effects of algorithms.(3) Akin to pharmacovigilance for monitoring drug effects, algorithmovigilance underscores the need - perhaps even the ethical imperative - to continuously monitor AI-based tools used in healthcare practice, not only for safety, but also for effectiveness and equity.

Though there is growing consensus on the need for adequate monitoring of AI tools, agreement on the right level of monitoring is lacking and figuring out how to accomplish monitoring across so many domains is a daunting challenge. We still need to resolve many fundamental questions. For example, what data and what analysis techniques are needed to achieve appropriate monitoring of AI-based tools, whether in cardiovascular health or other therapeutic areas? How should the role and type of monitoring change depending on the nature of an AI tool and the specific context in which the tool will be used? Fundamentally, to what degree is the evaluation of clinical outcomes needed depending on the context of AI deployment, and how do we streamline the process of evaluating clinical performance to match the anticipated scale of widespread AI deployment?

In many cases, adequate monitoring will likely require attention not just to algorithm performance in the more narrow sense of analytical validation, for example how the algorithm is performing on the data it is using and whether the underlying data the algorithm was trained on has been validated,(4) but also to the effect of using the algorithm on clinical outcomes. Evaluating clinical outcomes requires the use of many different data sources to assess the efficacy and safety of AI-based tools when used in various care settings.(5) Fundamentally, evaluating clinical outcomes requires confident demonstration of the relationship between individuals or cohorts, the AI-based intervention, and reliable assessment of outcomes. EHR data may be a start, but to truly accomplish this, we must look beyond data that is collected within traditional clinical trial infrastructure to data collected during routine care and by patients outside of regularly scheduled visits (e.g., through digital health technologies and patient-reported outcomes). And, importantly, we must build new methods for combining these data sources to answer questions such as, "for what rate of patients for whom the AI tool predicted a therapeutic response did a response actually occur?," "did use of the AI tool to support decision-making lead to a change in how the patient feels or functions overall?," and most fundamentally, "is the AI tool leading to positive, and not negative, clinical outcomes across the patients for which it is used?."

To monitor and evaluate performance of AI-based tools in improving clinical outcomes, we must urgently create the infrastructure to enable scientific analysis of multiple data sources, at scale. Yet, to date, efforts to develop this infrastructure have generated more questions than answers. Using many different data sources to evaluate clinical outcomes across broad populations is inherently difficult for a variety of reasons.

First, clinical outcomes can be challenging to measure even with the most optimal data. While some outcomes may occur suddenly, others can take years to manifest to a point where they can be measured. Further, in some therapeutic areas, there may also be a lack of consensus about what outcomes are clinically meaningful and the measurement criteria may not be clearly defined.

Second, to enable performance monitoring for clinical outcomes at near the scale that will be needed for AI tools will require the use of data that is already collected during the normal course of clinical care. In fact, it would be impractical to rely solely on a parallel data collection method akin to a traditional clinical trial. That said, to use data that is collected in routine care and available from EHRs and other "real-world" sources requires careful scientific attention, because it typically is not generated with the purpose of evaluating clinical outcomes across many patients. For example, in cardiovascular care, acute myocardial infarction (AMI) is a key clinical outcome that, unlike many other clinical events, has widely adopted international consensus diagnosis criteria. However, the use of EHRs to ascertain occurrence of an AMI is not straightforward because the clinical circumstances around an AMI can be complex and the variables that are essential for clinical diagnosis, including markers of myocardial necrosis, ECG, and symptoms, are often recorded in EHRs inconsistently or not at all.(6)

And third, proactive and intentional design of study methods will be critical to assess the performance of AI tools on clinical outcomes in a manner that instills confidence in the findings. Here again, the scale of the task of monitoring AI tools in real world settings presents challenges. We will need agreement on when it is appropriate to use fit-for-purpose study designs and statistical methods that may rely, in many cases, on observational data without randomization. At the same time, because clinical outcomes necessarily involve real patients and many confounding variables that make it difficult to ascertain causal links, application of randomization in some circumstances will be needed, particularly in higher risk settings. But in these cases, the likely scale of widespread AI tool adoption will require pragmatic approaches to randomized study design where the conduct resembles usual clinical practice and the results can be applied to multiple settings beyond that reflected in a traditional clinical trial.(7) And though the concept of pragmatic studies has gained traction in recent years, implementation of pragmatic randomized designs requires thorough planning and consideration of research ethics.

Overall, AI tools provide an incredible opportunity to enable continuous improvement, innovation, and equity in our healthcare systems (a.k.a., the learning health system) by bringing together powerful data sources that can inform and help improve clinical care.(8) In fact, broad use of AI tools in healthcare is likely essential to achieving the goal of optimized health for every patient. However, this objective will only be possible, and more importantly, will only be responsible, if we have adequate means to monitor the performance of AI tools as they are deployed in real world settings. This includes consideration of the effect of AI-tools on clinical outcomes, at least in some cases. To enable this vision, we need to continue pushing forth efforts to develop consensus standards around the use of multiple data sources to support continuous evaluation of AI tools in the real world.


Armoundas AA, Narayan SM, Arnett DK, Spector-Bagdady K, Bennett DA, Celi LA, Friedman PA, Gollob MH, Hall JL, Kwitek AE, Lett E, Menon BK, Sheehan KA, Al-Zaiti SS; on behalf of the American Heart Association Institute for Precision Cardiovascular Medicine; Council on Cardiovascular and Stroke Nursing; Council on Lifelong Congenital Heart Disease and Heart Health in the Young; Council on Cardiovascular Radiology and Intervention; Council on Hypertension; Council on the Kidney in Cardiovascular Disease; and Stroke Council. Use of artificialintelligence in improving outcomes in heart disease: a scientific statement from the American Heart Association. Circulation. Published online February 28, 2024. doi: 10.1161/CIR.0000000000001201


  1. Executive Office of the President. 2023 [Accessed 2023 Nov 6]. Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Available from:
  2. Coalition for Health AI. Coalition for Health AI. 2023 [Accessed 2023 Nov 6]. Blueprint for Trustworthy AI Implementation Guidance and Assurance for Healthcare. Version 1.0. Available from:
  3. Embi PJ. Algorithmovigilance-Advancing Methods to Analyze and Monitor Artificial Intelligence-Driven Health Care for Effectiveness and Equity. JAMA Netw Open. 2021 Apr 1;4(4):e214622.
  4. Gottlieb S, Silvis L. Regulators Face Novel Challenges as Artificial Intelligence Tools Enter Medical Practice. JAMA Health Forum. 2023 Jun 2;4(6):e232300.
  5. Tsopra R, Fernandez X, Luchinat C, Alberghina L, Lehrach H, Vanoni M, et al. A framework for validating AI in precision medicine: considerations from the European ITFoC consortium. BMC Med Inform Decis Mak. 2021 Oct 2;21(1):274.
  6. Rubbo B, Fitzpatrick NK, Denaxas S, Daskalopoulou M, Yu N, Patel RS, et al. Use of electronic health records to ascertain, validate and phenotype acute myocardial infarction: A systematic review and recommendations. Int J Cardiol. 2015 Mar 5;187:705–11.
  7. Dal-Ré R, Janiaud P, Ioannidis JPA. Real-world evidence: How pragmatic are randomized controlled trials labeled as pragmatic? BMC Med. 2018 Apr 3;16(1):49.
  8. Institute of Medicine, Roundtable on Evidence-Based Medicine. The Learning Healthcare System: Workshop Summary. National Academies Press; 2007. 374 p.

Science News Commentaries

View All Science News Commentaries

-- The opinions expressed in this commentary are not necessarily those of the editors or of the American Heart Association --