Systematic Framework for Healthcare AI
Systematic Framework for Healthcare AI
Comprehensive framework for systematic data design in biomedical AI - problem definition, bias detection, modeling strategies, and validation protocols.
# Pages: 35 Source: /home/steve/Documents/Books/Data_Design_in_Biomedical_AI_ML.pdf ---
341 © The Author(s) 2024 G. J. Simon, C. Aliferis (eds.), Artificial Intelligence and Machine Learning in Health Care and Medical Sciences, Health Informatics, https://doi.org/10.1007/978-3-031-39355-6_7 Data Design in Biomedical AI/ML Gyorgy Simon and Constantin Aliferis Abstract Data Design refers to the systematic choice of what data are modeled for analysis and how these data and the model output(s) are mapped between the Problem Space (real-world) and the Model Space (features for the ML modeling). ML data design is an essential element of ML modeling. ML data design differs from classical statistical, epidemiological etc. study designs in that (a) ML data design relies heavily on the existence of digital data repositories that are created inde- pendently of the problem solving intent at hand, (b) ML modeling is highly scal- able and mostly automated, (c) when using experimental data, ML data design may be used to guide the experiments conducted, (d) uses a richer set of data representations that transcend the classical design matrices such as text, rela- tional databases, graphs etc.; and (e) ML modeling has its own distinct capabili- ties, limitations and other properties and these are reflected in the data design choices. The present chapter covers tried and tested strategies and protocols that contribute to successful data designs and addresses a number of important biases that threaten validity and generalizability of results. Lower level data transforma- tions, data storage, and security aspects are covered in the “Data Preparation, Transforms, Quality, and Management” chapter. Data Design Overview Figure 1. provides the context for Data Design by depicting the process of problem solving via ML modeling. Elements in black outlines are naturally occurring phe- nomena (i.e. occurring outside the control of problem solving team); elements in G. Simon (*) · C. Aliferis Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA e-mail: constantinaibestpractices@gmail.com
342 green outlines are controllable by the analyst/scientist involved in the ML-based problem solving. ML-based problem solving involves two steps. The first one is high-level data design, where one or more problem solving data sets are con- structed. These data contain the subjects and data elements relevant to the problem, and are either extracted from a repository of naturally-collected data (i.e. without the control of the ML-based problem solving team—top branch in the figure) or collected specifically for the purpose of the modeling (lower branch). In the second step, modeling data sets are constructed, upon which machine learning algorithms can operate. The goal of data design. A ML model can be viewed as a function that maps inputs to outputs. The inputs to the ML model are ML features, which exist in the ML model space. ML features don’t always have one-to-one correspondence to real- world entities or to the naturally-collected data elements. These entities and data elements exist in the real-world problem space. Similarly, the model output, which exists in the model space, needs to be mapped to the real-world entities in the prob- lem space. The goal of data design is to create this mapping between the real-world problem space and the model space in a manner that after mapping the real-world entities onto the ML model space, then solving the problem in ML model space, and finally mapping the ML solution from the model space back to the real-world (prob- lem space), we obtain a correct solution to the real-world problem. Figure 2 depicts the context around data design from the perspective of the ML modeling process. It presents an overview of the ML modeling process and its itera- tive interaction with modeling. The blue rectangle highlights the elements of the modeling steps that fall under data design, and these are the elements that are dis- cussed in this chapter. Elements outside the blue box, namely, data transformations, model fitting, evaluation and iterative improvement of the model are covered in other chapters (“The Development Process and Lifecycle of Clinical Grade and Fig. 1 Overview of problem solving using ML modeling G. Simon and C. Aliferis
343 Fig. 2 Overview of the modeling process. Steps of the Data Design are highlighted inside the blue rectangle Other Safety and Performance-Sensitive AI/ML Models,” “Data Preparation, Transforms, Quality, and Management, ” and “Evaluation”) of this book. The first step in ML-based problem solving is to define the problem to be solved according to the following five elements: 1. Outcome. What clinical outcomes are we considering? If we consider multiple outcomes, which is the primary outcome and which are secondary? 2. Exposure/Intervention. Is there a particular exposure or intervention that we wish to estimate the effect of? 3. Predictor variables (aka independent non-interventional variables). Which variables are relevant to this analysis and we want to include? Which variables are confounders and we absolutely must include? Which variables must we omit? 4. Target population. Which patients should the answer hold true for? 5. Time frame. When should the answer hold true? When do we plan to intervene, apply the model, or use the knowledge that we gained from this analysis? How long does it take for the outcome to manifest itself? The above 5 elements describe the real-world solutions we wish to obtain and the data elements needed to obtain them. Data design is the process of creating a formal specification of project goals and variables and establishing a bidirectional mapping between the real- world entities (and data elements) in the problem space and the ML features in the ML model space. ML then provides the function approximation models of the real-world data generating process that can be used along with the aforemen- tioned mapping to solve problems in the real world by first solving them in the ML model. Data Design in Biomedical AI/ML
344 As part of the data design, we also need to consider potential data design biases. Data design biases are systematic errors in the choice of data sources, including variables and samples, as well as mismatch of data to the modeling methods to be used. Classical statistical study design vs. data design. In classical statistical study design the data representation for modeling is a matrix (the “design matrix” in sta- tistical terminology), often a two-dimensional table of numbers. In ML however, we often deal with other data representations such as higher-dimensional matrices (ten- sors), graphs, sequences, text documents, images, combinations of the above, etc. The data design methods we describe here are generally applicable, although for simplicity we give examples based on flat matrix representations. Defining the Problem At the initiation of a ML project, usually, only a rough clinical or health science question and the context of use of the results are known. The objective of this step is to refine the rough problem statements into a more precise, formalized and opera- tionalizable format. By precise, we mean that problem statement contains all and only the information we need to solve it; by formalized, we mean that the answer to the problem can be expressed as an estimand (a computable quantity); and by oper- ationalizable, we mean that we can compute the answer in terms of the available data elements. In the following sections, we describe the setting, the five elements that make problem statements precise, and the most commonly used esti- mands. Afterwards, we describe common data design types to which we can map our problem, and in the last section of this chapter, we describe the inference pro- cess and explain what a “valid answer” means. Setting The first critical junction in defining the problem statement is to decide the opera- tive setting. We consider three broad settings. First clinical settings, where the problem concerns clinical decision making, including risk models, estimates of effects of exposures, targeting interventions, and timing of interventions. Such clini- cal models ideally will directly inform patient care or otherwise become part of health care delivery. The second type of setting is operational settings, where the results from the analysis are not directly used for treating patients, but rather for managing the sys- tem of health care. The third type of setting is health science research. ML models can be used for a broad array of research problems, which include biomarker discovery, optimized treatment protocols based on biomarkers, discovering biological causal pathways, clinical trials, etc. Translational research contexts bridge the health sciences with the health care problem solving domains. G. Simon and C. Aliferis
345 The setting in which the modeling results are used entails many attributes of the data design and modeling. For example, evaluation of health care-oriented ML models needs to take patient safety into account. Clearly, the direct risk of harm to patients is highest in the clinical setting and lowest in the research setting. The scope of populations involved in health care versus health science modeling can vary from very narrow to full-population studies. However, health care modeling is often restricted to specific health systems with or without examination of transla- tion across systems. Setting refers to context in which the modeling results will be used. We broadly distinguish between three settings: clinical, operational and research. Different settings impose different requirements towards the steps of the mod- eling process. Best Practice 7.1.1 The ML data design needs to take the operative setting of the ML models into account. Elements of the Problem Statement As we discussed earlier, the modeling project is typically motivated by a clinical question, an operational opportunity, or a research need. This initial motivation offers only a rough outline for the problem statement. The five elements of a problem statement (Outcome, Exposure/Intervention, Predictor variables, Target population, Time frame) help make a rough prob- lem statement more precise. [1]. Example 7.2.1 As a hypothetical example, experts in a health system may believe that “start- ing diabetes treatment earlier could improve major cardiac events”. The rough problem statement is “Does starting diabetes treatment earlier reduce major cardiac events?”. This question is not precise: how much earlier should we start? It is not formalized: what metric (estimand) for outcomes should we compute? It is also not operationalizable: how do we define “dia- betic” using the available data elements? Data Design in Biomedical AI/ML
346 To make the question more precise, we need to define 5 elements. Not all ele- ments are needed for all questions, but most questions need most of these elements. Target Population If we construct a clinical risk model, the target population consists of patients to whom we are going to apply the resulting model. All patients in the target popula- tion must be at risk of developing the outcome in the problem statement. If the problem concerns an interventional treatment, the target population is patients who are eligible for the treatment. If the problem concerns the study of a biological function, then the target population is the set of research subjects in which this bio- logical function exists. Exposure/Intervention Some studies are concerned with the effect of an intervention or of an exposure (defined below). Not all studies have an intervention of interest, but if there is one, we need to specify it. The intervention in our running example is the earlier initia- tion of diabetes treatment. The intervention or exposure divides the population into two groups. In case of an exposure to a naturally-occurring factor, patients with the exposure are referred to as the exposed group and the remaining patients form the unex- posed group (or controls if similar to the exposed group before exposure). Example 7.2.1 The example problem statement is related to the patient population of the health system in question, so the target population is the patient population (1) served by the hypothetical health system (2) who would be considered for diabetes treatment, or who could conceivably benefit from earlier diabetes treatment. So the example question is further refined to: “Can earlier initia- tion of diabetes treatment in diabetic patients eligible for it reduce major cardiac events in this health system?” The patients receiving the therapeutic interventions are referred to as treat- ment group, while the remaining (untreated) patients are referred to untreated patients (also as controls if untreated and similar to the treated ones before treatment). The target population is the set of patients to which the problem statement applies. G. Simon and C. Aliferis
347 In non-designed data (e.g., collected from routine care records) therapeutic inter- ventions may be considered as exposures. It is also common to collect data about interventions and multiple exposures and model them simultaneously. Note that classical study design does not distinguish between exposure and treatment, and refers to both as ‘exposure’. Outcome Not all analyses have a designated response variable (e.g., clinical outcome of interest). For example, finding comorbidities in older diabetic patients does not have a designated outcome of interest. However, the product of the analysis still needs to be specified. In this example, this product is the set of common comorbidities. Commonly, studies may also have multiple outcomes which are then categorized as primary and secondary. Notice that the meaning of ‘control’ depends on the comparison being made: it can refer to two different groups, either those without the outcome or those without the intervention/exposure [2]. Time Period Time period is the time frame encompassing the data to be modeled. Such time frames may concern, e.g., the time point at which the intervention is carried out (or a decision support model is used); or the time period during which the outcomes develop. There could also be a time period for collecting information before the intervention is applied. The primary outcome(s) is (are) the main focus of interest; other outcomes are called secondary outcomes. Patients with the outcome in question are referred to as cases, while patients without the outcome are referred to as controls. Example 7.2.1 In our running example, the main and only outcome is major cardiac events (MACE). Additional (secondary outcomes) could also be of interest, e.g., health care utilization or quality of life. Example 7.2.1 In our running example, for a retrospective analysis aiming to establish the effects of early diabetes treatment on MACE, we can use a design in which we collect historical patient data covering a 10 year time window starting 10 years before analysis and ending at time of analysis. MACE occurs 5–10 years after diabetes, hence the choice of time window length. Note that there are alterna- tive designs to accomplish this modeling that will be discussed later. Data Design in Biomedical AI/ML
348 Predictor (Non-Outcome) Variables Predictor variables are all the data (in addition to outcomes) that could possibly be relevant to our modeling task. Predictor variables include demographics (age, sex or gender as appropriate), risk factors, exposures, interventions, social behavioral data,