Perspectives on AI/ML Safety Assurance

Reliability Analysis & Safety Standards

📄 Research Document ⏱️ 16 min read 📂 AI/ML Theory

ML reliability glass ceiling analysis (~10⁻³ vs 10⁻⁹ required for safety-critical systems) - Topological Data Analysis for ultra-reliable ML, DAL A/ASIL D/SIL 4 standards.

AI SafetyReliabilitySafety StandardsTDA

🎯 Key Insight: This document is part of the Phoenix Technical Documentation Library - a curated collection of peer-reviewed research papers and official guidelines for AI/ML implementation in healthcare, security, and enterprise systems.

Full Document

HAL Id: hal-04635957 https://hal.science/hal-04635957 Submitted on 4 Jul 2024 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Copyright Perspectives on AI-ML Safety Assurance Emmanuel Ledinot, Philippe Quere, Philippe Baufreton, Jean Gassino, Franck Serratrice, Hugues Bonnin, Damien Chabrol, Amina Mekki-Mokhtar, Olivier Appere, Joseph Machrouh To cite this version: Emmanuel Ledinot, Philippe Quere, Philippe Baufreton, Jean Gassino, Franck Serratrice, et al.. Per- spectives on AI-ML Safety Assurance. ERTS2024, Jun 2024, Toulouse, France. ฀hal-04635957฀

Perspectives on AI-ML Safety Assurance Emmanuel Ledinot emmanuel.ledinot@thalesgroup.com Philippe Quere philippe.quere@stellantis.com Philippe Baufreton philippe.baufreton@safran.com Jean Gassino jean.gassino@irsn.fr Franck Serratrice franck.serratrice@renault.com Hugues Bonnin hugues.bonnin@continental.com Damien Chabrol damien.chabrol@kronosafe.com Amina Mekki-Mokhtar amina.mekkimokhtar@ansys.com Olivier Appere appere@adacore.com Joseph Machrouh joseph.machrouh@thalesgroup.com Abstract— AI-ML suffers from a reliability glass-ceiling phenomenon (e.g. ~10-3 error/inference), making it incompatible with safety-criticality. Several orders of magnitude are missing. We explain why, we point to the characteristics of ML that conflict with the assurance objectives assigned to safety-critical developments. Could encapsulation of ML constituents into fault-tolerant architectures, ML development assurance, and software/hardware development assurance, altogether mitigate the gap? We argue that in spite of impressive progress of ML state-of-the-art, the answer is negative. Drawing from Topological Data Analysis (TDA) and set-based non-linear control, we propose to supplement ML point-based specification and verification with volume-based specification and verification to meet 10-5 err./ inf. levels, as a minimum. We outline the rationale of a new research field we name (Ultra) Reliable Machine Learning, at the confluence of TDA, statistics on manifolds, and ML safety assurance. Some cross-domain safety regulation principles guide the underlying rationale. We illustrate the methodology on image classification. Keywords— Machine Learning, ML reliability, Safety assurance, ML assurance, latent manifold, Topological Data Analysis, persistence homology, extensional coverage analysis. I. INTRODUCTION Data analysis and statistics have first developed to extract synthetic information from population data as insights on complex phenomena (descriptive statistics). Inferential statistics then focused on explanatory models of past observations, to get predictors on some limited aspects of complex phenomena. Never until recently, had statistical estimation to address safety-critical ‘control’. We use ‘control’ in the broad sense of OODA loops (Observation, Orientation, Decision, Action), where control of physics is involved and life, goods or environment is at risk. Machine Learning, especially Deep Learning (DL), opened a new era: unprecedented performance in machine vision and problem solving in high dimension. However, chaotic behavior exemplified by adversarial examples limited DL applicability [39], and is still a matter of concern. Could DL-based components, developed with extreme rigor and encapsulated in fault-tolerant architectures, deliver services that meet the reliability requirements specific to safety-critical ‘control’? This type of requirements is new to Machine Learning and data science. 1 Development Assurance Level The co-authors of this paper are members of the Embedded France association’s working group dedicated to analysis of safety assurance standards in safety-related industrial domains, to contribute their evolution [24]. We investigate the case of Machine Learning in this paper since ML-dependent safety-criticality is now on the agenda of aeronautics [2] and of automotive industry. Our focus is limited to ML reliability, ML verification, and to safety assurance of ML-dependent systems. To our knowledge, current best accuracy scores on the easiest of image classification benchmarks (MNIST) are about 2.10-3 error/inference [40]. From a system safety perspective, this reliability level is poor: one error every seven lines containing 80 digits each. To make the gap more explicit, let us assume a 50Hz input stream of digits processed by an AI- ML-dependent safety-critical vision-based controller. It would make ~360 generalization errors per hour, when reliability target in the most critical case discussed in this paper would be one every billion of hours. To address this gap, [1] screened the techniques amenable to improve ML reliability. They questioned the feasibility of reaching the reliability levels required by highest DAL1s and concluded negatively. After some scoping and terminological preliminaries, we summarize this survey of reliability augmentation methods. We propose a conjectural explanation why the reliability enhancement attempts uniformly failed (sections II, III, IV). Then, we discuss why software assurance will have no impact on this reliability gap (section V), and why fault- tolerant architectures will solve only the easy cases (section VI). At this stage, we conclude that for true ML-dependent safety-criticality, there is no escape from improving ML reliability by several orders of magnitude. From a geometric and topological perspective on approximant adjustment, we convey intuition on how great the challenge is. Thanks to recent advances in Topological Data Analysis (TDA), we propose a research path that would control ODD 2 modeling, data sampling, generalization domain definition, and approximant adjustment more tightly than achieved today. We review some recent papers that suggest relevance of such an attempt. We compare the rationale of safety-critical software verification, with our TDA-enabled (U)R-ML verification proposal (section VIII). 2 Operational Design Domain, see Road vehicles — Safety of the Intended Functionality ISO 21448 standard.

Finally, we discuss whether ML-dependent safety-critical ‘control’ could reach the ultimate reliability level of 1, i.e. correctness. Software engineering and assurance managed to ensure extremely high levels of quality. We compare the two domains on specification and verification. Contribution: We propose a diagnosis on the ML- reliability plateau. We propose orientations to overcome the reliability gap by supplementing current point-based approach of data science with a TDA-enabled volume-based approach. Disclaimer: The views expressed in this paper are those of the authors as members of the Embedded France Working Group on safety assurance standards. They may not reflect the opinion of their affiliations. II. SCOPING AI-ML-DEPENDENT SAFETY A. Systems perimeter We address ML-dependent safety-critical systems. Since our group is cross-domain, for the rest of the paper we use the following convention: DAL A is an abbreviation of all the corresponding assurance levels in the other industrial domains. DAL A stands for DAL A (aeronautic), ASIL D (automotive), SIL 4 (railway, process industry and many domains) and class 1 (nuclear). In this paper, an ML-component is classified as safety- critical if, and only if, it is a “Single Point of Catastrophic Failure” (SPCF). In other words, some error, in adverse foreseeable conditions, could lead to a catastrophic accident. DAL A is mandatory for SPCF components: no mitigation mechanism in the system architecture to prevent some failure causality chain originating from the ML-component to evolve into a catastrophic accidental scenario. We abbreviate “SPCF- ML” such situations. Our prototypical SPCF-ML example in automotive is pedestrian detection systems coupled to automatic-braking systems. See [29] for state of the art on DL-dependent pedestrian detection performance: robustness and accuracy are still a major concern. In aeronautics, inhabited autonomous urban air mobility is the example we have in mind. More generally, we consider ML-dependent vehicle control, safety-critical healthcare devices, and all kinds of safety-critical operational technologies (OTs). B. ML perimeter We consider off-line supervised learning in high to very high input-space dimension (e.g. 104 to 106 and beyond). We exclude continuous learning and recent ML developments like transformers and LLMs. Regarding the ML-safety survey [5], we address Robustness and Monitoring. Ethics and Alignment are out of the scope of this paper. C. Machine-vision perimeter Open world semantic scene segmentation is the natural long-term goal. However, we do not claim supplementing such complex ML developments with TDA at first. In this paper, we limit ourselves to development and assurance rationale of a proof of concept based on MNIST3. 10-5 err./inf. is our first milestone to fill the reliability gap. We present it as an illustrative example of a generic methodology expected to 3 MNIST is a prominent entry point benchmark in image classification community. It consists of 70000 handwritten digits elaborated by NIST in the USA. be progressively scaled up to ML processing pipe-lines as complex as 3D scene segmentation. After MNIST [35], the planned next step is LARD (Landing Approach Runwaw Detection) [25]. Only then, could one conclude on (U)R-ML practical viability. MNIST and LARD have in common existence of strong knowledge on the data generation process that enable structured data interpretation. III. TERMINOLOGICAL PRELIMINARIES We need to avoid misinterpretation on terms like ‘dimension’, ‘dimension reduction’, ‘latent’ and a few more. A. Machine learning  Approximant, any function ℝn → ℝp, estimator of an underlying function specified by textual requirements and labeled datasets. We use ‘ML-model’, after adjustment, as synonymous of fitted approximant.  Inference, and generalization, are used as synonymous: approximant activation on some input vector not seen during the training, calibration, and testing phases.  Ambient space, also named embedding space: space where the vectors (or points) of the datasets spread. Depending on the context, we use “ambient space” for input only (nD) 4, output only (pD), or input-output ((n+p)D) space. For greyscale image classifiers, n is the number of pixels and p that of classes (e.g. MNIST: n=28x28=784, p=10).  Latent space or latent manifold, the regions of the ambient space where the dataset points concentrate, i.e. cluster. Latent space has its own dimension named latent dimension, or intrinsic dimension.  Dimension reduction. The classical interpretation of this term is identification of the input space features that prominently condition the form of the output latent manifold (projection on a lower dimensional space keeping most of the information, like PCA5). We never use this meaning. We consider ambient to latent dimensionality collapse by shifting from an external view to an internal view of the point cloud. When continuous natural processes generate data, dimensionality collapse occurs. Physical, operational, and control laws constrain input, state and output data to concentrate in low-dimensional regions that unfold, split, curl, merge etc. in ambient space. (Manifold Hypothesis (MH) on point clouds [11]). B. Logics  Extensional refers to extension as defined in “Extension Theory” [6], i.e. vector encoding of magnitudes for geometric and algebraic calculation. In the sequel, we regard geometric and topological analysis of point clouds in vector spaces as synonymous with “extensional approach”.  Intensional qualifies definitions of sets or objects by symbol sequences (logical formulas, analytical expressions, characteristic predicates etc.). For 4 nD stands for n Dimensions (1D curves, 2D surfaces, etc.) 5 Principal Component Analysis

example, first-principle models are intensional characterizations of process behaviors. Structural coverage in software testing is intensional. It is hooked to programs’ source or binary code symbols. Ontologies of ODDs and analytic formulation of data- augmentation processes are on the intensional side as well. IV. ML-RELIABILITY GLASS CEILING A. Reliability augmentation techniques In [1], a group of researchers investigated the means to improve ML reliability. Though ML made major progress on accuracy over the last two decades (1 to 2 orders of magnitude), 10-3/inf. is still too poor from a safety engineering viewpoint. [1] reviews quantitative reliability results obtained by model diversification, by monitoring (ODD, robustness, I/O consistency), by robustness enhancement techniques (model stability and training stability), by selective classification, by conformal prediction, and by temporal redundancy on sequences. Their main conclusion is the following: all the methods that tried to increase reliability by redundancy of independent models, i.e., models resorting to independent approximant spaces, independent datasets and independent optimization processes, succeeded only marginally. Reliability stayed stuck in the range of 10-2 / inference instead of the expected 10-4 = 10-2 * 10-2 or even 10-6 = 10-2 * 10-2 * 10-2. Moreover, these techniques improved reliability at the expense of significant availability losses. B. Common Cause Analysis Strong correlation of inference errors between independently developed ML-models, i.e. lack of independence between redundancies, is an experimental fact evidenced by [1]. It is consistent with [39] where evidence is given that an adversarial example designed for model1 trained and tested on dataset1 still fools model2 specifically developed to be independent of model1 (datasets, approximant space, and optimization process). Similarly, [38] demonstrated a limited 13% reliability progress. It is negligible from a safety engineering perspective given the reliability targets mentioned previously. Since in this paper we are going to compare ML and software engineering in the safety-critical case, we recall that in the 1980s [37] evidenced experimental reject of the independence hypothesis on N-version programming. What could be an explanation? Our working hypothesis that motivates our interest for TDA-augmented ML is that complexity of the latent manifold’s shape could be the common mode that correlates error occurrences between the so-called “independent” redundancies6. Fig. 1. Model adjustment to a point cloud (green shape adjusted to the red spots). The dashed ellipses delineate topologically complex regions that are hard to fit correctly. 6 In section X, another potential cause is considered on MNIST: labeling errors [41]. State space complexity of non-linear dynamical systems (attractors, curvature, holes, cavities, etc.), compelled control engineers to start by splitting it into covering subspaces where dynamics regime has some homogeneity and regularity amenable to a local linear approach. Then, they aggregate these local controllers into a unique global controller by mode switching and scheduling logics, up to complete coverage of the topologically complex reachable input/state/output space. The ML components we consider in this paper address the same type of continuous data manifolds. By contrast, standard data science addresses training datasets all at once, straight away at global scale. Possibly, the ML model redundancies used in [1] failed to adjust reliably on the same topologically complex regions. Hard-to-fit regions of input space are problem dependent. In other words, they are ML-model independent, so they can correlate any pair of redundancies. The shape of training datasets is a potential common cause in ensemble learning. C. Plateauing performance When the approximant space is defined by the solutions to (n – 1) polynomial equations over n variables, the ambient space is nD and the latent space is 1D algebraic curves. Given k points in nD Euclidian space, finding a polynomial curve that links the k points is still an open mathematical problem [9]. By 2022, a proof of existence was published on the Web. It is under peer-review. In case of confirmation, more than a century will have been necessary to solve the (n-ambient, 1- latent) case for an intensively investigated class of functions. Fig. 2. The picture is courtesy of [8]. 1D latent manifolds in 3D ambient space. Limiting generalization errors to very small number of occurrences requires controlling adjustment with extreme precision. Impact of “fitting” variability on the 3 projected curves when “adjustment” varies slightly (difference between the dashed and non- dashed curves). Admittedly, equation solving (i.e. ‘exact adjustment’) is of different nature than ML-model fitting. It is harder because of equation solving exactness. However, precision-controlled fitting in high dimension is a very difficult problem as well, even if a “flexible” one7. We advocate that high reliability of generalization will necessitate sophisticated mathematical tools to control where and why generalization errors occur. The ability to explain why a generalization error occurred in order to fix it will be mandatory for DAL A ML. Any known error that could potentially be a single cause of catastrophic failure, should be eliminated to comply with regulation. D. Zero-measure verification Any behavioral specification defined by a cloud of points is extremely poor with respect to:

The immensity of the high dimensional ambient space,

References & Citation

Source: Phoenix Technical Documentation Library
Category: AI/ML Theory
Original: Peer-reviewed research paper / Official guideline
License: CC BY 4.0 (unless otherwise noted)

Suggested Citation:
Perspectives on AI/ML Safety Assurance. Phoenix Technical Documentation Library, Avondale.AI. Accessed May 2026. https://avondale.ai/technical/