Blog | Industry, Medical

Quality Assurance for AI Software and Machine Learning

Marcos E. Mehle

The time when artificial intelligence (AI) was limited to labs and research groups is long gone. Its widespread presence in the internet-related industry (search engines, social networks, e-commerce) is now obvious to practically everyone, and we can no longer imagine the IT world without it. Today, the application of AI has spread to all kinds of industries, including those where reliability and safety are critical, and where strict regulations apply. And since quality assurance (QA) is an essential contributor to safety, it has become necessary to develop QA and regulatory frameworks for higher risk AI applications.

 Photo by on Unsplash

Artificial Intelligence and Machine Learning

Although the terms “artificial intelligence” and “machine learning” (ML) are sometimes used interchangeably, there is a substantial difference between them. We generally talk about artificial intelligence as the “science and engineering of making intelligent machines, especially intelligent computer programs” [1], while we see machine learning as a subset of AI. In ML, the model is built by an algorithm capable of inferring patterns and correlations from a set of data, which is usually quite large.

While one part of AI deals with algorithms that are defined by programmed sets of rules understandable to humans, in ML the “rules” by which the algorithm operates may not be easily interpretable or may not be interpretable at all. This is because there is a huge degree of complexity entangled in layers of mathematical operations that cannot be understood in the sense of “this-does-that”. There are many types of ML models; some of them are less complex and more interpretable, such as linear regression, logistic regression or simple decision trees, while others like deep learning algorithms and random forests are virtual black boxes and their explainability is an active area of research, with some methods that enable us to at least partially understand what was learned.

Transparency issues

“I don’t think we should use this on real people. And that’s because it’s a black box. We don’t actually understand what it’s doing.” [2] Back in the day, Dr Rich Caruana, one of the researchers at Microsoft Research, trained a neural net on a data set of pneumonia cases, and he found it to be extremely accurate with identifying when pneumonia was present. But when he was asked to use it on real patients, he was not enthusiastic. Just the fact that an algorithm proves to be effective at predicting the likelihood of a disease doesn’t mean that it is safe to use. Even if the algorithm was tested by using methods like cross-validation and tested for repeatability and stability, it might still be susceptible to failures with some inputs.

To minimise the probability of such failures, ML models can be tested, especially for sensitive domains. Generally speaking, the outcome of an ML model is a prediction, which is not easy to compare or verify against some kind of expected value. Nevertheless, developers can test the machine learning model performance by comparing predicted values with the model output values, which is different from testing the ML model for any input, as the range of expected values is limited. So-called black box testing of ML models can employ a variety of techniques, such as metamorphic testing, model performance, dual coding, comparison with linear models and coverage guided fuzzing and testing with varying data slices. There is also the problem of causality. A machine learning algorithm doesn’t know if a regularity found on input data is really a cause for a prediction or just a correlation.

Standards for Quality

The current regulatory framework for different kinds of software relies on a software and system engineering paradigm that was clearly not designed with machine learning in mind. Widely used standards for software development life-cycle processes, like IEC62304 for medical device software, ISO26262  [3] for automotive or IEC14207 for general-purpose software, are based on defining requirements, defining the architecture, decomposing the system into smaller units, integrating, verifying and validating the result – what is known as the V-model.

Photo by Franck V. on Unsplash

All these activities were defined on the foundation that all software errors are systematic, not random, but since it is practically impossible to test a software in all its possible “internal states”, it is better to build complex software from smaller, more testable units. So, how would you effectively apply such concepts to software that is developed in a non-deterministic way by running an automated training process on a data set, and where the “unit-level” of software engineering doesn’t seem to exist? There are, fortunately, some short-term solutions to such problems, like treating the ML algorithms as a black box or “software of unknown provenance”, and then mitigating the risks related to this fact with external risk mitigations. But this approach is more a workaround than an actual solution.

There are currently at least two international standards under development that are likely to address these issues: the ISO/IEC CD 23053 (“Framework for Artificial Intelligence Systems Using Machine Learning”) and the ISO/AWI TR 23348 (“Statistics – Big Data Analytics – Model Validation”). These standards will hopefully provide a common approach for assessing the compliance of AI software, and will be especially useful for developing high-risk AI applications for regulated industries.

Regulatory framework

If we consider the huge advances in the application of AI and ML, it is clear that the standardisation and regulatory frameworks are lagging. Currently there is no regulatory document available that addresses specific challenges that originate from the substantial difference existing between ML and other software. Recently, the EU has published a white paper discussing a high-level approach to the regulatory compliance of AI [4].

 Photo by National Cancer Institute on Unsplash

The white paper proposes an approach focusing on the following key aspects: requirements for training data; requirements for record-keeping of data and methodologies; requirements for the application specifications, characteristics; requirements for robustness and accuracy; requirements for human oversight and specific requirements for particular AI applications. It is clear that, once available, the EU regulations will no longer allow for the “black box” approach (at least for high-risk applications), and it will require a process-based approach, that includes the collection of data, design/choice of models, and requirements for human oversight.

The healthcare industry

The American Food and Drug Administration (FDA) has probably gone the farthest in this regard by proposing a Regulatory Framework for Modifications to Artificial Intelligence Machine Learning-Based Software [5], with concrete approaches and steps for approval of medical devices employing ML. The potential for ML applications in the healthcare industry is substantial, and there are already several applications on the market with approval from the FDA [6]. In the EU, on the contrary, the lack of concrete guidance on the topic of healthcare AI seems to be aggravated by the confusion that originates from the transition to the new MDR and the new rules for classification of medical software.


Even in regulated industries and in high-risk applications, there is enormous potential for machine learning. However, the QA regulatory frameworks do not seem to be quite ready for it. The main challenge arises from the approach for developing and certifying software for critical applications which was not designed to accommodate the challenges from machine learning. Different regulatory bodies are working on defining new regulations that would address these challenges, supported by the standardisation community, which is working on the preparation of new standards for machine learning applications.



Marcos E. Mehle is an expert in the field of Quality Assurance of medical and other complex-domain products. He joined Cosylab in 2011 and has served in the roles of Hardware Engineer, Project Manager, Group Manager and Head of Quality Management. Marcos holds a B.Sc. in Electrical and Electronics Engineering.

[1]   John McCarthy, “What is artificial intelligence?” (






0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

Others Also Read