Statistical Significance vs. Equivalence: What Clinical Trials Really Show

At medXteam, the focus is on clinical data. In this context, as CRO, we not only carry out clinical trials with medical devices in accordance with MDR and ISO 14155, but also offer all other options and forms of data collection and product approval as well as market surveillance. The focus of clinical trials is on the data collected, the evaluation of the data and the interpretation of the results. When interpreting results, a common mistake is to interpret the lack of a statistically significant difference between two treatments or products as evidence of their equivalence. In this blog post we will examine why a non-significant difference does not mean equivalence and what consequences this can have for clinical studies of medical devices .
Underlying regulations
EU Regulation 2017/745 (MDR)
ISO 14155
1. Introduction
An essential step after collecting data in clinical trials is their evaluation. Testing statistical significance or equivalence plays a crucial role here, depending on the nature of the study and the aim of the investigation. Statistical significance refers to whether the observed results are likely due to a real effect rather than random fluctuations. Equivalence, on the other hand, means that two treatments or products can be considered equivalent because their differences are not clinically relevant.
2. What does a non-significant difference mean?

A non-significant difference in a clinical trial means that the observed difference between two groups is not large enough to be statistically confident that it was not due to chance. Typically, a p-value greater than 0.05 is considered not significant. The p-value indicates how likely it is that the observed data or something more extreme will occur given the null hypothesis. The significance level (usually 0.05) is the threshold at which the p-value is considered small enough to reject the null hypothesis.


A clinical study compares a new implant with an existing implant and finds a p-value of 0.08. This means that the probability that the observed difference was due to chance is higher than 5%. Since the p-value is above the established significance level of 0.05, the difference is considered not significant.

3. Why is this not equivalent to equivalence?

In contrast to testing for a statistically significant difference, equivalence testing aims to show that the differences between two treatments or products are so small that they lie within a clinically acceptable range. This is achieved through specific study designs such as equivalence or non-inferiority studies.

Equivalence studies:

These studies set two predefined limits (equivalence limits) within which the differences between treatments must lie to be considered equivalent. The goal is to show that the effectiveness or safety of the new product does not differ significantly from that of the established product.

Non-inferiority studies:

These studies check whether the new product is no worse than the existing product by only setting a lower limit that the new product cannot exceed.

4. Differences in methodology

4.1 Null hypothesis

When testing for statistically significant differences, the null hypothesis is usually that there is no difference. In equivalence studies, however, the null hypothesis is that the treatments are not equivalent. The study must provide enough evidence to refute this null hypothesis.

Statistical significance tests play a central role in both types of studies, but the objectives and interpretation of the results differ. In classic tests of statistical significance, one looks for evidence that an observed difference did not occur by chance. The null hypothesis is rejected if a statistically significant difference is found (p-value < α).

In equivalence studies, however, the null hypothesis is that the treatments are not equivalent (that there is a significant difference). To refute this null hypothesis, the study must show that the differences between treatments are small enough to fall within a predefined equivalence range. Statistical significance is also tested here, but a different confidence interval is used. The results must show that the confidence interval of the difference lies entirely within the equivalence region to achieve statistical significance in terms of equivalence.

So in both cases statistical significance is used, but with different goals and interpretations.

4.2 Confidence intervals

While when testing for significant differences, confidence intervals are used to show the uncertainty of the estimate, in equivalence studies, confidence intervals are used to check whether they lie within the established equivalence limits. If the entire confidence interval lies within these limits, equivalence can be assumed.

These differences in methodology make it clear that the mere absence of a statistically significant difference is not sufficient to demonstrate equivalence. There are other factors that must be taken into account to ensure correct interpretation of the study results.

4.3 Lack of power of the study

A study with a small sample size or insufficient power may miss true differences. The lack of a significant difference may therefore simply be due to the study not being sufficiently powered to detect this difference. This is where sample size planning comes into play: careful sample size planning is crucial to ensure the power of the study. The power of a study describes the probability that the study will detect a real effect if it actually exists. Without appropriate sample size planning, there is a risk that a study will not be able to detect significant differences, even if they exist, due to too few participants.

4.4 Confidence intervals and uncertainty of the estimate

A non-significant difference can be associated with wide confidence intervals, which can indicate both clinically important differences and no differences. This shows the uncertainty of the estimate and does not suggest equivalence.

4.5 False null hypothesis

The null hypothesis in most studies is that there is no difference. Failure to reject this null hypothesis does not mean that it has been proven that there is no difference, just that there is not enough evidence to claim the opposite.

5. Examples of problems in clinical trials of medical devices

5.1 Comparison of two implants

In a study evaluating a new hip implant compared to an established product, a p-value of 0.06 was found. Although the difference is not statistically significant, the new implant could still be less effective or safe. A wide confidence interval could range from large superiority to significant inferiority.

5.2 Evaluation of a new diagnostic device

A new diagnostic device is tested against a standard device and the results show a p-value of 0.09. This doesn't mean that both devices are equally good, just that the study didn't find enough evidence to determine a difference. The study may not be large enough to detect small but clinically relevant differences.

6. How should equivalence be checked?

6.1 Equivalence and non-inferiority studies

To test equivalence, specific study designs such as equivalence or non-inferiority studies must be used. These studies have specific hypotheses and statistical methods to show that the differences between treatments are within a predefined tolerance limit.


An equivalence study could define that the new implant is clinically equivalent if the difference in functionality is within a range of ± 2% compared to the standard implant.

6.2 Confidence intervals and equivalence limits

Instead of just looking at p-values, confidence intervals should also be considered. If the entire confidence interval lies within the predefined equivalence limits, equivalence can be assumed.

7. Practical steps to avoid misunderstandings

Clear study design:

The study should clearly define whether it aims to find differences (superiority study) or to prove equivalence or non-inferiority. This influences the choice of statistical methods and the interpretation of the results.

Adequate sample size:

A sufficient sample size is crucial to ensure the power of the study. This helps detect real differences and avoid false negatives.

Predefined equivalence limits:

Before starting the study, clear equivalence limits should be established based on clinical considerations. This helps to better assess the clinical relevance of the results.

8. Conclusion

The absence of a statistically significant difference in clinical trials does not automatically mean that the medical devices tested are equivalent. Specific study designs and statistical methods are required to demonstrate equivalence. Careful planning and interpretation of study results are crucial to assess the true effectiveness and safety of medical devices. This is the only way we can ensure that new products meet the high standards of clinical practice and offer real benefits for patients.

9. How we can help you

Our statisticians accompany you from data collection through analysis to interpretation of the results. Be safe.

As CRO, we support you throughout the entire process of generating and evaluating clinical data and in the approval and market monitoring of your product. And we start with the clinical strategy! We also create the complete clinical evaluation file for you.

In the case of clinical trials, we consider together with you whether and, if so, which clinical trial needs to be carried out, under what conditions and in accordance with what requirements. We clarify this as part of the pre-study phase: In 3 steps, we determine the correct and cost-effective strategy with regard to the clinical data collection required in your case.

If a clinical trial is to be carried out, basic safety and performance requirements must first be met. The data from the clinical trial then flow into the clinical evaluation, which in turn forms the basis for post-market clinical follow-up (PMCF) activities (including a PMCF study if necessary).

In addition, all medical device manufacturers require a quality management system (QMS), including when developing Class I products.

We support you throughout your entire project with your medical device, starting with a free initial consultation, help with the introduction of a QM system, study planning and implementation through to technical documentation - always with primary reference to the clinical data on the product: from the beginning to the end End.

Do you already have some initial questions?

You can get a free initial consultation here: free initial consultation

How is a clinical assessment based on performance data created?

At medXteam, the focus is on clinical data. In this context, as CRO, we not only carry out clinical trials with medical devices in accordance with MDR and ISO 14155, but also offer all other options and forms of data collection and product approval as well as market surveillance. The focus is always on clinical evaluation, both during product approval and during clinical follow-up. One possible route for creating the clinical evaluation is based on so-called performance data. How can such a clinical assessment be carried out? What options are there to provide clinical evidence? And what role do clinical data play in this? In this blog post, we explore these questions, particularly explaining when and how this route of clinical assessment can be used .


MDR Medical Device Regulation; EU Regulation 2017/745

PMCF Post-Market Clinical Follow-up, clinical follow-up

CEP Clinical Evaluation Plan

CDP Clinical Development Plan

Underlying regulations

EU Regulation 2017/745 (MDR)

1 Introduction

As already described in the last blog post, the clinical evaluation for all medical devices - from Class I to Class III - is an essential step for every manufacturer of medical devices. This is derived from Article 61 of EU Regulation 2017/745 (MDR):

“The manufacturer shall determine and justify the scope of clinical evidence to demonstrate compliance with the relevant essential safety and performance requirements. The level of clinical evidence must be appropriate to the characteristics of the device and its intended purpose. To this end, manufacturers shall carry out, plan and document a clinical assessment in accordance with this Article and Part A of Annex XIV."

If the “performance data” route was defined during planning in the CEP, all requirements for the process and for the creation of the clinical assessment that result from the MDR and also from MEDDEV 2.7/1 Rev. 4 must still be adhered to . How this works: This blog post provides the relevant answers .

2. The route via performance data

The way to demonstrate the clinical performance of a product through performance data has always been possible and remains so under the MDR (Article 61):

If demonstration of compliance with essential safety and performance requirements based on clinical data is considered inappropriate, any such exception shall be based on the manufacturer's risk management and taking into account the specific characteristics of the interaction between the device and the human body, the intended clinical performance and the information provided by the manufacturer; this applies without prejudice to paragraph 4. In this case, the manufacturer shall duly justify in the technical documentation set out in Annex II why he demonstrates compliance with essential safety and performance requirements solely on the basis of the results of non-clinical testing methods, including performance evaluation, technical testing ( “bench testing”) and preclinical evaluation, is considered suitable .“

The decision is based on various aspects:

  • the result of risk management
  • the characteristics of the interaction between product and body
  • proof of performance based on product evaluations (technical, in-vitro)
  • the result of the preclinical assessment (initial literature search, verification tests, etc.)

This decision must be appropriately explained and documented in the clinical evaluation plan.

This route is preferred when a clinical trial offers little benefit. A typical example of this is the wooden tongue depressor, for which clinical data does not exist in the literature. In such cases, technical data such as breaking strength and workmanship indicate the safety and performance of the product.

As the equivalence route becomes less and less possible and applicable, it is becoming more and more the new standard based on performance data if there is no need to generate your own clinical data.

Below are examples of when this route makes sense:

2.1 Example – Medical Software

Most software products (Class I and IIa) are examples of products where performance data makes sense. The reasoning for this decision is as follows:

The product has been fully verified as part of the software life cycle process in accordance with IEC 62304 and all tests have been successfully completed. The testing included unit testing, integration testing, system testing and usability testing. Based on these tests, it can be shown that the product works effectively.

According to MDCG-2020-1 (Guidance on Clinical Evaluation (MDR)/Performance Evaluation (IVDR) of Medical Device Software), scientific validity is defined as the extent to which the output of the software product is valid based on the selected inputs and algorithms is associated with the desired physiological state or clinical disease. In order to provide proof of scientific validity, a literature search is carried out, which also includes proof of benefit according to the MDR as well as determining the state-of-the-art and identifying the safety and performance of the medical device.

The clinically relevant components of the system are the implementations of the algorithms/questionnaires for diagnosis or the course of therapy. The literature search focuses on scores/detection algorithms as well as on the general use of digital products in the diagnosis/therapy of the indicated indications.

Table 1: Clinical evaluation of a software product

2.2 Example – dentist chair

Another product whose clinical performance, safety and benefits can be easily assessed using performance data and for which a clinical test makes no sense is the dental treatment unit: the dental chair.

Such products are active medical devices that are used to treat children and adults in the dental field. These products are dental treatment devices according to ISO 7494 with a dental patient chair according to ISO 6875. They are intended exclusively for use in dentistry and may only be operated by medical professionals. The dental treatment unit is used as an aid for patient positioning and for treatment in the dental field. Depending on whether dental instruments are part of this treatment unit and, if so, which ones, these products are classified in class IIa or IIb.

Due to the clear intended purpose of these products, the question of whether a clinical trial should be carried out on humans is unnecessary. The claims about the product relate to the ergonomics for both the patient and the practitioner and user of the product. It also emphasizes efficiency and ease of operation, and prescribed procedures and supporting components to facilitate infection control and maintain water quality. These statements are not suitable endpoints for a clinical trial. However, they can be supported with performance data. For example, the topic of ergonomics and ease of use can be proven via the usability test (DIN EN 62366-1). Compliance with the relevant standards and regulations on water hygiene and quality also confirms these claims about the product. The reason for choosing the path based on performance data is now listed here in Table 2: