Blog

From figures to findings - statistics in practice

At medXteam, clinical data is our focus. As a CRO, we not only conduct clinical trials (studies) with medical devices in accordance with the MDR and ISO 14155, but also offer support in the statistical planning and analysis of study data. This article provides an overview of the most important statistical concepts in clinical studies, from basic explanations and practical examples to more advanced topics.

Abbreviations

GCP (Good Clinical Practice)

MDR Medical Device Regulation; EU Regulation 2017/745

Underlying regulations

EU Regulation 2017/745 (MDR)
General Data Protection Regulation (GDPR)
Medical Devices Implementation Act (MPDG)
ISO 14155

1 Introduction

Statistical methods play a central role in the clinical investigation of medical devices. They are key to data analysis, results interpretation, and regulatory compliance. This article covers the following topics:

  • Confidence intervals
  • First and second order errors
  • Acceptance criteria
  • Boxplot
  • Forestplot
  • Connected data
  • Sensitivity / Specificity

2. What was that about the confidence interval again?

The confidence interval indicates the range within which an estimated parameter, such as the mean or an effect size, lies with a given probability. It quantifies the uncertainty surrounding an estimate and is therefore an indispensable tool in statistics.

The confidence interval indicates how precise an estimate is. The narrower the interval, the more confident we can be that the true value is close to the estimated value. Conversely, a wide interval suggests greater uncertainty. A confidence interval is often given with a confidence level of 95%. This means that the true value will lie within the stated range in 95 out of 100 cases if the study is repeated under identical conditions.

Example: 

Suppose a study shows a mean wound healing time of 10 days with a confidence interval of [8, 12] at a 95% confidence level. This means that the true mean is within this range with a 95% probability.

2.1 In-depth study

How do confidence intervals work and why are they crucial?

  • The basic idea of ​​the confidence interval: The confidence interval is based on the uncertainty inherent in any sample. It helps to quantify this uncertainty by specifying a range within which the true value of a parameter is highly likely to lie. The more data we collect and the lower the variance of the data, the more precise (i.e., narrower) the interval becomes.
    • Interpreting a 95% confidence interval: It does not mean that the true value lies within the interval "with 95% probability". Instead, the statement refers to the fact that if we were to repeat the data collection process infinitely many times, the interval would contain the true value in 95% of cases.
  • What factors influence the width of a confidence interval? The width of the interval depends on three main factors:
  • Sample size: Larger samples provide more precise estimates because the influence of random fluctuations is reduced. This leads to narrower confidence intervals. With small samples, the intervals are wider because the uncertainty is greater.
  • Data variability: If the data has a greater spread (i.e., if the values ​​are widely scattered around the mean), the intervals are wider because the uncertainty about the actual value increases.
  • Selected confidence level: Higher confidence levels (e.g., 99% instead of 95%) result in wider intervals because more uncertainty is taken into account. Conversely, a lower confidence level (e.g., 90%) results in narrower intervals.

Practical implications: A particularly wide interval indicates that additional data is needed to better narrow down the true value.

2.2 Alternative methods for estimating confidence intervals

The classical method assumes that the underlying data is normally distributed and that the sample size is sufficiently large. Alternative approaches can be used when these assumptions are not met or when samples are small

  • Bootstrapping:This method is ideal when the assumption of normality is violated or when dealing with small sample sizes. It involves repeatedly drawing a large number of samples with replacement from the available data. For each of these samples, the parameter of interest (e.g., the mean) is calculated. The distribution of these estimates then serves as the basis for deriving the confidence interval.
    • Advantages: Robust against distribution disturbances; flexibly applicable.
    • Application example: For non-normally distributed data, such as highly asymmetrical blood pressure values, bootstrapping provides a precise estimate.
  • Bayesian confidence intervals (credible intervals):Unlike classical statistics, the Bayesian approach works with probabilities. Here, prior knowledge about the parameter is introduced through a so-called prior distribution. This is combined with the observational data (likelihood) to calculate the posterior distribution. The credible interval then indicates the range within which the true value lies with a certain probability.
    • Advantages: Integration of prior knowledge; better interpretability with small sample sizes.
    • Application example: If previous studies show that a medical device typically causes a wound healing time of about 10 days, this information can be included in the analysis to reduce uncertainty.

2.3 Practical relevance of confidence intervals in clinical research

  • Clinical relevance versus statistical significance: Confidence intervals provide more information than a p-value. While a p-value only indicates whether an effect is statistically significant, the confidence interval also shows whether the effect is clinically relevant. For example, a medical device might produce a statistically significant reduction in wound healing time, but this reduction might be so small that it is clinically irrelevant.
  • Assessment of uncertainties: Regulatory decisions often involve checking whether the lower limit of the confidence interval is above a certain threshold that is considered clinically significant.

3. First and second order errors

Statistical tests carry the risk of errors because decisions are made based on sample data that can only partially reflect reality. First- and second-order errors are therefore important concepts in statistics and particularly relevant in clinical research, where incorrect decisions can have significant consequences.

Two types of errors can occur in statistical tests:

  • First-order error (alpha error): This occurs when the null hypothesis is rejected even though it is true. This error is also known as a "false alarm." Example: An ineffective medical device is classified as effective.
  • Second-order error (beta error): This error occurs when the null hypothesis is retained even though the alternative hypothesis is true. This is often described as "missing an effect." Example: An effective medical device is not detected.

Maintaining a balance between these types of errors is a key task in planning clinical trials. The significance level and statistical power play a central role in this.

Example: 

A new medical device is being tested. An alpha error would lead to the approval of an ineffective product, while a beta error could classify an effective product as ineffective.

Practical implications: While an alpha error is problematic from a regulatory and economic perspective, a beta error can hinder medical innovation.

3.1 In-depth study

  • Relationship between alpha and beta errors: There is a direct relationship between these two types of errors. If the significance level (alpha) is chosen more strictly to reduce the probability of an alpha error (e.g., from 0.05 to 0.01), the risk of a beta error often increases. Conversely, relaxing the alpha value reduces the beta error but increases the risk of detecting false effects.
  • Statistical power: Statistical power is a measure of how well a test can detect a real effect. A statistical power of 80% means that a real effect is missed in 20% of cases (beta error).
  • Factors influencing test power include sample size, effect size, and the chosen significance level. A larger sample size increases the probability of detecting small effects and reduces the beta error.
  • Adjustment for multiple analyses:
  • Interim analyses: In studies with regular data analyses, the probability of an alpha error can increase because each analysis carries the chance of detecting a random effect. Methods such as the O'Brien-Fleming method use stricter thresholds in early analyses to control the overall error rate.
  • Bonferroni correction: This method divides the significance level by the number of comparisons to keep the overall error rate low. However, this is conservative and can reduce the statistical power when performing a large number of tests.
  • Bayesian perspective:

Instead of using rigid significance levels, Bayesian statistics assesses probabilities. For example: How likely is it that an effect is larger than a clinically relevant threshold? This can lead to more flexible and interpretable results, especially with small samples.

  • ROC curves:

The Receiver Operating Characteristic (ROC) curve shows the trade-offs between sensitivity (true positives) and specificity (false positives). It helps to identify thresholds that minimize both alpha and beta errors.

4. Acceptance criteria

Acceptance criteria define the conditions under which a clinical outcome is considered successful. They are crucial for the interpretation of study results and the decision as to whether a medical device is effective or safe.

Acceptance criteria define which results are required to achieve a specific goal. They influence study planning, hypothesis formulation, and ultimately, the approval decision for a product.

Example: 

A medical device is being developed to shorten the healing time after surgery. The acceptance criterion is that the average healing time must be reduced by at least 20% compared to standard treatment. The study will verify whether the confidence interval of the result exceeds this threshold.

4.1 Further Study

  1. Non-inferiority, superiority, and equivalence tests:
  • Non-inferiority test: Shows that the new product is no worse than an existing treatment, within an acceptable tolerance range.
  • Superiority test: Proves that the product is significantly better.
  • Equivalence test:Checks if the product is similar within a defined range (e.g. ±10%).
    1. Bayesian approaches:
  • Instead of setting a fixed threshold for acceptance, Bayesian methods calculate the probability that the true effect is greater than a predefined threshold. This allows for a dynamic and probabilistic analysis.
    1. Clinical significance:
  • A statistically significant effect does not automatically meet an acceptance criterion, as clinical relevance must also be assessed. For example, a 1% reduction in pain might be statistically significant but clinically insignificant.
    1. Cost-benefit analysis:
  • Strict acceptance criteria can increase the quality of the assessment, but often require larger samples, which increases the cost and duration of the study.

5. How do I read a boxplot?

A box plot, also known as a box graph, is a versatile statistical tool that easily visualizes the distribution of data. It helps to identify central trends, variability, and potential outliers at a glance and is particularly useful for comparisons between groups.

A boxplot provides a compact representation of the distribution of a data set. The most important components are:

  • Median: The line in the middle of the box represents the central value of the data.
  • Quartiles: The bottom edge of the box is the 1st quartile (Q1), the top edge the 3rd quartile (Q3).
  • Interquartile range (IQR): The range between Q1 and Q3 comprises the middle 50% of the data.
  • Whisker: The lines above and below the box indicate data values ​​outside the IQR, up to a defined limit (often 1.5 times the IQR).
  • Outliers: Data points that lie outside this limit are displayed separately as points.

Example 

Let's imagine we have data on the healing time of two patient groups (Group A and Group B):

  • Group A has shorter healing times with low variability, resulting in a compact box with short whiskers.
  • Group B shows greater differences between patients, resulting in a wider box and longer whiskers.

A direct comparison of the two boxplots can quickly show which group is more homogeneous and whether there are any extreme outliers.

5.1 Further Study

  1. Detailed interpretation:
  • Median: Indicates the central tendency of the data and is robust against outliers.
  • IQR: Shows the dispersion of the middle 50% of the data and gives an impression of the variability.
  • Whiskers and runaways:Help identify extreme values ​​that could potentially distort the analysis.
    1. Comparing groups: Boxplots are ideal for illustrating differences between groups, for example, to compare the effect of a medical device on different age groups. Differences in the height of the box or the whiskers can indicate variability or systematic effects.
    2. Enhanced visualizations:
  • Violin plots: A combination of box plot and density plot that shows the entire distribution of the data. Particularly useful for multimodal distributions (e.g., two peaks in the data).
  • Parallel Boxplots:Multiple boxplots placed side by side facilitate the direct comparison of groups.
    1. Use in clinical trials:
  • Outlier analysis: In a clinical trial, outliers could indicate patients who respond exceptionally well or poorly to a treatment. Such findings can provide clues to individual differences that are important for further research.
  • Stratification:Boxplots can be used to stratify and visually represent data by subgroups (e.g., age groups, gender).
    1. Robustness: Because the median and quartiles are insensitive to outliers, the box plot is particularly robust. However, strongly asymmetrical distributions (e.g., long "tails" on one side) can be misleading. In such cases, alternative representations such as the violin plot can be helpful.

6. How do I read a forest plot?

A forest plot is an indispensable tool in meta-analysis, enabling the presentation and interpretation of results from multiple studies or subgroups. It displays estimates and their confidence intervals in a single, unified diagram.

The Forestplot consists of:

  • Estimate points: These points or squares represent the effect (e.g. mean, odds ratio) of each study or subgroup.
  • Confidence intervals: The horizontal lines indicate the uncertainty of the estimate.
  • Vertical line: This represents the "no effect" point, e.g., an odds ratio of 1 or an effect size of 0.
  • Overall effect: A diamond at the bottom indicates the weighted mean of all studies, with the width of the diamond representing the confidence interval.

Example

A meta-analysis examines the effectiveness of a plaster on wound healing time in various studies.

  • Study A shows a significant reduction in wound healing time, with a confidence interval that lies entirely below the "no effect" line.
  • Study B has a wide confidence interval that includes both positive and negative effects, which suggests uncertainty in the results.
  • The overall effect (diamond) is also below the line, indicating a significant effectiveness of the patch.

6.1 Further Study

  1. Analysis of heterogeneity:
  • Cochran's Q-test: Checks whether the variation between studies is greater than would be expected by chance.
  • I² statistic: Indicates the percentage of variability that is explained by heterogeneity. A high value (e.g., above 50%) suggests that a random-effects model is more appropriate.
  1. Fixed effects vs. random effects models:
  • Fixed-effects model: Assumes that all studies measure the same true effect and that differences arise only by chance.
  • Random Effects Model:It takes into account that studies may have different populations and conditions, and allows for greater variability between studies.
    1. Bayesian Forest Plots:
  • Bayesian approaches use prior knowledge to better model uncertainty. The forest plot could visualize posterior distributions and credible intervals, allowing for a deeper interpretation.
  1. Interpretation in practice:
  • A forest plot can be used to assess the consistency of results. Studies whose confidence intervals do not cross the "no effect" line provide strong evidence. Divergent results from individual studies may indicate methodological differences or specific population effects.

7. What is linked data?

Linked data are measurements that are not independent of each other. This often occurs in clinical trials when, for example, the same patient is measured multiple times (e.g., before and after treatment) or when observations exist within pairs or groups (e.g., twins or devices tested on the same patient).

With related data, one measurement directly influences another. The most well-known examples are before-and-after measurements or paired samples. In such cases, it is important to use statistical methods that account for this dependency, otherwise incorrect conclusions may be drawn.

  • Typical scenario: Data collection before and after a patient's treatment. Since both measurements come from the same patient, they are not independent.

Example

A study is investigating the effectiveness of a new wound dressing in accelerating post-operative healing. Healing time is measured in the same patients before and after application of the dressing. Since both measurements are taken from the same patient, they are related. A simple comparison of means without considering this relationship would lead to biased results. Instead, a paired t-test should be used to correctly analyze the differences in healing times.

7.1 Further Study

  1. Why is dependency important? Independent data adhere to the basic assumption of many statistical tests. However, with related data, this dependency violates the assumption. Therefore, a specific analysis is required to avoid biased results.
  2. Suitable statistical methods:
  • Paired t-test: This test compares the means of two related groups by analyzing the differences between the pairs.
  • Wilcoxon signed-rank test: This is the non-parametric alternative when the data are not normally distributed.
  • Linear mixed models (LMM): These models are particularly useful in complex study designs with multiple time points or groups. They can analyze random effects (e.g., individual differences) and fixed effects (e.g., treatment) simultaneously.
  1. Variance-covariance structure: In advanced models such as repeated-measures ANOVA, the relationship between measurements must be modeled correctly. Different assumptions about the structure (e.g., compound symmetry or autoregressive structures) influence the results.
  2. Practical challenges:
  • Missing values: Linked data are particularly susceptible to bias when measurements are missing. Methods such as multiple imputation or maximum likelihood estimation can help minimize distortions.
  • Complexity: Analyzing related data often requires specialized software and knowledge of advanced statistical methods.

8. Difference between sensitivity and specificity

Sensitivity and specificity are fundamental measures for evaluating the quality of a diagnostic test. They describe how well a test is able to identify sick individuals and correctly exclude healthy individuals.

Sensitivity: The proportion of truly ill individuals correctly identified by the test (true positives). It measures the ability to avoid overlooking sick individuals.

Specificity: The proportion of truly healthy individuals who are correctly identified as healthy (true negatives). It describes how well the test can avoid false alarms.

Why is this important? A perfect test would have a sensitivity and specificity of 100%. In practice, however, compromises often have to be made, for example in mass screenings where a test with high sensitivity is preferred to avoid missing any sick people.

Example

 A test for diagnosing a rare disease includes:

  • 90% sensitivity: Of 100 actually sick patients, the test correctly identifies 90; 10 are incorrectly classified as healthy.
  • 80% specificity: Of 100 healthy individuals, 80 are correctly identified as healthy; 20 are incorrectly classified as ill.

8.1 Further Study

  1. Relationship with prevalence:
  • The positive and negative predictive values ​​(PPV and NPV) depend directly on the prevalence of the disease. A test with high sensitivity could produce many false-positive results at low prevalence.
  1. ROC curves and thresholds:
  • A receiver operating characteristic (ROC) curve shows how the sensitivity and specificity of a test change at different threshold values. The ideal threshold maximizes sensitivity and specificity while minimizing false positives and false negatives. The area under the ROC curve (AUC) is a measure of the overall performance of the test.
  1. Trade-offs between sensitivity and specificity:
  • Tests with high sensitivity (e.g., screening tests) often have lower specificity and produce more false-positive results. Combined testing strategies (e.g., a sensitive screening test followed by a specific confirmatory test) can improve diagnostic accuracy.
  1. Bayesian analysis:
  • Bayesian analyses allow the calculation of the probability that a patient is actually ill, based on a positive test result and known prevalence. This helps to better inform diagnostic decisions.
  1. Practical applications:
  • Diagnostic tests such as COVID-19 antigen tests or mammography screenings.
  • Evaluation of new diagnostic devices or methods in clinical trials.

9. Conclusion

Statistical methods are indispensable tools in clinical research and medical device development. They enable precise data analysis, the quantification of uncertainties, and informed decision-making. From calculating confidence intervals and avoiding first- and second-order errors to interpreting boxplots and forestplots, statistics offers a wide range of techniques to improve the quality and validity of clinical studies. The targeted application of these methods allows us not only to demonstrate the efficacy and safety of medical devices but also to meet regulatory requirements and ultimately optimize patient care. In a world where data plays an increasingly important role, statistics remains an essential component of evidence-based medicine.

10. How we can help you

We would be happy to support you in the development, implementation, and use of a database-driven system. As your CRO, we will also handle the complete data management and monitoring via the EDC system.

We support you throughout your entire project with your medical device, starting with a free initial consultation, help with the introduction of a QM system, study planning and execution, right up to technical documentation - always with primary reference to the clinical data on the product: from the beginning to the end.

Do you already have some initial questions?

You can get a free initial consultation here : free initial consultation

medXteam GmbH

Hetzelgalerie 2 67433 Neustadt / Weinstraße
+49 (06321) 91 64 0 00
kontakt (at) medxteam.de