In the running of a clinical trial, much laboratory data has been collected and
ID: 3354482 • Letter: I
Question
In the running of a clinical trial, much laboratory data has been collected and hand entered into a data base. There are 50 different lab tests and approximately 1000 values for each test, so there are about 50,000 data points in the data base. To ensure accuracy of these data, a sample must be taken and compared against source documents (i.e. printouts of the data) provided by the laboratories that performed the analyses.
The study manager for the trial can allocate resources to check up to 15% of the data and he wants the QC efforts to be focused on checking outlier values so that clinically improbable or impossible values may be identified and reviewed. He suggests that the sample consist of the 75 highest and 75 lowest values for each lab test since that represents about 15% of the data. However, he would be delighted if there was a way to select less than 15% of the data and thus free up resources for other study tasks.
The study statistician is consulted. He suggests calculating the mean and standard deviation for each lab test and including in the sample only the values that are more than 3 standard deviations from the mean.
Given that the study manager wants the QC efforts to be focused on selecting outlier values, whose method is a better way of selecting the sample? Why? Using what you have learned about measures of central tendency and dispersion, how would your answer change if you knew the data were not normally distributed? Explain your reasoning and answers.
Explanation / Answer
The manager’s method of selecting the top and bottom 15% is better than the one suggested by the statistician as the mean and standard deviation are computed from all values, including outliers. Hence, outliers tend to pull the mean towards them and inflate the standard deviation. Hence, they tend to bias the criteria used for judging whether a value is an outlier.
I would suggest a third alternative ie to calculate the Inter Quartile Range (IQR) and calculate the lower and upper bound. Any value which is higher than Upper bound or lower than lower bound is an outlier.
IQR = Third Quartile (Q3) – First Quartile (Q1)
Upper Bound = Q3 + 1.5 * IQR
Lower Bound = Q1 + 1.5 * IQR
This is approach is the best even when the data is not normally distributed.