An error grid, when augmented with a tally of nonreported results, accounts for 100% of data and separates errors into multiple severity-of-harm zones.
By: Jan S. Krouwer
To legally sell in vitro diagnostic assays, manufacturers must submit performance data to FDA that demonstrates that the assay meets specifications. However, FDA approval does not guarantee that serious problems do not lurk within the assay that could cause severe patient harm and subsequent financial loss to the manufacturer. A typical performance specification has limits for average bias and imprecision. The logic of this specification is that according to a popular model, average bias plus two times imprecision gives limits for 95% of the results that are thought to prevent patient harm.1
There are two problems with this specification. For one, use of the implied single set of limits for average bias plus two times imprecision suggests that results within limits have no harm, and results outside of limits will cause harm. But just as the magnitude of error can increase, so can the severity of harm. Single limits are usually placed between no harm and minor harm so that 95% of the results cause no harm. If the limits were wider to ensure no serious harm, then minor or moderate harm could occur frequently, which is undesirable. The second problem is that only 95% of the results are specified. This leaves up to 5% of the results not only unspecified but possibly with large errors and causing significant harm, since the 95% limits are often constructed using two times the CV. This means that if the specification were just met, then the 5% remaining results would have greater error than two times the CV. Thus, for a clinical laboratory that reports a million results per year, up to 50,000 results could cause severe harm.
Now, it is clear that this does not occur—the percentage of results that cause severe harm is much less—but if it did happen, the assay would be acceptable according to the specification. Examples of specifications that specify less than 100% of the results or are based on average bias plus a multiple of the standard deviation are LDL cholesterol and other lipids, creatinine, and CLIA assays.2-4
Error Grids: Requirements and Limits
A solution to these two problems is to use an error grid, which, when augmented with a tally of non-reported results, accounts for 100% of the data and separates errors into multiple zones associated with different severities of harm.5 Error grids are largely unused except for with glucose, although FDA now requires an error grid for CLIA-waived assay submissions.6 A generic error grid, shown in Figure 1, contains at least three zones. Zone A, the innermost zone, is the region that demarcates no harm from minor harm. Zone C, the outermost zone (known as Limit of Erroneous Results [LER] by regulators), is the region that causes major harm. Although the original glucose error grids did not specify percentages of results within zones, error grids used to support CLIA waiver assays specify a large percentage of results (often 95%) to be in zone A and no results in zone C. This leaves the remainder of results (5%) in zone B, the zone between zones A and C. An important attribute of zone B is that it prevents a small increase in error from causing a result to go directly from no harm (zone A) to serious harm (zone C).
Determining the limits for error grid zones can be challenging; requesting clinician opinion is common. One should beware the term clinically acceptable or similar terms that may be used especially for the zone-A limits. Zone-A limits represent best judgment but are likely to contain the region where most (but not all) patients are unharmed by assay error. There is always a trade off in setting zone-A limits between having limits that can’t be met versus harming too many patients. Thus, it does not seem right to use the term clinically acceptable when some patients are harmed. Rather than defining “clinically acceptable” errors, zone-A limits represent a socioeconomic statement about the practice of medicine; for example, that the performance of an assay could usually be improved by running a second assay using a different methodological principle—but doing so would be very costly.
Given that one has established limits for an error grid, it remains to provide data to show that the error-grid requirements have been met. Consider the simple error grid shown in Figure 1. The innermost zone is often evaluated by performing a method comparison. For example, with 125 specimens, if 124 are within zone A, the 95% confidence limits for this result of 99.2% within zone A are 95.6% to 99.9%. It is important to obtain these results in as unbiased a manner as possible.7 This can be achieved by estimating total error directly and not excluding use error in the protocol.8 Although the same experiment could be used to evaluate zone C, the conclusions will not be very compelling. For example, if no results out of 125 specimens are in zone C, then the 95% confidence interval for this 0% rate is 2.9%. This means that for a million reported results, no more than 29,000 results will be in zone C—not very comforting. In fact, to have 95% confidence that fewer than 1 result in a million will land in zone C requires 3.71 million specimens—all of which are not in zone C. Unfortunately, method comparisons are sometimes used as evidence that zone C requirements have been met, a strategy that is perhaps reinforced by FDA’s CLIA waiver requirement of performing a method comparison to establish zone-C performance. To provide evidence that zone-C results are acceptable requires evaluating low-probability events. Risk management is an effective way to do this. There are general risk-management guidelines on FMEA (Failure Mode Effects Analysis) and fault trees, and a 2008 IVD Technology article listed ten tips on conducting risk management.9,10
Residual and Acceptable Risk
Perhaps the biggest challenge is to know after all mitigations have been made whether the residual risk is acceptable. The ISO standard on risk management (14971) suggests performing a risk-benefit analysis to determine the acceptable level for cases when the manufacturer’s level of risk has not been met; however, this is not useful for most diagnostic assays.11 For almost any assay, the residual risk that may cause serious harm to a small percentage of patients due to assay error is almost always lower than the risk of withholding the assay from clinicians and thereby eliminating the clinical information from all patients that the assay would have provided. For example, a small percentage of diabetic patients have been seriously harmed by glucose meters, but this harm is less than the harm that would occur if all glucose meters were withdrawn from the market.12 This argument applies across manufacturers. An individual manufacturer’s assay must have comparable performance to other existing assays in the market. Hence, from a regulatory standpoint, the level of acceptable risk is tied to current performance, which again is a socioeconomic statement about the practice of medicine. However, when serious harm does occur, the current performance is usually no longer deemed acceptable until the error event that has caused the harm has been corrected. So it is up to the manufacturer to determine its own level of acceptable risk.
A commonly used approach to estimate risk in FMEA and fault trees is to classify the severity and probability of occurrence of each of the many potential failure events, rank the combination of severity times probability of all events in a Pareto chart, and ensure that mitigations are applied to the events with the highest risk until the risk is “acceptable.” The problem with this approach is that the probabilities are qualitative, not quantitative.
In principle, one could quantify these probabilities and assess the probability of serious harm with quantified fault trees. For example, consider an algorithm to suppress reporting results from a noisy response. Although simple in principle, these algorithms can be difficult to create because there can be overlap between the response signals that give results with little or no error and those that yield larger errors. Thus, one trades off suppressing too many results with not suppressing results that cause large errors. These studies are done over many samples and with and without conditions that cause noisy responses. When these studies are complete, one may be able to estimate a probability of large error due to failure of the algorithm.
Although one does not want to allow any large errors, it is important to realize that the probability of a large error is never zero, which implies that there must be an acceptable probability (e.g., risk) of a large error. If one constructs a fault tree with known probabilities of all events that can cause large errors, and one sums these probabilities across all events, then one can estimate the frequency of occurrence of large errors (and potential patient harm) for the assay. This fault tree quantification is complex and costly enough that it is not performed for most diagnostic assays. Moreover, one must not confine potential error events to assay design. The manufacturing process is also a candidate for errors—including use errors—that can lead to large assays errors. There are many examples of reagent lot recalls or even, in one case, a software build that was incorrect.13,14
Another tool to lower the risk of serious errors is Failure Reporting And Corrective Action System (FRACAS), especially in released products.9 With FRACAS, manufacturers can collect a mixture of customer complaint data and electronically captured data. If a manufacturer has 1000 fielded instruments that each run 50 samples a day for an assay, this yields more than 18 million results a year. If these instruments are connected to the manufacturer and can transmit error conditions, then the manufacturer can achieve the sample size needed to prove reliability with respect to these error conditions. Of course, this is not the same as a method comparison and will typically not reduce the risk of errors such as interferences, but FRACAS still can lower the overall risk. The challenge is to automate systems since there are so many results. If large errors do occur, an efficient FRACAS system can expose these errors before they become widespread. This will lower the frequency of patient harm. Practically speaking, the amount of effort manufacturers devote to lowering risk is limited by financial constraints. Thus, the residual risk is acceptable if one has lowered risk as much as possible within financial constraints. It may also be helpful to assess financial risk by using decision-analysis-based financial modeling.15
Conclusion
Use of an error grid addresses the problems in many current specifications. Method comparison protocols can evaluate specifications with respect to minor harm but not major harm. Risk management is appropriate to assess risk of major harm. FMEA and fault trees can help evaluate whether residual risk is at an acceptable level. Ultimately, residual risk is at an acceptable level when it has been made as low as possible within financial constraints.
References
1. JC Boyd and DE Bruns, “Quality Specifications for Glucose Meters: Assessment by Simulation Modeling of Errors in Insulin Dose.” Clinical Chemistry 2001; 47: 209-214.
2. PS Bachorik and JW Ross, “National Cholesterol Education Program recommendations for measurement of low-density lipoprotein cholesterol: executive summary.” The National Cholesterol Education Program Working Group on Lipoprotein Measurement. Clinical Chemistry 1995; 41: 1414-1420.
3. GL Myers et al, “National Kidney Disease Education Program Laboratory Working Group Recommendations for Improving Serum Creatinine Measurement: A Report from the Laboratory Working Group of the National Kidney Disease Education Program.” Clinical Chemistry 2006; 52: 5-18.
4. “Subpart 1—Proficiency Testing Programs for Tests of Moderate Complexity (Including the Subcategory), High Complexity, or Any Combination of These Tests.” See http://www.cdc.gov/clia/pdf/42cfr49300.pdf, pp. 933-937.
5. “How to construct and interpret an error grid for diagnostic assays: EP27 proposed guideline.” CLSI/NCCLS document EP27-P. (Wayne, PA: NCCLS), 2009.
6. Recommendations: Clinical Laboratory Improvement Amendments of 1988 (CLIA) Waiver Applications for Manufacturers of In Vitro Diagnostic Devices, January 30, 2008. Accessed at http://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm079632.htm.
7. Krouwer JS and Cembrowski GS. “A review of standards and statistics used to describe blood glucose monitor performance,” Journal of Diabetes Science and Technology, 2010; 4: 75-83.
8. Estimation of total analytical error for clinical laboratory methods: approved guideline. CLSI/NCCLS document EP21-A. (Wayne, PA: NCCLS), 2003.
9. Risk management techniques to identify and control laboratory error sources. Approved guideline. 2nd ed. CLSI/NCCLS document EP18-A2. (Wayne, PA: NCCLS), 2009.
10. JS Krouwer. “Ten tips to improve risk assessment,” IVD Technology 2008; 14; 16-23.
11. Medical devices – application of risk management to medical devices. ISO 14971. Geneva, Switzerland: International Organization for Standardization, 2007.
12. FDA Public Health Notification: Potentially Fatal Errors with GDH-PQQ Glucose Monitoring Technology. Accessed at http://www.fda.gov/medicaldevices/safety/alertsandnotices/publichealthnotifications/ucm176992.htm.
13. Search through http://www.accessdata.fda.gov/scripts/cdrh/CFdocs/cfRES/res.cfm.
15. RA Howard and J Matheson, eds. The principles and applications of decision analysis. (Palo Alto, CA: Strategic Decisions Group), 1983.
Jan S. Krouwer,
PhD, has more than 30 years’ experience working for IVD manufacturers. He was the executive director of Evaluations and Reliability at Ciba Corning Diagnostics and is now President of Krouwer Consulting (Sherborn, MA). He can be reached at jan.krouwer@comcast.net.
Login or
register to post comments