Context.-Most proficiency testing materials (PTM) contain an artificial matrix that may cause immunoassays to perform differently with this material than with clinical samples. We hypothesized that matrix effects would be reduced by using fresh frozen serum (FFS).
Objective.-To compare the performance of an FFS pool to standard PTM for measurement of α-fetoprotein, carcinoembryonic antigen, human chorionic gonadotropin (hCG), and prostate-specific antigen (PSA).
Design.-One FFS specimen and 4 different admixtures of PTM were distributed in the 2003 College of American Pathologists K/KN-A (for α-fetoprotein, carcinoembryonic antigen, hCG, and total and free PSA) and C-C (hCG only) Surveys.
Participants.-The number of laboratories that participated in the surveys varied from a low of 288 (free PSA, K/KN-A Survey) to a high of 2659 (hCG, C-C Survey).
Main Outcome Measures.-Method imprecision and method bias were compared between the FFS specimen and the standard PTM specimen with the closest value. Method imprecision was determined by calculating the coefficients of variation for each method and for all methods combined. Bias was defined as the proportional difference between peer-group mean and the median of all method means.
Results.-The FFS specimen gave significantly higher imprecision than PTM for the analytes α-fetoprotein, carcinoembryonic antigen, total PSA, and free PSA. For hCG, no substantial imprecision differences were observed in both surveys. Bias was significantly greater for the a-fetoprotein, carcinoembryonic antigen, and total PSA assays and significantly lower for the hCG and free PSA assays when comparing the FFS with the PTM.
Conclusions.-Fresh frozen serum did not provide consistently lower imprecision or bias than standard PTM in a survey of commonly ordered tumor markers.
(Arch Pathol Lab Med. 2005;129:331-337)
A key component of quality assurance in the clinical laboratory is participation in external proficiency testing programs. These programs allow individual laboratories to assess the relative accuracy of their methods in comparison with a peer group of other laboratories that subscribe to the survey. Such programs can also provide data on the comparability of different methods for the same analyte. This information is useful to laboratory and medical personnel, who must interpret test results obtained by different methods, as well as to the manufacturers who produce these assays.
The materials that are used by proficiency testing programs should, as much as possible, behave like genuine clinical specimens. Most proficiency testing materials (PTM) consist of treated human serum that is spiked with one or more analytes of interest to produce the desired concentrations. The artificial matrix may cause assays to perform differently with this material than with clinical samples. This difference is referred to as "matrix bias," the presence and magnitude of which is typically unknown for any given analyte/method combination. Consequently, the performance of a given method in an external survey may not reflect the results obtained when measuring patient specimens.
Immunoassays of serum proteins, such as hormones and tumor markers, introduce another form of bias. Many proteins that are secreted or shed into the circulation exist in more than one molecular form. Because of the nature of these analytes, there is no reference method that gives a universally accepted correct value for each analyte's concentration in serum. The amount measured by a given immunoassay depends on how the assay is standardized and the antigenic specificity of the antisera used. Performance of immunoassays in external proficiency surveys is thus subject to bias from the molecular nature of the spiked analyte (spike bias) as well as matrix bias.
One potential approach to reducing matrix and spike biases is to use pooled human serum as a PTM. To test this hypothesis, the College of American Pathologists (CAP) included a fresh frozen serum (FFS) sample in the first mailing of the 2003 K and C-C surveys. The questions to be answered were as follows:
* Is there a methodology bias between FFS and standard PTM?
* Is there a difference in imprecision between FFS and PTM?
* Do PTM specimens act like FFS specimens?
In this manuscript, we analyze the results of these surveys for 5 tumor markers: ct-fetoprotein (AFP), carcinoembryonic antigen (CEA), human chorionic gonadotropin (hCG), and total and free prostate-specific antigen (PSA).
MATERIALS AND METHODS
Samples
The 2003 CAP K/KN-A and C-C surveys included a commutable FFS specimen and 4 different admixtures of PTM. Fresh frozen serum was prepared by Aalto Scientific (Carlsbad, Calif) using a modification of the NCCLS C37-A Guideline 13. Briefly, donor blood was collected into plastic bags that were immersed in ice water, then centrifuged at 1500g for 8 minutes at 4°C to obtain platelet-rich plasma. The plasma was transferred to sterile plastic centrifuge bottles and allowed to clot for 4 hours at room temperature. Following centrifugation for 18 minutes at 2100g at room temperature, the resulting serum was removed and flash frozen. Serum units were shipped frozen from the donor center to the processing center and stored at -70°C for up to 2 months prior to pooling. A more detailed description of the collection and preparation of FFS appears in an accompanying manuscript by Miller et al.1
The 4 PTM specimens included in the K/KN-A and C-C surveys were prepared to CAP specifications by Bio-Rad Laboratories (Irving, Calif). These specimens are based on human serum that may have been chemically treated or admixed with nonhuman protein products. Analytes were added to the base serum protein material to prepare master pools containing the desired quantity of each analyte. The specific composition of these specimens is proprietary.
The AFP, CEA, total PSA, and free PSA analytes were included in the K/KN Survey only, whereas hCG was present in both the K/KN and C-C Surveys. Proficiency testing material specimens with similar analyte value to the FFS specimen were selected for the comparisons in this study. For AFP, CEA, and hCG (K-A), specimen K-01 from the 2003 K/KN-A Survey was used. For total PSA and free PSA, specimen K-16 of the same survey was used. For the 2003 C-C Survey comparisons, C-04 was chosen for hCG (C-C). The designated FFS specimens in both surveys were K-02 and C-02.
Study Design
The CAP 2003 K/KN-A and C-C surveys were mailed approximately 6 months apart in a blind control study. The inclusion of FFS in both surveys was not revealed to survey participants. In each survey, the participants analyzed the FFS sample along with 4 other challenges as if they were real patient samples, in a manner prescribed by the Clinical Laboratory Improvement Amendments of 1988.: The number of laboratories that participated in the surveys varied from a low of 288 (free PSA, K/KN-A Survey) to a high of 2659 (hCG, C-C Survey), as indicated in Table 1.
Data
The survey participant results were screened for outlying values prior to the statistical analysis. First, the histograms of the data were visually inspected, and errors that occurred because participants incorrectly filled out the report form were removed. The data were then subjected to a 2-pass, 3-SD test for outliers. Laboratory results that were greater than 3 SDs from their peergroup mean on the first or second pass were eliminated. After outlier exclusion, participant results from peer groups with fewer than 10 laboratories were not considered for analysis.
Statistical Analysis
The method imprecision and method bias were compared between the FFS specimen (K-02 or C-02) and a comparable PTM specimen with similar target value (either K-01, K-16, or C-04). Method imprecision was determined in 2 ways: by calculating the coefficient of variation (CV) for all methods combined (all-method CV) and by taking the average of the CVs for each method (mean method CV). The method bias was calculated as the percentage difference between a peer-group mean and the median of all peer-group means. The absolute values of individual method biases were averaged to obtain a mean bias for each analyte. In addition, a nonparametric Spearman rank correlation on peer-group mean and CV between FFS and PTM was performed to determine the likely matrix effects. For hCG, the performance of the same pool of FFS in both K/KN-A and C-C surveys was also compared and analyzed. All data analyses were performed using SAS for Windows version 8.2 software (SAS Inc, Gary, NC).
RESULTS
Table 1 shows the data summary for all analytes. The number of laboratories measuring each analyte is given in the first column, followed by the number of methods used to measure that analyte in the second column. For each analyte, the performance of FFS is compared with that of the PTM specimen with the closest value, as determined from the median of means. Calculated values for the mean method CV, all-method CV, and mean bias appear for each analyte, grouped according to specimen (FFS vs PTM).
The FFS specimen resulted in significantly higher imprecision than the comparable PTM specimen for the analytes AFP, CEA, total PSA, and free PSA, based on the mean method CV. For hCG, no substantial imprecision differences were observed in either survey. Method variation, which was assessed by the all-method CV, was significantly greater for the AFP, hCG (C-C), and total PSA assays and significantly lower for the hCG (K-A) and free PSA assays when comparing the FFS with the PTM specimen. Figure 1 provides a detailed comparison of the imprecision obtained with FFS and PTM specimens, organized by analyte and reagent. Individual reagents are identified on the x-axis by an alphanumeric code (Table 2).
Figure 2 plots the method calibration bias for the FFS and PTM specimens by analyte. As illustrated, the magnitude of bias from FFS and PTM specimens was quite inconsistent among the various assays. For AFP, the calibration bias was greater in FFS than in PTM. For free PSA and hCG, the observed bias was lower in FFS than in PTM. There was no clear difference in calibration bias between FFS and PTM for CEA and total PSA when discounting the outlier performance of reagent I for FFS in total PSA.
Some reagent systems seem to perform more consistently than others when measuring the different analytes in both FFS and PTM specimens. The observed bias differences between FFS and PTM for each analyte are graphed in Figure 3 as a scatterplot. For each analytic method located along the x-axis, the percentage bias differences for the assays performed with that method are graphed along the y-axis. These plots allow one to compare the consistency of reagent performance across one or several different assays. Methods A3, A4, Dl, El, E3, G3, and H show fairly low bias differences (approximately 10% or less) between FFS and PTM for the analytes measured. Among this group, Dl and E3 were used for only one assay, whereas A3, A4, El, G3, and H demonstrated low bias differences between FFS and PTM across 2 or more analytes.
Table 3 shows the results of Spearman rank correlation between FFS and PTM on peer-group mean and peergroup CV. If FFS acts like standard PTM, then the correlation between FFS and PTM should be high and significant. For the peer-group mean, the comparable PTM sample appears to correlate better with other PTM samples than with the FFS sample. The correlations among PTM samples are all significant, whereas the FFS-PTM correlations are low and nonsignificant for most assays. This result suggests the presence of matrix effects between FFS and PTM. Overall, CEA had the least matrix effects.
The same correlation analysis was then performed on the CV, revealing a somewhat different picture. There were no substantial differences in FFS-PTM and PTM-PTM CV correlations. It appears that the correlation on CV between FFS and PTM was much more significant than on the peergroup means. This suggests that the matrix impact on method imprecision is less significant than on the calibration bias, because the method imprecision ranking does not differ substantially between FFS and PTM (ie, the methods with higher CVs in FFS will remain high in PTM, and vice versa).
Figure 4 compares the performance of the replicate FFS specimens (K-02 and C-02, sent out 6 months apart) in measuring hCG. For each reagent (x-axis), the overall mean and CV obtained in the 2 surveys are graphed on the y-axis. If the assay systems are perfectly consistent over time, one would expect to see very similar means and CVs for the FFS specimens measured in each survey. However, there were substantial differences in peer-group mean and CV between the 2 replicate FFS specimens for some reagents. More than half of the methods used in both surveys showed significant differences in mean and CV between K-02 and C-02, as noted in Figure 4. These differences are most likely due to long-term method calibration issues rather than to between-laboratory variations.3 Reagents Gl, G2, H, and I appeared to generate the most consistent results between the 2 replicate specimens.
COMMENT
The matrix that is found in proficiency testing materials may differ in a number of ways from native human serum. It may consist of human serum that has been chemically treated to remove lipids, individual fractions of human serum (eg, albumin and globulins), or serum products of animal origin (eg, bovine serum albumin). Though similar to native human serum, these materials are not identical to the patient samples that are analyzed in clinical laboratories. Subtle differences in the matrix may affect the conformation of proteins to be measured, availability of antigenic binding sites, and performance of reagents that are used in immunoassay kits. As a result, assays that have been validated with human serum can still perform differently when challenged with PTM.
This study was designed to assess the performance of FFS as an alternative to standard PTM for 5 commonly assayed tumor markers. In comparing FFS to PTM we focused on 2 measures, imprecision and method bias. Imprecision was significantly lower for PTM than FFS for all analytes except hCG, as determined from the mean method CV. Although this is an average value, a detailed review of Figure 1 shows that the CV for PTM was below that for FFS in 7 of 9 methods for AFP, in 9 of 12 methods for CEA, in 14 of 15 methods for PSA, and in all 6 methods for free PSA. There is a consistency to this finding across methods, so the mean method CVs are reflective of a trend and are not exaggerated by large differences in performance in only a few methods. The trend is especially obvious in the case of total PSA and free PSA. By contrast, the results for hCG in both the K-A and C-C surveys (shown in the lower panels of Figure 1) are mixed, with neither FFS nor PTM showing consistently lower CVs across methods.
One possible explanation for these findings is the relative concentrations of the analytes in FFS and PTM. Reviewing the data in Table 1, the concentrations of AFP, CEA, total PSA, and free PSA are 1.7- to 6-fold higher in PTM than in FFS. Imprecision of an assay generally improves as the concentration of the measured analyte increases. Therefore, the lower CVs associated with PTM may be attributable not to the nature of the specimen but to the higher concentrations of these 4 analytes.
The method bias provides a measure of how far individual method means vary from the median of those means. Inspection of Figure 2 reveals that for every assay except total PSA, several reagents produced a method bias in excess of 20% with either FFS or PTM. In the case of AFP, FFS gave a significantly greater mean method bias than did PTM. The situation was reversed for free PSA and hCG (K-A), and there was no significant difference in mean method bias between FFS and PTM for the analytes CEA and total PSA. Overall, neither FFS nor PTM produced a consistently lower method bias when compared across all analytes and reagents.
In several of the panels in Figure 2, there are large discrepancies between the method biases obtained with FFS and PTM. Examples include reagent E2 for AFP, reagent Gl for free PSA, and reagent C for hCG (K-A). These discrepancies highlight the effect that a different matrix can have on some reagent systems but not on others. The scatterplot in Figure 3 provides a quantitative picture of the bias differences between FFS and PTM. Depending on which reagent a laboratory uses for a particular analyte, a switch from PTM to FFS (or vice versa) could have a large effect on the values that are produced and reported to external proficiency testing providers.
Because the same FFS specimen was sent out in 2 separate surveys (K-A and C-C), both of which included hCG, we compared the results from each survey in Table 1. The medians of the mean and mean method CVs were similar in both surveys. However, the all-method CV and mean bias were lower in the K-A than in the C-C Survey. There were also significant differences between the means and CVs obtained by about half of the reagents (Figure 4). The mailings for these 2 surveys were about 6 months apart. These results suggest that there is a drift in assay calibration over time with some of the reagent systems used by participating laboratories. Alternatively, sample degradation in the FFS pool during the 6-month interval could also explain the difference in performance between the 2 surveys.
To attempt to answer the question of comparability between PTM and FFS, Spearman correlations were calculated on both peer-group mean and peer-group CV (Table 3). The results for peer-group CV show that matrix effects are not a significant determinant of imprecision. In this respect, PTM specimens act like FFS specimens. However, peer-group mean correlations between FFS and PTM are weaker. With the exception of 2 analytes (CEA and AFP), PTM did not behave like FFS. We therefore conclude that matrix effects are a significant source of bias in at least 3 of the tumor marker assays.
References
1. Miller WC, Myers CL, Ashwood ER, el al. Creatinine measurement: state of the art in accuracy and interlaboratory harmonization. Arch Paihol Lab Med. 2005;!29: 297-304.
2. Clinical Laboratory Improvement Amendments of 1988, Final Rule, 57 Federal Register 7002-7288 (1992).
3. Steele BW, Wang E, Palomaki G, Klee GC, Elin R], Witte DL. Sources of variability: a College of American Pathologists Therapeutic Drug Monitoring Survey study. Arch Paihol Lib Mer/. 2001:125:183-190.
William E. Schreiber, MD; David B. Endres, PhD; Geraldine A. McDowell, PhD; Clenn E. Palomaki, BS; Ronald J. EUn, MD, PhD; George G. Klee, MD, PhD; Edward Wang, PhD
Accepted for publication November 8, 2004.
From the Department of Pathology, Vancouver General Hospital, Vancouver, British Columbia (Dr Schreiber); the Department of Pathology, University of Southern California, Los Angeles (Dr Endres); the Laboratory Corporation of America, Research Triangle Park, NC (Dr McDowell); the Foundation for Blood Research, Scarborough, Me (Mr Palomaki); the Department of Pathology and Laboratory Medicine, University of Louisville, Louisville, Ky (Dr Elin); the Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minn (Dr Klee); and the College of American Pathologists, Northfield, III (Dr Wang).
Dr Klee declares that he has received research grants from Beckman Coulter and Biosite for work unrelated to the preparation of this manuscript. All other authors also have no relevant financial interest in the products or companies described in this article.
Reprints: William E. Schreiber, MD, Department of Pathology and Laboratory Medicine, Vancouver General Hospital, 855 W 12th Ave, Vancouver, British Columbia, Canada V5Z 1M9 (e-mail: schr@ interchange.ubc.ca).
Copyright College of American Pathologists Mar 2005
Provided by ProQuest Information and Learning Company. All rights Reserved