Validation of Syndromic Surveillance for Respiratory Pathogen Activity

The studied respiratory syndromes are suitable for syndromic surveillance because they reflect respiratory pathogen activity patterns

Syndromic surveillance is increasingly used to signal unusual illness events. To validate data-source selection, we retrospectively investigated the extent to which 6 respiratory syndromes (based on different medical registries) refl ected respiratory pathogen activity. These syndromes showed higher levels in winter, which corresponded with higher laboratory counts of Streptococcus pneumoniae, respiratory syncytial virus, and infl uenza virus. Multiple linear regression models indicated that most syndrome variations (up to 86%) can be explained by counts of respiratory pathogens. Absenteeism and pharmacy syndromes might refl ect nonrespiratory conditions as well. We also observed systematic syndrome elevations in the fall, which were unexplained by pathogen counts but likely refl ected rhinovirus activity. Earliest syndrome elevations were observed in absenteeism data, followed by hospital data (+1 week), pharmacy/general practitioner consultations (+2 weeks), and deaths/laboratory submissions (test requests) (+3 weeks). We conclude that these syndromes can be used for respiratory syndromic surveillance, since they refl ect patterns in respiratory pathogen activity. E arly warning surveillance for emerging infectious disease has become a priority in public health policy since the anthrax attacks in 2001, the epidemic of severe acute respiratory syndrome in 2003, and the renewed attention on possible infl uenza pandemics. As a result, new surveillance systems for earlier detection of emerging infectious diseases have been implemented. These systems, often labeled "syndromic surveillance," benefi t from the increasing timeliness, scope, and diversity of health-related registries (1)(2)(3)(4)(5)(6). Such alternative surveillance uses symptoms or clinical diagnoses such as "shortness of breath" or "pneumonia" as early indicators for infectious disease. This approach not only allows clinical syndromes to be monitored before laboratory diagnoses, but also allows disease to be detected for which no additional diagnostics were requested or available (including activity of emerging pathogens). Our study assessed the suitability of different types of healthcare data for syndromic surveillance of respiratory disease.
We assumed that syndrome data-to be suitable for early detection of an emerging respiratory disease-should refl ect patterns in common respiratory infectious diseases (7)(8)(9)(10). Therefore, we investigated the extent to which time-series of respiratory pathogens (counts per week in existing laboratory registries) were refl ected in respiratory syndrome time-series as recorded in 6 medical registries in the Netherlands. We also investigated syndrome variations that could not be explained by pathogen counts. As an indication for syndrome timeliness, we investigated the delays between the syndrome and pathogen time-series.

Syndrome Data Collection and Case Defi nitions
We defi ned syndrome data as data in health-related registries that refl ect infectious disease activity without identifying causative pathogen(s) or focusing on pathogenspecifi c symptoms (such as routine surveillance data for infl uenza-like illness [11] or surveillance of acute fl accid paralysis for polio [12]).
Registries for syndrome data were included if they met the following criteria: 1) registration on a daily basis; 2) availability of postal code, age, and sex; 3) availability of retrospective data (>2 years); and 4) (potential) real-time data availability.
Six registries were selected ( Table 1) that collected data on work absenteeism, general practice (GP) consultations, prescription medications dispensed by pharmacies, diagnostic test requests (laboratory submissions) (13), hospital diagnoses, and deaths. In all registries, data were available for all or a substantial part of 1999-2004. For the GP, hospital, and mortality registry, defi nition of a general respiratory syndrome was guided by the case defi nitions and codes found in the International Classifi cation of Diseases, 9th revision, Clinical Modifi cation (ICD-9-CM), as selected by the Centers for Disease Control and Prevention (Atlanta, GA, USA) (www.bt.cdc.gov/surveillance/syndromedef). For the laboratory submissions and the pharmacy syn-drome, we selected all data that experts considered indicative of respiratory infectious disease (for detailed syndrome defi nitions, see online Technical Appendix, available from www.cdc.gov/EID/content/14/6/917-Techapp.pdf).

Respiratory Pathogen Counts
As a reference for the syndrome data, we included specifi c pathogen counts for 1999-2004 from the following sources: 1) Weekly Sentinel Surveillance System of the Dutch Working Group on Clinical Virology (which covers 38%-73% of the population of the Netherlands [14]  respiratory disease-related counts of Streptococcus pneumoniae (data in 2003-2004 were interpolated for 2 laboratories during short periods of missing data; total coverage 24%); and 3) national mandatory notifi cations of pertussis. The networks for respiratory pathogen counts are other networks than the earlier described laboratory submissions network for syndrome data.

Data Analysis and Descriptive Statistics
Data were aggregated by week and analyzed by using SAS version 9.1 (SAS Institute Inc., Cary, NC, USA). For the GP, pharmacy, and laboratory submissions registries, we expressed the respiratory counts as a percentage of total weekly counts to adjust for the infl uence of holidays and, for laboratory submissions, changes in the number of included laboratories over time. By looking at the graphs, we explored the relationship between the time-series of respiratory pathogens and syndromes and calculated Pearson correlation coeffi cients.

Linear Regression Models
To investigate whether the respiratory syndromes refl ect patterns in respiratory pathogen counts, we constructed multiple linear regression models. These models estimated respiratory syndrome levels at a certain time with, as explanatory variables, the lagged (range of -5 to +5 weeks) pathogen counts as explanatory variables. We used linear regression of the untransformed syndrome to estimate the additive contributions of individual pathogens to the total estimated syndrome. We assumed a constant syndrome level attributable to factors other than the respiratory pathogens and constant scaling factors for each of the lagged pathogens. A forward stepwise regression approach was used, each step selecting the lagged pathogen that contributed most to Akaike's information criterion of model fi t (15). Each pathogen entered the model only once and only if it contributed signifi cantly (p<0.05). Negative associations (e.g., between enteroviruses, which peak in summer, vs. respiratory syndromes, which peak in winter) were excluded to avoid noncausal effects.
To discriminate between primary and secondary infections by S. pneumoniae (as a complication of respiratory virus infection) (16)(17)(18)(19), we used the residuals from regressing S. pneumoniae counts on other pathogens as the variable for S. pneumoniae (instead of its counts) for all the earlier described models for respiratory syndromes.
We checked for autocorrelation in the residuals of the models with hierarchical time-series models (using SPLUS 6.2) (20,21). We calculated R 2 values to estimate to what extent respiratory pathogen counts explain variations in syndromes. To explore to what extent seasonal variation could be a confounder, we also calculated R 2 values of the models after adding seasonal variables (sine and cosine terms) and R 2 values for seasonal terms alone. We also investigated the pathogen-specifi c effects in the models, by calculating the standardized parameter estimates before and after adding seasonal terms.
The models were used to estimate the expected syndrome level with 95% upper confi dence limits (UCLs). We considered distinct syndrome elevations that exceeded the UCLs, as unexplained by the models (for model details, see online Technical Appendix).

Timeliness
We investigated the timeliness of the registry syndromes in 2 ways: 1) as a measure of differences in timeliness between registries, we evaluated the time delays of the syndromes relative to each other by calculating for each of the syndromes the time lag that maximized Pearson correlation coeffi cient with the hospital registry (as a reference); 2) by estimating the time delays between each of the syndromes and the lagged pathogens included in its regression model.

Data Exploration and Descriptive Statistics
Respiratory syndrome time series were plotted for all registries (Figure 1). The Christmas and New Year holidays coincided with peaks and dips in the pharmacy and absenteeism syndromes (not shown). Because these results were probably artifacts, we smoothed these yearly peaks and dips and censored them in the analyses performed on the absenteeism registry, in which they had a strong infl uence on outcomes. For all registries, the respiratory syndromes demonstrated higher levels of activity in winter, which overlapped or coincided roughly with the seasonal peaks of infl uenza A, infl uenza B, RSV, and (albeit less pronounced) S. pneumoniae laboratory counts ( Figure 1). Infections with parainfl uenza virus, M. pneumoniae, adenovirus, and rhinovirus were detected slightly more frequently during winter (data not shown). Bordetella pertussis and enterovirus showed seasonal peaks only in summer (data not shown).
The seasonal peaks in laboratory counts of infl uenza A, infl uenza B, and RSV corresponded with peaks in the GP, pharmacy, and hospital syndromes. Other syndromes did have less obvious correspondence. Each year, around October, the respiratory syndrome showed a peak in the GP We calculated Pearson correlation coeffi cients between the different unlagged time series of respiratory pathogens and syndromes (Table 2). Syndrome time series in all reg-istries correlated strongly with S. pneumoniae (unadjusted total counts). The hospital, GP, pharmacy, and laboratory submissions data strongly correlated with RSV and infl uenza A counts (Table 2). Mortality data correlated strongly with infl uenza A (r = 0.65) and infl uenza B (r = 0.50) infections. The highest correlations between pathogen time series were between S. pneumoniae and the other pathogens (up to 0.51 with infl uenza A, Table 3). Table 4 presents, for each registry, the time lag (in weeks) that maximized the model fi t of regressing syndrome on pathogens. For the GP, hospital, mortality, and pharmacy data, the respiratory pathogens explained the syndrome variation very well (78%-86%). Variations in the absenteeism syndrome could be explained for 68% by variations in the pathogen counts. Although the laboratory submissions syndrome had the lowest explained variance, still 61% of the variations in this syndrome were explained by variations in pathogen counts. Hierarchical time-series models did not show signifi cant autocorrelation in the residuals of the models with pathogen counts as explanatory variables (20,21).

Linear Regression Models
When seasonal terms were added to the model, the variations in the mortality syndrome were just as well ex-920 Emerging Infectious Diseases • www.cdc.gov/eid • Vol. 14, No. 6, June 2008 plained as by the model with only pathogen counts (Table  5; R 2 remains 78%), while by the model with only seasonal terms, the explained variance was much lower (only 52%, Table 5). For the hospitalizations, laboratory submissions, and GP data, only slightly more syndrome variation was explained by adding seasonal terms. With only seasonal terms, the explained variance for these syndromes was clearly lower than with only pathogens in the models (8%-11% lower, Table 5). However, for the absenteeism and, to a lesser extent, the pharmacy data, the model with both pathogen and seasonal terms clearly explained more syndrome variations (Table 5, absenteeism 68% vs. 80%; pharmacy 80% vs. 87%). Furthermore, for the absenteeism data, the model with only seasonal terms had an even higher R 2 than the model with only pathogens, whereas for the pharmacy data, the R 2 with only seasonal terms was only slightly lower (3%, Table 5). Table 6 shows that for mortality, hospitalizations, laboratory submissions, and GP data, the pathogens with the highest effect clearly were RSV, infl uenza A, and infl uenza B, with no or only modest decline in standardized parameter estimates after adding seasonal terms. For the GP and hospital data, some pathogens became insignifi -cant after seasonal terms were added (GP: rhinovirus and adenovirus; hospital: parainfl uenza virus). For the pharmacy data, half of all pathogen variables became insignifi cant after seasonal terms were added, whereas for the absenteeism data, almost all pathogens became insignificant (Table 6).
Several syndrome observations exceeded the 95% UCLs of the models (0-10/registry/year), which indicates that those syndrome observations deviated strongly from model predictions. The recurrent elevation in October of the absenteeism, GP, and pharmacy syndrome several times exceeded the UCLs (October 2001: pharmacy and GP; 2002: absenteeism; 2003: GP, absenteeism; not shown), which indicated that the model could not explain these elevations.

Timeliness
In Figure 2, for each registry, the difference in timeliness with the hospital registry is indicated by the lag that maximizes R 2 . The absenteeism syndrome (green line) preceded the hospital syndrome by 1 week, followed by the GP-based and prescription-based syndromes at +1 week and the syndrome based on mortality and laboratory sub- mission data at +2 weeks after the hospital syndrome (projected on x-axis, Figure 2). The differences in timeliness between the syndromes and the pathogen surveillance data were refl ected by the regression models relating the syndromes to the (positive or negative) lagged pathogens (Table 4). Infl uenza A and infl uenza B had lags of 0-5 weeks, which suggests that the registry-syndromes were 0-5 weeks ahead of laboratory counts for these infections. Fluctuations in the time series of respiratory hospitalizations and the laboratory RSV counts seemed to appear in the same week (lag = 0). All other syndromes appeared to be 1-3 weeks later than the RSV counts, except absenteeism, which is 2 weeks earlier. Again, absenteeism seemed to be the earliest syndrome (2-5 weeks earlier than RSV, infl uenza A, and infl uenza B), followed by the hospital syndrome (0-2 weeks earlier), the GP-based and prescription-based syndromes (2 weeks earlier until 1 week later), the laboratory submission syndrome (1 week earlier until 2 weeks later), and the mortality syndrome (0-3 weeks later than RSV, infl uenza A, and infl uenza B).

Discussion
We explored the potential of 6 Dutch medical registries for respiratory syndromic surveillance. Although several other studies also evaluated routine (medical) data for syndromic surveillance purposes (22)(23)(24)(25)(26)(27), most evaluated only 1 syndrome and correlated this only to infl uenza data. An exception is Bourgeois et al. (24), who validated a respiratory syndrome in relation to diagnoses of several respiratory pathogens in a pediatric population, and Cooper et al. (27), who estimated the contribution of specifi c respiratory pathogens to variations in respiratory syndromes. Both studies concluded that RSV and infl uenza explain most of the variations in these syndromes, consistent with our fi ndings.
Our study shows that all syndrome data described in this study showed higher levels in winter, which corresponded to the seasonal patterns of RSV, S. pneumoniae, and infl uenza A and B viruses. Linear regression showed that the syndromes can be explained by lagged laboratory counts for respiratory pathogens (up to 86%, highest effect of infl uenza A, infl uenza B, and RSV), which indicates their potential usefulness for syndromic surveillance. Timeliness differed, with up to 5 weeks potential gain in early warning by syndromic data, compared with routine laboratory surveillance data.
A limitation of our study is the short duration of our time series, especially for absenteeism and pharmacy data. Therefore, whether our observed associations between syndromes and pathogen counts can be generalized remains unclear.
We relied on laboratory pathogen counts as a proxy for their prevalence and the illness they cause. Changes in test volume over time would result in misclassifi cation bias (as noncausative pathogens will be detected as well). However, such changes are presumably dwarfed by changes during "truly" epidemic elevations of common respiratory pathogens. Additionally, laboratory diagnostics are mostly performed on hospitalized patients, and thus results inadequately refl ect activity of pathogens that predominantly cause mild illness.
By adding seasonal terms, we observed that for the absenteeism and, to a lesser extent, the pharmacy registry, the associations between the respiratory syndromes and the pathogen counts might be biased to some extent. For the GP, hospital, laboratory submission, and mortality data, 922 Emerging Infectious Diseases • www.cdc.gov/eid • Vol. 14, No. 6, June 2008  Absenteeism  2  5  4  2  4  5  ---GP  -1  1  2  -1  1  2  2  --3  Pharmacy  -1  0  2  0  2  5  2  -5  3  Hospitalization  0  2  1  --2  3  ---Laboratory  submissions   -2  0  1  -3  -2  -5   Mortality  3  1  0  ------*S. pneumoniae, Streptococcus pneumoniae; RSV, respiratory syncytial virus; RV, rhinovirus; PIV, parainfluenza virus; GP, general practice; -, pathogen not included in model. †The lag time (in weeks) is indicated, that showed optimal fit between syndrome time-series and lagged pathogen counts included in the linear regression model; e.g., according to the model, the trend in hospitalizations precedes the influenza A laboratory counts by 2 weeks. season is probably not an important confounder for the association between the syndromes and pathogens, because including seasonal terms in the models resulted in the same or only slightly higher explained syndrome variance (measured by R 2 ). Models with seasonal terms alone mostly had lower explained variance than the pathogen models. For the GP and hospital data, some pathogens became insignifi cant after seasonal terms were added (Table 6) but not those pathogens with the largest effect estimates (RSV, infl uenza A and B). Therefore, we are confi dent in concluding that the GP, hospital, laboratory submission, and mortality syndromes do refl ect pathogen activity suffi ciently for use in syndromic surveillance. The higher R 2 value of the absenteeism model with seasonal terms alone suggests seasonality of absenteeism caused by several nonrespiratory conditions (28,29). To some extent, this also applies to the pharmacy syndrome, which includes medications that are not specifi c for respiratory infections (e.g., antimicrobial drugs). This could be validated in future studies by linking medications to illness. However, for both the absenteeism and pharmacy syndromes, the variation explained by seasonal terms is probably overestimated to some extent because data for only 2 and 3 years were used. Consequently, these time series contained less information on variation between different years than for the other registries, which benefi ts fi tting of a model with several sine and cosine terms.
To our knowledge, laboratory submission data (test requests) have not been evaluated before as a data source for syndromic surveillance. The modest explained variance for the laboratory submissions syndrome could possibly refl ect the limited use in our country of laboratory testing algorithms, which leads to substantial differences in diagnostic regimes for patients with similar clinical symptoms. In addition, occasional extra alertness by clinicians can make these data unreliable for surveillance. For instance, an unusual peak was observed in the laboratory submissions syndrome in 1999, after the offi cial announcement of an outbreak of Legionnaires' disease (30).
An unexpected increase was also observed in the absenteeism, GP, and pharmacy syndromes, which occurred consistently each year around October (2001)(2002)(2003)(2004). These peaks preceded the syndrome peaks concurring with peaks in infl uenza A, infl uenza B, and RSV counts and may be caused by rhinovirus activity-and asthma exacerbations caused by rhinovirus-which usually rises in the fall (31)(32)(33). Rhinovirus might go undetected because GP physicians rarely ask for diagnostics if they suspect a nonbacterial cause for relatively mild respiratory disease. Although  (Table 1). Measured by the syndrome lag with the maximized R 2 , the timeliness differed between the registries in the following order: absenteeism, hospital, pharmacy/general practice (GP), mortality/ laboratory submissions (as projected on the x-axis). specifi c asthma diagnoses were excluded from the respiratory syndrome defi nitions, exacerbations of asthma might affect other respiratory categories in the GP or pharmacy syndrome. This observation illustrates that additional diagnostics are needed for identifying the causes of unexplained respiratory disease elevations. Several novel respiratory pathogens for which diagnostics are not yet widely available have been discovered in recent years, underlining that it is quite possible that "hidden" epidemics occur (34)(35)(36). The extra October peak and several other syndrome elevations above the 95% UCLs in our study may well refl ect such hidden epidemics. The fact that these occur is supported by studies showing that many individual syndrome cases cannot be linked to known pathogens. For example, Cooper et al. (37), who investigated syndromic signals by using patient self-sampling (at home), could only obtain diagnostic results for 22% of these cases.
For early warning surveillance, timeliness is crucial. Absenteeism data seem to have the best timeliness, but their lack of medical detail complicates interpretation. Unexpectedly, the hospital data refl ect respiratory pathogen activity earlier than the GP data. Although in the Netherlands patients are encouraged to consult their GP before going to the hospital, elderly persons, for whom respiratory infections are more likely to cause severe illness, may often go to a hospital directly. Therefore, hospital data may prove to be an earlier marker for respiratory disease than GP data, but this possibility needs further exploration.
An important concern when using syndromic surveillance is that it may generate nonspecifi c alerts, which, if they happen regularly, would lead to lack of confi dence in a syndrome-based surveillance system. Here, we see a clear advantage of using data from multiple registries in parallel so that signal detection can be made more specifi c by focusing on signals that occur concurrently in >1 data source. To illustrate this we defi ned every exceeding of the UCLs of the regression models as a "signal," i.e., a syndrome elevation unexplained by known pathogen activity and therefore possibly refl ecting activity of underdiagnosed or emerging infectious disease. Over 2002-2003 (the period that all 6 registries were in the study), only 5 "concurrent" signals occurred versus 34 "single" signals over all registries. We did not evaluate whether the syndromes indeed detect outbreaks of infectious diseases earlier than clinical or laboratory pathogen surveillance. Such an evaluation is often performed by testing the ability to detect historical natural outbreaks or simulated outbreaks (10,38). However, historical natural outbreaks are rare and simulated outbreaks may be unrealistic. Nevertheless, further research into the outbreak detection performance of these syndromes would be worthwhile.
The results of this study suggest that it might be best to combine syndromic data and pathogen counts in a prospective surveillance system. Such surveillance can identify distinct syndrome elevations that cannot be explained by respiratory pathogen activity as indicated by routine laboratory pathogen surveillance.

Conclusion
Overall, the GP, hospital, mortality and, to a lesser extent, laboratory submission syndromes refl ect week-toweek fl uctuations in the time-series of respiratory pathogens as detected in the laboratory. Registries monitoring trends of these syndromes will therefore most likely refl ect illness caused by emerging or underdiagnosed respiratory pathogens as well and therefore are suited for syndromic surveillance. Further research would be required to assess to what extent absenteeism and pharmacy data refl ect respiratory illness. Investigating the actual outbreak detection performance of the syndromes in this study would also be worthwhile.
Data from the registries in this study are not yet realtime available, although given modern information technology, this availability is clearly feasible. Our study can help prioritize which type of healthcare data to include in future syndromic real-time surveillance systems.