Novel application of a statistical technique, Random Forests, in a bacterial source tracking study

In this study, data from bacterial source tracking (BST) analysis using antibiotic resistance profiles were examined using two statistical techniques, Random Forests (RF) and discriminant analysis (DA) to determine sources of fecal contamination of a Texas water body. Cow Trap and Cedar Lakes are po...

Full description

Saved in:
Bibliographic Details
Published inWater research (Oxford) Vol. 44; no. 14; pp. 4067 - 4076
Main Authors Smith, Amanda, Sterba-Boatwright, Blair, Mott, Joanna
Format Journal Article
LanguageEnglish
Published Kidlington Elsevier Ltd 01.07.2010
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this study, data from bacterial source tracking (BST) analysis using antibiotic resistance profiles were examined using two statistical techniques, Random Forests (RF) and discriminant analysis (DA) to determine sources of fecal contamination of a Texas water body. Cow Trap and Cedar Lakes are potential oyster harvesting waters located in Brazoria County, Texas, that have been listed as impaired for bacteria on the 2004 Texas 303(d) list. Unknown source Escherichia coli were isolated from water samples collected in the study area during two sampling events. Isolates were confirmed as E. coli using carbon source utilization profiles and then analyzed via ARA, following the Kirby–Bauer disk diffusion method. Zone diameters from ARA profiles were analyzed with both DA and RF. Using a two-way classification (human vs nonhuman), both DA and RF categorized over 90% of the 299 unknown source isolates as a nonhuman source. The average rates of correct classification (ARCCs) for the library of 1172 isolates using DA and RF were 74.6% and 82.3%, respectively. ARCCs from RF ranged from 7.7 to 12.0% higher than those from DA. Rates of correct classification (RCCs) for individual sources classified with RF ranged from 23.2 to 0.2% higher than those of DA, with a mean difference of 9.0%. Additional evidence for the outperformance of DA by RF was found in the comparison of training and test set ARCCs and examination of specific disputed isolates; RF produced higher ARCCs (ranging from 8 to 13% higher) than DA for all 1000 trials (excluding the two-way classification, in which RF outperformed DA 999 out of 1000 times). This is of practical significance for analysis of bacterial source tracking data. Overall, based on both DA and RF results, migratory birds were found to be the source of the largest portion of the unknown E. coli isolates. This study is the first known published application of Random Forests in the field of BST.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0043-1354
1879-2448
DOI:10.1016/j.watres.2010.05.019