Comparative genomics of 245 enterohemorrhagic Escherichia coli genomes provides a phylogenomic framework to track the origin and emergence of highly pathogenic clones

, PhD
Postdoctoral Fellow
Department of Biology
South Texas Center for Emerging Infectious Diseases
University of Texas at San Antionio
One UTSA Circle, San Antonio. TX 78249 USA

B. Rusconi1, F. Sanjar1, S.S.K. Koenig1, M. Eppinger1

Background: Escherichia coli of the O157:H7 serotype are the dominant Shiga toxin-producing enterohemorrhagic E. coli (EHEC) in North America that cause widespread and potentially lethal outbreaks of food borne disease. Unlike other E. coli, the O157:H7 lineage features a genetically homogenous population structure, whose limited marker base significantly complicates phylogenetic marker development. Isolates are routinely differentiated using molecular and phenotypic typing strategies, such as pulsed-field gel electrophoresis or metabolic profiling. The introduction of cost efficient and rapid next generation sequencing technologies allowed the field to transition from assessing the genome plasticity in only few selected loci to unbiased marker discovery utilizing whole genome sequence typing approaches.

Methods: For this study we have sequenced 189 epidemiological diverse EHEC isolates, collected through collaborations with our partners at PSU, FDA, USDA, Washington University, and the National Veterinary Institute (Sweden), as prerequisite to capture the genomic diversity found in the environment or in clinical settings. Assembled genomes were screened for single nucleotide polymorphisms (SNP) and variants in the genome architecture using our custom developed mutation discovery pipeline. Resulting SNPs and whole genome alignments served as basis to develop a phylogenetic hypothesis using maximum parsimony and likelihood to probe the isolates phylogenetic relatedness.

Results: We recorded a total 9,045 SNPs and categorized them according to their potential physiological impact. The enriched polymorphic marker base is comprised of 8,080 genic (3,822 sSNPs, 4,207 nsSNPs) and 965 intergenic SNPs resulting in more than 500 SNP profiles. Our refined phylogenetic hypothesis for the EHECs revealed how individual outbreak isolates fit into the global phylogeographic patterns when compared to global and temporal dispersed reference isolates from genomic repositories.

Discussion: The deployed high-resolution whole genome-based sequence typing methodologies on growing sets of samples provides a previously unprecedented phylogenetic accuracy and resolution and elucidates the evolutionary relationships among EHEC, reassessing the degree of genomic plasticity. Identified canonical SNPs can readily be available for efficient and cost effective typing assay in support of classical technologies. Finally, the synergistic use of genomic resources enriched with strain associated epidemiological and phenotypic/clinical metadata will open the avenue for genome wide association studies that will allow to link bacterial EHEC genotype to human disease severity.

1University of Texas at San Antonio, South Texas Center for Emerging Infectious Diseases, San Antonio, TX 78249