Skip to content

TY2482, LB226692 vs Genbank Ecoli

eparejatobes edited this page Jun 9, 2011 · 12 revisions
  • who Konrad Paszkiewicz, University of Exeter Sequencing Service. khp204 at ex.ac.uk
  • what Whole genome phylogenies
  • date 03/06/2011

Objective:

To produce a whole genome phylogeny of the outbreak strains against the existing (non-draft) NCBI E.coli genomes. The following are the results of my attempt to analyse the two sequenced E.coli outbreak isolates (identified as 0104 serotype) - TY2482 and LB226692. Both were sequenced using Life Technologies Ion Torrent technology.

Datasets:

The TY2482 reads were assembled by Nick Loman using MIRAhttp://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt The annotation for TY2482 was obtained from http://www.era7bioinformatics.com/en/E_Coli_EHEC_O104_STRAIN_EU_OUTBREAK_era7bioinformatics.html LB226692 reads were assembled by Life Tech and University of Muenster http://www.ncbi.nlm.nih.gov/nuccore/AFOB00000000

Note that these are also now available at github

Methods:

TY2482 was used as the 'reference strain here'.

  1. The dnadiff command as part of the MUMmer 3.21 http://mummer.sourceforge.net/ package was used to generate whole-genome alignments. As part of this process the MUMmer show-snps command is e__xecuted and 'calls' (if you can define it that way) snps between the two genomes. dnadiff was run for TY2482 against all other genomes in NCBI Genbank and the LB226692 assembly of another 0104 isolate.
  2. The out.snps files for each TY2482 vs Query alignment were parsed and SNPs from all alignments extracted into a single file.
  3. The GFF annotation performed by BG7 was used to identify putative gene locations and determine whether a SNP would cause a synonymous or non-synonymous change.
  4. Only SNPs for which synonymous changes were present were used to generate a pseudo-sequence.
  5. The program FastTreeMP was used to generate a tree using generalised time-reversible model (options: FastTreeMP -nucleotide -gtr)
  6. The resulting tree can be visualised in MEGA Steps 2 and 3 used custom scripts partly based on work originally performed by David Studholme

Results:

Full MUMmer alignments and SNP calls available at http://bio-ruby.ex.ac.uk/ecoli_outbreak SNP table listing synonymous/non-synonymous and gene IDs is available for download here (though it is 270Mb) The following is the phylogeny produced by the above analysis. Highlighted in red are the outbreak strains.

EHEC outbreak phylogeny

Comments:

First of all I should point out that Kat Holt's excellent SNP analysis of TY2482 LB226692 and uses some filtering which the analysis here would definitely benefit from, especially given the homopolymer issues Kat noted. http://bacpathgenomics.wordpress.com/2011/06/04/ehec-genomes/. Even with this relatively dirty dataset its reassuring that the results here agree with David Studholme's analysis finding that closest relative is Escherichia coli 55989 NC_011748.

According to MUMmer the TY2482 strain shares 97.23% of its sequence with LB226692. However LB226692 only shares 95.56% of its sequence with TY2482. It is possible there are one or two plasmids lurking there, but it could also be an artefact of the different methods used to assemble these two isolates. The isolates have around 1500 SNPs between them. 1281 are within coding regions and 239 are classified as synonymous changes. This seems a rather high number to me if they are merely different clinical isolates of the same strain but I don't think we have enough background knowledge to really get a handle on how common and/or significant this is. Certainly a good proportion are due to crappy filtering on my part.

The same comparison has been done for each Genbank E.coli genome wrt TY2482.

It may be worth while checking out other closely related species in case cross-species hybridisation has occurred. By trade I am a facility manager and a bioinformatician, rather than a bona-fide pathogenomicist so I apologise if there are any issues I have neglected to take into account or anything which would be obvious to a proper pathogen researcher. Please let me know if you think there is anything clearly wrong with this approach and I will do my best to correct it.

From a bioinformatics point of view its interesting just to see that our current tools/knowledge-set mean that it is quite difficult rapidly pin down exactly what happened to form this strain.

I'll try to look at the differences between the two strains in more detail and perhaps see whether the latest BGI assembly significantly changes anything. It would also be really useful if someone could repeat the above analysis using the raw reads (if the reads for LB226692 become available) and aligning with TMAP, Newbler or some other form of homopolymer aware alignment suite to Escherichia coli 55989. I'd be interested to know if the SNPs identified by MUMmer (when suitable filtered) give comparable results to those called by alignment of the reads to a reference.