IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

Data Sources

IMG includes publicly available genomic sequence data integrated with JGI sequence data for microbes and selected model eukaryotes. Genome data is provided by secondary data sources such as EBI Genome Reviews, RefSeq, UniProt/SwissProt, UniProt/TrEMBL (protein information). In addition, relevant functional context data are integrated from: InterPro (protein families and domains), GO (gene ontologies), Pfam and TIGRfam (curated protein families), COG (curated clusters of orthologous groups of proteins), Enzyme (enzyme nomenclature), KO (KEGG Orthology), KEGG (metabolic pathways and reactions), and LIGAND (compounds). Specific data sources include the following:

JGI Sequence Data

All JGI microbial genomes are sequenced and assembled at the Production Genomics Facility, with subsequent finishing at various JGI partner or collaborator institutions. Automated annotation for archaeal and bacterial genomes sequenced at JGI is provided by the Genome Analysis Pipeline at Oak Ridge National Laboratory. This pipeline is described in Genetic Engineering, Vol. 26.[1]

Public Sequence Data

Archaeal, Bacterial, Plasmid, and Viral Sequence Data

NCBIs RefSeq is the primary data source for IMG for publicly available finished and draft microbial, viral (including phages), and isolate (i.e. not already part of a sequenced microbe) plasmid genomes. RefSeq contains curated versions of entries in the Genbank nucleotide sequence database representing the complete sequences of chromosomes and plasmids. RefSeq is updated regularly, thus ensuring continuous improvement and standardization of gene annotations. In addition, gene product names used inconsistently among the original submissions are curated, thus improving comparative analysis across genomes.

Eukaryotic Sequence Data

IMG currently integrates genomic data from several lower eukaryotes (fungi, protozoa). Primary genomic sequence data is from NCBI's RefSeq. For higher level eukaryotes (such as human, mouse, fly etc.), RefSeq and Entrez/Gene serve as the primary data source mainly for their currency, consistency and uniformity of cross-references with functional resources, such as UniProt [2], InterPro [4], KEGG [9], PIR [12-13], UniGene [14], and model organism databases.

UniProt

UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins[2]. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. It is a "one-stop shop" that allows easy access to all publicly available information about protein sequence annotation.

The UniProt/SwissProt Protein Knowledgebase is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy, and a high level of integration with other databases[3]. SwissProt is the gold standard for protein annotation, with extensive cross-references, literature citations, and computational analyses provided by expert curators.           

UniProt/TrEMBL (Translated EMBL Nucleotide Sequence Data Library) is a computer-annotated protein sequence database complementing the UniProt/SwissProt protein knowledgebase. Recognizing that sequence data were being generated at a pace exceeding SwissProt's ability to keep up, TrEMBL was created to provide automated annotations for those proteins not in SwissProt. UniProt/TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL/GenBank/DDBJ nucleotide sequence databases and also protein sequences extracted from the literature or submitted to UniProt/Swiss-Prot. The database is enriched with automated classification and annotation. All protein translations submitted to the GenBank/EMBL/DDBJ nucleotide sequence databases are automatically incorporated into the next incremental UniProt Knowledgebase release (fortnightly).

InterPro

InterPro is a database of protein families, domains, and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences[4,5]. InterPro provides an integrated view of the commonly used signature databases.

Pfam

PFAM is a collection of multiple-sequence alignments and hidden-Markov models of common protein domains and families[6]. PFAM can be used to view protein domain structures and species distributions of common domain signatures of proteins. IMG uses NCBI's RPS-BLAST on Position Specific Scoring Matrices (PSSM) provided by the Conserved Domain Database (CDD) for it's rapid speed in computation.

TIGRfam

TIGRfams are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family1, where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects [15].

COG

Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages[7]. Each COG consists of individual proteins or groups of paralogs from at least three lineages and thus corresponds to an ancient conserved domain. Thus COG is an attempt at phylogenetic classification of the proteins encoded in complete genomes. Each COG includes proteins that are inferred to be orthologs (direct evolutionary counterparts). As with PFAM, IMG computes top COG hits using RPS-BLAST on PSSM's provided by CDD.

GO

The Gene Ontology (GO) provides a hierarchical, controlled vocabulary of terms that can be used to annotate gene products at varying levels of specificity[8]. GO is a dynamic vocabulary defined in three ontologies: molecular function, biological process, and cellular component. A gene product may be a component of one or more parts of a cell or part of the extracellular environment. Coverage can be improved by using a reduced vocabulary of high-level GO terms (GO Slim), in which more specific GO terms have been replaced by a limited number of general-purpose ancestor terms. This is particularly useful when performing comparative analysis between proteomes (for example, to allow a user to compare the proportions of two proteomes that are classified as transcription factors). GO information will be available in future versions of IMG.

Enzyme

Enzyme is a repository of information about the nomenclature of enzymes. It is primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB), and it describes each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided.

KO

The KEGG Orthology (KO) is an extension of the KEGG ortholog identifiers, which is structured as a four level hierarchy. The top level consists of the following five categories: metabolism, genetic information processing, environmental information processing, cellular processes and human diseases. The second level divides the five functional categories into finer sub-categories. The third level corresponds directly to the KEGG pathways, and the fourth level consists of the leaf nodes, which are the functional terms [11].

KEGG

KEGG consists of pathway maps, which are collections of diagrams representing the information pathways of interacting molecules or genes[9]. It contains all known metabolic pathways and a limited, but increasing, number of regulatory pathways and molecular assemblies. KEGG currently represents most of the known metabolic pathways and some of the known regulatory pathways in about 100 diagrams. The metabolic pathways were originally compiled from the book "Metabolic Maps" by the Japanese Biochemical Society, and the Boehringer wall chart, "Biochemical Pathways," but KEGG attempts to cover a wider range of biochemical pathways at a higher level of abstraction. Each diagram in KEGG has been manually drawn and is continuously updated. A KEGG diagram does not represent a consensus of known pathways in different genomes; it is intended as a reference drawing of all chemically feasible pathways. The genome-specific pathways are then automatically generated by matching the enzyme genes in the gene catalog with the enzymes on the reference pathway diagrams.

KEGG LIGAND

KEGG LIGAND contains knowledge on the universe of chemical substances and reactions that are relevant to life. It is a composite database currently consisting of COMPOUND, DRUG, GLYCAN, REACTION, RPAIR, and ENZYME databases. ENZYME is derived from the Enzyme Nomenclature, but the others are internally developed and maintained.

GOLD

GOLD (Genomes Online Database) provides comprehensive information regarding complete and ongoing genome sequencing projects around the world[10].

Taxonomy

The taxonomic lineage of each genome (domain, phylum, class, order, family, genus, species, and strain information) is stored in IMG and is linked to NCBI’s taxonomy resource. The IMG taxonomy also draws from Bergey's Manual of Systematic Bacteriology.

Protein Data Bank (PDB)

The Protein Data Bank (PDB) is the single worldwide depository of information about the three-dimensional structures of large biological molecules, including proteins and nucleic acids. A variety of information associated with each structure is available through the PDB including sequence details, atomic coordinates, crystallization conditions, 3-D structure neighbors computed using various methods, derived geometric data, structure factors, 3-D images and a variety of links to other resources [16]

References

1. Hauser, L., F. Larimer, M. Shah, and E. Uberbacher. 2004. Analysis and Annotation of Microbial Genome Sequences. Genetic Engineering. Vol. 26, 225-238. Kluwer Academic/Plenum Publishers.

2. UniProt Consortium. 2007. The Universal Protein Resource (UniProt). 35: D193-D197.

3. Bairoch, A., and R. Apweiler. 1997. The SWISS-PROT protein sequence database: Its relevance to human molecular medical research. J. Mol. Med. 75: 312-316.

4. Mulder N.J., R. Apweiler, et al. 2007. New developments in the InterPro Database. Nucl. Acids Res. 35: D224-228.

5. Apweiler, R., et al. 2000. InterPro--an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16(12): 1145-1150.

6. Sonnhammer, E.L., S.R. Eddy, and R. Durbin. 1997. Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins 28(3): 405-420.

7. Tatusov, R.L., E.V. Koonin, and D.J. Lipman. 1997. A genomic perspective on protein families. Science 278(5338): 631-637.

8. Gene Ontology Consortium. 2004. The Gene Ontology Database and Informatics Resource. Nucl. Acids Res. 32: 258-261.

9. Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl. Acids Res. 27: 29-34.

10. Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: A Monitor of Genome Projects Worldwide. Nucl. Acids Res. 2006, 34: D332-D334.

11. M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno and M. Hattori. 2004. The KEGG Resource for Deciphering the Genome. Nucl. Acids Res. 32: D277-D280.

12. Wu CH, Huang H, Nikolskaya A, Hu Z, Yeh LS, Barker. 2004. The iProClass Integrated database for protein functional analysis. Computational Biology and Chemistry, 28: 87-96.

13. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker. 2004. PIRSF: family classification system at the Protein Information Resource . Nucleic Acids Research, 32: D112-D114.

14. Wheeler DL, et al. 2003. Database Resources of the National Center for Biotechnology. Nucl Acids Res 31:28-33.

15. Haft D.H., Selengut J.D., White O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31:371-373 (2003)

16. Berman HM, Henrick K, Nakamura H (2003): Announcing the worldwide Protein Data Bank. Nature Structural Biology 10 (12), p. 980