IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

Data Model

Background

Genomic data reside in multiple specialized resources. For effective genome analysis it is important to integrate related information into a coherent common framework. This integration helps identify and resolve data inconsistencies and allows detect new relationships between genes, such as phylogenetic conservation and cross-species homologies. Furthermore, the integration of data from different data sources increases the reliability of the interpretation of the results.

Integrating gene annotations from various data sources involves identifying important and reliable data sources, regularly monitoring these sources, parsing and interpreting the results, and establishing associations between related entities such as the correlation of gene fragments and known genes.

The data model underlying the IMG data warehouse outlined below, provides the framework for integrating and managing microbial and selected eukaryotic genomic data collected from multiple data sources. The data model captures primary genomic sequence data, computationally predicted and curated gene models, pre-computed sequence similarity relationships, and functional annotations of the genome in a coherent biological context.

abstract data model

Overview

Object class Taxon/Organism models an organism with its taxonomic lineage (domain, phylum, class, order, family, genus, species, and strain). The primary DNA sequence data of a chromosome/replicon is assembled into scaffolds and/or contigs, which are modeled by class Scaffold. Scaffolds allow to capture data about draft (unfinished) genomes as well as genomic elements of a taxon (replicons, including chromosomes, plasmids, mitochondria, extra-chromosomal elements, etc.). Computationally predicted genomic regions such as CRISPR and additional sub-sequence features, such as intervening regions, repeats, promoters, etc. are modeled by attribute Feature of class Scaffold. Class Assembly models the assembly procedure and the methods used for assembling the chromosome and scaffolds.

NCBI RefSeq is the main source of genome sequence data and associated annotations. For selected public genomes from RefSeq, a gene model review and curation is carried out at JGI, with the revised data sets replacing the original data sets. So far, 22 archaeal public genomes have been reviewed- see Archaeal QA for details on the revisions for these genomes.

Class Gene can be viewed as the main object class of the IMG data model, whereby genes are characterized by attributes of Gene, such as gene name and symbol, as well as through relationships with other object classes.

Gene specific features such as predicted ORFs (Open Reading Frames), mRNA transcripts, and non-coding RNAs are identified with start/end coordinates and modeled by the class Gene. Predicted ORFs are further characterized by attributes of class Gene, such as gene name and symbol, pseudogene etc.

Protein-coding genes are further characterized by their associated molecular function (class Enzyme), protein families and functional domains (classes IPR Family and Pfam Family, TIGRfam, PDB Xref). Class GO Term models gene function in terms of molecular function, cellular component, and biological process. Class KO Term models the KEGG Orthology hierarchy: KEGG Orthology is an extension of the KEGG ortholog identifiers, structured as a hierarchy in the context of KEGG pathways. Class COG models clusters of orthologous groups of genes and further characterizes gene function.

Ortholog and paralog gene relationships are modeled by classes Ortholog and Paralog, respectively. Ortholog and paralog groups are modeled by Ortholog Group and Paralog Group, respectively, where gene grouping is based on bidirectional best hit (BBH) single-linkage. For a description of gene relationships, groups, and clusters, see http://img.jgi.doe.gov/v1.0/help/concepts.html.

Class User Annotation captures community user annotations for IMG genes.

Pathways maps, reactions, and compounds from KEGG, and LIGAND are modeled by classes KEGG Map, Reaction, and Compound, respectively.

IMG curated compounds, reactions, interactions, pathways, and networks are modeled by classes IMG Compound, IMG Reaction, IMG Pathway and IMG Network, respectively. The functional role of genes is characterized by terms modeled by class IMG Term, with corresponding links to Enzymes. Details on IMG native terms, compounds, reactions, interactions, pathways, and networks are further discussed below.