IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

Eukaryotic Genomes in IMG

The following eukaryotic genomes have been included into IMG in order to increase its genomic context for comparative analysis:

IMG 2.1 (March 2007)

model organism eukaryotic genomes

  1. Anopheles gambiae PEST
  2. Arabidopsis thaliana
  3. Caenorhabditis elegans
  4. Cryptosporidium parvum Iowa type II
  5. Danio rerio
  6. Dictyostelium discoideum AX4
  7. Drosophila melanogaster
  8. Homo sapiens
  9. Mus musculus
  10. Rattus norvegicus
  11. Trypanosoma brucei TREU927
IMG 2.3 (Sep 2007)

fungi genomes

  1. Aspergillus nidulans FGSC A4
  2. Aspergillus niger CBS 513.88
  3. Aspergillus terreus NIH2624
  4. Candida albicans SC5314
  5. Cryptococcus neoformans var. neoformans B-3501A
  6. Gibberella zeae PH-1
  7. Magnaporthe grisea 70-15
  8. Neurospora crassa OR74A
  9. Pichia stipitis CBS 6054
  10. Ustilago maydis 521

protists genomes

  1. Cryptosporidium hominis TU502
  2. Entamoeba histolytica HM-1:IMSS
  3. Giardia lamblia ATCC 50803
  4. Leishmania major Friedlin
  5. Theileria parva Muguga
  6. Plasmodium yoelii yoelii 17XNL

plant genomes

  1. Ostreococcus lucimarinus CCE9901

Integration Process

RefSeq and Entrez/Gene serve as the primary data source for eukaryotic genomes. The integration process is outlined in the diagram below, consists of two stages: data collection and data integration.

Integration Process

Integration Process

Data Collection

The data collection stage involves the following steps:
  1. Genome selection
    1. Starting with the most current list of NCBI Genome Projects for eukaryotes, select eukaryotic organisms that have RefSeq files (R), Gene information (G), and have good sequence coverage.
    2. For eukaryotic organisms selected in (a) above, examine consistency and accuracy of the NCBI taxon_ID across the various views for each genome, including Genome Project detail view, taxonomy view, etc. Address potential discrepancies in organism name (various synonyms, spellings) and taxon ID.
  2. Data collection
    1. Download the Entrez/Gene data files from the ftp site.
    2. Check for each taxon_ID in the list, and count number of genes in gene_info and gene2refseq files and compare them with NCBI genome project view per NCBI taxonomy view. Clarify any discrepancies by contacting NCBI if necessary.
    3. Collect refseq protein (.faa) files and genomic contig (.fna) files.
    4. Collect KEGG EC# to gene association data files from KEGG.
    5. Collect cross reference data sources such as InterPro, PIR's iProClass.
  3. Revise and customize the integration utilities (ETL) as needed.

Data Integration

For each organism:

  1. Chromosomal genomic data (NC_ nnnnnn or NT_nnnnnn, NW_nnnnnn), mRNA (NM_nnnnnn or XM_nnnnnn) and protein sequence (NP_, XP_, YP_ ) data files were downloaded from current RefSeq release, using either organism name or NCBI taxon ID.
  2. Gene specific data was downloaded from Entrez/Gene, with RefSeq data providing the context for associated primary transcript information and alternate variant transcripts. For each gene:
    1. the longest transcript associated with a gene is used in IMG as primary transcript while other transcripts are recorded as alternate transcripts linked to that gene;
    2. KEGG EC numbers, KO terms, GO terms and dbxref are associated with the gene when correlations are available
    3. Transcript and protein specific links to UniGene, UniProt, InterPro, TIGRFam and PIR Super families were derived using PIRs iProClass.

Integration Notes

1. Data resource issues
  1. Organelle replicons (i.e. mitochondrial/chloroplast replicons) are sometimes, but not always, included as part of the RefSeq genome data files. Discrepancies were noticed in the taxon id specified for the organelle replicon vs. the taxon id specified for the chromosomal replicons for the same organism. Consequently, it is possible to have inconsistent gene counts for an organism when one compares the numbers with its corresponding model organism database.
  2. Data from various resources are not always consistent with each other, due to their independent update frequency, which may result in missing links or inconsistent/missing annotations such as KO terms, enzymes, GO terms. For example, various NCBI resources (Entrez/Gene, NCBI Taxonomy, Genome Projects list, RefSeq) changes are not coordinated. Lack of synchronization between sources may cause inconsistencies in the type of data connections.

    Example: during the week of 07-20-2007, a new version of chromosome # 7 of Magnaporthe grisea 70-15 appeared in Entrez/Gene, while Release 23 of RefSeq files contained the contigs of previous versions for that chromosome. Consequently, while references from Entrez/Gene pointed to the new version of the chromosome, the data for this chromosome were not available from RefSeq's ftp site. As a result, 1,195 genes could not be integrated due to missing genomic and corresponding protein files from RefSeq Release 23.

2. Gene location
  1. Some genes may not have chromosomal assignment or precise genomic location associated with it: this is possible in higher eukaryotes, as at times only a symbol or an identifier is assigned by the corresponding nomenclature committee of the model organism but no other data about these genes are known yet (e.g. NCBI and IMG view). In other cases, this could be caused by the lack of such information in RefSeq with regard to the IMG defined primary transcript associated with the genes. Genes without a chromosomal assignment are associated with virtual IMG created Unknown chromosome/scaffold.
  2. Some genes may have been associated to multiple chromosomes or chromosomal regions. IMG assigns the first chromosomal location that is provided in the data file and ignores the others.
3. Gene categories
  1. Entrez/Gene categorizes genes besides as protein-coding, rRNA and pseudo genes, unknown and other for which there is not much documentation available. IMG interprets these as uncharacterized genes.
  2. IMG uses its own convention for rRNA gene naming, which could be different than rRNA naming conventions traditionally followed by the corresponding model organism databases.
4. DNA sequence
  1. There is no correspondence between genes DNA sequence length and its corresponding proteins AA sequence length due to intron spacings and UTR gene structure elements. IMG uses the corresponding mRNA sequence length as DNA sequence length of the gene.
5. Protein coding transcript
  1. IMG uses the longest protein coding transcript of a gene as its primary transcript. All other associated transcripts are treated as alternate transcripts. This may confuse users when they check corresponding model organism sources or the NCBI Entrez or Ensembl data sources. RefSeq does not explicitly specify which one is a primary transcript and which are alternates. This information is usually embedded in the definition line as free text with no standard convention.
  2. RefSeq assigns NP_ accessions for transcripts that have evidence of cDNA/EST/UniGene data or full-length mRNA data, while the predicted transcripts have XP_ or YP_ type of accessions. Thus a gene may have both better curated transcripts (of NP_ type) as well as multiple predicted transcripts (of XP_ or YP_ type).

Comparative Analysis Context

The inclusion of model organism eukaryotic into IMG increases its genomic context for comparative analysis, as illustrated by the following examples.

Example 1

Consider the Chloroflexus gene 638348236. BLAST against IMG 2.0 does not return any experimentally verified hits:

Euk. eaxmple 1a

Example 1a

However, BLAST against the IMG 2.1 database shows that this gene is highly similar to the Arabidopsis monogalactosyldiacylglycerol synthases and therefore may have this function.

Euk. eaxmple 1b

Example 1b

Example 2

Consider the Nitrosomonas gene 637427593. BLAST against the IMG 2.0 returns the following genes, with no indication about its potential function.

Euk. eaxmple 2a

Example 2a

However, BLAST against the IMG 2.1 shows that this gene has high similarity to human and mouse cyclooxygenases and therefore may be also a fatty acid oxygenase.

Euk. eaxmple 2b

Example 2b