Data Processing
Data Extraction, Transformation, and Loading
Data are downloaded from external sources in flat-file or database format via FTP. Perl-based tools are then used to extract and parse information from these sources. The general methodology is to extract the data from the database or flat file, convert the data into a structure that is consistent with the object structure in the IMG warehouse, then export these data into flat files. Each file has a corresponding *.ctl file, which is used to load the file into an Oracle database using the SQL*Loader tool.
The process starts with extracting data from external sources (RefSeq, EBI Genome Reviews, GenBank/EMBL, JGI microbial genome data, LIGAND, GO, COG, Enzyme, InterPro, KEGG, KO, Pfam, TIGRFam, CDD databases, and GOLD).
Next, auxiliary functional reference tables from LIGAND, KEGG, GO, COG, Enzyme, and InterPro, Pfam and TIGRFam are parsed, transformed and loaded. Finally, genomic data from RefSeq, EBI Genome Reviews, GenBank/EMBL and JGI microbial genome data are parsed, transformed and loaded.
Loading the genome data from each of the sources in the previous step involves manual addition and editing of taxon information, merging of revised gene model annotations for JGI microbes, and calculating GC content and protein statistics.
For genomes loaded into IMG, the NCBI Taxonomy information (domain, phylum, order, etc.) sometimes needs further expert revision and correction. Additional information, such as ecotype, phenotype, disease and relevance metadata are added from GOLD database.
Postprocessing
After data are loaded into the IMG warehouse, additional postprocessing is carried out. First, auxiliary data files are generated by organizing protein sequences into taxon batches. These files are then used to create FASTA query files and BLAST databases grouped by genome.
BLAST is used for computing homolog, paralog, and ortholog gene relationships. RPS-BLAST (Reverse Position Specific BLAST) is used for computing gene associations with COG as well as precomputed alignment positions. Gene relationships and COG gene associations are subsequently loaded into the warehouse.
Starting with IMG 2.4, genes are associated with Pfam and TIGRfam using hmmsearch, a tool in the HMMER package (http://hmmer.janelia.org).
A BLAST pre-filter is used for quickly narrowing down the candidate HMM models. BLAST is run with a non stringent e-value cutoff in order to pick up any sub-sequence from the seed sequences of an HMM model that could be a candidate for full HMM scoring. Low complexity masking is turned off. For this step, the BLAST e-value cutoff is set to 10,000 for Pfam and 1000 for TIGRfam seed sequences databases, respectively. hmmsearch is then applied on the candidate models with a per family noise cutoff (--cut_nc). The scoring for domain level hits are recorded for Pfam, while scoring is done for the full model is recorded for TIGRfam. The "ls" models are used, though in the future "fs" will also be considered for fragmented metagenomic data in the case of Pfams.
Additional postprocessing involves computing statistics for each genome, such as total number of bases, GC content, number of genes, number of genes without associated function, number of genes with/ without homologs, genes associated with KEGG pathways, COG, Pfam, and InterPro. The phylogenetic occurrence profiles for each gene are also precomputed using bidirectional best hits (BBH) to genes in other genomes.
Finally, database postprocessing involves generating derived data, loading foreign key constraints, indexing, and generating statistics for performance optimization. Cache data may be generated to improve performance of the Web UI application.
Data Cleaning
Before they are loaded into IMG, new JGI finished genomes undergo a QA process for identifying and correcting overlapping CDS errors, identifying CDSs with an incorrect start codon, and identifying potentially missing CDSs. Overlapping CDSs are classified as same-direction (type I), divergent (type II), or convergent (type III) overlaps. The process involves three steps: editing overlapping CDSs, fixing non-overlapping CDSs with incorrect start codons, and finding missing CDSs.
To correct overlapping CDSs, BLAST queries for each of the overlapping CDSs are run against the NR database and the top 10 hits are examined. Based on the BLAST results, each CDS is then trimmed manually by adjusting the 5' end, deleting the ORF, or flagging it as a pseudogene.
To find nonoverlapping CDSs with incorrect start codons, first BLAST queries are run for all nonoverlapping CDSs and the 10 best hits are recorded. All the genes that are more than 30 amino acids shorter or longer than the 10 best hits are then identifed. For ORFs longer than the 10 best hits, the errors are corrected by manual trimming. For genes shorter than the 10 best hits, the CDS are manually extended or a pseudogene tag is added. When the CDS is trimmed or extended, a BLAST query is run for the resulting protein and the alignment is checked against best hits and NCBI domain database. In addition to BLAST queries, multi-domain proteins are compared to the COG and Pfam databases to confirm the integrity of the protein domains that can be assigned to a function; this comparison also allows checking for possible frameshift mutations since we are able to detect parts of a specific domain in two or more adjacent regions in different frames.
- Regions that hit an intact COG or Pfam are considered to be functional genes, while regions which do not comply with the above criterion are annotated in a different way, using the internal IMG nomenclature for truncated coding sequences.
- Regions with no COG or Pfam hits are annotated based on the length of their alignment with homologous proteins.
To find missing CDSs, BLAST queries are run for all DNA fragments, which do not encode conserved proteins (i. e., proteins with hits in the conserved domain databases or NR). Such DNA fragments are designated as "intergenic regions" and for those of more than 200 base pairs long BLAST hits with an E-value of 10-5 are recorded. Regions with significant hits may correspond to: (a) annotated CDS which could correspond to functional genes or pseudogenes (i.e. fragments of the functional protein); (b) sequences that are not annotated, which are then annotated as genes or pseudogenes. .
The re-annotated genome is then used to update our database i.e. add new proteins or remove proteins from it that are annotated as pseudogenes.
Genes in new JGI genomes deemed problematic are tagged as follows:
- /short
This tag is used for CDSs that are shorter (80% of the length) than their homologs (at least 90% of their homologs), they cannot be extended to the full length and the missing fragments cannot be identified in the upstream or downstream DNA. They correspond to a truncated COG (or Pfam) hit (less than 80% and more than 30%) and they are longer than 100 aa. - /long
This tag is used in the case that a CDS is longer than at least 90% of its homologs on the 3' end. It overlaps with another gene and the overlap cannot be removed by simple gene truncation, without changing the nucleotide sequence. Long CDSs are suspected cases of sequencing errors (stop codon missed due to a frameshift or a incorrect nucleotide assignment). - /frameshift
These are 2 or more adjacent CDSs homologous to different parts of the same gene in other genomes. They are generated by frameshifts, or by the introduction of stop codons. The differences between frameshift fragments and suspected fusion proteins are that 1) frameshift fragments do not have a complete COG (or Pfam) hit and 2) frameshift fragments of this particular length are present in only one genome (or in different strains of the same species). If there are suspected frameshift fragments of approximately the same length present in phylogenetically distant species, this could be the indication of a fusion protein rather than real frameshift fragments. Presence of 3 or more fragments of the same gene (generated by multiple stop codons or frameshifts) is a sign of a pseudogene. - /pseudo
This tag is used for pseudogenes. Pseudogenes are by definition the genes that are in the process of being destroyed and are no longer functional. CDSs are tagged as pseudogenes if they 1) are interrupted by more than 1 stop codon or frameshift, or 2) are separated by another ORF, or 3) correspond to a truncated COG (or Pfam), less then 30% of the full-length COG.
Computing Gene Relationships
Homolog, paralog, and ortholog relationships are established through BLAST hits between genes, with genes in all genomes compared to genes in all other genomes ("all vs. all"). The computations are carried out incrementally between different versions of the warehouse.
Orthologous pairwise relationships are computed as bidirectional best hits between genomes. Paralogous pairwise relationships are computed as reciprocal hits within the same genome. Homologs are unidirectional hits. The E-value of 10-2 or better is used for these pairwise relationships. Additional filtering by percent identity, bit score, and more stringent E-values can be applied as needed through the Web UI application filtering or database queries.
Orthologous groups are formed by using the Markov Cluster Algorithm (MCL). A conservation score is calculated to normalize the strength of similarity. This is basically the bit score between two sequences divided by bit score of the sequences when BLASTed against itself (self bit score). More precisely, it is
cons_scorexy = bit_scorexy / max( bit_scorexx, bit_scoreyy )
where x and y are two separate sequences. The mcl tool is runned with default parameters. Paralog groups are formed by using the same procedure.
Genes are associated with COG using RPS-BLAST (Reverse Position Specific BLAST) and NCBI's Conserved Domain Database (CDD). COG computations use an E-value cutoff of 10-2. These computations also provide additional alignment information.
Genes are associated with Pfam and TIGRfam using hmmsearch, a tool in the HMMER package (http://hmmer.janelia.org).
Positional clusters are computed as sequences of genes within 300 base pairs of each other, regardless of strand orientation. Positional clusters are usually used for highlighting genes that are positional neighbors in the KEGG map.
Parameters
The following table summarizes the parameters used for the various forms of NCBI BLAST.
| Computation | Tool | Maximum E-value |
Minimum Percent Identity |
Low Complexity Filtering | Limit on top Hits | Other Notes |
|---|---|---|---|---|---|---|
| COG | rpsblast | 1e-2 | None | None | 1 | Top hit only. |
| PRIAM | rpsblast | 1e-10 | 45% | Soft Masking (-F 'm S') | 1 | Top hit only. ≥ 70% alignment length on query gene and PRIAM sequence. |
| Protein all. vs. all |
blastall -p blastp | 1e-2 | None | Soft Masking (-F 'm S') | 2500 | The results is the foundation of other computations such as BBH clusters, paralog clusters, phylogenetic profiler searches which may use more stringent filtering. |
| RNA all vs. all | blastall -p blastn | 1e-5 | None | None | 250 | ≥ 50% alignment over length of both sequences. |
Tools
The IMG system makes use of the standard bioinformatics tools BLAST[1]. CLUSTALW[2], and MCL[3]. Information about BLAST can be viewed on the NCBI web site. Details about ClustalW are available from EBI. Mview is used for formatting ClustalW alignment output. Information about Mview is available at Mview documentation. An introdution into MCL may be found at MCL - a cluster algorithm for graphs.
References
1. Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410.
2. Thompson, J.D., D.G. Higgins, and T.J. Gibson. 1994. CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22: 4673-4680.
3. Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families, Nucleic Acids Res. 30(7):1575-1584 (2002).
