1. What is KOBAS 3.0?Back to Top


KOBAS (KEGG Orthology Based Annotation System) is a web server for gene/protein functional annotation (Annotation module) and functional set enrichment (Enrichment module). Given a set of genes or protein, it can determine whether a pathway, disease, and Gene Ontology(GO) term shows statistically significant. The last version of KOBAS, KOBAS 2.0, has abundant annotation information of gene sets from multiple databases covering pathways (KEGG PATHWAY, Reactome, Biocyc, Panther), diseases (KEGG DISEASE, OMIM, NHGRI GWAS Catalog), and GO terms, and more than 4,000 species are supported. Since KOBAS 2.0 is widely used by worldwide researchers, we update it to KOBAS 3.0, which supports more data formats as input and more accurate functional enrichment algorithms.

KOBAS 3.0 is composed by two function, Annotation and Enrichment, as follows:

1.1 Annotation

For Annotation module, it accepts gene/protein list as input, including IDs or sequences. And it generates annotations for each gene based on multiple databases about pathways, diseases, and Gene Ontology. That is, for each gene, you can find which pathways, diseases, and Gene Ontology are related to this gene.

1.2 Enrichment

Enrichment module gives you the answer of which pathways, diseases, and GO terms is statistically significant associated with the genes/proteins you just input.

For Enrichment module, there are two modules according to their differences in input format:

1.2.1 Gene list Enrichment

This module is called “Identify” in KOBAS 2.0. It accepts same input formats as Annotation module, and the results of Annotation module as input is also allowed (see details at 3.1). It is based on the first generation gene set enrichment method, a gene-level statistic called Overrepresentation Analysis(ORA), a simple and frequently used test based on the hypergeometric distribution. Many tools have applied this methods, such as DAVID. However, we support other distributions like binominal test, chi-square test, frequency list and 3 FDR correction methods, like Benjamini and Hochberg (1995), Benjamini and Yekutieli (2001), and QVALUE.

1.2.2 Exp-data Enrichment

This module is a new feature in KOBAS 3.0. Allowing the gene expression as input gives a big change for functional gene sets enrichment because it makes us be able to use set based second or net-based gene set enrichment method, which use the information of molecular measures where the ORA ignores. By considering the coordinated changes in gene expression, these methods account for dependence between genes in a pathway, which ORA does not.

This module has integrated 9 methods including set-based methods: Globaltest, GSEA, GSA PADOG, PLAGE, GAGE, SAFE and net-based methods: GANPA, CEPA.

Furthermore, to detect the enriched gene sets supported by multiple methods, Exp-data Enrichment module gives gene set enrichment score and probability of being enriched sets based on the results of 9 gene set enrichment(GSE) methods.

2. ProceduresBack to Top


2.1 Annotation

Step 1: Check and choose your type of gene list (see details at 3.1)

Step 2: Select the corresponding species

Step 3: Input the gene list

Step 4: Click run and start run Annotations

2.2 Gene list Enrichment

Step 1: Input:

  • Check and choose your type of gene list (see details at 3.1)
  • Select the corresponding species
  • Input the gene list

Step 2: Choose databases you want to do enrichment analysis. Default use all the databases except OMIM.

Step 3: Input the background for comparing with the input as Step 1. Background should be all the genes in your experiment, while gene list you have input in Step 1 is a set of genes you are interested in. Default of the background: the whole genes in the species. If you want to use your own background, you should run Annotation with all the background genes first, then input this result into “Background”.

Step 4: Choose options for statistics. Use the default option is well. For hypergeometric test / Fisher's exact test and chi-square test, foreground must be a subset of background. For chi-background is needed.

Step 5: Click Run.

2.3 Exp-data Enrichment

Step 1: Input gene expression data with associated information:

  • Select the type of gene IDs in the gene expression data.
  • Select the corresponding species.
  • Upload the gene expression matrix data file.
  • Select the technology generating the expression data: Microarray or RNA-Seq(For microarray data, you can input either raw data (Please check) or normalized data. For RNA-Seq data, please input normalized data only.).

Step 2: Upload sample group code associated with gene expression matrix file.

Step 3: Select the gene sets database. Each term in this database will be regard as a gene set. Considering the time cost for run the multiple methods, only one kind of database are allowed each time you run.

3. Input data format:Back to Top


3.1 About gene/protein list:

You can either use gene/protein ID or sequences, and even result of blast.

3.1.1 Gene/protein ID:

These 4 type of ID is now allowed in KOBAS 3.0 :

  • NCBI Entrez Gene ID
  • RefSeq Protein ID
  • Ensembl Gene ID
  • UniprotKB AC/ID

If your ID is not in these 4 type, there are some convenient tools to convert ID:
https://biodbnet-abcc.ncifcrf.gov/db/db2db.php
http://www.ensembl.org/biomart/

3.1.2 Sequences:

Only FASTA format is allowed. Either nucleotide sequence or protein sequence is supported.

3.1.3 Tabular BLAST output


see details at http://www.pangloss.com/wiki/Blast

3.2 About the gene expression data

You should input 2 files:

Gene expression matrix

Binary condition or phenotype data of each sample.

3.2.1 Gene expression matrix:

A tab delimited text file format that contains expression values. Columns correspond to samples, rows correspond to genes.

Header='Genes' and sample names , Row names = gene ID.

See the example:

Suppose we have M genes, N samples, for Eij is our gene expression matrix (i=1,2,…M, j=1,2,…N); the file to save Eij should be like this :

The header line contains GeneID and the identifiers for each sample in the dataset.

The remain lines contains each GeneID and the corresponding gene expression value of each sample.

Header Line format: GeneID(tab)[Sample 1 name](tab)[Sample 2 name](tab) … [Sample N name]

Remain Line format: [geneid](tab)[Ei1](tab)[Ei2](tab)…[EiN]

You can download our example files from http://kobas.cbi.pku.edu.cn/expression.php

KOBAS 3.0 supported gene expression matrix from either microarray or RNA-Seq.

  • For microarray data, if you input the raw data and want to do normalization, please check "Need normalization".
  • For RNA-Seq data, you are only allowed to input FPKM/RPKM expression data on gene level instead of transcript level. raw reads data.
Note:
  1. Same gene ID in the different rows are not allowed.
  2. Each gene in the first column should only be the same type of identifier, and the type of identifier should be declared at “Gene ID Type”.
  3. The species of these Gene ID should be declared at “Species”
  4. The technology that generating the expression data should be declared at “Technology”

3.2.2 Sample class file

Sample class label, separated by the new lines. Each sample label should be corresponding with each sample in the expression matrix (from 2nd column to the last column of the matrix). Use '0' for unaffected samples (controls) and ‘1’ for affected samples (cases). Only binary class type is supported. See the example:

Note: Each group must have more than 2 samples when input gene expression matrix and the phenotype data.

4. Output result explain:Back to Top


4.1 About Exp-data Enrichment

The output has 5 columns:

  • GENE SET: ID number of gene sets in the chosen database, eg. KEGG PATHWAY.
  • NAME: Name of gene set.
  • ENRICH RESULT: KOBAS 3.0 SVM model gives a determination for each gene set based on 9 methods mentioned above. “True” means this gene set is determined as a statistically significant enriched one, while “False” means this gene set is not determined as an enriched one.
  • PROBABILITY: the probability gives the value of probability of this determination.
  • ENRICH SCORE: the distance of the target pathway to the separating hyperplane of SVM. For the enriched gene sets, this value is above 0, and the larger the better.

All the result is sorted by ENRICH SCORE in descending order.

For example:

For our example data (GSE1297.exp.txt), the output shows Alzheimer’s disease is most likely be the enriched pathway, with higheset ENRICH_SCORE. All the gene sets with POSITIVE=1.0 is the enriched pathways that you may want.