a GO tool

PARAMETERS


Identifiers

Protein identifiers can be provided as UniProt accessions (e.g. P31946), UniProt entry names (formerly called UniProt ID) (e.g. 1433B_HUMAN), or STRING identifiers (e.g. 9606.ENSP00000361930).

Upload data

Drag & drop or click to upload a file: Expects a tab-delimited text-file (".txt" or ".tsv") with 1, 2, or 3 columns depending on the "Enrichment method" used. Additional or unnecessary columns will simply be ignored. You can either drag and drop a file or click on the "Choose file" button to upload a file. Alternatively, you can use the copy and paste fields below.

Foreground and Background consist of protein identifiers (or protein groups e.g. "P02407;P14127"). The Foreground is the test group, the proteins of interest you want to characterize in order to e.g. obtain overrepresented GO terms. To test for statistical significance the Foreground is compared to the Background. Typically, the Background consists of the entire genome (which is often not optimal for Proteomics data). Therefore we highly recommend using your own custom Background, since it will have the same biases the Foreground has (sample preparation, instrumentation, etc.). Intensity (i.e. Abundance) is coupled to the Background, and any kind of abundance measure can be used e.g. copy number, iBAQ, LFQ, spectral counts, etc. This means for each protein identifier (or protein group) in the Background there should be a corresponding numeric value in the Intensity column (proteins with missing abundance data will not be ignored, but placed together into a missing values group). No Intensity values have to be provided for the Foreground. Further details are described below and on the FAQ page. Unncessary data and parameters will be ignored.

Enrichment methods

  • abundance_correction expects 3 columns: Foreground, Background, and Intensity. This method was tailor-made to account for the inherent abundance bias is Mass Spectromtry based shotgun-proteomics data (since proteins can't be amplified, it will be more likely to detect highly abundant proteins compared to low abundant proteins). This bias can influence GO-term enrichment analysis by showing enriched terms for abundant rather than e.g. post-translationally-modified (PTM) proteins. Please see the original Publication and the FAQ pages on "How does the abundance_correction method work?". When should you use this method? If you have PTM data or data that suffers from a similar bias. When comparing PTM proteins to the genome (as the background) we've found abundance bias, simply because a PTM will in most cases not be present at a stoichiometry of 100%. Hence it is more likely to identify PTM proteins/peptides on abundant proteins (rather than low abundant proteins) and therefore enrichment analysis will show enrichment for abundant rather than modified proteins.
  • characterize_foreground expects 1 column: Foreground. Display existing functional annotations for your protein(s) of interest. No statistical test for enrichment is performed.
  • compare_samples expects 2 columns: Foreground and Background. Provide your own Foreground and Background and compare them without abundance_correction. Instead of using the genome as the Background (since you will probably not detect every protein of the entire genome of your organism), you can provide your own custom Background, which you've measured in your lab, and therefore account for the biases you might have in your data (sample preparation, instrumentation, etc.). When should you use this method? This is the most generic method and should be used if you are not using 'abundance_correction'. An example would be case (Foreground) vs. control (Background). So in general the Background should simply consist of the experimentally measured proteome of your model organism using control conditions.
  • compare_groups expects 2 columns: Foreground and Background. The method is intended to compare two groups with user defined numbers of replicates for each respective group. This method is similar to compare_samples, but takes redundancies within the respective group into account. The idea is to count protein to functional associations multiple times per protein (e.g. 8 out of 10 replicates in the treatment group identified protein A vs. 4 out of 10 for the control group. Or 8 PTM sites on the protein were found vs. 4). In short, redundant proteins are not reduced to a set (in the mathematical sense) but are kept "as is" and all of them counted regardless of redundancy. When should you use this method? One example would be to account for multiple PTM sites of the same protein, counting the same protein multiple times, once for each site, instead of once for the entire protein. If selecting subsets of proteins from one group is impeded/unfavourable. Frequent terms will be favoured since the counts in Fisher's exact test will be higher.
  • genome expects 1 column: Foreground. This method provides a Background from UniProt Reference Proteomes restricted to "one protein sequence per gene" (as provided by UniProt). In order to know which organism's reference proteome to use as the background you're expected to provide the NCBI taxon identifier (e.g. 9606 for Homo sapiens). Please make sure to use the exact TaxID UniProt provides, some cases are can be unexpected e.g. instead of Taxid '4932' for 'Saccharomyces cerevisiae', UniProt provides '559292' for 'Saccharomyces cerevisiae S288C'. We therefore support '559292'.

Analysis options

These options enable you to restrict the enrichment analysis to specific categories, subsets, etc. which means that p values corrected by multiple testing will be penalized less. In contrast to the Report options below, which will simply not display certain terms according to the filtering criteria. There is a REST-API for easy programmatic access to this service, which enables more customized search criteria (see API).
Category of functional associations: Select a specific functional category from the drop-down menu. E.g. one or all three GO categories (molecular function, biological process, cellular component), UniProt keywords, KEGG pathways, PubMed IDentifiers (PMIDs), Reactome, Wiki Pathways, Interpro domains, PFAM domains, Brenda Tissue Ontology (BTO), or Disaese Ontology IDs (DOID).
Over-, under-represented or both: Choose to only test overrepresented or underrepresented GO-terms, or to report both of them.
GO basic or a slim subset:Choose between the full Gene Ontology or a GO slim subset. These are curated subsets of GO terms that are less fine grained and tailored to a specific organism or domain.
Multiple testing per entity type: Perform multiple testing separately for each functional category (i.e. entity type) or together for all functional terms (as in the original version of aGOtool).

Report options

These options serve to filter the results to only report relevant results. This means that most probably more results were generated, but since they are below the given cutoff they will not be shown.
p value cutoff: Maximum value (threshold) for uncorrected p values (e.g. 0.01 means 1%). "1" means no filter will be applied.
p value corrected (FDR) cutoff: Maximum value (threshold) for FDR corrected p values (e.g. 0.01 means 1%). "1" means no filter will be applied.
Filter foreground count one: Don't report any functional terms that are associated with only a single protein for the given Foreground.
Filter redundant parent terms: The aim of this filter is to reduce the plethora of results to a more clear and concise subset thereof, without loosing essential information. It can be applied to functional terms with a hierachical structure (such as GO, Brenda Tissue Ontology, Disease Ontology, and Reactome) with “child” terms being more specialized than their “parent” terms (geneontology.org). This filter retains the most specific terms and removes terms of a lower hierarchical level (closer to the root of the tree). The filter only applies if the same Foreground proteins are associated with the terms.


Results

The results can be displayed in a compact and a comprehensive form, or downloaded as tab-delimited text files.

Column definitions

rank: The rank is a combination of uncorrected p value and effect size (based on s value). It serves to highlight the most interesting results and tries to emphasize the importance of the effect size.
term: A unique identifier for a specific functional category.
description: A short description (or title) of a functional term.
p value corrected: p value without multiple testing correction, stemming from either Fisher's exact test or Kolmorov Smirnov test (only for "Gene Ontology Cellular Component TEXTMINING", "Brenda Tissue Ontoloy", and "Disease Ontology" since these are based on a continuous score from text mining rather than a binary classification).
effect size: We chose the difference in proportions (positive associations divided by all associations) of the Foreground and the Background as the effect size. For the functional categories of "Gene Ontology Cellular Component TEXTMINING", "Brenda Tissue Ontoloy", and "Disease Ontology" we are reporting the Kolmogorov Smirnov distance instead. The values range from -1 to 1.
year: Year of the scientific publication.
over under: Overrepresented (o) or underrepresented (u).
s value: The s value is a combination of (minus log) p value and effect size.
ratio in FG: The ratio in the ForeGround is calculated by dividing the number of positive associations for a given term by the number of input proteins (protein groups) for the Foreground.
ratio in BG: The ratio in the BackGround is analogous to the above ratio in the FG, using the associations for the background and Background input proteins instead.
FG count: The ForeGround count consists of the number of all positive associations for the given term (i.e. how many proteins are associated with the given term).
FG n: ForeGround n is comprised of the number of input proteins for the Foreground.
BG count: The BackGround count is analogous to the "FG count" for the Background.
BG n: BackGround n is analogous to "FG n".
FG IDs: ForeGround IDentifierS are semicolon separated protein identifers of the Forground that are associated with the given term.
BG IDs: BackGround IDentifierS are analogous to "FG IDs" for the Background.
etype: Short for "Entity type", numeric internal identifer for different functional categories.

Entity types

entity type description
-20 GO cellular component TEXTMINING
-21 GO biological process
-22 GO cellular component
-23 GO molecular function
-51 UniProt keyword
-52 KEGG
-54 Interpro
-55 PFAM
-56 PMID
-57 Reactome
-58 WikiPathways