Frequently asked questions

What is "aGOtool"?

It is a Gene Ontology enrichment tool. This site can be used for functional annotation enrichment for proteomics data. It contains various enrichment methods, one of them is called "abundance_correction", which is specifically catered towards Post Translationally Modified proteins. You can perform functional enrichment for GO (molecular function, biological process, cellular component), UniProt keywords, KEGG pathways, PubMed publications, Reactome, Wiki Pathways, Interpro domains, PFAM domains, Brenda Tissues, and Disaeses.

What is enrichment analysis?

Enrichment analysis is frequently used to examine "-omics" data sets for enriched functional terms in a subset of the data set, such as regulated genes or modified proteins. It involves a statistical test to find significant differences in the frequency of GO-terms associated with e.g. modified proteins relative to their frequency in the genome.

Why use aGOtool?

There are many GO enrichment tools freely available, so why would I want to use aGOtool instead? aGOtool stands out, because it was specifically designed for the abundance bias in Mass Spectrometry (MS) based proteomics data. This bias exists, since proteins can't be amplified, and abundant proteins are more likely to be detected by MS. This effect is even more pronounced when investigating post-translationally-modified (PTM) proteins, due to the fact that site occupancy is usually not as high as 100%. Meaning, that not every single copy of a given protein will have the PTM of interest at the same location. Therefore, it will be easier to detect PTMs on abundant proteins compared to low abundant proteins.

A GO-enrichment analyses comparing post-translationally modified proteins to unmodified proteins is likely to reveal enriched GO-terms associated with abundant proteins, rather than modified proteins specifically. We have developed a method to correct for this bias (see publication or explanation below) and this freely accessible web-tool, which you are most welcome to use.

This will result in

a comparison to an appropriate control group
increased specificity
fewer significantly enriched, but more accurate and biologically meaningful terms

How can I use aGOtool?

You need to provide a list of UniProt protein accession numbers for the foreground (e.g. a list of proteins you've identified in an experiment having a certain PTM), and a list of accession numbers including their abundance values for the background (e.g. a reference proteome with "normal" conditions).

What is the "foreground" and "background" proteome?

The foreground consists of proteins you are interested in, e.g. proteins with a specific PTM, or upregulated proteins due to a treatment. The foreground is characterized by comparing it to the background. The background is what you compare against. The foreground is a subset of the background, in case you are using the abundance_correction method.

How does the "abundance_correction" method work?

We assume that you have abundance values for the background proteome and that the foreground is a proper subset of the background. For the background proteome you're asked to provide a list of protein identifiers (such as a list of UniProt accession numbers, UniProt entry names or STRING identifiers) and corresponding abundance values. For the foreground only protein identifiers but no abundance values are used. Instead of comparing the foreground to the entire proteome, it is instead compared to the (experimentally measured) background proteome. Furtheremore, a correction for the abundance range is incorporated into the analysis as follows.
The abundance values of the foreground are assumed to be the same as the background. Based on the abundance values of the foreground proteome 100 bins are created (background proteins outside of that range are disregarded). For each bin a correction factor $w_i$ is calculated by dividing the number of foreground proteins $n_{fi}$ by the number of background proteins $n_{bi}$ occurring within that bin. $$w_i = {n_{fi} \over n_{bi} } $$
For each GO-term a contingency table is constructed in order to calculate Fisher’s exact test. Where for a given GO-term

$a$ is the number of all positively associated proteins (with that GO-term) of the foreground,
$b$ is the number of all non-associated proteins of the foreground,
$c$ is the number of all positively associated proteins of the background, and
$d$ is the number of all non-associated proteins of the background.

	foreground	background
+	$a$	$c$
-	$b$	$d$

Let $hasGoTerm(p)$ be defined as $$hasGOterm(p) = \begin{cases} 1, \text{if p has the GOterm of interest} \over 0, \text{otherwise} \end{cases}$$ and $hasnotGOterm(p)$ be defined as $$hasnotGOterm(p) = \begin{cases} 1, \text{if p has not the GOterm of interest} \over 0, \text{otherwise} \end{cases}$$ and $prot_i$ be the set of proteins in the $i^{th}$ bin.

Thus the four values ($a$, $b$, $c$, $d$) for the Fisher’s exact test are calculated as follows: $$a = {\sum_{i=1}^{i=100} \sum_{p∈prot_i} hasGOterm(p)}$$ $$b = {\sum_{i=1}^{i=100} \sum_{p∈prot_i} hasnotGOterm(p)}$$ $$c = {\sum_{i=1}^{i=100} w_i \sum_{p∈prot_i} hasGOterm(p)}$$ $$d = {\sum_{i=1}^{i=100} w_i \sum_{p∈prot_i} hasnotGOterm(p)}$$ Notice that $a$ and $b$ are calculated in the conventional way, in contrast to $c$ and $d$, where $w_i$ is introduced in order to match the distribution of the foreground.

In short, $a$ and $b$ are calculated by simply counting (summing) the associations for all the proteins in the foreground. While $c$ and $d$ are calculated by summing the appropriate correction-factors.

I only have a foreground proteome, can I still use aGOtool?

Yes! You can select the "genome" method and thereby chose a reference background proteome appropriate for your organism.

Why is "background n" equal to "foreground n", specially since I've uploaded more Accessions for the background than the foreground?

This is due to the abundance correction method, which aims to compare same sized groups. Therefore the associations (e.g. GO-terms) of the background downscaled to reflect the number of associations of the foreground (see correction-factor "How does aGOtool work?" above).

Why is "percent associated background" equal to Zero?

This can occur when the above described correction-factor is very small, and the summed correction-factors are between 0 and 1, and are subsequently rounded down to zero instead of up to one. e.g. a bin containing many background proteins, but only few foreground proteins will lead to a small correction-factor.

I still have questions...

Feel free to write me an email [email protected]

	foreground	background
+	\(a\)	\(c\)
-	\(b\)	\(d\)