Algorithm and format requirements
Algorithm Format requirements
Algorithm & format requirements
SDPfox - the software package for the prediction of functional specificity groups
and amino acid residues that determine the specificity using MPA.
Many protein families contain homologous proteins that have a common biological function, but different specificity towards substrates, ligands, effectors, DNA, proteins and other interacting molecules including other monomers of the same protein. All these interactions must be highly specific. Our aim is to find groups of protein with same specificity (specificity group) and amino acid residues, that determine this specificity.
Amino acid residues that determine differences in protein functional specificity and account for correct recognition of interaction partners, are usually thought to correspond to those positions of a protein multiple alignment, where the distribution of amino acids is closely associated with grouping of proteins by specificity. SDPfox searches for division into specificity groups and positions that are well conserved within this groups but differ between them. These positions are called SDPs (specificity-determining positions).
SDPfox includes the following interconnected procedures: SDPlight to predict SDPs, SDPprofile to assign specificity to unannotated proteins, SDPgroup to split family into groups of specificity from a training sample (a small number of proteins from the considered family, for which specificity is known), SDPclust to construct a cluster tree of protein specificity. SDPlight, SDPprofile and SDPgroup is available at web-server SDPfox, all methods of SDPfox is realized as a stand-alone console program SDPfox.
Consider a multiple protein sequence alignment. The proteins are divided into N specificity
groups, numbered by i=1,...,N. The goal in to identify columns (positions) in the alignment,
in which the amino acid distribution is closely associated with the grouping by specificity.
This association in column p of the alignment is measured by the mutual information
To address the facts that frequencies are calculated based on a small sample, and that substitutions to amino
acids with similar physical properties should be weakly penalized, the observed amino acid frequencies are modified.
Instead of using , where
is the number of occurrences of residue in group i,
is the size of group i (here i is a single group or
the whole alignment), SDPlight uses smoothed frequencies
To calculate the statistical significance of the obtained values of
Ip, average of distribution and dispersion of mutual information (M(Ip)
and D(Ip)) are calculated from theoretical knowledge. To offset the background similarity of proteins
that is higher within groups than between groups, we calculate the expected mutual information for the column p
Iexp=aM(I)+b where a and b do not depend on the position, i.e.
are the same for every position of the alignment , so that
Then, Z-scores are calculated:
Given a series of Z-scores corresponding to every position
of the multiple alignment, one needs to evaluate the significance of the Z-scores in order to tell whether the
observed Z-score is sufficiently high to indicate a SDP. SDPpred uses an automated procedure for setting the
thresholds based on the computation of the Bernoulli estimator. The observed Z-scores are oredered by
decrease: . The threshold is defined as:
To predict specificity of other proteins, we calculate a profile matrix based on SDPs for each group:
Then, for each protein we calculate profile weights for all specificity groups: ,
and select the group providing the maximal weight. Then one can assume that the considered protein has a specificity coinciding with the specificity of the maximal weight.
SDPgroup is an iterative procedure for splitting family into groups of specificity using training sample:
SDPclust is a stochastic procedure to construct specificity cluster tree:
The only information needed for prediction of SDPs is a multiple alignment of protein sequences divided into
specificity groups. The aligned sequences should be in the GDE or fasta format. Amino acids should be indicated by
small or big characters. Gaps should be indicated by dots or hyphen. Name of each sequence is placed onto a
separate line and begins with '%' or '>'.
The alignment may be manually edited in order to define the specificity groups. They may be
separated by lines beginning and ending with the "equals" sign and containing name of the following group, e.g.: