Assume that we have n observations (for example, calculated profile score in different positions of a sequence) producing values {rk}. Fixing a threshold t one selects kt values. The following question arises: “what is better – to select small number of scores with a high threshold or to select many scores with moderate threshold?” To answer this question one can calculate the probability Pt(kt) that at least k values from the set {ri} exceed the given value t to select the threshold providing the minimum to the probability . The motivation here is as follow, biologically significant values (e.g. site scores) should be “nonrandom”. We select a threshold t that provides the highest degree of “non-randomness”. The obtained minimal value of Pt(kt) can be used as the p-value for the threshold t .
Clearly it is not necessary to scan all possible threshold values. The best threshold necessarily is one of the observed values {ri}. To minimize Pt(kt), sort the set of observed values by decrease. Then scan the observed values and for every k consider the threshold tk=rk and calculate the probability Pk=P(rk). The probability pk that exactly k out of n scores exceed rk is given by the Bernoulli distribution:
The probability Pk that at least k values exceed rk is obtained by summation of the above probabilities:
(1)
Scanning over all possible values k produces the minimum for Pk and defines the threshold :(2)
It is well known (Balakrishnan and Cohen, 1991) that if {rk} are instances of the same random variable, then the values Pk, P*, k* have the following properties: Pk is uniformly distributed in the interval [0,1]; k* is uniformly distributed in the interval [1,n]. These distributions do not depend on the distribution of the source random variable r. The real objects (e.g. the real binding sites) are not random and their scores do not follow the same distribution as the source random variable r. Hence if the data set contains real objects, the values Pk, P*, k* will not have the above properties of rank statistics and can be used as indicators that allow one to separate the real data with non-random high scores from random noise. The traditional approaches based on likelihoods or confidence probability are local and do not take into account the complete set of data. On the contrary, our approach is global and involves analysis of the entire dataset. For example, suppose that the dataset {rk} contains 15 values, ten of which have the significance pk=0.1. In this case the local approach provides week significance 0.1 while our approach produces the significance 1.9e-7. On the other hand, if only one observation out of fifteen has significance 0.1 our approach will give significance of the dataset 0.79.(3)
Here fαk is the observed frequency character α in column k, fα is the background frequency. To apply the rank statistic technique for threshold selection one need the background probability distribution of the information content. This distribution can be calculated in some simple cases (for example if the character probabilities are uniform), but in the general case it is unknown. In a general case this probability can be obtained using the Monte-Carlo generation of random columns. We can assume that the distribution for is normal or exponential.Here we describe application of the rank statistics to select appropriate (site-containing) sequences in the MEME setting. The algorithm identifies the highest-scoring hits of the current profile in each sequence. Then using the selected sites, it reconstructs the profile. At that point our technique can be used to retrain only significant sites. We use the current profile score as a measure of the site quality. We assume that the profile score has the normal distribution (profile score is sum of independent random variables, positional nucleotide weights). Hence, the probability that the score exceeds a given value can be determined. Using the rank statistics, we can select sites that should be included in the profile. The modified MEME algorithm will have the following iterations:
Andrey Mironov |
Department of Department of Bioengineering and Bioinformatics, MSU, Moscow, Russia |