Mutual Information

The program calculates the correlation between each pair of columns [i, j]: one from the proteins alignment, and the other from the site alignment. As a measure of correlation, the mutual information is used:

X, Y Arrays of 20 aminoacids and 4 nucleotides respectively.
Observed frequency of aminoacid x being in position i and nucleotide y being in position j.
Expected frequency with the hypothesis of absence of correlations between columns. Calculated as frequency of aminoacid x in column i multiplied by frequency of nucleotide y in column j

Large difference of the and the expected distribution is reflected in high mutual information value

Statistical significance

In order to understand what values of mutual information are sufficiently high to be non-random, the statistical significance value is calculated as Z-score:
- is the distribution of the mutual information for random (non-correlated) pairs of columns with the same aminoacid and nucleotide compositions as in [i, j] pair.
and are the mean and the standard deviation, respectively.
To estimate , 10,000 random pairs of columns are generated and mutual informtion values are calculeted for them.