I was trying to rewrite(badly needed) my M.Sc thesis, and had to go traipsing through the paper trails involved for information content calculations.

It turns out it is calculated by -Log(P(Concept)) (as per this paper)

Here P(x) ==> Probability of X.

And in the case of the nltk.wordnet package the probability is calculated directly in reference from a given corpus with the straightforward n(Concept) / N(Concepts in corpus).

n(Concept) — number of occurences of the Concept in the corpus

N(Concepts in corpus) — Size of the corpus in terms of Concepts/sensets.

Now, my innately(insanely?) curious brain goes off why those specific choices? i.e: negative log and probability calculation.

1. The -ve is mostly convenience. Since all probabilities are less than 1, if we use log to the base of 2, (which i presume is true in this case) the log results will always be negative, so a -ve makes sense.

2. Log, now here’s the interesting part, A log is essentially the inverse of an exponential function. And exponential functions blow-up/magnify relative differences(aka first-order difference). Which means a log will reduce them. So in effect if two points on the probability distribution are closer, (compared other two) they will move even closer.

3. P(x) — This one’s rather straightforward, as it is a simple ratio, it gives a good idea of which concepts are often and most used in a given corpus.