What is Shannon Information?

8 min read

Information theory started with Claude Shannon’s 1948 paper, where he was trying to solve the problem of how to communicate through electrical signals. Electrical signals have two letters. 1 or 0. The alphabet has 26 letters. There’s only so much you can say with two letters, or can you say more?

Let’s say that Alice is trying to talk to Bob through an electrical wire which can only transmit 1 or 0, and let’s also suppose that they communicate perfectly so that the channel is noiseless. We would like to send messages that are built from any of four symbols: {a,b,c,d}\{ a, b, c, d \} with the following probability distribution:

P(a)=1/2P(b)=1/8P(c)=1/4P(d)=1/8\begin{aligned} \mathcal{P}(a)=1 / 2 \\ \mathcal{P}(b)=1/ 8 \\ \mathcal{P}(c)=1 / 4 \\ \mathcal{P}(d)=1 / 8 \end{aligned}

The idea is to use a coding scheme:

a00,b01c10,d11a \rightarrow 00, \quad b \rightarrow 01\\ c \rightarrow 10, \quad d \rightarrow 11

This solves the problem, but if we have a non-uniform distribution as Alice and Bob do, or in the English alphabet (vowels are more common than consonants) than you might guess that this coding scheme is inefficient.

The expected length of this coding scheme is given by:

E[L]=12(2)+18(2)+14(2)+18(2)=2 bits\begin{aligned} \mathbb{E}[L] &=\frac{1}{2}(2)+\frac{1}{8}(2)+\frac{1}{4}(2)+\frac{1}{8}(2)\\ &= 2 \text{ bits} \end{aligned}

So the message


would be uniquely coded and decoded as

00 00 01 10 11 00 10 10 00 00 10 = 0000011011001010000010 = 22 digits

For example, a “smarter” coding scheme that is more efficient would be:

So the same above message


would be uniquely coded and decoded as

0 0 110 10 111 0 10 10 0 0 10 = 0011010111010100010 = 19 digits

and so the expected length is

E[L]=12(1)+18(3)+14(2)+18(3)=74 bits\begin{aligned} \mathbb{E}[L] &=\frac{1}{2}(1)+\frac{1}{8}(3)+\frac{1}{4}(2)+\frac{1}{8}(3)\\ &= \frac{7}{4} \text{ bits} \end{aligned}

The reason this coding scheme works so well is that since aa appears so often, we might as well assign aa its own binary string, and assign less likely symbols b,db, d longer binary strings.

A Measure of Information

We need a way to measure whether a symbol x{a,b,c,d}x \in \{a,b,c,d\}, in the most efficient coding scheme contains a short or long binary string. Recall that the least likely symbols would have the longest coding scheme.

We can do this by defining the information content as inversely proportional to the probability: (XX is a random variable)

i(x)1/P(x)i(x) \propto 1/\mathcal{P}(x)

Thus the least likely symbols would have the greatest amount of information. This isn’t a bad, but Claude Shannon says a better measure would be to take the base 2 logarithm:

i(x)log(1P(x))=log(P(x))i(x) \equiv \log \left(\frac{1}{\mathcal{P}(x)}\right)=-\log \left(\mathcal{P}(x)\right)

Which we define i(x)i(x) as the information content of xx. The reason we take the logarithm is so that if the source is memoryless, which means that it produces each symbol independently, the information content is additive:

i(x1x2)=log(P(x1x2))=log(P(x1)P(x2))\begin{aligned} i\left(x_{1} \cap x_{2}\right) &=-\log \left(\mathcal{P}\left(x_{1} \cap x_{2}\right)\right) \\ &=-\log \left(\mathcal{P}\left(x_{1}\right) \mathcal{P}\left(x_{2}\right)\right) \end{aligned}=log(P(x1))log(P(x2))=i(x1)+i(x2)\begin{aligned} &=-\log \left(\mathcal{P}\left(x_{1}\right)\right)-\log \left(\mathcal{P}\left(x_{2}\right)\right) \\ &=i\left(x_{1}\right)+i\left(x_{2}\right) \end{aligned}

The expected value of information content is given by:

H(P)=E[i(x)]=xP(x)i(x)=xP(x)log(P(x))\begin{aligned} H(\mathcal{P}) &= \mathbb{E}[i(x)]= \sum_{x} \mathcal{P}(x) i(x)\\ &=-\sum_{x} \mathcal{P}(x) \log \left(\mathcal{P}(x)\right) \end{aligned}

H(P)H(\mathcal{P}) is called the Shannon entropy of the information source. Some authors call this the amount of missing information, or the expected surprisal. Note that Shannon entropy depends only on the probability distribution.

If the distribution is continuous we simply do the oldest trick in the book, which is replace summation by integration:

H(P)=xP(x)log(P(x))H(\mathcal{P}) =-\int_{x} \mathcal{P}(x) \log \left(\mathcal{P}(x)\right)

Where we integrate over all xx.


Once you plug in a probability distribution, we can calculate a numerical value for the entropy. This numerical value has units of bits, if you use take the logarithm to base 2.

Okay now the important part. So from a probability distribution you can calculate the entropy, which gives a number in bits. What is a bit?

Let’s imagine the following game with just N=8N=8 coins:

  1. I set the coins to be in some configuration of heads and tails. Say I chose HTTHTHTH, but you don’t know that.

  2. You have to determine the minimum number of yes/no questions until you know what the configuration is.

How would you solve this? Well since every coin can be heads or tails, we could ask:

  1. Is the first coin heads?

  2. Is the second coin heads?

… and so on. A yes would mean heads, and a no would be tails. We can store the information as some string of ones and zeros, with 1 for heads, and 0 for tails.


In this case, we always need to ask 8 questions, as long as the coin is fair.

Let’s work backwards and try to figure out the probability distribution in this game. The possible configurations are the 2N2^N binary strings:

10000010 10101010 10000000…

and each configuration has 1/2N1/2^N chance of being drawn, so the probability distribution for 8 fair coins is:

P(x)=128\mathcal{P}(x) = \frac{1}{2^8}

The entropy is given by:

H(P)=xP(x)log(P(x))=28128log(128)=8 bits\begin{aligned} H(\mathcal{P}) &= -\sum_{x} \mathcal{P}(x) \log \left(\mathcal{P}(x)\right)\\ &= -2^8 \frac{1}{2^8} \log \left(\frac{1}{2^8}\right) = 8 \text{ bits} \end{aligned}

We see that the number of yes/no questions (8) is the number of bits measured by the entropy. This is why the entropy is sometimes called the missing information of a probability distribution, since it’s the number of yes/no questions to determine what the random variable xx is, given the probability distribution P(x)\mathcal{P}(x). In the game, xx corresponds to the binary strings, which is essentially a random variable we need to figure out.

For non-uniform distributions as in the introduction, since the number of yes/no questions is an integer (you can’t ask half a yes/no question) then the shannon entropy measures the average minimum yes/no questions it would take to figure out xx. This agrees with what we said about entropy being the expected value of information content.

In the introduction, the average length of the efficient coding scheme is just the shannon entropy. Thus, efficiency in the coding scheme corresponds to the minimum yes/no questions, which is just the Shannon entropy.

Information is not subjective

Consider these two sentences:

A: Each of the ten houses in the street costs one million dollars.

B: The first house in this street costs one million dollars, the second house costs one million dollars,… and the tenth house costs one million dollars.

Intuitively, it seems that A and B are saying the same “information”. The content of A and B are the same. However, the size of A is much less than the size of B.

You can go further, which C has more content, but less size than B:

C: Each of the houses in the country costs one million dollars.

The point is that Information theory does not care about the contents of the message. Information theory is only concerned with the size of the message, and nothing else. Thus, information content is not a subjective measure, but an objective one.

If I hid a coin in either my left or right hand, while it might seem that you, the player has less information on where the coin is, Information theory doesn’t answer the question: “Where is the coin?” Instead, it answers questions like: “How many binary questions would I need to ask before finding out where is the coin?” The answer to this exact question is the Shannon entropy, in bits.

Shannon Entropy of a Coin/Bernoulli Distribution

Given a probability distribution, we can compute its entropy:

Suppose that a probability distribution has the following values for p[0,1]p \in [0 , 1]:

P(x=H)=pP(x=T)=1p\mathcal{P}(x = H) = p \quad \mathcal{P}(x = T) = 1 - p

We can plug this probability distribution into Shannon’s entropy formula:

H(P)=plog(p)+(1p)log(1p)H(\mathcal{P}) = p\log(p) + (1-p)\log(1-p)

Which is plotted below:

Entropy of a coin. The entropy is the highest when the coin is fair.

If the coin is fair, the entropy is maximal, which means that we have the maximum missing information or maximum surprisal.

If the coin is unfair, the entropy decreases, until there is zero missing information or we are completely not surprised at all.


Given a probability distribution P(x)\mathcal{P}(x), we can calculate the Shannon entropy:

H(P)=xP(x)log(P(x))H(\mathcal{P}) =-\sum_{x} \mathcal{P}(x) \log \left(\mathcal{P}(x)\right)

The entropy only depends on the probability distribution P(x)\mathcal{P}(x) we put in, and has units of bits. It is interpreted as the average minimum yes or no questions needed to determine what a random variable xx can be.

Further Reading

If you are hungry for more, check out the information theory playlist by Art of the Problem:

Get The Best Of All Hands Delivered To Your Inbox

Subscribe to our newsletter and stay updated.

Copyright © 2020

Zhi Han