Introduction to Information Theory

copyright (c) 1999 by R. I. 'Scibor-Marchocki

This is a collections of e-mail letters, which I wrote to my very good friend Deb, as a set of lessons in Information Theory. I have deleted each salutation, to protect her privacy. If anybody shows interest, someday, I might edit these lessons into a better organized tutorial. So, please write me at webmaster@rism.com.


Subject functional

Consider two functions f(x) for x on the domain X and g(y) for y on the domain Y.

Then f is a set of ordered pairs (x, f(x)) and g is a set of ordered pairs (y, g(y)).

Let us suppose, further, that the range of f(x) is the space Y.

The composite function g(f(x)) for x on the domain X. Probably, we should write -- but we never do -- this composite function as g(y = f(x)), to emphasize that g is not a function of f. It is a function of the value of y=f(x).

On the other hand, a functional -- notice the suffix "al" -- is a function of a function. Thus the functional g(x, f(x)) for x on the domain X, is not to be considered as evaluated at a specific x. Instead, it has a value for the whole domain X. There would be a family of related functions f.

One way to construct a functional, is by means of an integral over the domain X. Thus the functional G(f(X)) is

G = the integral, over X, of g(x, f(x)) dx.

For brevity, we often write G(f) or G(X), depending upon which we want to emphasize.

The function f often will be a probability density distribution, p(x) for x on X.

Then, some functionals are

1 = the integral, over X, of p(x)*dx. This is the zeroth moment.

mean = the integral, over X, of x*p(x)*dx. This is the first moment.

variance (about the mean) = the integral, over X, of (x - mean)**2 * p(x) * dx. This is the second moment.

kurtosis = the integral, over X, of (x-mean)**3 * p(x) * dx. This is the third moment. Do you know this name, "kurtosis"?

In general, the n-th order moment is

I(n) = the integral, over X, of (x - mean)**n * p(x) * dx.

The moment generating function m(y) is

m(y) = the integral, over X, of exp((x - mean) * y) * p(x) * dx.

Then the n-th order derivative of m(y), with respect to y, evaluated at y=0, is the n-th order moment I(n). Did you know this theorem?

The Information Rate H(X) is

H(X) = - the integral, over X, of p(x) * ln(kappa * p(x)) * dx.

In Information Theory, we use the minus sign, as shown, to have the H(X) positive. However, in Thermodynamics, it is conventional to omit the minus sign. Thus H(X) is the neg-entropy.

You may find it interesting to take the specific example of the Gaussian probability density distribution

p(x) = (1 / ((sqrt(2 * pi) * sigma)) * exp((- 1 / 2) * ((x - mean) / sigma)**2)

for x on X = (-oo, oo).

Verify that the moment generating function, indeed, generates the moments. And verify that

H(X) = ln(sqrt(2 * pi) * epsilon * (sigma / kappa))

where epsilon, of course, it the base of the Naperian logarithms.

We had most of this, already; but, we did not emphasize that these are functionals.

We are sneaking into the formal development of Information Theory.


Subject pseudo-metrizable topology

Shortly before the first World War, the concept of a topology was formalized. In France, Lebesque invented what has become known as the Lebesque integral. In Germany, Stieltjes invented what has become known as the Stieltjes integral.

Most of this work was centered in France. Germany and Italy also contributed.

All scientific journals in Europe ceased publication until some time in early 1920's.

By 1925, when the scientific journals were being published again, the literature shows that the Stieltjes integral had been subsumed into the Lebesque integral -- by means of a mapping -- as the Lebesque-Stieltjes integral. Exactly how these two integrals became associated is lost to history and to mathematics.

Also, by 1925, it became obvious that a metric-geometry has a natural-topology. The question arose, what properties does a topology need posses to be metrizable; that is, for a metric to exist?

It took two-and-a-half decades to answer that question. In the early 1950's, Kelley (sorry, I do not remember the specific variant of the name) published a book with the title _General Topology_. Immediately, several other authors plagiarized this book. They even did not bother to change the title. Identical copies -- different authors. These books include the formal statement of the Pseudo-metrizable theorem.

Roughly speaking, the condition is that a measure exist.

In 1956, I took a two-semester class of Real Variables, under Alexandrovicz, at USC (= University of Southern California). The Mathematics Department had assigned a textbook; but, he employed his own notes, instead.

At the end of the year, I asked him, how come he had not told us to purchase the book -- published in 1955 -- _Measure and Integration_ by Munroe? He said that he never heard of it. That he had been using notes, which he had transcribed verbatim, from the lectures of his professor, in around 1950, in Wiede'n (Vienna), in Polish. And that that professor had likewise transcribed verbatim from his professor in Warszawa (Warsaw), in around 1925.

We compared. He selected at random in his notebook. He would read a page -- in Polish. I read the corresponding material from Munroe -- in English. We observed that it was a literal translation.

Now, the question arises Where did Munroe obtain his material? And why did he not credit its origin? Obviously, it must have come from that professor in Warszawa, in 1925. Could Munroe have been that professor, migrated to the USA?

This is the *only* book of Real Variables, which does not pre-suppose a metric. So, even though it is out-of-print, by now, I have to use it as the specific reference for Real Variables; that is, for the development of measure and integration.

Munroe shows that integration is actually just another measure.

Thus, we may say that the condition for a topology to be pseudo-metrizable is either that it be measurable or that it be integrable.


Subject Lebesque-Stietjes integration

When we say that a proposition is true ae (= almost everywhere), we mean that the proposition is true, except on a set of measure zero.

 

Each point of the Stieltjes portion of the Lebesque-Stieltjes integral is to be replaced by a Lebesque-integral over a suitable interval of a suitable function, such that its Lebesque-integral is equal to the aforementioned Stieltjes portion of the Lebesque-Stieltjes integral. This equivalence arose sometime during WWI. Who, what, when, or how, is lost because of the war.

 

Munroue shows that the Lebesque and the Rieman integrals are equal provided that they both exist.

Also, it is obvious that if two Lebesque-integrable functions f(x) and g(x) are equal ae on X, then their Lebesque-integrals are equal.

Thus, the usual strategy for the evaluation of the Lebesque integral of f(x) is

1. Make sure that f(x) in Lebesque-integrable on X.

2. Find a function g(x) such that

a. g(x) is Lebesque-integrable

b. g(x) = f(x) ae on X.

c. The Rieman integral of g(x) exists.

3. Since the Rieman integral is an anti-derivative, guess a function h(x), whose derivative is the function g(x).

4. Then h(x) is the indefinite Lebesque-integral of f(x).

 

The Rieman-Stieltjes integral is a denumerable-summation. If the summation is infinite, it is to be interpreted as the limit (if it exists) of the corresponding finite summation, as the upper limit increases without bound.

 

Thus, we have two strategies for the evaluation of the Stieltjes portion of a Lebesque-Stieltjes integral either convert it to a pure Lebesque-integral or convert it to a denumerable summation.

 

On the other hand, a denumerable summation may be converted to a Rieman-Stieltjes integral, which immediately is convertible to a Lebesque-Stieltjes integral. Which, in turn, may be reduced to a pure Lebesque-integral. It may be easier or appropriate to go directly from a denumerable summation to a Lebesque-Stieltjes integral, however.

 

With the aforementioned methods, we reduce any Lebesque-Stieltjes integral to a Riemann-integral, and, perhaps, a denumerable summation. Or we may reduce a denumerable summation to a Rieman-integral.

We have several choices. We do what appears easiest. Thus, we see that instead of being something formidable, the Lebesque-Stieltjes integrals actually make it much easier to evaluate various integrals or summations.


Subject The three measures of Information Theory

Most of Information Theory makes use of only the Lebesque-Stieltjes integral. Only near the end do we consider the metrization of the pseudo-metrizable topology.

Thus, one may follow Munroe and specify, as axioms, the existence of each of the three measures/integrals. Or else, one may postulate a pseudo-metrizable topology and then infer the same existence as a consequence.

 

We designate our space as X. Then, when we have some function f(x) defined for each x on X, we write the function as f(X).

 

The three measures of Information Theory.

First, we have the mu-measure u(X). It is the underlying measure, which supports the other integrals. However, it is only implicitly that this measure is involved. It stays hidden from view, except for the two theorems The invariance of the Information Rate H(X) and the mapping -- which we have discussed, already -- which carries any given probability-density distribution, on a flat space, to that of the Gaussian probability-density distribution.

Second, we have the probability-density measure p(X). The associated cumulative-probability distribution P(X) is the indefinite integral of p(X); that is, P(x) is the integral of p(x), with the upper-limit of integration being x. It is constrained to be P(oo) = the definite integral, over the whole space X, of p(x) = 1.

I emphasize that, in Information Theory, we do not specialize the probability distribution. Except, that -- as Boltzman showed more than a century ago -- any conservation law is associated with a probability distribution. Each implies the other, uniquely.

And, third, we state as the first axiom of Information Theory, the existence of the functional H(X).

The remaining three axioms of Information Theory will be next.


Subject invariance

Invariance is dear to the heart of a mathematician. An entity can be characterized by its invariance. One can construct a whole concept around its invariance. A book can be written.

Let us be more specific. We have two measures The underlying mu-measure of the space and a p-measure -- the probability-density distribution. It is the latter which describes the properties of the object, which is in the space.

Hence, it is easier for us to relate to the p-measure. Somehow, we would like to say that we are utilizing a property of the probability-density distribution. We would be willing enough to impose an invariance under a p-measure preserving mapping.

For those readers who are impatient, we remark that the Channel Rate R(X Y), indeed, is invariant under one-to-one mappings, which are p-measure preserving. And we will show how Channel Rate R(X, Y) can be axiomatized, directly. However, it is not easy to do so, without the experience of the prior axiomatization of the Information Rate H(x).

By that time, however, we already will have obtained the Channel Rate as a difference of three Information Rates.

So, we will impose the greater constraint of invariance under both the p-measure preservation and, in addition, the mu-measure preservation. Thus, our second axiom states that

The Information Rate H(X) is invariant under one-to-one mappings, which are both mu-measure preserving and p-measure preserving.

We observe that any mapping which preserves both the mu-measure and the p-measure, of necessity, preserves the p-measure. Thus, the Information Rate H(X) satisfies the axioms of the Channel Rate R(X, Y). Hence H(X) may be considered as a special case of the R(X, Y). But, again, this is only a curiosity; because, by that time, we will have studied both the Information Rate and the Channel Rate.

Another observation is that, in the discrete case, *any* one-to-one mapping, of necessity, preserves the mu-measure. Hence, in the discrete case, we would not need to mention the mu-measure preservation, explicitly, in the foregoing axiom. As a result, in the discrete case, the distinction between the Information Rate and the Channel Rate is blurred.

Two down; two to go. Continuity and addition remain..


Subject first look at the continuity axiom

The second axiom had been that of invariance. We said that H(X) does not change, if the p-measure (the probability-density distribution function) does not change.

Now, for the third axiom, we consider what happens if the probability-density distribution changes, a little.

Do we want H(X) to be a continuous function of the p(X)? The answer turns out to be "Yes." But, it is not obvious. We might have wanted either a less or a more stringent condition.

Furthermore, exactly what do we mean, when we say that the p(X) changes a little?

Clearly, we are not in a position to be able to formulate this axiom, precisely, at present. We will have to skip it for now; return to it after we learn more about the Information Rate H(X).

However, for now, we observe that if the p(X) changes, to say q(X), this q(X) may not satisfy the axiom of p(X)

1 = integral over X of p(x)

For a given fixed a strictly greater than zero, let us define q(x) = a * p(x) for x on X. Then

a = integral over X of q(x)

Rather than attempting to do something about this failure, let us investigate its consequence. While we have not proved it as yet, we do know that the Information Rate H(X) is

H(X) = - integral over X of p(x) * ln(kappa * p(x))

Let us perform the same integration using q(X) and call the result Ha(X)

Ha(X) = - integral over X of q(x) * ln(kappa * q(x))

Substitution gives us

= - integral over X of (a * p(x) * ln(kappa * a * p(x))

= - a * integral over X of p(x) * (ln(a) + ln(kappa * p(x)))

= - a * ln(a) * integral over X of p(x)

- a * integral over X of p(x) * ln(kappa * p(x))

Substitution of the known values of these integrals yields

= - a * ln(a) + a * H(X)

Solve for H(X), to obtain

H(X) = (Ha(X) + a * ln(a)) / a

Now, when we perturb the p(X) locally, we will not have to worry about the global consequence of this perturbation; because we have a method of computing the resulting H(X).

This is all we can do at this time, with this continuity axiom.


Subject addition axiom for the Information Rate

Given a *joint* probability-density distribution p(X, Y) on the Cartesian-product space XxY.

The *marginal* probability-density distribution p(X) is defined as

p(x) = integral over Y of p(x, y) for x in X.

Likewise, then, by symmetry of notation, the marginal probability-density distribution p(Y) is

p(y) = integral over X of p(x, y) for y in Y.

The *conditional* probability-density distribution p(y|x) for a fixed x on X, is defined as

p(y|x) = p(x, y) / p(x) for (x, y) on XxY.

And, likewise,

p(x|y) = p(x, y) / p(y) for (x, y) on XxY.

Each of these equations may be solved for the joint probability-density distribution, to yield

p(x, y) = p(y|x) * p(x) = p(x|y) * p(y) for (x, y) on XxY.

We define the *product* probability-density distribution as

prod-p(x, y) = p(x) * p(y) for (x, y) on XxY.

Observe that

1 = integral over XxY of p(x, y)

1 = integral over Y of p(y|x) for x on X

1 = integral over X of p(x|y) for y on Y

1 = integral over X of p(x)

1 = integral over Y of p(y)

1 = integral over XxY of prod-p(x, y)

 

Remember that the Information Rate H(X) is

H(X) = - integral over X of p(x) * ln(p(x)),

where I have left-out the kappa, for easier writing here. But, in a formal presentation, we have to account for the kappa.

Compute the H(Y|X) as

H(Y|x) = - integral over Y of p(y|x) * ln(p(y|x)) for x on X.

Compute the H(XxY) and substitute the conditional probability-distribution p(y|x)

H(XxY) = integral over XxY of (p(y|x) * p(x)) * (ln(p(y|x)) + ln(p(x)))

Evaluate this integral to yield

H(XxY) = H(X) + integral over X of H(Y|x) * p(x).

We take this last equation as our addition axiom for the Information Rate.

These are the four axioms for the Information Rate.

For completeness, we observe that the Information Rate of the product probability-density distribution is

H(X*Y) = H(X) + H(Y).


Subject The set of four axioms is equivalent to the postulate

We have presented the set of four axiom, upon which Information Theory may be based.

One may take any set of axioms/postulates for whatever one desires to study. One *only* requirement is that they be self-consistent.

Unfortunately, self-consistency is impossible to prove. The best one can do is to claim that one is not aware of any inconsistency.

There is a heuristic, which says that the fewer axioms/postulates in the set, and the less that they resemble each other, the greater the chances of self-consistency. Well, we have only four axioms, and they do not resemble each other. Thus, we are in fairly good shape.

We have an even better alternative. Instead of these four axioms, we may claim the

Postulate of Information Theory. The integral

H(X) = - integral over X of p(x) * ln(kappa * p(x))

exists.

There is little reason to doubt its existence. Especially, if we begin with a pseudo-metrizable topology, in which case, this is the third measure/integral, whose existence we would claim. Furthermore, there are numerous examples of the existence of such an integral -- for instance, that of the Gaussian probability-distribution, which we have presented, already.

And with only one postulate, we pretty well can guarantee the self-consistence of this set.

However, from a philosophical point of view, this postulate looks very ad-hoc. We can do better. We have the

Theorem The set of four axioms is equivalent to the postulate. The postulate implies the set of four axioms. And conversely, the set of four axioms implies the postulate, which is unique up to an arbitrary multiplicative constant.

The proof of the first half is easy.

The first axiom -- the existence of H(X) -- already is asserted in the postulate.

The second axiom -- the invariance of H(X) -- is obvious from the postulate.

The fourth axiom -- the addition of H(X) -- we had set-up from the postulate. Thus, we already have proven it.

The third axiom -- the continuity of H(X) -- we have not stated this axiom, precisely, as yet. However, we can prove more -- that the first derivative of H(X) exists -- fairly easily.

Given a fixed p(X), a fixed (small) interval i, of length i > 0, and a fixed (small, in magnitude) constant d != 0.

Let q(x) = (1 + d) * p(x), with x on i and q(x) = p(x), with x on the complement of i in X.

Then delta-a, by definition, is the integral over X of q(x) minus that of p(x). It is equal to d times the integral over i of p(x). By the mean-value theorem, it becomes

delta-a = d * i * mean of p(x) on i

This is not equal to zero and proportional to d * i.

The integral Ha(X) is equal to H(X) plus assorted integrals over i. Substitute into the formula new-H(X) = Ha(X) / a + ln(a). Then, delta-H(X), by definition, is

delta-H(X) = new-H(X) - H(X) = Ha(X) / a + ln(a) - H(X)

Each term has a factor of d. One term has a factor of the integral over i of p(x) * ln(p(x). By the mean-value theorem, this factor is i * mean over i of p(x) * ln(p(x)). Each of the other terms has a factor of the integral over i of p(x).

Then, in the quotient of delta-H(X) divided by delta-a, the factors d and i cancel. What remains is the sum of several constant terms, plus a constant times the quotient of the mean of p(x) * ln(p(x)) divided by the mean of p(x).

Thus, the limit, as the product d * i approaches zero, is bounded. This is the derivative of H(X). Since the derivative exists, the function H(X), of necessity, is continuous. QED.

We have finished the proof of the first part.

The original concept of the Stieltjes-integral, as well as how it had been subsumed into the Lebesque-integral is lost during the WWI. Modern textbooks gloss over this topic.

The H-integral does not fit the requirements of the subsumtion -- at least as stated in modern textbooks. Thus, I had to re-construct the Stieltjes integral, in its own right. Evaluate the Stieltjes-integral for H. Construct an appropriate mapping into a Lebesque-integral for H. As I remember, this exercise took about fifty type-written pages.

The derivation of the postulate from the axioms, as I remember, takes another fifty type-written pages -- in addition to employing the Stieltjes-integral. The proof proceeds in a manner analogous to that employed in the _Grundlagen der Analysis_, to develop real-numbers from the Peano axioms.

If I ever find my original notes, I will present this proof, in detail.


Last time, I should have provided the final formula for the derivative of H(X) with respect to a. It is

(- the mean over i of (p(x) * ln(p(x)))) / (the mean over i of p(x)) - H(X)

The first term represents the direct contribution of the perturbation to the p(X). The second term represents the effect of the subsequent requirement to adjust the p(X) to continue to satisfy the

1 = integral over X of p(x)


 

 


 

 

 



webmaster@rism.com
Last modified on Tuesday 10-th August 1999
Copyright (c) 1999 by R. I. 'Scibor-Marchocki