copyright (c) 1999 by R. I. 'Scibor-Marchocki
This is a collections of e-mail letters, which I wrote to my very good friend
Deb, as a set of lessons in Linear Algebra. I have deleted each salutation, to
protect her privacy. If anybody shows interest, someday, I might edit these
lessons into a better organized tutorial. So, please write me at
webmaster@rism.com.
There are two names for this probability distribution normal and Gaussian. Since the word "normal" is over-used, I prefer to call it "Gaussian".
We use oo as the symbol for infinity.
Let f(x) = (1/sqrt(2 * pi)) * exp((- 1 / 2) * x**2)
Then, we have the normalized Gaussian probability integrals
#1 integral of f(x), with respect to x, from 0 to oo is 1 / 2.;
hence from -oo to oo is 1.
#2 integral of x * f(x) .. 1 / sqrt(2 * pi) and by symmetry 0.
#3 integral of x**2 * f(x) ... 1 / 2 and 1.
#4 integral of f(x) * f(y), with respect to y and x, each from 0 to oo is 1 / 4;
hence each from -oo to oo is 1.
Let g(x) = f(x / sigma).
Then, from the foregoing, we have Gaussian probability integrals
#21 integral of g(x), with respect to x, from 0 to oo is 1 / 2;
hence from -oo to oo is 1.
#22 integral of x * g(x) ... sigma / sqrt(2 * pi) and 0. This is the mean.
#23 integral of x**2 * g(x) ... sigma**2 / 2 and sigma**2. This is the variance.
#24 integral of g(x) * g(y), with respect to y and x, each from -oo to oo is 1.
Proofs
#2. Let u = x**2 / 2. Then du = x * dx. Substitution yields the integral of (1 / sqrt(2 * pi)) * exp(- u) du, from 0 to oo. The exponential integrates to one. Hence the value of the expression is 1 / sqrt(2 * pi). QED.
#4. Change to polar coordinates. We obtain the integral of (1 / (2 * pi)) * r * exp(- r**2 / 2) d-theta dr, theta from 0 to pi / 2 and r from 0 to oo. This integral separates into the product of (1 / (2 * pi)) times the integral of d-theta, from 0 to pi / 2 times the integral of r * exp(- r**2 / 2) dr, from 0 to oo. The first integral is pi / 2. For the second integral, again, let u = r**2 / 2. Then du = r * dr. Substitution yields the integral of exp(- u) du, from 0 to oo. It is one. -- just as we had in the proof of #2. Multiplication yields the desired result. QED.
#1. Let I be the required integral. Since the integral is strictly positive, the integral, obviously, is non-negative.. Then I**2 is the integral in #4. Taking the square-root, we obtain the desired result. QED.
#3. Use integration by parts u times the integral of dv equals u * v - the integral of v times du. Thus, let u = x and dv = x * exp(- x**2 / 2) dx. Then du = dx and v = - exp(- x**2 / 2). Thus, the integral becomes (1 / sqrt(2 * pi)) * [ - x * exp(- x**2 / 2) + the integral of exp(- x**2 / 2) dx], from 0 to oo. The first term is zero, at each bound. The second term, by #1, yields 1 / 2, as required. QED.
We have completed the proofs of the foregoing table.
We generalize #24, above. First to n factors, by Mathematical Induction, we have the integral of ((1 / sqrt(2 * pi))**n) * exp((- 1 / 2) * ((x1 / sigma1)**2 + (x2 / sigma2)**2 + ... + (xn / sigman)**2)), with respect to the volume, each x from -oo to oo. The result is one.
Now, given any positive-definite nxn matrix A and the row vectors x and mu -- its mean, we have the integral of ((1 / sqrt(2 * pi))**n) * sqrt(determinant of A) * exp((- 1 / 2) * (x - mu) * A * (x - mu)tr), with respect to the volume, each x from -oo to oo. The result is one. Proof Factor A into O * D * Otr. We know that the determinant of A is equal to that of D. And associate the ortho-normal matrix O with the vector (x - mu) as u = (x - mu) * O.. We have reduced the problem to that of the previous paragraph. QED.
Finally, write the previous integral as the integral of ((1 / sqrt(2 * pi))**n) (1 / sqrt(det(A))) * exp((- 1 / 2) * (x - mu) * (inverse of A) * (x - mu)tr), with respect to the volume, each x from -oo to oo. The result still is one. This A is the inverse of the previous one. It still is a positive-definite symmetric nxn matrix.
This matrix A is called, by statisticians, the variance-matrix-- sometimes, the co-variance matrix.
Next time, the correlation coefficient and linear regression.
The logical way, would be to derive all of Information Theory, first. Then to connect the results with the several other fields, for instance, with Statistics.
I am attempting to provide a short-cut between some of the results and Statistics. Without Information Theory, it is like walking a tight-rope, without a safety-net. But, I will try, anyhow.
We already have all of the tools, from last time. Only a few odds and ends remain. Then, we link the Gaussian probability distribution to the 3Ch of Information Theory, whose matrix we presented a two lessons ago.
Statisticians have a long tradition of employing the Gaussian probability. They do not have any good justification for doing so. But; hey, it works! Information Theory would provide the motivation, justification, conditions when applicable, and how to adapt/adopt a situation where the probability distribution is not Gaussian.
Information Theory would justify the hypothesis that the variance matrix is positive-definite symmetric. Here we just take it as the given hypothesis.
In the factorization of the 3Ch matrix, we employed only the positive symmetric portion of the hypothesis. Here, we use the definite portion of the positive-definite.
The matrix A = (1 s s 1) is positive-definite, by hypothesis. As written, it obviously is a symmetric 2x2 matrix. What restriction does being positive-definite impose upon the value of the elements s?
Compute the eigen-values from their definition 0 = det(A - xI). To this end,
A - xI = (1 s s 1) - x * (1 0 0 1) = (1 - x s s 1 - x).
Its determinant is
0 = (1-x)**2 - s**2 = x**2 - 2 * x + (1 - s**2).
By the quadratic formula, the roots are
(2 +- sqrt(2**2 - 4 * (1 - s**2))) / 2 = 1 +- s.
Impose the condition that the product of the eigen-values be positive-definite. Actually, we did not need to compute the roots. Their product is just the constant term of the eigen-polynomial. Thus, 0 < (1 - s**2). Transposing, we have 1 > s**2. From which it follows that -1 < s < 1. That is, s is in the open interval (-1, 1).
Consider the special case of the Gaussian probability density distribution, which employs the aforementioned matrix A = (1 s s 1). It is
g(x) = (1 / (2 * pi * sqrt(1 - s**2))) * exp((- 1 / 2) * x * (inverse of (1 s s 1)) * xtr), where x is a row vector. We will take it as (x y).
While we could employ the general formulae, which we had obtained previously, it actually is easier to work it out directly.
The inverse of (1 s s 1) is (1 -s -s 1) / (1 - s**2).
Multiplication of the numerator by the two x-vectors yields the expression
x**2 - 2 * s * x * y + y ** 2
Completing the square in y, yields
x**2 + (y - s * x)**2 - s * x**2 = (1 - s**2) * x**2 + (y - s * x)**2
The division gives us
x**2 + (y - s * x)**2 / (1 - s**2)
Let u = (y - s * x) / sqrt(1 - s**2). Then y = u * sqrt(1 - s**2) + s * x. Of course, we also have x = x. The Jacobian (the same Jacobi) of this transformation is
J = det(dx/dx dx/du dy/dx dy/du) = sqrt(1 - s**2).
The foregoing derivatives are *partial*-derivatives; but we do not have the necessary type-face.
Then, g(x) becomes
g(x) * J = (1 / (2 * pi)) * exp((- 1 / 2) * (x**2 + u**2))
This is the normalized Gaussian probability density distribution. Its integral, we remember, is one -- as it should be. Its first moment -- the mean -- is zero. Its second moment -- the variance -- is one, when computed in the space (x u). In any ortho-normal space, the cross-variance always is zero.
But, we need the three second-moments computed in the (x y) space. The variance of x**2 still is one.
The cross-variance of
x * y = x * u * sqrt(1 - s**2) + s * x **2
is The first term is a cross-variance, which in the (x, u) space still is zero. The second term has the variance of x**2, which, as we stated before is, one. Multiply it by the factor s. Hence, the value is s.
The variance of
y**2 = s**2 * x**2 + 2 * s * sqrt(1 - s**2) * x * u + (1 - s**2) * u**2
is The first term is s**2 times the variance of x**2, which is one. This product is s**2. The variance of the second term is zero. The variance of the third term is (1 - s**2). The sum of these three terms is just one.
Again, without the benefit of the explanation available from Information Theory, we just define the cross-correlation as the cross-variance divided by the square-root of the product of each of the individual single-variable variances. The result is that the cross-correlation is equal to s, which in turn is in the open interval (-1, 1).
Whew! All of this work to come back the s, which we had two lessons ago. -)
All that remains, is to describe the link from the Gaussian probability distribution to the factorization of the 3Ch matrix. Actually, this is just a summary of what has been said in previous lessons.
In n-dimensional space, we are given a row-vector x of n elements -- the independent variable. We are given another like-sized row vector mu -- the mean of the vector x. And we are given an nxn matrix A, which by hypothesis is a positive-definite symmetric matrix -- the variance matrix.
Then the Gaussian probability density function is
g(x) = (1 / sqrt(2 * pi))**n * (1 / sqrt(det(A)) * exp((- 1 / 2) * (x - mu) * (inverse of A) * (x - mu)tr)
The matrix A may be factored, in order, into the composite-bra, the spectrum, and the composite-ket. In the large-style, they are
composite-bra = (O1 * S1 * < 0 0 O2 * S2 * >tr)
spectrum = (I S Str I)
composite-ket = (<tr * S1 * O1tr 0 0 > * S2 * O2tr)
Since the matrix A is symmetric, this ket is the transpose of the bra. However, since they are not ortho-normal (unitary) matrices, their inverses are not their transposes. Since they are diagonal matrices and since their sub-matrices are either ortho-normal or diagonal, their inverses are easy to compute. Thus, the
inverse of composite-bra = (<tr * (1 / S1) * O1tr 0 0 > * (1 / S2) * O2tr)
inverse of composite-ket = (O1 * (1 / S1) * < 0 0 O2 * (1 / S2) * >tr)
The composite-bra consists of three-layers -- factors -- which are, from left-to-right
exterior-bra = (O1 0 0 O2), aka external-bra
gauge-transformation = (S1 0 0 S2), aka scale or scaling
interior-bra = (< 0 0 >tr), aka internal-bra
The composite-ket may be factored, likewise.
The regression line is y = x * S or x = y * S -- they are not equivalent. These are in terms of the principal coordinates. In terms of the given coordinates
x = mu + (x y) * (inverse of composite-bra)
Since, usually, the magnitude of the individual elements of the small spectrum decreases rapidly, a good approximation may be obtained by retaining only the first few elements.
When they perform a factor-analysis, statisticians call the exterior-bra the loading of the principal factors upon the coordinates. They call the gauge-transformation the principal factors. Usually, they would be ordered in decreasing order. Statisticians lack the necessary tools to go any further. Hence, they do not have any more names.
A suitable name for the interior-bra would be the loading of the principal correlation upon the principal factors. Finally, the small spectrum S is the set of principal cross-correlations. Again, they would be ordered in decreasing order of their magnitudes.
Unable to do any better, statisticians replace the interior-bra by an identity matrix. Then the anomaly of a poorer fit of the regression may result as more correlation elements are considered.
This principal-regression should be a great boon to statisticians. We offer it here without much explanation. But, statisticians are pragmatic. If it works, use it. But, will anybody pick-up on this?
Several lessons ago, I believe that I forgot to mention the obvious theorem, with an equally obvious proof.
Theorem The absolute-value of the determinant of any matrix A is the positive square-root of the determinant of (A * Atr), likewise of (Atr * A).
If the matrix A is not square, the larger of the aforementioned products introduces an extraneous null-space of a size equal to the difference of the sizes of the two products. Since most algorithms for the computation of a determinant have trouble with null-spaces, use that product which results in the smaller matrix.
The inverse of an ortho-normal (unitary) matrix is just its transpose. The inverse of a diagonal matrix is just its reciprocal. The inverse of a 2x2 matrix is a little more difficult; but not too bad.
In our case, the only 2x2 matrix which we may want to invert is the large-style spectrum (where S is a small-style diagonal matrix)
(I S Str I)
Its inverse is particularity easy
(I -S -Str I) / (I - S * Str)
Once we have a problem reduced to the multiplication of ortho-normal (unitary), diagonal, or 2x2 matrices; we can manipulate the expressions easily.
For example, a Fourrier series and its various generalizations may be expressed in terms of ortho-normal (unitary) matrices in Banach (Hilbert) space. These are real (complex) infinite-dimensional -- either denumerable or of the power of the continuum -- spaces. By now, we clearly are in the realm of Algebraic Topology. In passing, Banach was a famous Polish mathematician, in Warszawa (English spelling Warsaw).
The various spread-spectrum concepts also may be represented in terms of such matrices.
In Information Theory, it gets even easier. We take the Arc cosine of a certain -- very specific -- matrix. Then, we show that the matrix multiplication may be performed by the addition of the angles. These angles are matrices, themselves, of course.
Once we have achieved this level of sophistication, everything is reduced to easy matrix manipulations.
I have exasperated my friends by responding, "It is just another ortho-normal (unitary) matrix." When they were all excited about some newest wavelets, pulse-code modulation, or other new-fanged communications scheme.
In the previous lesson, we had related the principal linear regression, expressed in the coordinates of the spectrum, to the external coordinates. We may make this linear regression more usable.
Let y be a row-vector, like the vector x. However, we think of y as the dependent variable, corresponding to x, the independent variable.
Start with the last equation of the previous lesson. Post-multiply each side by the composite-bra. Now, express the principal linear regression, in terms of the coordinates of the spectrum, by the expression large-style spectrum minus the identity ((I S Str I) - I).
Post-multiply by the inverse of the composite-bra to obtain the principal linear regression, expresses in terms of the external coordinates
y = mu + (x - mu) * (composite bra) * ((I S Str I) - I) * (inverse of composite bra)
The inverse of the composite-bra was provided in the previous lesson, which inverse we had obtained easily -- employing transposition and reciprocals.
Now, that was easy, was it not?
Any other juggling-act upon a tight-rope -- without a safety net -- you want performed? -)
webmaster@rism.com
Last modified on Saturday 17-th July 1999
Copyright (c) 1999 by R. I. 'Scibor-Marchocki