Linear Algebra

Selected theorems

and

Documentation of the Shareware Library

 

=== Factorization of a channel ===

Construction of the factorization of a channel S.

Definition:  A channel is a partitioned-in-the-middle positive-definite symmetric matrix S.  We do not need the additional properties which are discussed in Information Theory.

Note:  If the partitioning of the matrix S as given is not in the middle; augment the matrix -- with an identity -- as required.  For cascaded channels, augment each to match the largest one.

Comments:  If the actual matrix is not definite; i.e., if the matrix is singular; the subroutines utilize our convention that the reciprocal of zero is zero.  If the actual matrix is not positive; i.e., if an eigenvalue is negative; the subroutines take the absolute-value of the eigenvalue, before finding its square-root.  However, in doing so; the matrix is modified.  Thus, the subroutines survive; but, the result is of questionable significance.

These abbreviations are employed in the following discussion:

Disassemble it into its parts:  S = (a, b; btr, c).  By hypothesis, this is a positive-definite symmetric matrix with equal-size square partitions.

Hence, the matrices a and c obviously are symmetric.  Factor them into a = LBRA LDIAG LKET and c = RBRA RDIAG RKET, where each of the kets is the transpose of its corresponding bra.

Let the matrix d be the product d = (1 / LDIAG) LKET b RBRA (1 / RDIAG).

Factor it into d = MBRA MDIAG MKET.  It will be shown that each element of the diagonal matrix MDIAG lies in the open interval (-1, 1).

Assemble the partitioned matrices

ExBRA = (LBRA, 0; 0, RKETtr)

ExDIAG = (LDIAG, 0; 0, RDIAG)

InBRA = (MBRA, 0; 0, MKETtr)

InDIAG = (I, MDIAG; MDIAG, I)    this is not really a diagonal matrix.  However, since it is a convenient name, we use it anyhow.  Since the MDIAG is a true diagonal matrix, it is obvious that InDIAG is a symmetric matrix.  Moreover, it will be shown that it is a positive-definite matrix.

InKET = InBRAtr = (MBRAtr, 0; 0, MKET)

ExKET = ExBRAtr = (LBRAtr, 0; 0, RKET)

Then, the requisite factorization is S = ExBRA ExDIAG InBRA InDIAG InKET ExDIAG ExKET.

Theorem:  The preceding assertion.

Proof is obvious.

[I discovered/devised this factorization around 1962.] 

Lemma:  Given the symmetric matrix S = (a, b; b, c).  The matrix S is positive definite iff (= if and only if) a > 0, c > 0, and b^2 < a c.

Proof:  The eigenvalues are defined as the roots of the characteristic equation

0 = det(S - x I) = det((a - x, b; b, c - x)) = (a - x) (c - x) - b^2 = x^2 - (a + c) x + (a c - b^2).

By Descartes rule of signs, there will be no negative roots iff a + c > 0 and a c > b^2, which, of necessity is >= 0.  QED.

Corollary:  The last inequality of the lemma is equivalent to rho is in the open interval (-1, 1), where rho is defined as rho = b / sqrt(a c).

Corollary:  A single Jacobi rotation diagonalizes the matrix S.  The rho may be recovered (except for its sign) from this diagonalized form.

Theorem:  The preceding generalizes to an nxn matrix S = (a, b; btr, c), each of whose partitions is a diagonal matrix.  This matrix may be diagonalized by means of n Jacobi rotations.  The set of the rho may be recovered (except for each sign) from this diagonalized form.

Corollary:  Each rho of a channel is in the open interval (-1, 1).

Proof:  This assertion is expressible in terms of eigenvalues, which are invariant under similarity transformations.  QED.

Eigenvalues, as a set, are unique.  Because of our convention to sort the eigenvalues of a channel, its eigenvalues become individually unique.  Hence the analysis of a channel, as we presented it, is unique – up to the usual caveat regarding axial vectors and degeneracies.  The only possible problem is addressed by the

Theorem:  Given a channel represented by the positive definite symmetric matrix S = (a, b; btr, c).  Each of the two partitions a and c is a positive definite symmetric matrix.

Proof:  Factor the matrix S into S = O diag Otr.  By hypothesis, each diagonal element is strictly positive.  Work backwards, to obtain the set of the rho.  Work forward, to obtain the diagonals of the a and c matrices.  Temporarily, ignore all of the ortho-normal matrices.  The remaining factors have to multiply.  By the preceding theorem, each rho is in the open interval (-1, 1) and each element upon a principal diagonal is strictly positive.  The product of the determinant of each factor is the determinant of S.  The value of each determinant is a real number; hence, an element of a field.  Remember that any determinant is invariant under a similarity transformation.

Now, we prove by contra positive:  Introduce each of the ortho-normal matrices gradually, employing the bilinear transformation.  By continuity, if any eigenvalue were to become negative, it would have to pass through zero.  By doing so, it would violate the product of the determinants.  QED.

In passing, if you actually want to factor the matrix representation of a channel into S = O E Otr; employ the Jacobi algorithm.  In principle, the set of the rho may be recovered from the E matrix.  But, in practice, it is easier to employ the provided channel routines upon the original matrix.

Converse:  If each of the two matrices a and c is positive-definite symmetric and if each of the rho lies in the open interval (-1, 1); then the matrix S indeed is positive-definite symmetric.

Proof is obvious.

Theorem:  A positive definite Hermitian matrix H is a channel.  The converse is not true, however.

The proof is obvious.

Comment:  If you want to analyze it as a channel; make a copy of H and set its cmp=0 and its part=1.  Then, employ the channel routines.  However, the result is of dubious significance,  since the MDIAG is not of a form that could be an isomorph of a complex matrix.  Hence, since a single factor by itself could not be a non-complex matrix, the MBRA and MKET must not be either.

We will require a formula for the determinant of InDIAG and that of the channel.  So, let us prove two theorems.

Lemma:  Given a scalar d.  Let S be the matrix S = (1, d; d, 1).  It is symmetric.  It may be factored as S = O E Otr, where the ortho-normal matrix O is O = (1, -1; 1, 1) / sqrt(2) and the diagonal matrix of eigen-values E is E = (1 + d, 0; 0, 1 - d).

Proof:  Either employ the Jacobi algorithm or just multiply it out.  That S is symmetric and that O is ortho-normal is obvious.

Theorem:  Given a diagonal matrix d.  Let S be the partitioned matrix S = (I, d; d, I).  It is symmetric.  It may be factored as S = O E Otr, where the ortho-normal matrix O is O = (I, -I; I, I) / sqrt(2) and the diagonal matrix of eigen-values E is E = (I + d, 0; 0, I - d).

Proof:  Either by Mathematical Induction or just multiply it out.  That S is symmetric and that O is ortho-normal is obvious.

Corollary:  The determinant of S is det(S) = product of (1 - di^2), where di are the elements of the diagonal matrix d.

Proof follows from the definition of a determinant.

Theorem:  The determinant of the channel S = (a, b; btr, c) is det(S) = det(a) det(c) product of (1 - di^2), where di are the elements of the diagonal matrix MDIAG.

Proof:  The determinant of an ortho-normal matrix is either one or minus one.  The determinant of its transpose has the same value.  The determinant of a product of matrices is the product of their individual determinants.  Since the ortho-normal matrices occur in pairs of O and Otr, their determinants multiply to one.  All that remains is the formula in the conclusion.  QED.  By now, you should not require any hyperlinks to the foregoing statements.  :-)

=== Factor analysis and multi-dimensional linear-regression ===

Some time ago, we already had solved the one-dimensional linear regression.  Here, we anticipate the generalization to multi-dimensional  Thus, we will include the theorems, which provide the obvious generalizations, as part of the definitions.

Given a set of m ordered pairs (x, y).  The x and y generalize to vectors, which we take as horizontal.

Definition of average:  xo = (1 / m) sum(x).  yo = (1 / m) sum(y).  So do the xo and yo generalize to vectors.

Definition of the second order moment I:  Ixx = sum((x – xo)^2).  Iyy = sum((y – yo)^2).  The Ixx and Iyy generalize to positive-definite symmetric matrices.

Definition of variance Var:  Var-xx = (1 / m) Ixx.  Var-yy = (1 / m) Iyy.  So do the Var-xx and Var-yy generalize to positive-definite symmetric matrices.

Definition of standard-deviation sigma:  sigma-x = sqrt(Var-xx).  sigma-y = sqrt(Var-yy).  By convention, we take the positive square-root.  Yes, we can take a square-root of either a positive-definite Hermitian or positive-definite symmetric matrix.  Hence, the sigma-x and sigma-y generalize to positive-definite symmetric matrices.

Definition of second-order cross-moment I:  Ixy = sum((x – xo) (y – yo)).  The Ixy generalizes to a real matrix.

Definition of second-order cross-variance Var:  Var-xy = (1 / m) Ixy.  So does the Var-xy generalize to a real matrix.

Observe that there is no such thing as a cross standard-deviation; because a cross-variance may be negative.  Good thing; because a square-root of a real matrix is not definable.  Thus, we have the

Definition of the correlation-coefficient rho and the angle Theta:  rho = Var-xy / (sigma-x sigma-y) = cos(Theta).  The correlation-coefficient generalizes to the real matrices rho = (1 / sigma-x) Var-xy (1 / sigma-y).  We postpone the discussion of the generalization of Theta.

Theorem:  Var-xx = (1 / m) sum(x^2) – xo^2.  Var-xy = (1 / m) sum(x y) – xo yo.  Var-yy = (1 / m) sum(y^2) – yo^2.

Proof:  By the definition of Var-xy = (1 / m) sum((x – xo) (y – yo)) = (1 / m) (sum(x y) – xo sum(y) – yo sum(x) – m xo yo) = (1 / m) sum(x y) – xo yo.  QED.  Likewise for the other two variances.

Theorem:  Single-dimensional linear-regression also known as a least-squares fit.

(y – yo) / sigma-y = ((x – xo) / sigma-x)) rho.

This equation generalizes to:

y = (x – xo) (1 / sigma-x) rho sigma-y + yo = (x – xo) (1 / Var-xx) Var-xy + yo.

Proof:  Let us fit the line eta = a x + b.  The residue is (y – eta).  We want to minimize its variance, which is  (1 / m) times the sum of the squares of this residue:

u = (1 / m) sum((y – eta)^2) = (1 / m) sum((y – (a x + b))^2).

To do so, set each of the partial derivatives equal to zero:

0 = du/da = - (2 / m) sum((y – a x – b) x) = - 2 ((Var-xy + xo yo) – a (Var-xx + xo^2) – b xo).

0 = du/db = - (2 / m) sum(y – a x – b) = - 2 (yo – a xo – b).

To solve this system of linear equations, multiply the first equation by (-1 / 2) and the second by (xo / 2), and add, to obtain

0 = (Var-xy + xo yo) – a (Var-xx + xo^2) – xo yo + a xo^2 = Var-xy – a Var-xx.

Hence,

a = Var-xy  / Var-xx.

b = yo - a xo.

QED.

Corollary:  The resulting minimum value of u is u-min = Var-yy (1 - rho^2).

Proof:  First, substitute b into the u, to obtain

u-min = (1 / m) sum((y – (a x + (yo – a xo)))^2) = (1 / m) sum(((y – yo) – (a (x – xo)))^2 = Var-yy – 2 a Var-xy + a^2 Var-xx.

Then, substitute a, to obtain

u-min = Var-yy – 2 (Var-xy / Var-xx) Var-xy + (Var-xy / Var-xx)^2 Var-xx = Var-yy – (Var-xy)^2 / Var-xx.  QED.

By observing the formula of this corollary, statisticians say that “rho^2 is the relative variance, of Var-yy, explained by the regression.”

Corollary:  The average of the residue is zero:  0 = (1 / m) sum(y – eta).

Proof:  0 =? (1 / m) sum(y – eta) = (1 / m) sum(y – (a x + b)).

Substitute b, to obtain

= (1 / m) sum(y – (a x + (yo – a xo))) = (1 / m) sum((y – yo) – a (x – xo)) = 0.  QED.

Corollary:  The cross-variance Var-xr of x by the residue is zero.  0 = Var-xr = (1 / m) sum((x – xo) ((y – eta) – ave(y – eta))).

Proof:  0 =? (1 / m) sum((x – xo) ((y – eta) – ave(y – eta))) = (1 / m) sum((x – xo) ((y – yo) – (a (x – xo) + (b – b)))) = Var-xy – a Var-xx = Var-xy – (Var-xy / Var-xx) Var-xx = 0.  QED.

Drop the a and the b.

Remember that in the analysis of a channel, we partition its matrix into (a, b; btr, c).  In terms of the variances, we have LBRA LDIAG^2 LBRAtr = a = Var-xx, b = Var-xy, RKETtr RDIAG^2 RKET = c = Var-yy.  Hence, sigma-x = LBRA LDIAG LBRAtr and sigma-y = RKETtr RDIAG RKET.  The matrix d is defined as d = (1 / LDIAG) LKET b RBRA (1 / RDIAG) = LDIAGinv LBRAtr Var-xy RKETtr RDIAGinv = MBRA MDIAG MKET.  Thus, Var-xy = LBRA LDIAG MBRA MDIAG MKET RDIAG RKET.  Hence, rho = (1 / sigma-x) VAR-xy (1 / sigma-y) = (LBRA LDIAGinv LBRAtr) (LBRA LDIAG MBRA MDIAG MKET RDIAG RKET) (RKETtr RDIAGinv RKET) = LBRA MBRA MDIAG MKET RKET.

Theorem:  For the multi-dimensional linear-regression, the coefficient of (x – xo) in the regression formula generalizes to LBRA LDIAGinv MBRA MDIAG MKET RDIAG RKET.  As a mnemonic, these are the seven small-style factors of the channel, taken in order; but, with the LDIAG being inverted.

Proof:  This coefficient is (1 / sigma-x) rho sigma-y = (1 / (LBRA LDIAG LBRAtr)) (LBRA MBRA MDIAG MKET RKET) (RKETtr RDIAG RKET) = LABRA LDIAGinv MBRA MDIAG MKET RDIAG RKET.  QED.  Alternatively, this coefficient is (1 / Var-xx) Var-xy = (1 / (LBRA LDIAG^2 LBRAtr)) (LBRA LDIAG MBRA MDIAG MKET RDIAG RKET) = LBRA LDIAGinv MBRA MDIAG MKET RDIAG RKET.  QED.

Corollary:  The y from whence came a given x is given by the same formula; but, with the middle-factor being MDIAGinv.

Proof is obvious.

Theorem:  The cross-variance Var-xr generalizes to the zero matrix.  The average of the residue generalizes to the zero horizontal vector.

Proofs are obvious.

Theorem:  The minimum of u generalizes to Wtr (I – MDIAG^2) W, where W = MKET RDIAG RKET.  This matrix is positive-definite symmetric, as it has to be by virtue of being a variance.

Proof is obvious.

By observing the formula of this theorem, we may say that “the square of each element of MDIAG is the relative variance, of the corresponding simple-channel, explained by the regression.”

[I formulated and proved this theorem S 08-th May 2004.]

Corollary:  The corresponding standard deviation is Wtr sqrt(I – MDIAG^2) W and the standard error of the mean is Wtr sqrt(I – MDIAG^2) W (1 / sqrt(m - 1)), where m is the amount of points (that is, of observations).

Proof is obvious.

[Corollary added on Su 11-th December 2005.]

These last two theorems apply only to the original event data-set or to any event data-set with the same variance and average.  Such may be constructed easily.

[Statisticians have named the LBRA "factor analysis" and the LDIAG "loading upon the factors".  Since there is no such factorization in the available literature, the MBRA, MDIAG, and MKET have not been named.]

[This multi-dimensional linear-regression was demonstrated by me in 1980 or 1981.]

By now, our analysis has reduced the problem to that of n statistically-independent single-dimensional simple-channels.  Each of these has its independent-variable x and dependent-variable y in what statisticians call "z-scale"; that is, the mean is zero and the standard deviation is one.  Thus, the standard deviation of the mean of the centroid is a circle, centered at the origin, with a radius of one.  Likewise, the standard error of this mean is a circle, centered at the origin, with a radius of 1 / sqrt(m - 1).

For a fixed x, the standard deviation of the corresponding y consists of two components:  1) The residual error from the regression; namely, sqrt(1 - rho^2).  2) The standard error of the mean of the centroid, reverse-projected on the y-axis; namely, sqrt(1 + rho^2) / sqrt(m - 1).  Since they are independent of each other, their combined value is the square-root of the sum of each of their squares; namely, sqrt(1 - rho^2 + (1 + rho^2) / (m - 1)), where (as already stated) m is the amount of points; that is, of obvervations.

The corresponding confidence-interval is centered on the regression line y = rho x and its length (height) is the aforementioned standard-deviation multiplied by the cumulative student-t distribution for the specified confidence criterion.

A similar argument shows that the standard-error of the y-intercept of the regression line is sqrt(2 / (m - 1)); because (1 + rho^2) / (m - 1) comes from the standard-error of the centroid and (1 - rho^2) / (m - 1) comes from the standard-error of the regression.  Pray, observe that this standard-error does not depent upon rho.  Multiply the standard-error of the y-intercept of the regression line by the same cumulative student t distribution, to obtain the confidence-interval.  In summary, we have established the --

TheoremVertical confidence-interval.  The regression line is y = rho x.  The end-points of the confidence interval of the y for a fixed x are +- t sqrt(1 - rho^2 + (1 + rho^2) / (m - 1) from the nominal regression-line.  The y-intercept of the regression-line has end-points of its confidence interval at +- t sqrt(2 / (m - 1)).  Where t is the cumulative student-t distribution for the desired confidence-criterion.

Proof was presented in the preceding arguments.

Corollary:  For large m, the y predicted for a given x is not statistically significantly different from zero; if abs(x) <= t sqrt(1 - rho^2) / abs(rho).  The Student-t probability distribution may be approximated by the corresponding cumulative Gaussian probability distribution.

Proof is obvious.

The foregoing is for a single prediction.  Of course, for m' predictions, the standard error of their mean is divided by the sqrt(m' - 1).

Definition:  Let erf (=ERror Function) be the integral, from minus-infinity to x, of the Gaussian probability density distribution.
This forward reference does not constitute a circular logical argument; because, it refers to a definition.

Lemma:  The derivative of erf(x) is just the Gaussian probability density distribution.

Proof:  The Riemann integral is an anti-derivative.

Consider the one-to-one bi-continuous mapping of the correlation-coefficient rho to the x, given by (1 + rho) / 2 = erf(x).  The domain of rho is (-1, 1) and of x is (-oo, oo).

Theorem:  The derivative of rho, with respect to x, is given by drho/dx = 2 * (the Gaussian probability density distribution, at x).

Proof is obvious.

Corollary:  The derivative of rho, with respect to x, is given by drho/dx = 2 * (the Gaussian probability density distribution, at x = inverse of erf((1 + rho) / 2)).

Proof:  Substitute x from the aforementioned mapping.

Theorem:  The rho confidence interval.  The confidence-interval of the rho is given by the product of t sqrt((1 - rho^2) / m - 1) by the derivative which was the subject of the previous corollary.

Proof:  The confidence-interval has to be computed in terms of a variable with the Gaussian probability distribution.

Since the algorithm that factors a composite channel into the n simple channels provides only the absolute value of the correlation coefficient rho, by symmetry, we have to employ the two-tailed Student-t distribution.

On the other hand, the conventionally accepted t-statistic for the correlation coefficient is rho / sqrt((1 - rho^2) / (m - 2)), with m-2 degrees of freedom.  Again (as stated in the previous sentence), we have to employ the two-tailed Student-t distribution.

Which t-statistic is correct, or the better, one to employ?  Each is credible.  Statistics is more of an art than a science.  It is your judgment call.  The bottom-line, in either case, is that you need many observations to obtain a reliable prediction.

A discussion of the Student-t probability distribution is out of the scope of this work on Linear Algebra.  There are numerous references available on the Internet.

 

 

 

UP        NEXT

Copyright (c) 2003, 4, 5, 6 by R.I. ‘Scibor-Marchocki.  Last revised Saturday 28-th January 2006.