Linear Regression

Given a set of points, the purpose of linear regression is to pass an over-determined surface as close - in some sense - as possible to these points.

First we have to select a metric, both in the sense of between a pair of points and over a set of such pairs. There are innumerable possible choices and several popular choices. We are going to adapt the possibly-weighted Lebesque-Stieltjes integral of the square of the difference of the dependent variables.
LS-integral over the domain of (y - f(x))^2 p(x), with respect to x,
where p(x) x in X, is a probability density distribution p(X) and where f(x) is the function to be fitted to the given y = y(x) x in X, is the function y(X).
For the purpose of this exercise, we are going to restrict ourselves to the one-dimensional case.

Definitions

We take the probability density distribution p(x) x in X, as p(X).  Its Lebesque-Stieltjes integral over X, by definition, is one
1 = LS-integral over X of p(x), with respect to x
The Greek letter mu, which we write as u, is the mean of x over X. Likewise, yo is the mean of y over X. The subscript of the capital I indicates the variable of the moment. The second (and higher) order moments are taken about the center of mass.  The second-order moments are called "variance" by statisticians and "moment of inertia" by physicists.  The third-order moments are called "kurtosis".  The Greek letter sigma, which we write as s, is the standard deviation of the variable indicated in its subscript, which we write as a suffix.  The Greek letter rho is the correlation coefficient.  The subscript ee stands for expected error.
u = Ix = LS-integral over X of x p(x), with respect to x
yo = Iy = LS-integral over X of y p(x), with respect to x
sx^2 = Ix2 = LS-integral over X of (x - u)^2 p(x), with respect to x
sy^2 = Iy2 = LS-integral over X of (y - yo)^2 p(x), with respect to x
Ixy = LS-integral over X of (xi - u)(y - yo) p(x), with respect to x
rho = Ixy / (sx sy)
see^2 = Iee = LS-integral over X of (yp - y)^2 p(x), with respect to x

Lemmata

We observe that each of these two Lebesque-Stieltjes integrals are zero.
0 = LS-integral over X of (xi - u) p(x), with respect to x
0 = LS-integral over X of (yi - yo) p(x), with respect to x

Derivation


The straight-line (y predicted, abbreviated as yp), which we want to fit, by what is known as the method of least squares, to the set of points, is
yp = m (x - u) + b + yo
We adapt the Euclidean metric for the vertical distance between a point and the line. Then, we integrate it over X. Thus, we want to minimize the function
w = Iee = LS-integral over X of (yp - y)^2 = LS-integral over X of [m (x - u) + b - (y - yo)]^2
To this end, employing "differentiation under the integral sign" (theorem to be quoted and the reference to be located and cited), we find the partial derivatives of w with respect to m and b, set each equal to zero, and solve for b and m.
0 = dw/dm = 2 LS-integral over X of [m (x - u) + b - (y - yo)] (x - u) p(x), with respect to x = 2 (m Ix2 - Ixy)
0 = dw/db = 2 LS-integral over X of [m (x - u) + b - (y - yo)] p(x), with respect to x = 2 b
Thus
m = Ixy / Ix2 = rho sy / sx
b = 0

Substitution back into w yields
see^2 = Iee = w = m^2 Ix2 + b^2 + Iy2 - 2 m Ixy = Iy2 - Ixy^2 / Ix2 = (Iy2 - Ixy) (Iy2 + Ixy) / Ix2

Results

Substitution back into the equation for the straight-line yields the regression of y upon x
yp = rho (sy / sx) (x - u) + yo
which, alternatively, may be written as
(yp - yo) / sy = rho ((x - u) / sx)
The expected error is
see^2 = Iee = (Iy2 - Ixy) (Iy2 + Ixy) / Ix2
Observe that, when written in terms of the normalized coordinates, the line passes through the origin and has a slope equal to the correlation coefficient rho.  Had we asked for the regression of x upon y, we would have obtained
(xp - u) / sx = rho ((y - yo) / sy)
These are not the same straight lines.

Unbiased expected error

If the cardinality n of the set given points is finite, the unbiased expected error is 

unbiassed-see^2 = unbiassed-Iee = Iee n / (n - 2)

And it is this unbiased expected error that should be employed in the Student-t probability distribution.  References:
Statistics: a First Course, Donald H. Sanders, McGraw-Hill, Inc., 1995 (newer editions may be available).  ISBN 0-07-054900-1
Introduction to Mathematical Statistics, Robert V.Hogg and Allen T. Craig. fifth edition 1994.  ISBN 0023557222.

Generalization

The generalization to multi-dimensions would promote the second-order moments Ix2, Iy2, and Ixy to matrices.  The parity of the rank of these tensors is invariant under the generalization, as one would expect.  Since this generalization is a profound part of Linear Algebra, we place the one-dimensional special-case within College Algebra, rather than Calculus.

Copyright 2000 by R. I. 'Scibor-Marchocki
Last modified on Wednesday 16-th February 2000
Webmaster@rism.com