The objective in a principal components analysis (or PCA) of a set of variables is to find a new set of variables that have certain desirable properties, and that efficiently represent the information in the original set of variables (efficiently here means with a smaller number of new variables than the original set.) The numerical analysis problem that must be solved in PCA is that given p variables, \({X_1},{X_{2,}} \ldots ,{X_p},\) find the linear combinations (new variables) \({Z_1},{Z_2}, \ldots ,{Z_p},\) e.g.
\[{Z_{1i}} = {a_{11}}{X_{1i}} + {a_{12}}{X_{2i}} + \cdots + {a_{1p}}{X_{pi}},{\rm{ }}i = 1, \ldots ,n,\]
that have the following properties:
Usually, but not always, the X’s have been rescaled to have a mean of 0 and a standard deviation of 1 (i.e. transformed to z-scores).
The first linear combination, \({Z_1},\) looks somewhat like a regression equation, and the properties are designed so that a) the (new) components are defined in decreasing order of importance, b) the components represent the “underlying dimensions” of the set of original variables, and c) are not correlated with one another.
The first, or principal, component (hence the name of the analysis) is
\[{Z_{1i}} = {a_{11}}{X_{1i}} + {a_{12}}{X_{2i}} + \cdots + {a_{1p}}{X_{pi}},{\rm{ }}i = 1, \ldots ,n,\]
and is defined by choosing the a’s in order to
\[\max ({\lambda _1}) = {\mathop{\rm var}} ({Z_1}) = (1/n)\sum\limits_{i = 1}^n {Z_{1i}^2,}\]
subject to the constraint that \(a_{11}^2 + a_{12}^2 + \cdots + a_{1p}^2 = 1.\) This constraint ensures that \({\lambda _1}\) is not made arbitrarily large simply by increasing the values of the a’s.
A second component
\[{Z_{2i}} = {a_{21}}{X_{1i}} + {a_{22}}{X_{2i}} + \cdots + {a_{2p}}{X_{pi}},{\rm{ }}i = 1, \ldots ,n,\]
can also be defined in order to
\[\max ({\lambda _2}) = {\mathop{\rm var}} ({Z_2}) = (1/n)\sum\limits_{i = 1}^n {Z_{2i}^2,}\]
subject to the constraints that \(a_{21}^2 + a_{22}^2 + \cdots + a_{2p}^2 = 1\), and \({\mathop{\rm cov}} ({Z_1},{Z_2}) = 0.\)
These constraints again make sure that \({\lambda _2}\) is not made arbitrarily large, and that the first and second components are independent or uncorrelated with one another. The optimization problem amounts to requiring the first component to be defined in such a way as to have the maximum variance over the n observations. The optimization problem may be stated as:
\[\begin{array}{c}\max ({\lambda _1}) = (1/n)\sum\limits_{i = 1}^n {Z_{1i}^2,{\rm{ s.t. }}\sum\limits_{i = 1}^n {a_{1i}^2} } = 1\\ = (1/n){({{{\bf{a'}}}_1}{\bf{X}})^2},{\rm{ s.t.}}\space{{{\bf{a'}}}_1}{{\bf{a}}_1} = 1\\ = (1/n)({{{\bf{a'}}}_1}{\bf{X}})({\bf{X'}}{{\bf{a}}_1}),{\rm{ s.t.}}\space{{{\bf{a'}}}_1}{{\bf{a}}_1} = 1\\ = {{{\bf{a'}}}_1}((1/n){\bf{XX'}}){{\bf{a}}_1},{\rm{ s.t.}}\space{{{\bf{a'}}}_1}{{\bf{a}}_1} = 1.\end{array} \]
Note that if \({\mu _X} = 0,\) and \({\sigma _X} = 1,\) (i.e. that the X’s are z-scores), \((1/n){\bf{XX'}} = {\bf{R}},\) the correlation matrix.
Maximizing \({\lambda _1},\) i.e.
\[\max ({\lambda _1}) = {{\bf{a'}}_1}{\bf{R}}{{\bf{a}}_1},{\rm{ s.t.}\space}{{\bf{a'}}_1}{{\bf{a}}_1} = 1\]
is equivalent to maximizing \({u_1}\), i.e.
\[\max ({u_1}) = {{\bf{a'}}_1}{\bf{R}}{{\bf{a}}_1} - {\lambda _1}({{\bf{a'}}_1}{{\bf{a}}_1} - 1),\]
where \({\lambda _1}\) appears in the equation as a “Lagrange multiplier” which incorporates the constraint into the equation.
To maximize \({u_1}\), set the partial derivative of \({u_1}\) with respect to \({a}\) to zero, i.e.,
\[{{\partial {u_1}}}/{{\partial {\bf{a}}}} = 2{\bf{Ra}} - 2{\lambda _1}{{\bf{a}}_1} = {\bf{0}},{\rm{ or}}\]
\[({\bf{R}} - {\lambda _1}){{\bf{a}}_1} = {\bf{0}}.\]
This last expression contains p simultaneous equations that must be solved for \({{\bf{a}}_1}.\) Comparison of this expression with that describing the “eigenstructure” of a square matrix reveals that choosing \({{\bf{a}}_1}\) to be the first eigenvector of R, with \({\lambda _1}\) as the first eigenvalue, solves the optimization problem in PCA. It also happens that the second and higher-numbered components are described by the second and higher-numbered eigenvectors and eigenvalues of the matrix R. There is a second way to state the optimization problem that winds up leading to the same result. In this second case, instead of choosing the coefficients, \({{\bf{a}}_1},\) such that the first component has the maximum variability over the observations, instead \({{\bf{a}}_1}\) is defined to have the maximum simultaneous resemblance to the X’s, i.e.,
\[\begin{array}{c}\max ({\lambda _1}) = \frac{1}{n}\sum\limits_{i = 1}^n {\left( {\sum\limits_{j = 1}^p {{{({a_{1j}}{X_{ji}})}^2}/\sum\limits_{j = 1}^p {a_{1j}^2} } } \right)} \\ = \frac{1}{n}{({{{\bf{a'}}}_1}{\bf{X}})^2}/({\bf{a'a}})\\ = \frac{1}{n}({{{\bf{a'}}}_1}{\bf{XX'}}{{\bf{a}}_1})/({\bf{a'a}})\\ = ({{{\bf{a'}}}_1}{\bf{R'}}{{\bf{a}}_1})/({\bf{a'a}}),\end{array}\]
which leads to the same problem as before, i.e. \(\max ({\lambda _1}) = {{\bf{a'}}_1}{\bf{R}}{{\bf{a}}_1},{\rm{ s.t. }}{{\bf{a'}}_1}{{\bf{a}}_1} = 1.\)