## Archive for the ‘machine learning’ Category

## Listing of Statistics and Machine Learning Conferences

Occasionally, I will query Google with “statistics conferences”, “machine learning conferences” or “pattern recognition conferences” and the like. But often, it is difficult to obtain anything meaningful other than the conferences of which I’m already aware (such as JSM, ICML, some IEEE conferences). Today, I found WikiCFP, which is a “A Wiki for Calls For Papers.” This seems to be what I needed. In particular, the following are very useful to me:

It seems limited for statistics though, as JSM is not even listed.

## Principal Component Analysis vs Linear Discriminant Analysis for Dimension Reduction

Lately I have been reviewing much of the electrical engineering literature on pattern recognition and machine learning and found this article in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) that compares Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) in facial recognition. Published in 2001, it is a bit dated. However, there are few papers (to my knowledge) with such a specific focus.

Before we discuss the paper further, let’s take a look at a summary of LDA and PCA.

The goal of LDA is to find a linear projection from the feature space (with dimension \(p\)) to a subspace of dimension \(C – 1\), where \(C\) is the number of classes, that maximizes the separability of the classes. It must be noted that LDA is often advertised as a Gaussian parametric model, but Fisher only assumed homoscedastic populations; that is, he assumed that the covariance matrices of each class are equal. We refer to the common covariance matrix as \(\mathbf{\Sigma}\). However, under the homoscedastic Gaussian assumption, LDA can be found to be the maximum likelihood method. In practice this covariance matrix must be estimated with data because it is unknown; the estimated covariance matrix is often called the pooled sample covariance matrix, \(\mathbf{S}_p\). Of course, when the sample size \(N\) is large relative to the dimension of the feature space (the number of variables) \(p\), this estimation is excellent. However, when \(p > N\), \(\mathbf{S}_p\) is singular, which causes a problem for the method. Often the inverse of this estimate is replaced with the Moore-Penrose pseudoinverse or is regularized. In the modern, high-dimensional case where \(p >> N\), this estimation is terrible.

A good overview of LDA is given here.

Wikipedia defines PCA nicely:

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA essentially rotates the data (via a linear transformation) so that most of the variability in the data is contained in as few dimensions as possible. For dimension reduction purposes, the usual practice is to drop the remaining dimensions containing little variability (the dimensions that correspond to the smallest eigenvalues) because they are highly correlated with the remaining dimensions. To borrow from Wikipedia once again,

PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance.

A good overview of PCA can be found here.

The problem that I have with PCA for dimension reduction in the classification context, which the PAMI paper considers, is that it ignores the response, and thus the eigenvectors (and corresponding eigenvalues) are found after considering the features as one data set. In other words, the training data is treated as if it all comes from the same population, which can be especially problematic in the multiclass classification setting. The paper acknowledges this issue:

Of late, there has been a tendency to prefer LDA over PCA because, as intuition would suggest, the former deals directly with discrimination between classes, whereas the latter deals with the data in its entirety for the principal components analysis without paying any particular attention to the underlying class structure.

The paper then makes the claim that

we will show that the switch from PCA to LDA may not always be warranted and may sometimes lead to faulty system design, especially if the size of the learning database is small.

I have no qualms about their claim and their subsequent results. However, there is no acknowledgement about the poor estimation of \(\mathbf{S}_p\), which leads to poor performance of LDA in the \(p >> n\) case. There have been many suggestions on how to improve this estimation, and often shrinkage methods significantly improve the estimation of \(\mathbf{\Sigma}\). LDA is not always the best choice either because of the need to pool covariance matrices: if the covariance matrix for each class describe very different shapes, then pooling essentially is a weighted average of the shapes, which may lead to a new shape not representative of any class. (This is similar to the classic independent two-sample t-test, where a pooled sample variance is used.)

It would be interesting to see a follow-up study done with the appropriate regularizations performed with LDA and PCA in the \(p >> N\) case.

As a side note, I find it humorous that these methods are often paired against each other. Two bitter enemies, R. A. Fisher and Karl Pearson, developed LDA and PCA, respectively. My favorite quote, which can be found in Agresti’s Categorical Data Analysis (p. 622), within the rivalry is Pearson’s response to a Fisher criticism:

I hold that such a view [Fisher's] is entirely erroneous, and that the writer has done no service to the science of statistics by giving it broad-cast circulation in the pages of the

Journal of the Royal Statistical Society. … I trust my critic will pardon me for comparing him with Don Quixote tilting at the windmill; he must either destroy himself, or the whole theory of probable errors…