## Archive for the ‘statistics’ Category

## Listing of Statistics and Machine Learning Conferences

Occasionally, I will query Google with “statistics conferences”, “machine learning conferences” or “pattern recognition conferences” and the like. But often, it is difficult to obtain anything meaningful other than the conferences of which I’m already aware (such as JSM, ICML, some IEEE conferences). Today, I found WikiCFP, which is a “A Wiki for Calls For Papers.” This seems to be what I needed. In particular, the following are very useful to me:

It seems limited for statistics though, as JSM is not even listed.

## Getting Started with Some Baseball Data

With all of the discussions (hype?) regarding applied statistics, machine learning, and data science, I have been looking for a go-to source of work-unrelated data. I loved baseball as a kid. I love baseball now. I love baseball stats. Why not do a grown-up version of what I used to do when I spent hours staring at and memorizing baseball stats on the back of a few pieces of cardboard on which I spent my allowance?

To get started, I purchased a copy of Baseball Hacks. The author suggests the usage of MySQL, so I will oblige. First, I downloaded some baseball data in MySQL format on my web server (Ubuntu 10.04) and decompressed it; when I downloaded the data, it was timestamped as 28 March 2011, so double-check if there is an updated version.

mkdir baseball cd baseball wget http://www.baseball-databank.org/files/BDB-sql-2011-03-28.sql.zip unzip BDB-sql-2011-03-28.sql.zip |

Next, in MySQL I created a user named “baseball”, a database entitled “bbdatabank” and granted all privileges on this database to the user “baseball.” To do this, first open MySQL as root (mysql -u root -p)

CREATE USER 'baseball'@'localhost' IDENTIFIED BY 'YourPassword'; CREATE databas bbdatabank; GRANT ALL PRIVILEGES ON `bbdatabank`.* TO 'baseball'@'localhost'; FLUSH PRIVILEGES; quit |

Note the tick marks (`) around bbdatabank when privileges are granted. Also, notice the deliberate misspelling when I constructed the db. WordPress freaks out on me because mod_security steps in and says, “Umm, no.” For more info about this, go here and here (see the comments as well).

Finally, we read the data into the database we just created by:

mysql -u baseball -p -s bbdatabank < BDB-sql-2011-03-28.sql |

That’s it! Most of this code has been adapted from the Baseball Hacks book, although I’ve tweaked a couple of things. As I progress through the book, I will continue to add interesting finds and code as posts. Eventually, I will move away from the book’s code as it focuses too much on the “Intro to Data Exploration” reader with constant mentions of MS Access/Excel. The author means well though as he urges the reader to use *nix/Mac OS X.

## Autocorrelation Matrix in R

I have been simulating a lot of data lately with various covariance (correlation) structures, and one that I have been using is the autocorrelation (or autoregressive) structure, where there is a “lag” between variables. The matrix is a v-dimension matrix of the form

$$\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{v-1}\\ \rho & 1& \ddots & \dots & \rho^{v-2}\\ \vdots & \ddots & \ddots & \ddots & \vdots\\ \rho^{v-2} & \dots & \ddots & \ddots & \rho\\ \rho^{v-1} & \rho^{v-2} & \dots & \rho & 1 \end{bmatrix}$$,

where \(\rho \in [-1, 1]\) is the lag. Notice that the lag decays to 0 as v increases.

My goal was to make the construction of such a matrix simple and easy in R. The method that I used explored a function I have not used yet in R called “lower.tri” for the lower triangular part of the matrix. The upper triangular part is referenced with “upper.tri.”

My code is as follows:

`autocorr.mat` |

I really liked it because I feel that it is simple, but then I found Professor Peter Dalgaard’s method, which I have slightly modified. It is far better than mine, easy to understand, and slick. Oh so slick. Here it is:

`autocorr.mat` |

Professor Dalgaard’s method puts mine to shame. It is quite obvious how to do it once it is seen, but I certainly wasn’t thinking along those lines.

## Principal Component Analysis vs Linear Discriminant Analysis for Dimension Reduction

Lately I have been reviewing much of the electrical engineering literature on pattern recognition and machine learning and found this article in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) that compares Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) in facial recognition. Published in 2001, it is a bit dated. However, there are few papers (to my knowledge) with such a specific focus.

Before we discuss the paper further, let’s take a look at a summary of LDA and PCA.

The goal of LDA is to find a linear projection from the feature space (with dimension \(p\)) to a subspace of dimension \(C – 1\), where \(C\) is the number of classes, that maximizes the separability of the classes. It must be noted that LDA is often advertised as a Gaussian parametric model, but Fisher only assumed homoscedastic populations; that is, he assumed that the covariance matrices of each class are equal. We refer to the common covariance matrix as \(\mathbf{\Sigma}\). However, under the homoscedastic Gaussian assumption, LDA can be found to be the maximum likelihood method. In practice this covariance matrix must be estimated with data because it is unknown; the estimated covariance matrix is often called the pooled sample covariance matrix, \(\mathbf{S}_p\). Of course, when the sample size \(N\) is large relative to the dimension of the feature space (the number of variables) \(p\), this estimation is excellent. However, when \(p > N\), \(\mathbf{S}_p\) is singular, which causes a problem for the method. Often the inverse of this estimate is replaced with the Moore-Penrose pseudoinverse or is regularized. In the modern, high-dimensional case where \(p >> N\), this estimation is terrible.

A good overview of LDA is given here.

Wikipedia defines PCA nicely:

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA essentially rotates the data (via a linear transformation) so that most of the variability in the data is contained in as few dimensions as possible. For dimension reduction purposes, the usual practice is to drop the remaining dimensions containing little variability (the dimensions that correspond to the smallest eigenvalues) because they are highly correlated with the remaining dimensions. To borrow from Wikipedia once again,

PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance.

A good overview of PCA can be found here.

The problem that I have with PCA for dimension reduction in the classification context, which the PAMI paper considers, is that it ignores the response, and thus the eigenvectors (and corresponding eigenvalues) are found after considering the features as one data set. In other words, the training data is treated as if it all comes from the same population, which can be especially problematic in the multiclass classification setting. The paper acknowledges this issue:

Of late, there has been a tendency to prefer LDA over PCA because, as intuition would suggest, the former deals directly with discrimination between classes, whereas the latter deals with the data in its entirety for the principal components analysis without paying any particular attention to the underlying class structure.

The paper then makes the claim that

we will show that the switch from PCA to LDA may not always be warranted and may sometimes lead to faulty system design, especially if the size of the learning database is small.

I have no qualms about their claim and their subsequent results. However, there is no acknowledgement about the poor estimation of \(\mathbf{S}_p\), which leads to poor performance of LDA in the \(p >> n\) case. There have been many suggestions on how to improve this estimation, and often shrinkage methods significantly improve the estimation of \(\mathbf{\Sigma}\). LDA is not always the best choice either because of the need to pool covariance matrices: if the covariance matrix for each class describe very different shapes, then pooling essentially is a weighted average of the shapes, which may lead to a new shape not representative of any class. (This is similar to the classic independent two-sample t-test, where a pooled sample variance is used.)

It would be interesting to see a follow-up study done with the appropriate regularizations performed with LDA and PCA in the \(p >> N\) case.

As a side note, I find it humorous that these methods are often paired against each other. Two bitter enemies, R. A. Fisher and Karl Pearson, developed LDA and PCA, respectively. My favorite quote, which can be found in Agresti’s Categorical Data Analysis (p. 622), within the rivalry is Pearson’s response to a Fisher criticism:

I hold that such a view [Fisher's] is entirely erroneous, and that the writer has done no service to the science of statistics by giving it broad-cast circulation in the pages of the

Journal of the Royal Statistical Society. … I trust my critic will pardon me for comparing him with Don Quixote tilting at the windmill; he must either destroy himself, or the whole theory of probable errors…