## Introduction to Probability Book

I discovered a fantastic introduction to probability book on Dartmouth’s math department’s website. The book covers the basics, as the title suggests, but the writing is terrific. Topics include random variables, moments, sums of random variables through both convolution and cross-correlation as well as generating functions, laws of large numbers, central limit theorems, Markov chains, and random walks.

I found it to be very readable and filled with lots of examples through derivation and simulation. Here is a link to the book.

## Wait til the Weekend

Probabilistic Modeling with WinBugs

Large Collection of Facial Recognition Databases, Papers, and Algorithms

Microarray data analysis: from disarray to consolidation and consensus

Accounting for Computer Scientists

Genetic Algorithm for Hello World

Five ways to visualize your pairwise comparisons (in R)

Cross-Validation and Mean-Square Stability

Pennant (iOS App for Displaying and Visualizing Baseball Data)

## Wait til the Weekend – Initial

I am terrible about leaving tabs open in Google Chrome. Often I find useful links, which are related (whether directly or indirectly) to my current project, but in order to get anything done *right now*, I have to push these off onto a stack. My stack of choice is opening a new tab. As I open more and more tabs without ever returning to the previous ones, Chrome becomes bogged down, sluggish, and filled with many potential distractions.

I currently have 3 Chrome windows open with a total of 20 tabs present. As an example, below is a screenshot of the tabs from one window.

At one point, InstaPaper was my fix for this problem. If I encountered a new link (that are really Internet shine-ys), my first response was to stuff the link quickly into my InstaPaper account through the nice icon on my address bar. My InstaPaper account became a huge repository of links that I wanted to read and to understand. Unfortunately, when a mass collection resulted, I became more unlikely to return to them.

Recently, through my Twitter account, I encountered John D. Cook’s blog and noticed his Weekend miscellany. It seems that he has a decent way of dealing with useful but potentially distracting links. Although this may not be his motivation, I am interpreting this as a temporary delay in dealing with a link with the goal of returning to them at a set time, say the weekend. I have decided I am going to do much the same thing and beginning something I am naming *Wait til the Weekend*.

Here are my first links:

Visualization of Support Vector Machines with Polynomial Kernels

Generalized Eigenvalue Decomposition in C#

Handwriting Recognition using Kernel Discriminant Analysis

Generalized Linear Model Notes

## Converting a String to a Variable Name On-The-Fly and Vice-versa in R

Recently, I had a professor ask me how to take a string and convert it to an R variable name on-the-fly. One possible way is:

x <- 42 eval(parse(text = "x")) [1] 42 |

Now, suppose we want to go the other way. The trick is just as simple:

x <- 42 deparse(substitute(x)) [1] "x" |

## Automatic Simulation Queueing in R

I spend much of my time writing R code for simulations to compare the supervised classification methods that I have developed with similar classifiers from the literature. A large challenge is to determine which datasets (whether artificial/simulated or real) are interesting comparisons. Even if we restricted ourselves to multivariate Gaussian data, there are a large number of covariance matrix configurations that we could use to simulate the data. In other words, there are too many possibilities to consider all of them. However, it is often desirable to consider as many as possible.

Parallel processing certainly has reduced the runtime for simulations. In fact, most of my simulations are ridiculously parallelizeable, so I can run multiple simulations side-by-side.

I have been searching for ways to automate a lot of what I do, so I can spend less time on the mundane portions of simulation and focus on classification improvement. As a first attempt, I have written some R code that generates a Bash script that can be queued on my university’s high-performance computer. The code to create a Bash script is create.shell.file(), which is given here:

# Arguments: # shell.file: The name of the shell file (usually ends in '.sh'). # r.file: The name of the R file that contains the actual R simulation. # output.file: The name of the file where all output will be echoed. # r.options: The options used when R is called. # sim.args: The simulation arguments that will be passed to the R file. # create.shell.file <- function(shell.file, r.file, output.file, r.options = "--no-save --slave", sim.args = NULL, chmod = TRUE, chmod.permissions = "750") { args.string <- '' if(!is.null(sim.args)) args.string <- paste('--args', sim.args) r.command <- paste('R', r.options, args.string, '<', r.file, '>', output.file) sink(shell.file) cat('#!/bin/bash\n') cat('#PBS -S /bin/bash\n') cat('echo "Starting R at `date`"\n') cat(r.command, '\n') cat('echo "R run completed at `date`"\n') sink() # If the chmod flag is TRUE, then we will chmod the created file to have the appropriate chmod.permissions. if(chmod) { chmod.command <- paste("chmod", chmod.permissions, shell.file) system(chmod.command) } } |

To actually queue the simulation, we make a call to queue.sim():

# Arguments: # sim.config.df: a dataframe that contains the current simulation configuration. # sim.name: The name of the simulation. The queued sim will be prepended to the queue name. # np: The number of processors to use for this simulation. # npn: The number of processors to use per node for this simulation. # email: The email address that will be notified upon completion or an error. # cleanup: Delete all of the shell files after the simulations are queued? # verbose: Echo the status of the current task? # queue.sim <- function(sim.config.df, sim.type = "rlda-duin", np = 1, npn = 1, email = "johnramey@gmail.com", cleanup = FALSE, verbose = TRUE) { sim.config <- paste(names(sim.config.df), sim.config.df, collapse = "-", sep = "") sim.name <- paste(sim.type, "-", sim.config, sep = "") shell.file <- paste(sim.name, ".sh", sep = "") r.file <- paste(sim.type, '.r', sep = '') out.file <- paste(sim.name, '.out', sep = '') sim.args <- paste(sim.config.df, collapse = " ") if(verbose) { cat("sim.config:", sim.config, "\n") cat("sim.name:", sim.name, "\n") cat("shell.file:", shell.file, "\n") cat("r.file:", r.file, "\n") cat("out.file:", out.file, "\n") cat("sim.args:", sim.args, "\n") } if(verbose) cat("Creating shell file\n") create.shell.file(shell.file, r.file, out.file, sim.args = sim.args) if(verbose) cat("Creating shell file...done!\n") # Example # scasub -np 8 -npn 8 -N "rlda-prostate" -m "johnramey@gmail.com" ./rlda-prostate.sh if(verbose) cat("Queueing simulation\n") queue.command <- paste("scasub -np ", np, " -npn ", npn, " -N '", sim.name, "' -m '", email, "' ./", shell.file, sep = "") if(verbose) cat("Queue command:\t", queue.command, "\n") system(queue.command) if(verbose) cat("Queueing simulation...done!\n") if(cleanup) { if(verbose) cat("Cleaning up shell files\n") file.remove(shell.file) if(verbose) cat("Cleaning up shell files...done\n") } } |

Let’s look at an example to see what is actually happening. Suppose that we have a simulation file called “gaussian-sim.r” that generates N observations from two different p-dimensional Gaussian distributions each having the identity covariance matrix. Of course, this is a boring example, but it’s a start. One interesting question that always arises is: “Does classification performance degrade for small values of N and (extremely) large values of p?” We may wish to answer this question with a simulation study by looking at many values of N and many values of p and see if we can find a cutoff where classification performance declines. Let’s further suppose that for each configuration that we will repeat the experiment B times. (As a note, I’m not going to actually examine the gaussian-sim.r file or its contents here. I may return to this example later and extend it, but for now I’m going to focus on the automated queueing.) We can queue the simulation for each of several configurations the following code:

library('plyr') sim.type <- "gaussian-sim" np <- 8 npn <- 8 verbose <- TRUE cleanup <- TRUE N <- seq.int(10, 50, by = 10) p <- seq.int(250, 1000, by = 250) B <- 1000 sim.configurations <- expand.grid(N = N, p = p, B = B) # Queue a simulation for each simulation configuration d_ply(sim.configurations, .(N, p, B), queue.sim, sim.type = sim.type, np = np, npn = npn, cleanup = cleanup, verbose = verbose) |

This will create a Bash script with a descriptive name. For example, with the above code, a file called “gaussian-sim-N10-p1000-B1000.sh” is created. Here are its contents:

#!/bin/bash #PBS -S /bin/bash echo "Starting R at `date`" R --no-save --slave --args 10 1000 1000 < gaussian-sim.r > gaussian-sim-N10-p1000-B1000.out echo "R run completed at `date`" |

A note about the shell file created. The actual call to R can be customized, but this call has worked well for me. I certainly could call R in batch mode, but I never do without any specific reason. Perhaps one is more efficient than the other? I’m not sure about this.

Next, for each *.sh file created, the following command is executed to queue the R script using the above configuration.

scasub -np 8 -npn 8 -N 'gaussian-sim-N10-p1000-B1000' -m 'johnramey@gmail.com' ./gaussian-sim-N50-p1000-B1000.sh |

The scasub command is used for my university’s HPC. I know that there are other systems out there, but you can always alter my code to suit your needs. Of course, your R script needs to take advantage of the commandArgs() function in R to use the above code.

## Autocorrelation Matrix in R

I have been simulating a lot of data lately with various covariance (correlation) structures, and one that I have been using is the autocorrelation (or autoregressive) structure, where there is a “lag” between variables. The matrix is a v-dimension matrix of the form

$$\begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{v-1}\\ \rho & 1& \ddots & \dots & \rho^{v-2}\\ \vdots & \ddots & \ddots & \ddots & \vdots\\ \rho^{v-2} & \dots & \ddots & \ddots & \rho\\ \rho^{v-1} & \rho^{v-2} & \dots & \rho & 1 \end{bmatrix}$$,

where \(\rho \in [-1, 1]\) is the lag. Notice that the lag decays to 0 as v increases.

My goal was to make the construction of such a matrix simple and easy in R. The method that I used explored a function I have not used yet in R called “lower.tri” for the lower triangular part of the matrix. The upper triangular part is referenced with “upper.tri.”

My code is as follows:

`autocorr.mat` |

I really liked it because I feel that it is simple, but then I found Professor Peter Dalgaard’s method, which I have slightly modified. It is far better than mine, easy to understand, and slick. Oh so slick. Here it is:

`autocorr.mat` |

Professor Dalgaard’s method puts mine to shame. It is quite obvious how to do it once it is seen, but I certainly wasn’t thinking along those lines.

## Principal Component Analysis vs Linear Discriminant Analysis for Dimension Reduction

Lately I have been reviewing much of the electrical engineering literature on pattern recognition and machine learning and found this article in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) that compares Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) in facial recognition. Published in 2001, it is a bit dated. However, there are few papers (to my knowledge) with such a specific focus.

Before we discuss the paper further, let’s take a look at a summary of LDA and PCA.

The goal of LDA is to find a linear projection from the feature space (with dimension \(p\)) to a subspace of dimension \(C – 1\), where \(C\) is the number of classes, that maximizes the separability of the classes. It must be noted that LDA is often advertised as a Gaussian parametric model, but Fisher only assumed homoscedastic populations; that is, he assumed that the covariance matrices of each class are equal. We refer to the common covariance matrix as \(\mathbf{\Sigma}\). However, under the homoscedastic Gaussian assumption, LDA can be found to be the maximum likelihood method. In practice this covariance matrix must be estimated with data because it is unknown; the estimated covariance matrix is often called the pooled sample covariance matrix, \(\mathbf{S}_p\). Of course, when the sample size \(N\) is large relative to the dimension of the feature space (the number of variables) \(p\), this estimation is excellent. However, when \(p > N\), \(\mathbf{S}_p\) is singular, which causes a problem for the method. Often the inverse of this estimate is replaced with the Moore-Penrose pseudoinverse or is regularized. In the modern, high-dimensional case where \(p >> N\), this estimation is terrible.

A good overview of LDA is given here.

Wikipedia defines PCA nicely:

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA essentially rotates the data (via a linear transformation) so that most of the variability in the data is contained in as few dimensions as possible. For dimension reduction purposes, the usual practice is to drop the remaining dimensions containing little variability (the dimensions that correspond to the smallest eigenvalues) because they are highly correlated with the remaining dimensions. To borrow from Wikipedia once again,

PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance.

A good overview of PCA can be found here.

The problem that I have with PCA for dimension reduction in the classification context, which the PAMI paper considers, is that it ignores the response, and thus the eigenvectors (and corresponding eigenvalues) are found after considering the features as one data set. In other words, the training data is treated as if it all comes from the same population, which can be especially problematic in the multiclass classification setting. The paper acknowledges this issue:

Of late, there has been a tendency to prefer LDA over PCA because, as intuition would suggest, the former deals directly with discrimination between classes, whereas the latter deals with the data in its entirety for the principal components analysis without paying any particular attention to the underlying class structure.

The paper then makes the claim that

we will show that the switch from PCA to LDA may not always be warranted and may sometimes lead to faulty system design, especially if the size of the learning database is small.

I have no qualms about their claim and their subsequent results. However, there is no acknowledgement about the poor estimation of \(\mathbf{S}_p\), which leads to poor performance of LDA in the \(p >> n\) case. There have been many suggestions on how to improve this estimation, and often shrinkage methods significantly improve the estimation of \(\mathbf{\Sigma}\). LDA is not always the best choice either because of the need to pool covariance matrices: if the covariance matrix for each class describe very different shapes, then pooling essentially is a weighted average of the shapes, which may lead to a new shape not representative of any class. (This is similar to the classic independent two-sample t-test, where a pooled sample variance is used.)

It would be interesting to see a follow-up study done with the appropriate regularizations performed with LDA and PCA in the \(p >> N\) case.

As a side note, I find it humorous that these methods are often paired against each other. Two bitter enemies, R. A. Fisher and Karl Pearson, developed LDA and PCA, respectively. My favorite quote, which can be found in Agresti’s Categorical Data Analysis (p. 622), within the rivalry is Pearson’s response to a Fisher criticism:

I hold that such a view [Fisher's] is entirely erroneous, and that the writer has done no service to the science of statistics by giving it broad-cast circulation in the pages of the

Journal of the Royal Statistical Society. … I trust my critic will pardon me for comparing him with Don Quixote tilting at the windmill; he must either destroy himself, or the whole theory of probable errors…

## Initial Post

This blog serves as a platform to post my current endeavors, most of which will be related to my profession — statistics. My goal is to increase my experience analyzing data and to improve my knowledge and familiarity about methods that I did not learn in school. Some methods might have been glossed over in class for sake of time, or I simply was preoccupied.

I do not expect the analyses to be perfect. In fact I am not looking for perfection; I will not sacrifice quality for quantity of analyses necessarily, but I prefer not to waste time on minutiae. If you see that I have made an error or are willing to suggest an improvement in my analysis or code, please let me know. I appreciate feedback and constructive criticism, especially because I am new to some of these methods.

At one point, I had two websites: http://johnramey.net and http://ramhiser.com. Although I had plans for both, I quickly realized that I should consolidate into one. So, I have moved things here. Currently I am migrating a couple of posts from my previous blog.