Bayesian Regularized Logistic Regression in High-Dimensional Classification

John A. Ramey.
Working Paper, Baylor University.

Availability:

Abstract. High-dimensional data are becoming increasingly common largely due to systems that can automatically collect large quantities of data and store this information to devices with large amounts of data space. Statistical data analysis is shifting from a focus on a few well-selected variables to identifying the most relevant variables among a large number of variables. High-dimensionality typically increases the difficulty of typical machine learning and supervised learning problems such as classification and regression, so much that often the phrase "curse of dimensionality" is used. Practically, one would like to determine which, if any, variables are relevant. The practitioner can decide to ignore variables that appear relatively unimportant through variable (or feature) selection. As an alternative, he can emphasize variables algorithmically through methods such as regularization. Optionally the practitioner can project the data to a lower-dimensional subspace through dimension reduction, thereby lessening the problem's difficulty.

Our focus in this presentation is on the machine learning problem, classification. The goal is to classify an unlabeled observation into one of K groups after building a classifier from training data, where the label for each observation is given. When each training observation is labeled, the problem is referred to as supervised learning. The error rate of a classifier is the true proportion of times that an unlabeled observation will be incorrectly classified. In this presentation we focus on data sets that have a large number of variables. The high-dimensionality of this data can severely increase the classification error rate. To improve this situation, we apply a form of regularization to the logistic regression classifier.

Regularization is a popular alternative method to feature selection in high-dimensional classification, especially when the feature space dimension p is much larger than the sample size n. Regularized logistic regression provides an effective classification tool by maximizing a penalized likelihood of \bm\beta = (\beta_0, \beta_1, \ldots, \beta_p)', the vector of feature parameters. We use regularization parameter \lambda to control the amount that the likelihood is penalized so that ||\bm\beta|| \le \lambda, though the intercept term \beta_0 is rarely included in the penalty expression. Often we choose \lambda by minimizing the estimated conditional error rate. Few studies have considered the regularized logistic regression classifier from a Bayesian perspective, and these studies primarily use multivariate normal prior distributions for \bm\beta, while ignoring that ||\bm\beta|| \le \lambda a priori. We propose a Bayesian regularization method that uses a truncated multivariate normal prior structure and utilizes our a priori knowledge of \bm\beta in relation to \lambda. We perform a simulation experiment to compare the error rates of non-Bayesian and Bayesian regularized logistic regression methods using randomly generated data as well as a high-dimensional real data set from the University of California-Irvine Machine Learning Repository.

Keywords: Bayesian methods, logistic regression, regularization, high-dimensional classification.

BibTeX Record:

@TechReport{ramey10bayeslogistic,
  author       = {John A. Ramey}
  title        = {Bayesian Regularized Logistic Regression in High-Dimensional Classification}
  year         = 2010,
  institution  = {Baylor University},
  type         = {ERID Working Paper},
  number       = 50
}