<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>johnramey &#187; statistics</title>
	<atom:link href="http://johnramey.net/tag/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://johnramey.net/blog</link>
	<description>Don&#039;t think. Compute.</description>
	<lastBuildDate>Wed, 07 Dec 2011 23:52:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Listing of Statistics and Machine Learning Conferences</title>
		<link>http://johnramey.net/blog/2011/06/12/listing-of-statistics-and-machine-learning-conferences/</link>
		<comments>http://johnramey.net/blog/2011/06/12/listing-of-statistics-and-machine-learning-conferences/#comments</comments>
		<pubDate>Sun, 12 Jun 2011 23:20:27 +0000</pubDate>
		<dc:creator>ramhiser</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[conferences]]></category>

		<guid isPermaLink="false">http://johnramey.net/blog/?p=149</guid>
		<description><![CDATA[Occasionally, I will query Google with &#8220;statistics conferences&#8221;, &#8220;machine learning conferences&#8221; or &#8220;pattern recognition conferences&#8221; and the like. But often, it is difficult to obtain anything meaningful other than the conferences of which I&#8217;m already aware (such as JSM, ICML, some IEEE conferences). Today, I found WikiCFP, which is a &#8220;A Wiki for Calls For [...]]]></description>
			<content:encoded><![CDATA[<p>Occasionally, I will query Google with &#8220;statistics conferences&#8221;, &#8220;machine learning conferences&#8221; or &#8220;pattern recognition conferences&#8221; and the like. But often, it is difficult to obtain anything meaningful other than the conferences of which I&#8217;m already aware (such as JSM, ICML, some IEEE conferences). Today, I found <a href="http://www.wikicfp.com/cfp/">WikiCFP</a>, which is a &#8220;A Wiki for Calls For Papers.&#8221; This seems to be what I needed. In particular, the following are very useful to me:</p>
<ul>
<li><a href="http://www.wikicfp.com/cfp/call?conference=machine%20learning">Machine Learning on WikiCFP</a></li>
<li><a href="http://www.wikicfp.com/cfp/call?conference=statistics">Statistics on WikiCFP</a></li>
</ul>
<p>It seems limited for statistics though, as JSM is not even listed.</p>
]]></content:encoded>
			<wfw:commentRss>http://johnramey.net/blog/2011/06/12/listing-of-statistics-and-machine-learning-conferences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting Started with Some Baseball Data</title>
		<link>http://johnramey.net/blog/2011/05/24/getting-started-with-some-baseball-data/</link>
		<comments>http://johnramey.net/blog/2011/05/24/getting-started-with-some-baseball-data/#comments</comments>
		<pubDate>Wed, 25 May 2011 03:34:53 +0000</pubDate>
		<dc:creator>ramhiser</dc:creator>
				<category><![CDATA[statistics]]></category>
		<category><![CDATA[baseball]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://johnramey.net/blog/?p=89</guid>
		<description><![CDATA[With all of the discussions (hype?) regarding applied statistics, machine learning, and data science, I have been looking for a go-to source of work-unrelated data. I loved baseball as a kid. I love baseball now. I love baseball stats. Why not do a grown-up version of what I used to do when I spent hours [...]]]></description>
			<content:encoded><![CDATA[<p>With all of the discussions (hype?) regarding applied statistics, machine learning, and data science, I have been looking for a go-to source of work-unrelated data. I loved baseball as a kid. I love baseball now. I love baseball stats. Why not do a grown-up version of what I used to do when I spent hours staring at and memorizing baseball stats on the back of a few pieces of cardboard on which I spent my allowance?</p>
<p>To get started, I purchased a copy of <a href="http://www.amazon.com/Baseball-Hacks-Joseph-Adler/dp/0596009429/ref=sr_1_1?ie=UTF8&amp;qid=1306290220&amp;sr=8-1">Baseball Hacks</a>. The author suggests the usage of MySQL, so I will oblige. First, I downloaded some baseball data in MySQL format on my web server (Ubuntu 10.04) and decompressed it; when I downloaded the data, it was timestamped as 28 March 2011, so double-check if there is an updated version.</p>

<div class="wp_codebox"><table><tr id="p894"><td class="code" id="p89code4"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">mkdir</span> baseball
<span style="color: #7a0874; font-weight: bold;">cd</span> baseball
<span style="color: #c20cb9; font-weight: bold;">wget</span> http:<span style="color: #000000; font-weight: bold;">//</span>www.baseball-databank.org<span style="color: #000000; font-weight: bold;">/</span>files<span style="color: #000000; font-weight: bold;">/</span>BDB-sql-<span style="color: #000000;">2011</span>-03-<span style="color: #000000;">28</span>.sql.zip
<span style="color: #c20cb9; font-weight: bold;">unzip</span> BDB-sql-<span style="color: #000000;">2011</span>-03-<span style="color: #000000;">28</span>.sql.zip</pre></td></tr></table></div>

<p>Next, in MySQL I created a user named &#8220;baseball&#8221;, a database entitled &#8220;bbdatabank&#8221; and granted all privileges on this database to the user &#8220;baseball.&#8221; To do this, first open MySQL as root (mysql -u root -p)</p>

<div class="wp_codebox"><table><tr id="p895"><td class="code" id="p89code5"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> USER <span style="color: #ff0000;">'baseball'</span>@<span style="color: #ff0000;">'localhost'</span> <span style="color: #993333; font-weight: bold;">IDENTIFIED</span> <span style="color: #993333; font-weight: bold;">BY</span> <span style="color: #ff0000;">'YourPassword'</span>;
<span style="color: #993333; font-weight: bold;">CREATE</span> databas bbdatabank;
<span style="color: #993333; font-weight: bold;">GRANT</span> <span style="color: #993333; font-weight: bold;">ALL</span> PRIVILEGES <span style="color: #993333; font-weight: bold;">ON</span> <span style="color: #ff0000;">`bbdatabank`</span><span style="color: #66cc66;">.*</span> <span style="color: #993333; font-weight: bold;">TO</span> <span style="color: #ff0000;">'baseball'</span>@<span style="color: #ff0000;">'localhost'</span>;
<span style="color: #993333; font-weight: bold;">FLUSH</span> PRIVILEGES;
quit</pre></td></tr></table></div>

<p>Note the tick marks (`) around bbdatabank when privileges are granted. Also, notice the deliberate misspelling when I constructed the db. WordPress freaks out on me because mod_security steps in and says, &#8220;Umm, no.&#8221; For more info about this, go <a href="http://abing.gotdns.com/posts/2006/wordpress-error-404-when-publishing-or-saving-post/">here</a> and <a href="http://drupal.org/node/110204">here</a> (see the comments as well).</p>
<p>Finally, we read the data into the database we just created by:</p>

<div class="wp_codebox"><table><tr id="p896"><td class="code" id="p89code6"><pre class="bash" style="font-family:monospace;">mysql <span style="color: #660033;">-u</span> baseball <span style="color: #660033;">-p</span> <span style="color: #660033;">-s</span> bbdatabank <span style="color: #000000; font-weight: bold;">&lt;</span> BDB-sql-<span style="color: #000000;">2011</span>-03-<span style="color: #000000;">28</span>.sql</pre></td></tr></table></div>

<p>That&#8217;s it! Most of this code has been adapted from the <a href="http://www.amazon.com/Baseball-Hacks-Joseph-Adler/dp/0596009429/ref=sr_1_1?ie=UTF8&amp;qid=1306290220&amp;sr=8-1">Baseball Hacks</a> book, although I&#8217;ve tweaked a couple of things. As I progress through the book, I will continue to add interesting finds and code as posts. Eventually, I will move away from the book&#8217;s code as it focuses too much on the &#8220;Intro to Data Exploration&#8221; reader with constant mentions of MS Access/Excel. The author means well though as he urges the reader to use *nix/Mac OS X.</p>
]]></content:encoded>
			<wfw:commentRss>http://johnramey.net/blog/2011/05/24/getting-started-with-some-baseball-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Autocorrelation Matrix in R</title>
		<link>http://johnramey.net/blog/2010/12/26/autocorrelation-matrix-in-r/</link>
		<comments>http://johnramey.net/blog/2010/12/26/autocorrelation-matrix-in-r/#comments</comments>
		<pubDate>Sun, 26 Dec 2010 04:55:09 +0000</pubDate>
		<dc:creator>ramhiser</dc:creator>
				<category><![CDATA[r]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[code]]></category>

		<guid isPermaLink="false">http://johnramey.net/blog/?p=12</guid>
		<description><![CDATA[I have been simulating a lot of data lately  with various covariance (correlation) structures, and one that I have been using is the autocorrelation (or autoregressive) structure, where there is a &#8220;lag&#8221; between variables. The matrix is a v-dimension matrix of the form $$\begin{bmatrix} 1 &#38; \rho &#38; \rho^2 &#38; \dots &#38; \rho^{v-1}\\ \rho &#38; [...]]]></description>
			<content:encoded><![CDATA[<p>I have been simulating a lot of data lately  with various covariance (correlation) structures, and one that I have been using is the autocorrelation (or autoregressive) structure, where there is a &#8220;lag&#8221; between variables. The matrix is a v-dimension matrix of the form</p>
<p>$$\begin{bmatrix} 1 &amp; \rho &amp; \rho^2 &amp; \dots &amp; \rho^{v-1}\\ \rho &amp; 1&amp; \ddots &amp; \dots &amp; \rho^{v-2}\\ \vdots &amp; \ddots &amp; \ddots &amp; \ddots &amp; \vdots\\ \rho^{v-2} &amp; \dots &amp; \ddots &amp; \ddots &amp; \rho\\ \rho^{v-1} &amp; \rho^{v-2} &amp; \dots &amp; \rho &amp; 1 \end{bmatrix}$$,</p>
<p>where \(\rho \in [-1, 1]\) is the lag. Notice that the lag decays to 0 as v increases.</p>
<p>My goal was to make the construction of such a matrix simple and easy in R.  The method that I used explored a function I have not used yet in R called &#8220;lower.tri&#8221; for the lower triangular part of the matrix.  The upper triangular part is referenced with &#8220;upper.tri.&#8221;</p>
<p>My code is as follows:</p>

<div class="wp_codebox"><table><tr id="p129"><td class="code" id="p12code9"><pre class="c" style="font-family:monospace;">autocorr.<span style="color: #202020;">mat</span></pre></td></tr></table></div>

<p>I really liked it because I feel that it is simple, but then I found <a href="http://tolstoy.newcastle.edu.au/R/e2/help/07/05/16585.html">Professor Peter Dalgaard&#8217;s method</a>, which I have slightly modified. It is far better than mine, easy to understand, and slick. Oh so slick. Here it is:</p>

<div class="wp_codebox"><table><tr id="p1210"><td class="code" id="p12code10"><pre class="c" style="font-family:monospace;">autocorr.<span style="color: #202020;">mat</span></pre></td></tr></table></div>

<p>Professor Dalgaard&#8217;s method puts mine to shame. It is quite obvious how to do it once it is seen, but I certainly wasn&#8217;t thinking along those lines.</p>
]]></content:encoded>
			<wfw:commentRss>http://johnramey.net/blog/2010/12/26/autocorrelation-matrix-in-r/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Principal Component Analysis vs Linear Discriminant Analysis for Dimension Reduction</title>
		<link>http://johnramey.net/blog/2010/12/26/principal-component-analysis-vs-linear-discriminant-analysis-for-dimension-reduction/</link>
		<comments>http://johnramey.net/blog/2010/12/26/principal-component-analysis-vs-linear-discriminant-analysis-for-dimension-reduction/#comments</comments>
		<pubDate>Sun, 26 Dec 2010 04:46:47 +0000</pubDate>
		<dc:creator>ramhiser</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[dimension reduction]]></category>
		<category><![CDATA[linear discriminant analysis]]></category>
		<category><![CDATA[principal components]]></category>

		<guid isPermaLink="false">http://johnramey.net/blog/?p=9</guid>
		<description><![CDATA[Lately I have been reviewing much of the electrical engineering literature on pattern recognition and machine learning and found this article in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) that compares Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) in facial recognition.  Published in 2001, it is a bit dated.  However, there are few [...]]]></description>
			<content:encoded><![CDATA[<p>Lately I have been reviewing much of the electrical engineering literature on pattern recognition and machine learning and found <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.5303&amp;rep=rep1&amp;type=pdf">this article</a> in <a href="http://www.computer.org/portal/web/tpami/">IEEE Transactions on Pattern Analysis and Machine Intelligence</a> (PAMI) that compares <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a> (PCA) and <a href="http://en.wikipedia.org/wiki/Linear_discriminant_analysis">Linear Discriminant Analysis</a> (LDA) in facial recognition.  Published in 2001, it is a bit dated.  However, there are few papers (to my knowledge) with such a specific focus.</p>
<p>Before we discuss the paper further, let&#8217;s take a look at a summary of LDA and PCA.</p>
<p>The goal of LDA is to find a linear projection from the feature space (with dimension \(p\)) to a subspace of dimension \(C &#8211; 1\), where \(C\) is the number of classes, that maximizes the separability of the classes. It must be noted that LDA is often advertised as a <a href="http://en.wikipedia.org/wiki/Gaussian_distribution">Gaussian</a> parametric model, but Fisher only assumed <a href="http://en.wikipedia.org/wiki/Homoscedasticity">homoscedastic</a> populations; that is, he assumed that the covariance matrices of each class are equal. We refer to the common covariance matrix as \(\mathbf{\Sigma}\).  However, under the <a href="http://en.wikipedia.org/wiki/Homoscedasticity">homoscedastic</a> <a href="http://en.wikipedia.org/wiki/Gaussian_distribution">Gaussian</a> assumption, LDA can be found to be the <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">maximum likelihood</a> method. In practice this covariance matrix must be estimated with data because it is unknown; the estimated covariance matrix is often called the pooled sample covariance matrix, \(\mathbf{S}_p\). Of course, when the sample size \(N\) is large relative to the dimension of the feature space (the number of variables) \(p\), this estimation is excellent.  However, when \(p &gt; N\), \(\mathbf{S}_p\) is singular, which causes a problem for the method.  Often the inverse of this estimate is replaced with the <a href="http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse">Moore-Penrose pseudoinverse</a> or is regularized.   In the modern, high-dimensional case where \(p &gt;&gt; N\), this estimation is terrible.</p>
<p>A good overview of LDA is given <a href="http://courses.cs.tamu.edu/rgutier/cs790_w02/l6.pdf">here</a>.</p>
<p><a href="http://en.wikipedia.org">Wikipedia</a> <a href="http://en.wikipedia.org/wiki/Principal_component_analysis#Details">defines PCA</a> nicely:</p>
<blockquote><p>PCA is mathematically defined<span style="font-size: small;"> </span>as an <a title="Orthogonal transformation" href="http://en.wikipedia.org/wiki/Orthogonal_transformation">orthogonal</a> <a title="Linear transformation" href="http://en.wikipedia.org/wiki/Linear_transformation">linear transformation</a> that transforms the data to a new <a title="Coordinate system" href="http://en.wikipedia.org/wiki/Coordinate_system">coordinate system</a> such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.</p></blockquote>
<p>PCA essentially rotates the data (via a linear transformation) so that most of the variability in the data is contained in as few dimensions as possible.  For dimension reduction purposes, the usual practice is to drop the remaining dimensions containing little variability (the dimensions that correspond to the smallest eigenvalues) because they are highly correlated with the remaining dimensions. To borrow from Wikipedia once again,</p>
<blockquote><p>PCA has the distinction of being the optimal <a title="Linear transformation" href="http://en.wikipedia.org/wiki/Linear_transformation">linear transformation</a> for keeping the subspace that has largest variance.</p></blockquote>
<p>A good overview of PCA can be found <a href="http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf">here</a>.</p>
<p>The problem that I have with PCA for dimension reduction in the classification context, which the PAMI paper considers, is that it ignores the response, and thus the eigenvectors (and corresponding eigenvalues) are found after considering the features as one data set. In other words, the training data is treated as if it all comes from the same population, which can be especially problematic in the multiclass classification setting. The paper acknowledges this issue:</p>
<blockquote><p>Of late, there has been a tendency to prefer LDA over PCA because, as intuition would suggest, the former deals directly with discrimination between classes, whereas the latter deals with the data in its entirety for the principal components analysis without paying any particular attention to the underlying class structure.</p></blockquote>
<p>The paper then makes the claim that</p>
<blockquote><p>we will show that the switch from PCA to LDA may not always be warranted and may sometimes lead to faulty system design, especially if the size of the learning database is small.</p></blockquote>
<p>I have no qualms about their claim and their subsequent results.  However, there is no acknowledgement about the poor estimation of \(\mathbf{S}_p\), which leads to poor performance of LDA in the \(p &gt;&gt; n\) case.  There have been many <a href="http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices">suggestions</a> on how to improve this estimation, and often <a href="http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices#Shrinkage_estimation">shrinkage</a> methods significantly improve the estimation of \(\mathbf{\Sigma}\). LDA is not always the best choice either because of the need to pool covariance matrices: if the covariance matrix for each class describe very different shapes, then pooling essentially is a weighted average of the shapes, which may lead to a new shape not representative of any class. (This is similar to the classic <a href="http://en.wikipedia.org/wiki/Student's_t-test#Independent_two-sample_t-test">independent two-sample t-test</a>, where a pooled sample variance is used.)</p>
<p>It would be interesting to see a follow-up study done with the appropriate regularizations performed with LDA and PCA in the \(p &gt;&gt; N\) case.</p>
<p>As a side note, I find it humorous that these methods are often paired against each other.  Two bitter enemies, <a href="http://en.wikipedia.org/wiki/Ronald_Fisher">R. A. Fisher</a> and <a href="http://en.wikipedia.org/wiki/Karl_Pearson">Karl Pearson</a>, developed LDA and PCA, respectively.  My favorite quote, which can be found in Agresti&#8217;s <a href="http://www.amazon.com/gp/product/0471360937?ie=UTF8&amp;tag=ramhiser-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0471360937">Categorical Data Analysis</a><img style="border: none !important; margin: 0px !important;" src="http://www.assoc-amazon.com/e/ir?t=ramhiser-20&amp;l=as2&amp;o=1&amp;a=0471360937" border="0" alt="" width="1" height="1" /> (p. 622), within the rivalry is Pearson&#8217;s response to a Fisher criticism:</p>
<blockquote><p>I hold that such a view [Fisher's] is entirely erroneous, and that the writer has done no service to the science of statistics by giving it broad-cast circulation in the pages of the <em><a href="http://en.wikipedia.org/wiki/Journal_of_the_Royal_Statistical_Society">Journal of the Royal Statistical Society</a></em>. &#8230; I trust my critic will pardon me for comparing him with Don Quixote tilting at the windmill; he must either destroy himself, or the whole theory of probable errors&#8230;</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://johnramey.net/blog/2010/12/26/principal-component-analysis-vs-linear-discriminant-analysis-for-dimension-reduction/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

