Chapter 7 Classification

Classification shares many similarities with regression: We have a response variable \(Y\) and a set of one or more predictors \(X_1,\dotsc,X_p\). The difference is that for classification problems, the response \(Y\) is discrete, meaning \(Y\in\{1,2,\dotsc,C\}\) where \(C\) is the number of classes that \(Y\) can take on.

We will focus our attention on binary responses \(Y\in\{0,1\}\), but all of the methods we discuss can be extended to the more general case outlined above.

To illustrate classification methods, we will use the Default data in the ISLR R library. The data set contains four variables: default is an indicator of whether the customer defaulted on their debt, student is an indicator of whether the customer is a student, balance is the average balance that the customer has remaining on their credit card after making their monthly payment, and income is the customer’s income.

default student balance income
No No 729.5 44362
No Yes 817.2 12106
No No 1073.5 31767
No No 529.3 35704
No No 785.7 38463
No Yes 919.6 7492

We also need to split up the data into training and test samples in order to measure the predictive accuracy of different approaches to classification.

7.1 \(k\)-Nearest Neighbors


The \(k\)-NN algorithms are built on the following idea: given a new observation \(X^*\) for which we want to predict an associated response \(Y^*\), we can find values of \(X\) in our data that look similar to \(X^*\) and then classify \(Y^*\) based on the associated \(Y\)’s. We will use Euclidean distance is a measure of similarity (which is only defined for real-valued \(X\)’s).

Let’s take a small portion (first 10 rows) of the Default data to work through a simple example. Notice that we will exclude the student variable since it is a categorical rather than numeric variable. We will use the 11th observation as our “test” data \(X^*\) that we want to make predictions for.

We now need to compute the similarity (i.e., Euclidean distance) between \(X^*=(X_1^*,X_2^*)\) and \(X_i=(X_{1i},X_{2i})\) for each \(i=1,\dotsc,n\).

\[dist(X^*,X_i)=||X^*-X_i||=\sqrt{(X_1^*-X_{1i})^2+(X_2^*-X_{2i})^2}\]

To do this in R, we can take use the apply( ) function. The first argument is the matrix of \(X\) variables that we want to cycle through to compare with \(X^*\).

The second argument of the apply( ) function tells R whether we want to perform an operation for each row (=1) of for each column (=2). The last row tells R what function we want to compute. Here, we need to evaluate \(dist(X^*,X_i)\) for each row.

##     1     2     3     4     5     6     7     8     9    10 
## 22502  9799  9954 13844 16611 14409  3144  4347 15641  7404

Notice that the function returns a set of 10 distances. If we wanted to use the 1st-nearest neighbor classifier to predict \(Y^*\), for example, then we would need to find the \(Y\) value of \(X_i\) for the observation \(i\) that has the smallest distance. We can find that value using the which.min( ) function.

## 7 
## 7
## [1] No
## Levels: No Yes

Therefore, we would predict \(Y^*=No\) having observed \(X^*\).

Now let’s go back to the full data set and test the performance of the \(k\)-NN classifier. The first thing we should do is standardize the \(X\)’s since the nearest neighbors algorithm depends on the scale of the covariates.

##     balance           income       
##  Min.   :-1.728   Min.   :-2.4632  
##  1st Qu.:-0.733   1st Qu.:-0.9207  
##  Median :-0.030   Median : 0.0804  
##  Mean   : 0.000   Mean   : 0.0000  
##  3rd Qu.: 0.686   3rd Qu.: 0.7730  
##  Max.   : 3.454   Max.   : 3.0059

Now we can use the knn( ) function in the class R library to run the algorithm on the training data and then make predictions for each observation in the test data. The first argument calls for the \(X\)’s in the training data, the second calls for the \(X\)’s in the test data (for which we want to predict), the third calls for the \(Y\)’s in the training data, and the fourth calls for \(k\), the number of nearest neighbors we want to use to make the prediction.

The knn1 object now contains a vector of predicted \(Y\)’s for each value of \(X\) in the test data. We can then compare the predicted response \(\hat{Y}\) to the true response in the test data \(Y\) to assess the performance of the classification algorithm. In particular, we will see the fraction of predictions the algorithm gets wrong.

## [1] 0.04467

In this case, the 1-NN classifier as an error rate of about 4.5% (or equivalently, an accuracy of 95.5%).

We can try increasing \(k\) to see if there is any effect on predictive fit.

## [1] 0.029
## [1] 0.026
## [1] 0.024
## [1] 0.027

We would then likely choose the model that predicts best (i.e., has the lowest error/misclassification rate).

The last object of interest when doing classification is the confusion matrix, which allows us to decompose misclassification mistakes into two groups: false positives (predict \(\hat{Y}=1\) when \(Y=0\)) and false negatives (predict \(\hat{Y}=0\) when \(Y=1\)).

Let’s produce the confusion matrix for the 10-NN classifier.

##      
## knn10   No  Yes
##   No  2889   60
##   Yes   18   33
## [1] 0.006192
## [1] 0.6452

The false negative rate is especially high, which would be concerning given the risks to the lending agency (e.g., bank).


7.2 Logistic Regression


Issues with the \(k\)-NN algorithms include the fact they can’t accommodate categorical \(X\)’s, the algorithms aren’t based on a formal statistical model so we can’t do inference (or learn about how the \(X\)’s relate to \(Y\)), and there is an assumption that all \(X\)’s matter and matter equally in determining \(Y\).

Our first solution to these problems is logistic regression.

Given a response \(Y\in\{0,1\}\) and a set of predictors \(X_1,\dotsc,X_P\), the logistic regression model is written as follows.

\[\text{Pr}(Y=1|X)={\exp(\beta_0+\beta_1X_1+\dotsc+\beta_pX_p)\over 1 + \exp(\beta_0+\beta_1X_1+\dotsc+\beta_pX_p)}\]

The intuition for this formula is as follows. If \(Y\in\{0,1\}\), then we can assume that \(Y\sim\text{Bernoulli}(\theta)\) where \(\theta=\text{Pr}(Y=1)\). We can then write down a regression model for \(\theta\) rather than \(Y\). The only remaining problem is that \(\theta\in(0,1)\), so we need to transform the linear regression function \(h(X)=\beta_0+\beta_1X_1+\dotsc+\beta_pX_p)\) in a way so that it is constrained to be between 0 and 1. The function \(e^{h(X)}/(1 + e^{h(X)})\) does just that.

Estimating a logistic regression model in R can be done using the glm( ) function, which is similar to the lm( ) command we use to estimate linear regression models.

Let’s illustrate with the training sample from the Default data set.

## 
## Call:
## glm(formula = default ~ student + balance + income, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.187  -0.142  -0.055  -0.020   3.686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.10e+01   5.89e-01  -18.70   <2e-16 ***
## studentYes  -6.46e-01   2.85e-01   -2.27    0.023 *  
## balance      5.83e-03   2.78e-04   20.96   <2e-16 ***
## income       4.71e-06   9.87e-06    0.48    0.633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2090.7  on 6999  degrees of freedom
## Residual deviance: 1109.4  on 6996  degrees of freedom
## AIC: 1117
## 
## Number of Fisher Scoring iterations: 8

Notice that we added one more option in the glm( ) function: type="binomial". This option tells R to use the logistic regression model rather than other types of generalized linear models.

The output from the logistic regression model looks fairly similar to that of linear regression models. However, the interpretation of model paramters (and their estimates) changes a bit.

For example, we find that the coefficient on balance is estimated to be about 0.0058, which means that a one dollar increase in balance multiplies the odds of default by exp(0.0058)=1.006. Since this number is greater than 1, we can say that increasing the balance increases the odds of default.

To predict responses in the test data, we can use the predict( ) function in R. We again need to add one option: type="response", which will tell R to return the predicted probabilities that \(Y=1\).

Then we can compute \(\hat{Y}\) by using the rule that \(\hat{Y}=\text{Yes}\) if the predicted probability is greater than 0.5 and \(\hat{Y}=\text{No}\) otherwise.

Just as before, we can compare the model predictions with the actual \(Y\)’s in the test data to compute the out-of-sample error (misclassification) rate.

## [1] 0.024

This error rate can be decomposed by producing the associated confusion matrix and computing the false positive and false negative rates.

##            
## glm.predict   No  Yes
##         No  2896   61
##         Yes   11   32
## [1] 0.003784
## [1] 0.6559


7.3 Classification Trees


Classification trees offer the same advantages over logistic regression that regression trees do for linear regression. That is, classification trees provide a classification rule that does not assume any form of linearity in the covariates \(X\).

The nice thing is their implimentation in R is nearly identical to that of regression trees.

We can again use the predict( ) function to predict the response values for the test data and compute the out-of-sample error (misclassification) rate. We need to specify the type="class" option so that the predict( ) function returns the predicted values \(\hat{Y}\).

## [1] 0.027

Finally, the error rate can be decomposed by producing the associated confusion matrix and computing the false positive and false negative rates.

##             
## tree.predict   No  Yes
##          No  2880   54
##          Yes   27   39
## [1] 0.009288
## [1] 0.5806