# Chapter 7 Classification

Classification shares many similarities with regression: We have a response variable \(Y\) and a set of one or more predictors \(X_1,\dotsc,X_p\). The difference is that for classification problems, the response \(Y\) is **discrete**, meaning \(Y\in\{1,2,\dotsc,C\}\) where \(C\) is the number of classes that \(Y\) can take on.

We will focus our attention on **binary** responses \(Y\in\{0,1\}\), but all of the methods we discuss can be extended to the more general case outlined above.

To illustrate classification methods, we will use the Default data in the `ISLR`

R library. The data set contains four variables: `default`

is an indicator of whether the customer defaulted on their debt, `student`

is an indicator of whether the customer is a student, `balance`

is the average balance that the customer has remaining on their credit card after making their monthly payment, and `income`

is the customer’s income.

default | student | balance | income |
---|---|---|---|

No | No | 729.5 | 44362 |

No | Yes | 817.2 | 12106 |

No | No | 1073.5 | 31767 |

No | No | 529.3 | 35704 |

No | No | 785.7 | 38463 |

No | Yes | 919.6 | 7492 |

We also need to split up the data into training and test samples in order to measure the predictive accuracy of different approaches to classification.

## 7.1 \(k\)-Nearest Neighbors

The \(k\)-NN algorithms are built on the following idea: given a new observation \(X^*\) for which we want to predict an associated response \(Y^*\), we can find values of \(X\) in our data that look similar to \(X^*\) and then classify \(Y^*\) based on the associated \(Y\)’s. We will use Euclidean distance is a measure of similarity (which is only defined for real-valued \(X\)’s).

Let’s take a small portion (first 10 rows) of the Default data to work through a simple example. Notice that we will exclude the `student`

variable since it is a categorical rather than numeric variable. We will use the 11th observation as our “test” data \(X^*\) that we want to make predictions for.

We now need to compute the similarity (i.e., Euclidean distance) between \(X^*=(X_1^*,X_2^*)\) and \(X_i=(X_{1i},X_{2i})\) for each \(i=1,\dotsc,n\).

\[dist(X^*,X_i)=||X^*-X_i||=\sqrt{(X_1^*-X_{1i})^2+(X_2^*-X_{2i})^2}\]

To do this in R, we can take use the apply( ) function. The first argument is the matrix of \(X\) variables that we want to cycle through to compare with \(X^*\).

The second argument of the `apply( )`

function tells R whether we want to perform an operation for each row `(=1)`

of for each column `(=2)`

. The last row tells R what function we want to compute. Here, we need to evaluate \(dist(X^*,X_i)\) for each row.

```
## 1 2 3 4 5 6 7 8 9 10
## 22502 9799 9954 13844 16611 14409 3144 4347 15641 7404
```

Notice that the function returns a set of 10 distances. If we wanted to use the 1st-nearest neighbor classifier to predict \(Y^*\), for example, then we would need to find the \(Y\) value of \(X_i\) for the observation \(i\) that has the smallest distance. We can find that value using the `which.min( )`

function.

```
## 7
## 7
```

```
## [1] No
## Levels: No Yes
```

Therefore, we would predict \(Y^*=No\) having observed \(X^*\).

Now let’s go back to the full data set and test the performance of the \(k\)-NN classifier. The first thing we should do is standardize the \(X\)’s since the nearest neighbors algorithm depends on the scale of the covariates.

```
## balance income
## Min. :-1.728 Min. :-2.4632
## 1st Qu.:-0.733 1st Qu.:-0.9207
## Median :-0.030 Median : 0.0804
## Mean : 0.000 Mean : 0.0000
## 3rd Qu.: 0.686 3rd Qu.: 0.7730
## Max. : 3.454 Max. : 3.0059
```

Now we can use the `knn( )`

function in the `class`

R library to run the algorithm on the training data and then make predictions for each observation in the test data. The first argument calls for the \(X\)’s in the training data, the second calls for the \(X\)’s in the test data (for which we want to predict), the third calls for the \(Y\)’s in the training data, and the fourth calls for \(k\), the number of nearest neighbors we want to use to make the prediction.

The `knn1`

object now contains a vector of predicted \(Y\)’s for each value of \(X\) in the test data. We can then compare the predicted response \(\hat{Y}\) to the true response in the test data \(Y\) to assess the performance of the classification algorithm. In particular, we will see the fraction of predictions the algorithm gets wrong.

`## [1] 0.04467`

In this case, the 1-NN classifier as an error rate of about 4.5% (or equivalently, an accuracy of 95.5%).

We can try increasing \(k\) to see if there is any effect on predictive fit.

```
# 5 nearest neighbors
knn5 = knn(stdtrainX, stdtestX, train$default, k=5)
mean(knn5 != test$default)
```

`## [1] 0.029`

```
# 10 nearest neighbors
knn10 = knn(stdtrainX, stdtestX, train$default, k=10)
mean(knn10 != test$default)
```

`## [1] 0.026`

```
# 50 nearest neighbors
knn50 = knn(stdtrainX, stdtestX, train$default, k=50)
mean(knn50 != test$default)
```

`## [1] 0.024`

```
# 100 nearest neighbors
knn100 = knn(stdtrainX, stdtestX, train$default, k=100)
mean(knn100 != test$default)
```

`## [1] 0.027`

We would then likely choose the model that predicts best (i.e., has the lowest error/misclassification rate).

The last object of interest when doing classification is the **confusion matrix**, which allows us to decompose misclassification mistakes into two groups: **false positives** (predict \(\hat{Y}=1\) when \(Y=0\)) and **false negatives** (predict \(\hat{Y}=0\) when \(Y=1\)).

Let’s produce the confusion matrix for the 10-NN classifier.

```
##
## knn10 No Yes
## No 2889 60
## Yes 18 33
```

`## [1] 0.006192`

`## [1] 0.6452`

The false negative rate is especially high, which would be concerning given the risks to the lending agency (e.g., bank).

## 7.2 Logistic Regression

Issues with the \(k\)-NN algorithms include the fact they can’t accommodate categorical \(X\)’s, the algorithms aren’t based on a formal statistical model so we can’t do inference (or learn about how the \(X\)’s relate to \(Y\)), and there is an assumption that all \(X\)’s matter and matter equally in determining \(Y\).

Our first solution to these problems is **logistic regression**.

Given a response \(Y\in\{0,1\}\) and a set of predictors \(X_1,\dotsc,X_P\), the logistic regression model is written as follows.

\[\text{Pr}(Y=1|X)={\exp(\beta_0+\beta_1X_1+\dotsc+\beta_pX_p)\over 1 + \exp(\beta_0+\beta_1X_1+\dotsc+\beta_pX_p)}\]

The intuition for this formula is as follows. If \(Y\in\{0,1\}\), then we can assume that \(Y\sim\text{Bernoulli}(\theta)\) where \(\theta=\text{Pr}(Y=1)\). We can then write down a regression model for \(\theta\) rather than \(Y\). The only remaining problem is that \(\theta\in(0,1)\), so we need to transform the linear regression function \(h(X)=\beta_0+\beta_1X_1+\dotsc+\beta_pX_p)\) in a way so that it is constrained to be between 0 and 1. The function \(e^{h(X)}/(1 + e^{h(X)})\) does just that.

Estimating a logistic regression model in R can be done using the `glm( )`

function, which is similar to the `lm( )`

command we use to estimate linear regression models.

Let’s illustrate with the training sample from the Default data set.

```
##
## Call:
## glm(formula = default ~ student + balance + income, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.187 -0.142 -0.055 -0.020 3.686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.10e+01 5.89e-01 -18.70 <2e-16 ***
## studentYes -6.46e-01 2.85e-01 -2.27 0.023 *
## balance 5.83e-03 2.78e-04 20.96 <2e-16 ***
## income 4.71e-06 9.87e-06 0.48 0.633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2090.7 on 6999 degrees of freedom
## Residual deviance: 1109.4 on 6996 degrees of freedom
## AIC: 1117
##
## Number of Fisher Scoring iterations: 8
```

Notice that we added one more option in the `glm( )`

function: `type="binomial"`

. This option tells R to use the logistic regression model rather than other types of *generalized linear models*.

The output from the logistic regression model looks fairly similar to that of linear regression models. However, the interpretation of model paramters (and their estimates) changes a bit.

For example, we find that the coefficient on balance is estimated to be about 0.0058, which means that a one dollar increase in balance multiplies the odds of default by exp(0.0058)=1.006. Since this number is greater than 1, we can say that increasing the balance *increases* the odds of default.

To predict responses in the test data, we can use the `predict( )`

function in R. We again need to add one option: `type="response"`

, which will tell R to return the predicted probabilities that \(Y=1\).

Then we can compute \(\hat{Y}\) by using the rule that \(\hat{Y}=\text{Yes}\) if the predicted probability is greater than 0.5 and \(\hat{Y}=\text{No}\) otherwise.

Just as before, we can compare the model predictions with the actual \(Y\)’s in the test data to compute the out-of-sample error (misclassification) rate.

`## [1] 0.024`

This error rate can be decomposed by producing the associated confusion matrix and computing the false positive and false negative rates.

```
##
## glm.predict No Yes
## No 2896 61
## Yes 11 32
```

`## [1] 0.003784`

`## [1] 0.6559`

## 7.3 Classification Trees

Classification trees offer the same advantages over logistic regression that regression trees do for linear regression. That is, classification trees provide a classification rule that does not assume any form of linearity in the covariates \(X\).

The nice thing is their implimentation in R is nearly identical to that of regression trees.

```
library(rpart)
# estimate regression tree
tree.fit = rpart(default ~ student + balance + income, method="class", data=train)
# plot estimated tree
plot(tree.fit,uniform=TRUE,margin=0.05,main="DEFAULT")
text(tree.fit)
```

We can again use the `predict( )`

function to predict the response values for the test data and compute the out-of-sample error (misclassification) rate. We need to specify the `type="class"`

option so that the `predict( )`

function returns the predicted values \(\hat{Y}\).

`## [1] 0.027`

Finally, the error rate can be decomposed by producing the associated confusion matrix and computing the false positive and false negative rates.

```
##
## tree.predict No Yes
## No 2880 54
## Yes 27 39
```

`## [1] 0.009288`

`## [1] 0.5806`