Machine Learning Series Intro

All posts from this series can be found at here.

During this time at home that most of us have been forced into I’ve decided start retooling my arsenal as a statistician. After getting my Master’s in Applied Statistics, I’ve spent the past 7 years working in marketing analytics. I’ve kept some aspects of these skills sharp, in particular linear regression and coding in R (building shiny apps and learning/developing tidyverse toolbox). But it’s time to both refresh and learn some new statistical techniques.

I decided to dive into machine learning and do that in R. It’s a topic where I have little exposure, but have a lot to learn. After coming across the book Machine Learning with R, the tidyverse, and mlr by Hefin Rhys I decided to buy it and make the leap. I’m not far into the book, but so far it’s been really easy to follow along and does a great job going over the tidyverse.

My goal is to go through the entire book chapter by chapter, and as I’m a big baseball nerd, create small projects with baseball data for each algorithm.


kNN Project: Predicting which MiLB hitters from A+ in 2013 made it to MLB


After going through the entire chapter on k-nearest neighbors (kNN), I decided to use kNN to build a model to predict which minor league players from high-A (A+) in 2013 ended up making a major league debut. I used a 100 plate appearance (PA) cutoff to be included in the data.

I settled on 4 predictors, one being the seasonal age of the player (Age) and the others being three rate statistics in strikeout percentage (K%), walk percentage (BB%) and isolated slugging percentage (ISO). The performance stats are typically decent indicators of player performance, while age plays a big factor in projecting talent. Additionally, the three performance metrics shouldn’t have much covariance between them.

I pulled the data from fangraphs and the leaderboard can be accessed here.

library("tidyverse")
library("mlr")
MiLB2013 <- read_csv("data/MiLB_HighA_2013.csv")
# Remove the player names from the data
MiLB2013knn <- select(MiLB2013,-Name)
summary(MiLB2013knn)
      Age              BB                 K                ISO               MLB           
 Min.   :19.00   Min.   :0.009901   Min.   :0.05991   Min.   :0.009615   Length:461        
 1st Qu.:22.00   1st Qu.:0.065089   1st Qu.:0.15353   1st Qu.:0.085308   Class :character  
 Median :23.00   Median :0.085389   Median :0.19312   Median :0.125767   Mode  :character  
 Mean   :22.88   Mean   :0.087832   Mean   :0.19784   Mean   :0.130792                     
 3rd Qu.:24.00   3rd Qu.:0.107784   3rd Qu.:0.23709   3rd Qu.:0.174545                     
 Max.   :29.00   Max.   :0.204409   Max.   :0.40415   Max.   :0.325243                     
MiLB2013knn %>% 
  group_by(MLB) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

To figure out what k value is optimal for the kNN model, I did hyperparameter tuning for k. I expected k to be a bit lower initially and started across the set of k = 1:10. When k=10 was the best performing, I expanded it out to k=25. Again, k=25 came back as the best performing and then opened it up to 60 as it shows below.

# Task
MiLBTask <- makeClassifTask(data = MiLB2013knn, target = "MLB")
Provided data is not a pure data.frame but from class tbl_df, hence it will be converted.
# hyperarameter tuning
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:60))
gridSearch <- makeTuneControlGrid()
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
tunedMiLB <- tuneParams("classif.knn", task = MiLBTask,
                     resampling = cvForTuning,
                     par.set = knnParamSpace, control = gridSearch)
tunedMiLB
Tune result:
Op. pars: k=35
mmce.test.mean=0.1906707
tunedMiLB$x
$k
[1] 35
knnTuningData <- generateHyperParsEffectData(tunedMiLB)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
                    plot.type = "line") +
  theme_bw()

The value of k that minimizes the mean mmce is 33. The graph above displays the mmce mean for all values of k between 1 and 60. While 33 is the minimum value on the graph, a potential range of k between 28 to 35 could work and in another iteration, k=35 had the lowest value.

I expected a lower optimum value for k because the data only has 461 observations.

tunedKnnMiLB <- setHyperPars(makeLearner("classif.knn"),
                         par.vals = tunedMiLB$x)

tunedKnnModelMiLB <- train(tunedKnnMiLB, MiLBTask)



Model Performance

To evaluate performance I used K-fold cross validation with 5 folds and 50 repetitions and started with the tuned k=33.

knn <- makeLearner("classif.knn", par.vals = list("k"=33))
knnModel <- train(knn, MiLBTask)
kFold <- makeResampleDesc(method = "RepCV", folds = 5, reps = 50,
                          stratify = TRUE)
kFoldCV <- resample(learner = knn, task = MiLBTask,
                    resampling = kFold, measures = list(mmce, acc), show.info = FALSE)
kFoldCV$aggr
mmce.test.mean  acc.test.mean 
     0.1923622      0.8076378 
calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
        predicted
true     No        Yes       -err.-   
  No     0.96/0.82 0.04/0.28 0.04     
  Yes    0.66/0.18 0.34/0.72 0.66     
  -err.-      0.18      0.28 0.19     


Absolute confusion matrix:
        predicted
true        No  Yes -err.-
  No     16720  730    730
  Yes     3704 1896   3704
  -err.-  3704  730   4434

The confusion matrix above shows that “No” MLB predictions were correct 82% of the time and “Yes” MLB predictions were correct 72% of the time.

However, of the players that did make it to MLB, it was wrong on their classification 66% of the time.

So the model seems to be conservative when predicting that a player will make it to MLB. The data had 24% of players make it to MLB and in the cross validation, it only predicted “Yes” 11% of the time.

Ultimately the model does a pretty good job of identifying which players won’t make MLB but a horrible job identifying who ultimately does make MLB.

Because the “Yes” error was so high, I wondered if a lower k value, might trade off some of the overall model error for a bit better “Yes” MLB result. Looking back at the hyperparameters plot, there was a local minimum under 20, so I tested out the model with k=17.

knn2 <- makeLearner("classif.knn", par.vals = list("k"=17))
knnModel2 <- train(knn2, MiLBTask)
# Use same CV as above
# kFold <- makeResampleDesc(method = "RepCV", folds = 5, reps = 50,
#                          stratify = TRUE)
kFoldCV2 <- resample(learner = knn2, task = MiLBTask,
                    resampling = kFold, measures = list(mmce, acc), show.info = FALSE)
kFoldCV2$aggr
mmce.test.mean  acc.test.mean 
     0.1995227      0.8004773 
calculateConfusionMatrix(kFoldCV2$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
        predicted
true     No        Yes       -err.-   
  No     0.93/0.83 0.07/0.35 0.07     
  Yes    0.61/0.17 0.39/0.65 0.61     
  -err.-      0.17      0.35 0.20     


Absolute confusion matrix:
        predicted
true        No  Yes -err.-
  No     16242 1208   1208
  Yes     3391 2209   3391
  -err.-  3391 1208   4599

The result here is mixed. When the model predicted that a player made MLB, it was only correct 64% of the time, down from 72% from the first model. However, looking at the players that did make it to MLB, the model correctly classified 40%, up from 34% in the first model.

The number of total players that this model predicted would make MLB is 15%, up from 11% from the first model. This is still well short of the 24% mark that did make MLB.

Which model is best depends on your goals. I’m leaning toward the second model because it’s making a closer attempt at predicting how many players actually made it, even if it’s wrong more often. Model 1 predicted correctly that 38 players would make MLB on average, while model 2 correctly predicted 44 players on average. Again, considering 112 players from 2013 made it up, that’s not a great model!



Predictions

Even though neither model did a great job in cross validation and the models use all 2013 data, I wanted to explore predicting some players from 2014 using our models.

I selected players that were ranked 26-30 in wRC+ in 2014. These players were Carlos Correa, Derrick Chung, Tyler White, Kyle Waldrop and Daniel Carbonell. We have a major league regular that debuted at a young age in Correa, a quad-A hitter in Tyler White, a cup of coffee MLB player in Waldrop and two other players that never made it up to the bigs.

# data for Carlos Correa, Derrick Chung, Tyler White, Kyle Waldrop, Daniel Carbonell
newMiLBdata <- tibble(Age = c(19, 26, 23, 22, 23),
                      BB = c(.123, .095, .151, .076, .06),
                      K = c(.154, .1, .145, .194, .19),
                      ISO = c(.185, .118, .260, .156, .194))
# Using k=33
newMiLBPred1 <- predict(knnModel, newdata = newMiLBdata)
Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
getPredictionResponse(newMiLBPred1)
[1] Yes No  No  No  No 
Levels: No Yes
# Using k=17
newMiLBPred2 <- predict(knnModel2, newdata = newMiLBdata)
Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
getPredictionResponse(newMiLBPred2)
[1] Yes No  Yes Yes No 
Levels: No Yes

The first kNN model with k=33 only predicted Carlos Correa to make MLB out of the 5 players. In past results it also correctly selected Kyle Waldrop to make it to MLB. The conservative nature of handing out “yes” MLB predictions is present in our small sample set. In both iterations, the model was wrong on Tyler White. Given that his stats were all superior to Waldrop (who was predicted correctly in a past iteration) - it points to age potentially having a large impact in the kNN model.

The second kNN model with k=17 correctly predicted the playing futures of all 5 players. Given that the model on average was only correctly predicting MLB players 62% of the time, the results here are either (a) lucky or (b) biased based on my selection of high performing 2014 players (ranked 26-30 in wRC+).



Closing Thoughts

While kNN was easy to implement and fun to test out on this question, it may not be the best algorithm for predicting which players would make MLB. It’s easy to identify who won’t make it, but it’s a lot harder to identify who does! It would also be nice to have information about the variables used in the model - like which one is most closely related to making it to the big leagues, which is something kNN doesn’t offer.

I also wanted to note that a decent amount of error may have been introduced given the sample of players needing only 100 plate appearances to be included in the data. It’s well known that any player can get “hot” or “cold” for 100+ PA - see our 2014 example of Derrick Chung! Ideally we’d be looking at more data than just A+ performance to gauge their full abilities.

I’ll be looking to use this data in future algorithms - including in the next chapter on logistic regression!



---
title: "Post 1: k-Nearest Neighbors Algorithm"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: html_notebook
---
<br/><br/>
*__Machine Learning Series Intro__*

*All posts from this series can be found at [here](https://smadaplaysfantasy.com/mlr/).*

*During this time at home that most of us have been forced into I've decided start retooling my arsenal as a statistician. After getting my Master's in Applied Statistics, I've spent the past 7 years working in marketing analytics. I've kept some aspects of these skills sharp, in particular linear regression and coding in R (building shiny apps and learning/developing tidyverse toolbox). But it's time to both refresh and learn some new statistical techniques.* 

*I decided to dive into machine learning and do that in R. It's a topic where I have little exposure, but have a lot to learn. After coming across the book [Machine Learning with R, the tidyverse, and mlr](https://www.manning.com/books/machine-learning-with-r-the-tidyverse-and-mlr) by [Hefin Rhys](https://twitter.com/HRJ21) I decided to buy it and make the leap. I'm not far into the book, but so far it's been really easy to follow along and does a great job going over the tidyverse.*

*My goal is to go through the entire book chapter by chapter, and as I'm a big baseball nerd, create small projects with baseball data for each algorithm.*

<br/>

### kNN Project: Predicting which MiLB hitters from A+ in 2013 made it to MLB
<br/>
After going through the entire chapter on k-nearest neighbors (kNN), I decided to use kNN to build a model to predict which minor league players from high-A (A+) in 2013 ended up making a major league debut. I used a 100 plate appearance (PA) cutoff to be included in the data.

I settled on 4 predictors, one being the seasonal age of the player (Age) and the others being three rate statistics in strikeout percentage (K%), walk percentage (BB%) and isolated slugging percentage (ISO). The performance stats are typically decent indicators of player performance, while age plays a big factor in projecting talent. Additionally, the three performance metrics shouldn't have much covariance between them.

I pulled the data from fangraphs and the leaderboard can be accessed [here](https://www.fangraphs.com/leaders/minor-league?pos=all&lg=8,9,10&stats=bat&qual=100&type=1&team=&season=2013&seasonEnd=2013&org=&ind=0&splitTeam=false&players=&sort=18,1).

```{r message=FALSE}
library("tidyverse")
library("mlr")

MiLB2013 <- read_csv("data/MiLB_HighA_2013.csv")

# Remove the player names from the data
MiLB2013knn <- select(MiLB2013,-Name)

summary(MiLB2013knn)

MiLB2013knn %>% 
  group_by(MLB) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
```

To figure out what k value is optimal for the kNN model, I did hyperparameter tuning for k. I expected k to be a bit lower initially and started across the set of k = 1:10. When k=10 was the best performing, I expanded it out to k=25. Again, k=25 came back as the best performing and then opened it up to 60 as it shows below.

```{r message=FALSE}
# Task
MiLBTask <- makeClassifTask(data = MiLB2013knn, target = "MLB")

# hyperarameter tuning
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:60))
gridSearch <- makeTuneControlGrid()
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
tunedMiLB <- tuneParams("classif.knn", task = MiLBTask,
                     resampling = cvForTuning,
                     par.set = knnParamSpace, control = gridSearch)
tunedMiLB
tunedMiLB$x

knnTuningData <- generateHyperParsEffectData(tunedMiLB)

plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
                    plot.type = "line") +
  theme_bw()

```

The value of k that minimizes the mean mmce is 33. The graph above displays the mmce mean for all values of k between 1 and 60. While 33 is the minimum value on the graph, a potential range of k between 28 to 35 could work and in another iteration, k=35 had the lowest value.

I expected a lower optimum value for k because the data only has 461 observations.

```{r message=FALSE}
tunedKnnMiLB <- setHyperPars(makeLearner("classif.knn"),
                         par.vals = tunedMiLB$x)

tunedKnnModelMiLB <- train(tunedKnnMiLB, MiLBTask)
```

<br/><br/>

__Model Performance__ 

To evaluate performance I used K-fold cross validation with 5 folds and 50 repetitions and started with the tuned k=33.

```{r message=FALSE}

knn <- makeLearner("classif.knn", par.vals = list("k"=33))
knnModel <- train(knn, MiLBTask)

kFold <- makeResampleDesc(method = "RepCV", folds = 5, reps = 50,
                          stratify = TRUE)

kFoldCV <- resample(learner = knn, task = MiLBTask,
                    resampling = kFold, measures = list(mmce, acc), show.info = FALSE)

kFoldCV$aggr

calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)

```

The confusion matrix above shows that "No" MLB predictions were correct 82% of the time and "Yes" MLB predictions were correct 72% of the time.

However, of the players that did make it to MLB, it was wrong on their classification 66% of the time.

So the model seems to be conservative when predicting that a player will make it to MLB. The data had 24% of players make it to MLB and in the cross validation, it only predicted "Yes" 11% of the time.

Ultimately the model does a pretty good job of identifying which players won't make MLB but a horrible job identifying who ultimately does make MLB.

Because the "Yes" error was so high, I wondered if a lower k value, might trade off some of the overall model error for a bit better "Yes" MLB result. Looking back at the hyperparameters plot, there was a local minimum under 20, so I tested out the model with k=17.

```{r}
knn2 <- makeLearner("classif.knn", par.vals = list("k"=17))
knnModel2 <- train(knn2, MiLBTask)

# Use same CV as above
# kFold <- makeResampleDesc(method = "RepCV", folds = 5, reps = 50,
#                          stratify = TRUE)

kFoldCV2 <- resample(learner = knn2, task = MiLBTask,
                    resampling = kFold, measures = list(mmce, acc), show.info = FALSE)

kFoldCV2$aggr

calculateConfusionMatrix(kFoldCV2$pred, relative = TRUE)
```

The result here is mixed. When the model predicted that a player made MLB, it was only correct 64% of the time, down from 72% from the first model. However, looking at the players that did make it to MLB, the model correctly classified 40%, up from 34% in the first model.

The number of total players that this model predicted would make MLB is 15%, up from 11% from the first model. This is still well short of the 24% mark that did make MLB.

Which model is best depends on your goals. I'm leaning toward the second model because it's making a closer attempt at predicting how many players actually made it, even if it's wrong more often. Model 1 predicted correctly that 38 players would make MLB on average, while model 2 correctly predicted 44 players on average. Again, considering 112 players from 2013 made it up, that's not a great model!

<br/><br/>
__Predictions__

Even though neither model did a great job in cross validation and the models use all 2013 data, I wanted to explore predicting some players from 2014 using our models. 

I selected players that were ranked 26-30 in wRC+ in 2014. These players were Carlos Correa, Derrick Chung, Tyler White, Kyle Waldrop and Daniel Carbonell. We have a major league regular that debuted at a young age in Correa, a quad-A hitter in Tyler White, a cup of coffee MLB player in Waldrop and two other players that never made it up to the bigs.

```{r}
# data for Carlos Correa, Derrick Chung, Tyler White, Kyle Waldrop, Daniel Carbonell
newMiLBdata <- tibble(Age = c(19, 26, 23, 22, 23),
                      BB = c(.123, .095, .151, .076, .06),
                      K = c(.154, .1, .145, .194, .19),
                      ISO = c(.185, .118, .260, .156, .194))

# Using k=33
newMiLBPred1 <- predict(knnModel, newdata = newMiLBdata)

getPredictionResponse(newMiLBPred1)

# Using k=17
newMiLBPred2 <- predict(knnModel2, newdata = newMiLBdata)

getPredictionResponse(newMiLBPred2)
```

The first kNN model with k=33 only predicted Carlos Correa to make MLB out of the 5 players. In past results it also correctly selected Kyle Waldrop to make it to MLB. The conservative nature of handing out "yes" MLB predictions is present in our small sample set. In both iterations, the model was wrong on Tyler White. Given that his stats were all superior to Waldrop (who was predicted correctly in a past iteration) - it points to age potentially having a large impact in the kNN model. 

The second kNN model with k=17 correctly predicted the playing futures of all 5 players. Given that the model on average was only correctly predicting MLB players 62% of the time, the results here are either (a) lucky or (b) biased based on my selection of high performing 2014 players (ranked 26-30 in wRC+). 

<br/><br/>
__Closing Thoughts__

While kNN was easy to implement and fun to test out on this question, it may not be the best algorithm for predicting which players would make MLB. It's easy to identify who won't make it, but it's a lot harder to identify who does! It would also be nice to have information about the variables used in the model - like which one is most closely related to making it to the big leagues, which is something kNN doesn't offer. 

I also wanted to note that a decent amount of error may have been introduced given the sample of players needing only 100 plate appearances to be included in the data. It's well known that any player can get "hot" or "cold" for 100+ PA - see our 2014 example of Derrick Chung! Ideally we'd be looking at more data than just A+ performance to gauge their full abilities.

I'll be looking to use this data in future algorithms - including in the next chapter on logistic regression!

<br/><br/>
