kNN Project: Predicting which MiLB hitters from A+ in 2013 made it to MLB
After going through the entire chapter on k-nearest neighbors (kNN), I decided to use kNN to build a model to predict which minor league players from high-A (A+) in 2013 ended up making a major league debut. I used a 100 plate appearance (PA) cutoff to be included in the data.
I settled on 4 predictors, one being the seasonal age of the player (Age) and the others being three rate statistics in strikeout percentage (K%), walk percentage (BB%) and isolated slugging percentage (ISO). The performance stats are typically decent indicators of player performance, while age plays a big factor in projecting talent. Additionally, the three performance metrics shouldn’t have much covariance between them.
I pulled the data from fangraphs and the leaderboard can be accessed here.
library("tidyverse")
library("mlr")
MiLB2013 <- read_csv("data/MiLB_HighA_2013.csv")
# Remove the player names from the data
MiLB2013knn <- select(MiLB2013,-Name)
summary(MiLB2013knn)
Age BB K ISO MLB
Min. :19.00 Min. :0.009901 Min. :0.05991 Min. :0.009615 Length:461
1st Qu.:22.00 1st Qu.:0.065089 1st Qu.:0.15353 1st Qu.:0.085308 Class :character
Median :23.00 Median :0.085389 Median :0.19312 Median :0.125767 Mode :character
Mean :22.88 Mean :0.087832 Mean :0.19784 Mean :0.130792
3rd Qu.:24.00 3rd Qu.:0.107784 3rd Qu.:0.23709 3rd Qu.:0.174545
Max. :29.00 Max. :0.204409 Max. :0.40415 Max. :0.325243
MiLB2013knn %>%
group_by(MLB) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
To figure out what k value is optimal for the kNN model, I did hyperparameter tuning for k. I expected k to be a bit lower initially and started across the set of k = 1:10. When k=10 was the best performing, I expanded it out to k=25. Again, k=25 came back as the best performing and then opened it up to 60 as it shows below.
# Task
MiLBTask <- makeClassifTask(data = MiLB2013knn, target = "MLB")
Provided data is not a pure data.frame but from class tbl_df, hence it will be converted.
# hyperarameter tuning
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:60))
gridSearch <- makeTuneControlGrid()
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
tunedMiLB <- tuneParams("classif.knn", task = MiLBTask,
resampling = cvForTuning,
par.set = knnParamSpace, control = gridSearch)
tunedMiLB
Tune result:
Op. pars: k=35
mmce.test.mean=0.1906707
tunedMiLB$x
$k
[1] 35
knnTuningData <- generateHyperParsEffectData(tunedMiLB)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
plot.type = "line") +
theme_bw()

The value of k that minimizes the mean mmce is 33. The graph above displays the mmce mean for all values of k between 1 and 60. While 33 is the minimum value on the graph, a potential range of k between 28 to 35 could work and in another iteration, k=35 had the lowest value.
I expected a lower optimum value for k because the data only has 461 observations.
tunedKnnMiLB <- setHyperPars(makeLearner("classif.knn"),
par.vals = tunedMiLB$x)
tunedKnnModelMiLB <- train(tunedKnnMiLB, MiLBTask)
Model Performance
To evaluate performance I used K-fold cross validation with 5 folds and 50 repetitions and started with the tuned k=33.
knn <- makeLearner("classif.knn", par.vals = list("k"=33))
knnModel <- train(knn, MiLBTask)
kFold <- makeResampleDesc(method = "RepCV", folds = 5, reps = 50,
stratify = TRUE)
kFoldCV <- resample(learner = knn, task = MiLBTask,
resampling = kFold, measures = list(mmce, acc), show.info = FALSE)
kFoldCV$aggr
mmce.test.mean acc.test.mean
0.1923622 0.8076378
calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
predicted
true No Yes -err.-
No 0.96/0.82 0.04/0.28 0.04
Yes 0.66/0.18 0.34/0.72 0.66
-err.- 0.18 0.28 0.19
Absolute confusion matrix:
predicted
true No Yes -err.-
No 16720 730 730
Yes 3704 1896 3704
-err.- 3704 730 4434
The confusion matrix above shows that “No” MLB predictions were correct 82% of the time and “Yes” MLB predictions were correct 72% of the time.
However, of the players that did make it to MLB, it was wrong on their classification 66% of the time.
So the model seems to be conservative when predicting that a player will make it to MLB. The data had 24% of players make it to MLB and in the cross validation, it only predicted “Yes” 11% of the time.
Ultimately the model does a pretty good job of identifying which players won’t make MLB but a horrible job identifying who ultimately does make MLB.
Because the “Yes” error was so high, I wondered if a lower k value, might trade off some of the overall model error for a bit better “Yes” MLB result. Looking back at the hyperparameters plot, there was a local minimum under 20, so I tested out the model with k=17.
knn2 <- makeLearner("classif.knn", par.vals = list("k"=17))
knnModel2 <- train(knn2, MiLBTask)
# Use same CV as above
# kFold <- makeResampleDesc(method = "RepCV", folds = 5, reps = 50,
# stratify = TRUE)
kFoldCV2 <- resample(learner = knn2, task = MiLBTask,
resampling = kFold, measures = list(mmce, acc), show.info = FALSE)
kFoldCV2$aggr
mmce.test.mean acc.test.mean
0.1995227 0.8004773
calculateConfusionMatrix(kFoldCV2$pred, relative = TRUE)
Relative confusion matrix (normalized by row/column):
predicted
true No Yes -err.-
No 0.93/0.83 0.07/0.35 0.07
Yes 0.61/0.17 0.39/0.65 0.61
-err.- 0.17 0.35 0.20
Absolute confusion matrix:
predicted
true No Yes -err.-
No 16242 1208 1208
Yes 3391 2209 3391
-err.- 3391 1208 4599
The result here is mixed. When the model predicted that a player made MLB, it was only correct 64% of the time, down from 72% from the first model. However, looking at the players that did make it to MLB, the model correctly classified 40%, up from 34% in the first model.
The number of total players that this model predicted would make MLB is 15%, up from 11% from the first model. This is still well short of the 24% mark that did make MLB.
Which model is best depends on your goals. I’m leaning toward the second model because it’s making a closer attempt at predicting how many players actually made it, even if it’s wrong more often. Model 1 predicted correctly that 38 players would make MLB on average, while model 2 correctly predicted 44 players on average. Again, considering 112 players from 2013 made it up, that’s not a great model!
Predictions
Even though neither model did a great job in cross validation and the models use all 2013 data, I wanted to explore predicting some players from 2014 using our models.
I selected players that were ranked 26-30 in wRC+ in 2014. These players were Carlos Correa, Derrick Chung, Tyler White, Kyle Waldrop and Daniel Carbonell. We have a major league regular that debuted at a young age in Correa, a quad-A hitter in Tyler White, a cup of coffee MLB player in Waldrop and two other players that never made it up to the bigs.
# data for Carlos Correa, Derrick Chung, Tyler White, Kyle Waldrop, Daniel Carbonell
newMiLBdata <- tibble(Age = c(19, 26, 23, 22, 23),
BB = c(.123, .095, .151, .076, .06),
K = c(.154, .1, .145, .194, .19),
ISO = c(.185, .118, .260, .156, .194))
# Using k=33
newMiLBPred1 <- predict(knnModel, newdata = newMiLBdata)
Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
getPredictionResponse(newMiLBPred1)
[1] Yes No No No No
Levels: No Yes
# Using k=17
newMiLBPred2 <- predict(knnModel2, newdata = newMiLBdata)
Provided data for prediction is not a pure data.frame but from class tbl_df, hence it will be converted.
getPredictionResponse(newMiLBPred2)
[1] Yes No Yes Yes No
Levels: No Yes
The first kNN model with k=33 only predicted Carlos Correa to make MLB out of the 5 players. In past results it also correctly selected Kyle Waldrop to make it to MLB. The conservative nature of handing out “yes” MLB predictions is present in our small sample set. In both iterations, the model was wrong on Tyler White. Given that his stats were all superior to Waldrop (who was predicted correctly in a past iteration) - it points to age potentially having a large impact in the kNN model.
The second kNN model with k=17 correctly predicted the playing futures of all 5 players. Given that the model on average was only correctly predicting MLB players 62% of the time, the results here are either (a) lucky or (b) biased based on my selection of high performing 2014 players (ranked 26-30 in wRC+).
Closing Thoughts
While kNN was easy to implement and fun to test out on this question, it may not be the best algorithm for predicting which players would make MLB. It’s easy to identify who won’t make it, but it’s a lot harder to identify who does! It would also be nice to have information about the variables used in the model - like which one is most closely related to making it to the big leagues, which is something kNN doesn’t offer.
I also wanted to note that a decent amount of error may have been introduced given the sample of players needing only 100 plate appearances to be included in the data. It’s well known that any player can get “hot” or “cold” for 100+ PA - see our 2014 example of Derrick Chung! Ideally we’d be looking at more data than just A+ performance to gauge their full abilities.
I’ll be looking to use this data in future algorithms - including in the next chapter on logistic regression!
