Skip to content

Analysis of MLB pitchers and what factors influence a pitcher's FIP. Bayesian Ridge Regression is used to determine which covariates are most important. The model is fit using both hand written MCMC algorithms and Stan.

Notifications You must be signed in to change notification settings

dteuscher1/MLB-Bayesian-Ridge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bayesian Ridge Regression for MLB Pitcher Performance

This repository contains work for implementing Bayesian ridge regression to determine how certain variables influence a pitcher's Fielding Independent Pitching (FIP). This was a project for my Bayesian statistics class at BYU, which is why Bayesian Ridge Regression is implemented and compared with results from Ridge regression from a Frequentist perspective.

The goals of the analysis are to 1) determine what a pitcher can change to improve his FIP and 2) show how shrinkage/penalization can be implemented from the Bayesian paradigm, specifically the commonly used Ridge regression from the Frequentist perspective.

The data for this analysis came from two different sources. The FIP for each pitcher was taken from Fangraphs and was obtained using the baseballr package. Information about each pitcher was downloaded from Baseball Savant for the 2016 season and included information such as the average spin rate, break, and speed for fastballs, breaking, and offspeed pitches, the percentage of pitches a pitcher threw in the strike zone and out of the strike zone, the percentage of the types of hits a pitcher gave up (flyball, popup, ground ball, line drive, etc.), and the average launch angle of hit pitches. There were a total of 49 different variables that were included in the analysis Since there were so many variables available, it seemed reasonable to want to implement shrinkage or regualization because many of these variables are likely correlated with each other. In order to avoid potential confounding, only right handed pitchers who pitched at least 90 innings in the 2016 season were included in the analysis. There were a total of 108 pitchers who met this qualifications and were included in the analysis.

The model was fit two ways, once using Stan, and once using a by hand Markov Chain Monte Carlo algorithm. The Stan code is included in the repository and can be referenced, as well as the written MCMC algorithm. One of the requirements of the project was that the model could not be completely conjugate, so the prior for $\sigma_b^2$ was a Gamma distribution, rather than an Inverse-Gamma distribution. The specific prior distributions and the formal MCMC algorithm that was used can be found in the project report, 651_Project_Report.pdf.

A sensitivity analysis was done to shown the influence of prior distribution on results. Three different priors for $\sigma_b^2$ were used and the results showed that the prior distribution for $\sigma_b^2$ greatly influenced the results from the posterior distribution. This illustrates how shrinkage and regualization occurs from the Bayesian paradigm since the prior distribution for the variance of the coefficients will influence the value of these coefficients. The results of the sensitivity analysis show that the coefficients are penalized more or less depending on the specified prior distribution for the variance of the coefficients.

Frequentist Ridge Regression was done on the data as well and then 95% bootstrap intervals were fit for the coefficients. There were 16 variables that didn't include zero in the confidence interval, while the Bayesian model only had three variables that didn't include zero in the 95% posterior credible interval. The results illustrate that Bayesian and Frequentist methods will often provide different results, which is influenced by the prior distributions. Usually only in certain situations will Bayesian and Frequentist methods provide the same results, so it is important to understand how these approaches differ from each other and use which ever approach is preferred by the researcher.

The Bayesian model had three variables that didn't include zero in the 95% posterior predictive distribution. These variables are the percentage of balls in play where the batter is very under the baseball, the average offspeed vertical break, and the overall average offspeed break (in all directions). It would make sense that these variables are significant because if batters are under the baseball a lot, they won't be hitting home runs, which is one of the components of FIP. Also the more offspeed pitches break, the harder they are to hit, the more likely batters will chase them, and they are less likely to be hit for home runs. As a result, the most effective thing for a right handed starting pitcher to decrease his FIP would be develop an effective offspeed pitch. It is more difficult to control how much batters get under the ball when they swing, but a pitcher can work to improve offspeed pitches, so that seems to be the ideal suggestion for improving FIP for a pitcher.

If there are any comments or questions about the analysis, I am happy to discuss them with anyone. Feel free to email me at [email protected] or DM me on Twitter.

About

Analysis of MLB pitchers and what factors influence a pitcher's FIP. Bayesian Ridge Regression is used to determine which covariates are most important. The model is fit using both hand written MCMC algorithms and Stan.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages