This repository contains work for implementing Bayesian ridge regression to determine how certain variables influence a pitcher's Fielding Independent Pitching (FIP). This was a project for my Bayesian statistics class at BYU, which is why Bayesian Ridge Regression is implemented and compared with results from Ridge regression from a Frequentist perspective.
The goals of the analysis are to 1) determine what a pitcher can change to improve his FIP and 2) show how shrinkage/penalization can be implemented from the Bayesian paradigm, specifically the commonly used Ridge regression from the Frequentist perspective.
The data for this analysis came from two different sources. The FIP for each pitcher was taken from Fangraphs and was obtained using the baseballr package. Information about each pitcher was downloaded from Baseball Savant for the 2016 season and included information such as the average spin rate, break, and speed for fastballs, breaking, and offspeed pitches, the percentage of pitches a pitcher threw in the strike zone and out of the strike zone, the percentage of the types of hits a pitcher gave up (flyball, popup, ground ball, line drive, etc.), and the average launch angle of hit pitches. There were a total of 49 different variables that were included in the analysis Since there were so many variables available, it seemed reasonable to want to implement shrinkage or regualization because many of these variables are likely correlated with each other. In order to avoid potential confounding, only right handed pitchers who pitched at least 90 innings in the 2016 season were included in the analysis. There were a total of 108 pitchers who met this qualifications and were included in the analysis.
The model was fit two ways, once using Stan, and once using a by hand Markov Chain Monte Carlo algorithm. The Stan code is included in the repository and can be referenced, as well as the written MCMC algorithm. One of the requirements of the project was that the model could not be completely conjugate, so the prior for 651_Project_Report.pdf
.
A sensitivity analysis was done to shown the influence of prior distribution on results. Three different priors for
Frequentist Ridge Regression was done on the data as well and then 95% bootstrap intervals were fit for the coefficients. There were 16 variables that didn't include zero in the confidence interval, while the Bayesian model only had three variables that didn't include zero in the 95% posterior credible interval. The results illustrate that Bayesian and Frequentist methods will often provide different results, which is influenced by the prior distributions. Usually only in certain situations will Bayesian and Frequentist methods provide the same results, so it is important to understand how these approaches differ from each other and use which ever approach is preferred by the researcher.
The Bayesian model had three variables that didn't include zero in the 95% posterior predictive distribution. These variables are the percentage of balls in play where the batter is very under the baseball, the average offspeed vertical break, and the overall average offspeed break (in all directions). It would make sense that these variables are significant because if batters are under the baseball a lot, they won't be hitting home runs, which is one of the components of FIP. Also the more offspeed pitches break, the harder they are to hit, the more likely batters will chase them, and they are less likely to be hit for home runs. As a result, the most effective thing for a right handed starting pitcher to decrease his FIP would be develop an effective offspeed pitch. It is more difficult to control how much batters get under the ball when they swing, but a pitcher can work to improve offspeed pitches, so that seems to be the ideal suggestion for improving FIP for a pitcher.
If there are any comments or questions about the analysis, I am happy to discuss them with anyone. Feel free to email me at [email protected] or DM me on Twitter.