Is the MLB All Star Game a Popularity Contest or Are Nominations Founded on Actual Statistics?

Chris Emm

Saving Session Data

Loading Session Data

Introduction

Major League Baseball's All Star game, which is held every July, with the excepion of the COVID-shortened 2020 season, is about highlighting the top players in the game and allowing them to play on the same teams in front of a national audience. In order to determine which players make the game, there are three votes -- fan, player, and coach. Ideally, the players who perform the best should make the All Star game, however it is possible that a avid fanbase can vote for their team's players even if they are performing well. In this project, I am going to analyze whether the All Star game truly has the season's biggest and brightest stars, or whether the game is simply a popularity vote.

Baseball is a game of scoring runs. There's a reason that the team with the most runs at the end of a game wins. Major Leage Baseball (MLB), especially in the past 20 years has seen an uptick of scoring, as the game has become more and more about offensive firepower rather than pitchers completely dominating the hitters. A team's front office and everyone that is included in the decision making process of roster formations need to be able to analyze player performance and determine which players will score them the most runs, and in effect, help them win the most games. In this project, we will analyze which offensive metrics are most closely related to scoring runs, using team data between 2011-2021. Then, based on our findings, we will analyze whether or not the players who make the All Star game arer also the one's who are producing the most at the plate.

Part I: Scraping Team Data for 2000-2021 Seasons

The first thing we are going to do is analyze a variety of offensive metrics and their relation to producing runs on offense. In order to do this, we will need to scrape team data from FanGraphs (https://www.fangraphs.com/). We will gather basic, advanced, and batted ball, statcast and plate discipline data that each team accumulated over each season for the last decade. Below are two functions that scrape the data from the website.

The following function scrapes the table that is located at the specified url, and creates a dataframe using pandas from the table that is scraped. The additional year and team arguments allow us to add respective columns based on which team each row is for.

The function below simply compiles a list of urls based on which FanGraphs page we want to visit. Since each statistical category is on a separate url, we have an argument, called stat, which determines which url we are looking to scrape from. This function will be used to create urls for all 30 MLB teams for the years that are specified (2011-2021). The page argument is used because some teams have too many players to fit on one page, so the remaining are placed on separate pages. As you can see, we wil use this function for both team and player scraping.

Scraping Team Data From Fangraphs

Below, we are actually compiling the web scrape results and merging all resulting dataframes into one overall dataframe called team_batting. We will perform our initial analysis on this dataset.

As you can see, there are some NaN values that we added for the EV, LA, Barrel%, and HardHit% categories. This is because prior to the 2015 season, statcast was not implemented and thus they have no data on these stats for seasons before 2015. I didn't make them 0, because that would actually impact the data, whereas an NaN value can be ignored.

Correlation Between Winning Games and Scoring Runs

Baseball is a game about scoring runs on offense and not allowing runs on defense. THe team with the most runs scored at the end of the game wins the game, so I would expect there to be at least a moderate correlation between scoring runs and winning games. Obviously, scoring runs is not the only factor -- if a team scores 10 runs in a game but gives up 11, they still lose. So whats more important is run correlation, but for the sake of studying only offensive metrics in this project, we will only discuss scoring runs on offense. Below, I will make a plot that shows the correlation between runs scored and wins for teams in the past 20 years.

As I predicted, there was a moderate correlation between winning and scoring runs. Again, the reason the correlation is only moderate and not strong is because run differential is really what's important. If I had the stats for runs given up (Runs Against), and I plotted its correlation with winning games, I would expect the correlation to be really close to 1. Now that we know that runs scored is fairly correlated to winning games and being a productive offense, I will examine what metrics make a player able to score more runs, and thus be a more productive player.

Correlation Between Scoring Runs and Various Batting Metrics

Since the team that has more runs wins the game, run are directly correlated to winning games. Obviously, that is a generic statement that can have some nuance; of course, a team that scores a lot of runs but gives up even more runs, will lose games, so really a team's Run%, Runs Scored / (Runs Scored + Runs Scored), is more directly related to winning, but we aren't worried about defense for this exercise.

Below, we are going to try to find the offensive metric(s) that best correlate with scoring runs, because scoring runs wins games, to an extent. We will plot the important metrics, described below, against a team's run total and find the correlation between the datapoints. This will show which stat is most correlated to scoring runs, and thus the stat that is likely important in terms of helping a team win games. Below are the metrics that we will be analyzing:


AVG: Batting Average

The percentage of times the batter gets a hit of out of all of his at-bats. (H/AB)
Formula: H / AB


OBP: On-Base Percentage

The ratio of the sum of the batter's hits, walks, hit by pitches to their number of plate appearances.
Formula: (H + BB + IBB + HBP) / PA


SLG: Slugging Percentage

The total number of bases a player records per at-bat
Formula: (1B + 2(2B) + 3(3B) + HR)/AB


OPS: On-Base Plus Slugging Percentage

Measures the ability of a player both to get on base and to hit for power
Formula: OBP + SLG


wOBA: Weighted On-Base Average

Designed to measure a player's overall offensive contributions per plate appearance
Formula: (0.69 NIBB) + (0.719 HBP) + (0.87 1B) + (1.217 2B) + (1.529 3B) + (1.94 HR) / (AB + BB - IBB + SF + HBP)

A baseball fan with basic knowledge might be under the assumption that batting average and homeruns can determine whether or not a player is good at hitting. As the plots have shown, this is not exactly the case. At the end of the day, teams want to score runs, regardless of how they do so. The plots, however, show that out of the five metrics we studied, batting average was the least correlated to scoring runs, with a correlation coefficient of just .705. The metric that had the greatest correlation to scoring runs was On-Base Plus Slugging (OPS) with a coefficient of .953.

What we can gather from this is that a player with a high OPS will likely produce more runs for a team than a player who has a high batting average but an OPS close to the batting average.

For example, say player A gets 3 hits in 10 at-bats but all 3 hits are singles and he gets 0 walks. His batting average is .300, which is good, but his OPS is only .600, which is below average.

Now say player B gets 3 hits in 10 at-bats but all 3 hits are homeruns and he also walks 3 times. His batting average is also .300, but now his OPS 1.6615, which is almost 3 times better than player A's OPS. Obviously, player B was more productive for his team than player A -- batting average does not show this, but OPS most certainly does.

Correlation Between Plate Discipline and Scoring Runs

Now, with OPS, we have an offensive metric that we determined to be highly correlated to scoring runs. Next, we want to determine what metrics are going to correlate to having a high OPS. One of the most important skills a player can have is plate discipline. In an era where strikeouts are happening at historic rates, having a player with a keen batting eye can be the difference between starting a rally and a rally fizzling out. The metrics we will look at are below:


BB/K: Walk to Strikeout Rate Rate

The rate at which a batter walks compared to stirking out. A value over 1 means that the batter walks more than he strikes out and a value under 1 means that he strikes out more than he walks.


O-Swing%: Swing Rate on Pitches Outside the Strike Zone

The percentage of pitches that are outside of the strike zone that the batter swings at.


Z-Swing%: Swing Rate on Pitches Inside the Strike Zone

The percentage of pitches that are inside of the strike zone that the batter swings at.


Swing%: Swing Rate

The percentage of pitches that the batter swings at.

As the plots show, a batter's strikeout to walk rate is moderately correlated to a player's ability to produce at the plate, as it has a .547 correlation coefficient. Furthermore, after looking at how plate disicpline effects a batter's strikeout to walk rate, we determined that the correlation between a batter having a low BB/K and a batter swinging at pitches outside of the strike zone is strong. In addition, we also found that the more pitches that a batter swings overall will lead to a decrease in BB/K. Through this, we can conclude that in order for a batter to be productive at the plate, it's important for them to make smart swing decisions, meaning that they should be selective of what pitches to swing at; minimizing the number of swings at pitches that are outside of the strike zone is very beneficial to improving BB/K and consequently improving overall production with the bat in their hands.

Correlation Matrix for Offensive Metrics

Building onto what we have done in the last two sections, in this section, I am going to display a correlation matrix for all offensive metrics to provide an even clearer visual into which stats correlate the best to each other, including scoring runs.

As we can see from the matrix -- BB/K, OBP, SLG, OPS, and ISO are all strongly correlated to scoring runs, just as we showed in the first section. Below, I will list the metrics that correlate well to the above metrics.


BB/K: Walk to Strikeout Rate

O-Contact%, Z-Contact%, Contact%, wOBA, wRC+, BB%


OBP: On-Base Percentage

LD%, HR/FB, EV, LA, Barrel%, HardHit%, wOBA, wRC+, BB%


SLG: Slugging Percentage

LD%, FB%, HR/FB, EV, LA, Barrel%, HardHit%, wOBA, wRC+, BB%


OPS: On-Base Plus Slugging

LD%, FB%, HR/FB, EV, LA, Barrel%, HardHit%, wOBA, wRC+, BB%


ISO: Isolated Power

LD%, FB%, HR/FB, EV, LA, Barrel%, HardHit%, wOBA, wRC+, BB%

Part II: Scraping Player Data for 2010-2021 Seasons

Now that we have begun to have an understanding on which stats are the best representives of offensive production, we will now scrape player data from the last decade, clean any missing data, perform an analysis on player's offensive production (instead of team), and then create a model that predicts an All Star appearance based on a player's stats. After we create the model, I will then analyze whether or not a player making the All Star game is related to their offensive production, or if the All Star game is just a popularity vote.

Just like I did above for the team dataframe, I am going to scrape player statistics from Fangraphs (the same categories as I did for the team). In addition, I am going to scrape Cot's Baseball Contracts to obtain player salaries, as well as the Lahman dataset to obtain positions for the players.

Scraping Standard Player Data From Fangraphs

Scraping Advanced Player Data From Fangraphs

Scraping Batted Ball Player Data From Fangraphs

Scraping Statcast Player Data From Fangraphs

Scraping Plate Discipline Player Data From Fangraphs

Merging Dataframes

Now that I have 5 different dataframes for player data, I am going to merge them altogether to get them into one overall dataset.

Removing Suffix From Player Names

In order to match the names in the Lahman dataset, we will remove the suffix from player names. We do this because I will need to merge data from the Lahman dataset into my dataframe by player name and year. If the player names are not exactly the same, the row will be lost when merging. For example, as you will see below, Cedric Mullins is recorded as Cedric Mullins II on Fangraphs, but I know that he is recorded as Cedric Mullins in the Lahman dataset.

As you can see, the dataset I created from Fangraphs now has only first and last name in the Name column. Now, when I merge dataframes, it will just add a position column to the row that has the correct player name and year.

Add Player ID to Player Table from Lahman Dataset

The appearances dataframe (from the Lahman Dataset) only has player ID instead of name. Fangraphs does not have playerIDs, so in order to merge the two together, I had to obtain the playerID that the Lahman Dataset uses from the People dataframe. I then merged the playerID into my dataframe and that allowed me to successfully merge the appearance dataframe.

Add Player's Position to Table

Reorder Columns of Dataframe

Remove Pitchers from Dataset

For this project, I will only be focusing on offensive metrics. In the National League (NL), up until 2022 (excluding the special COVID season in 2020), pitchers batted. If an American League (AL) team played an NL team at the NL team's stadium, the DH was not used. As a result, many pitchers have been included in this batting dataset. However, since pitchers are usually terrible at batting, we want to remove them from the dataset so that they do not skew the data. This is especially important when we will take a look at the stats for All Stars; pitchers make the All Star game for being dominant on the mound, not because they can produce at the plate.

Add Player Salaries to Dataset

Just as a fun aside, at the end of this project, we will compare a player's salary to whether or not he was an All Star. It will be interesting to see how many players that make top dollar end up not making the All Star game. In order to do that, I have scraped salaries from Cot's Baseball Contracts and placed them in a CSV file so that I could create a dataframe for them. I then merge the salary dataframe onto my player dataframe so that a new column, Salary, is added to the dataframe.

Part III: Analyzing Offensive Metrics Using Player Data

Now that we have player data, we can further analyze what metrics lead to more production from a player. For the next section, I am going to analyze how batted ball data as well as plate discipline affect a player's offensive production. While I did have this data available for the team dataframe, doing this with the player data allows me to have a larger sample size from which I take the mean from. If I did teams, it would only be 30 samples each year, whereas the player data has hundreds of players each year. I felt it would be more effective and representative to take a larger sample size.

Effect of Contact Quality on Production

As I expected, as exit velo and barrrel rate increase, so does a player's production. The same goes for launch angle, however, of course, offensive production starts to plateau around the 23-degree mark; this makes sense as the higher the launch angle, the more arc in the trajectory -- at a certain point, too much arc will result in easy pop-outs and flyouts.

Effect of Plate Discipline on Production

Next, I will plot plate discipline and how it affects offensive production. I would expect that a player who is more discipline will be able to work counts to get in spots where he gets a good pitch to hit, thus increasing his chances of producing at the plate and driving in runs. In this plot, I will once again focus on O-Swing%, SwStr%, Contact%, and BB/K.

As I suspected, there is a direct relationship between plate discipline and offensive production. In the top 3 plots, I plotted 3 different plate measurements of plate discipline in regards to BB/K. The plots show that the more selective a player is, the more likely he is to more walks and strike out less. For example, the less a player swings at pitches outside the strike zone, the higher his walk to strikeout rate will be. Same with the number of swinging strikes that are made against him. And then, for contact rate, the more contact the player makes, the more likely he is to get base hits, foul off tough pitches, and avoid striking out. Then, to put it all together, I showed that BB/K has a direct correlation to offensive production. In fact, BB/K was even stronger with OPS, which was the metric we determined to be most correlated to scoring runs.

Part IV: Creating a Model to Predict All Star Seasons

Now that I have completed the analysis offensive production, I am going to use what we have learned thus far and determine whether or not All Stars are actually the top performers or if the All Star nominations are simply a popularity contest. First, I am going to scrape the AllStar Full dataframe from the Lahman Dataset and then merge them into my dataframe, with the value of 1 representing an All Star nomination and 0 representing no All Star nomination. After I have competed that, I will create a predictive model that will be able to predict, using the stats in the dataframe, whether a certain stat line is going to get an All Star nomination.

Now that we have a dataframe of all of the All Stars from 2011-present, I will merge it into my player dataframe. All players that are in the All Star dataframe will have an All_Star value of 1, and the rest will have an All_Star value of 0. Then, using the modified dataframe, I will perform Linear Discriminant Analysis to model whether or not a certain statline will be nominated for the All Star game.

Merge Allstar Dataframe Onto Players Dataframe

Predictive Model Disregarding Name and Team

First, I am going to completely remove any personal information (Name, Team, Position) and only make a model using the actual stats. If the All Star game is not just a popularity contest, a model that disregards personal information should score pretty highly and be able to predict whether a player's statline warrants a All Star nomination fairly accurately.

I have printed out the model score, which tells us the percent of players the predictive model got right when predicting whether each player would make the All Star game or not. A success rate of 93% is excellent, and tells us that overall, using just the statline, the model can predict fairly accurately whether a player will get a nomination or not.

Predictive Model Disregarding Just Team

We want to find out whether a player's popularity has any impact on them getting an All Star nomination, so we will eliminate the Team information and one-hot encode the player names, so that the model can use the player names as a field to build their predictions off of. The reason I am looking at names is because if a player is continually voted into the ASG, then either, they are consistently good, or they have a good reputation with all fanbases and everybody votes for them. For example, Mike Trout, one of the greatest players in the world, likely even history of MLB, is consistently in the ASG. A lot of this is due to the fact that he is really good. However, last year, he made the ASG, which is held in July, even though he had been out with an injury since early May. The reason for this is because Mike Trout is a household name, and his reputation alone warranted a nomination. If name has an impact on nominations, this model should be able to predict nominations at a higher rate than our previous model.

The success rate of this model is still very high, however, it is not significantly better than our original model, so we would presume that including a player's name into the model structure has no real impact, which means that most voting fans likely don't consider the actual player they're voting for, but just the statline. Obviously, it is still possible that it can vary for certain players; for example, if a specific player is known to be dislikeable around the league, the player's name (personality) might be considered over the statline. It just isn't wide spread the reason for how a player gets nominated.

Predictive Model Disregarding Just Name

In addition to player popularity, there is team popularity. Teams with stronger fanbases could potentially be biased and vote for their own players to be in the ASG, regardless of how well they are performing. Because of this, I am going to eliminate the player Names from the dataframe and I am going to one-hot encode the Teams they are on. For example, in 2015, the year after the Kansas City Royals made an improbable run to the World Series (and the year they ended up actually winning it), the Royals had 4 starters and 3 reserves in the All Star game. THis was most definitely due to team popularity, conmsidering one of the starters had the lowest slugging percentage and ISO in all of MLB. It was a disgrace that many players who actually played well that year didn't get to play in the game, however, this was the fan's vote. Maybe this year was an outlier, maybe it was representative. If the team popularity has any say in ASG nominations, this model below should have a higher predictive rate than the original model.

As was the case above, the success rate of this model is still very high, however, it is not significantly better than our original model. This means that overall, it is unlikely that specific fanbases make a concerted effort to overtake the vote in order to put all of their team's players in the ASG. Most fans that care enough to vote, seem to generally place emphasis on the actual stats, instead of having complete and blind loyalty to their favorite team. In order to comfirm this, in the next section, we will perform further analysis on how stats line up compared to All Star nominations.

Part V: Are All Stars Truly the Most Productive Players or is the All Star Game Just a Popularity Vote?

In order to test our hypothesis that fans voted on All Star game nominations with consideration for the actual stats rather than the player or team, I am going to make some plots that show how the stats of All Stars line up to the stats of non-All Stars. First, I am going to make a violin plot of the offensive metrics we have studied in past sections with respect to each year the All Star game was held. Each violin is going to be split in half, with one side representing the distribution for non-All Stars and the other half representing All Stars.

As the plots show, All Stars are most certainly more productive at the plate than non-All Stars. In the plot, I made a split violin plot with the light green side being non-All Stars and the light yellow side being All Stars. I then grouped the dataframe by All Stars and non-All Stars and calculated the average of the metric we were plotting for each year in the dataset. The quartiles are printed inside each violin distribution, and the mean is highlighted by a point on the plot. The blue points represent the median for All Stars and the magenta points represent the median for the non-All Stars. As you can see in every single plot, the median of the offensive metrics is much higher for All Stars than they are for non-All Stars. As we showed in the previous sections, these stats correlate to offensive production, thus, we can say that since the All Stars have higher average marks for these metrics, they are in fact, more prodcutive than non-All Stars. This trend seems to indicate that All Stars are indeed nominated by their production, rather than popularity.

VI: Do All Stars Get Paid More Than Non-All Stars?

As a fun and interesting aside, I also decided to investigate whether or not All Stars get paid more than their Non-All Star counterparts. Intuitively, I predict they do, because players tend to become All Stars in their prime years, which is when they should be off the league minimum salary. Also, since we determined in the last section that All Stars are more productive, teams will be more likely to pay them more because production often comes at a premium unless you have a phenom like Bryce Harper, who has been making the All Star game since his rookie year when he was 19-years old. If a team is lucky enough to have a rookie become an All Star, then the player will have a low salary because for the first six years of a MLB career, players do not have free reign to sign anywhere (teams own their rights during this time), so the team can offer them lower salaries and the player will have to just play basically at whatever is offered. However, this only happens for generational players, so I expect the plot to show an overall trend that All Stars have higher salaries than Non-All Stars.

As I expected, the plot shows that All Stars are paid, on average, more than Non-All Stars. Again, this makes sense because more productive players usually make more money, and we showed that All Stars are more productive than Non-All Stars in section V.

Conclusion

As I have shown throughout this project, there are very distinct metrics that we keep track of in baseball that help determine a player's worth, value, and production to a team. The most indicative metric for offensive production is on-base plus slugging percentage (OPS), as this is the metric that correlated to scoring runs the most. The other statistics, most specifically OBP, SLG, and wOBA, were also highly correlated. In the process, we also showed that batting average and homeruns, while they do moderately correlate to scoring runs, are not the best representation of offensive production. With our final analysis of offensive production, we found that a player who is able to work counts, and make good swing decisions (swinging at hittable pitches, not swinging at pitcher's pitches) is going to be more productive at the plate.

Then, tying it all together, we studied how offensive production correlates to All Star game nominations. Through our analysis, we found that All Star game nominations are, in fact, related to offensive production, instead of just player or team popularity. In additon to offensive production, we also looked at how All Star game nominations are related to player salaries and found that All Stars, on average, are paid more than Non-All Stars. This likely is due to a couple of reasons -- productive players make more money than unproductive players and All Stars tend to be players in their prime, and players in their prime are on contracts that pay them more than when they were rookies.