df_teams <- read_csv("http://bcdanl.github.io/data/mlb_teams.csv")Data Storytelling Team Project - Baseball
Data
The following lists data frames about MLB since 1985:
df_teams: Yearly statistics and standings for MLB teamsdf_battings: Batting statisticsdf_pitchings: Pitching statisticsdf_salaries: Player salary datadf_salariesTeam: Team salary datadf_postseasons: Post season series informationdf_players: People table - Player names, DOB, and biographical info.- This data.frame is to be used to get details about players listed in the
df_battings,df_pitchings, anddf_salarieswhere players are identified only by variableplayerID.
- This data.frame is to be used to get details about players listed in the
MLB Teams (df_teams)
df_teams: Yearly statistics and standings for MLB teams- A data frame with 1128 observations on the 52 variables.
yearID Year
lgID League; a factor with levels AA AL FL NL PL UA
teamID Team; a factor
franchID Franchise (links to TeamsFranchises table)
divID Team’s division; a factor with levels C E W
Rank Position in final standings
G Games played
Ghome Games played at home
W Wins
L Losses
DivWin Division Winner (Y or N)
WCWin Wild Card Winner (Y or N)
LgWin League Champion(Y or N)
WSWin World Series Winner (Y or N)
R Runs scored
AB At bats
H Hits by batters
X1B Singles
X2B Doubles
X3B Triples
HR Homeruns by batters
TB Total bases TB = X1B + 2*X2B + 3*X3B + 4*HR
BB Walks by batters
SO Strikeouts by batters
SB Stolen bases
CS Caught stealing
HBP Batters hit by pitch
BB_HBP BB_HBP = BB + HBP
RC Runs created RC = (H + BB + HBP)*TB/(AB + BB + HBP)
SF Sacrifice flies
RA Opponents runs scored
ER Earned runs allowed
ERA Earned run average
CG Complete games
SHO Shutouts
SV Saves
IPouts Outs Pitched (innings pitched x 3)
HA Hits allowed
HRA Homeruns allowed
BBA Walks allowed
SOA Strikeouts by pitchers
E Errors
DP Double Plays
FP Fielding percentage
name Team’s full name
park Name of team’s home ballpark
attendance Home attendance total
BPF Three-year park factor for batters
PPF Three-year park factor for pitchers
teamIDBR Team ID used by Baseball Reference website
teamIDlahman45 Team ID used in Lahman database version 4.5
teamIDretro Team ID used by Retrosheet
MLB Batting (df_battings)
df_battings: Batting statistics- A data frame with 113799 observations on the 28 variables.
df_battings <- read_csv("http://bcdanl.github.io/data/mlb_battings.csv")playerID Player ID code
yearID Year
stint player’s stint (order of appearances within a season)
teamID Team; a factor
lgID League; a factor with levels AA AL FL NL PL UA
G Games: number of games in which a player played
AB At Bats
R Runs
H Hits: times reached base because of a batted, fair ball without error by the defense
X2B Singles
X2B Doubles: hits on which the batter reached second base safely
X3B Triples: hits on which the batter reached third base safely
HR Homeruns
TB Total bases TB = X1B + 2*X2B + 3*X3B + 4*HR
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base on Balls
SO Strikeouts
IBB Intentional walks
HBP Hit by pitch
BB_HBP Representing all the times a player reaches base via walks (including intentional walks) and hit by pitches. BB_HBP = BB + HBP
RC Runs created RC = (H + BB + HBP)*TB/(AB + BB + HBP)
SH Sacrifice hits
SF Sacrifice flies
GIDP Grounded into double plays
Outs The total number of outs a player is responsible for during a season Outs = 0.982 * AB - H + GIDP + SF + SH + CS Multiply AB by 0.982 due to the fact that approximately 0.8% of all at bats result in an error
OBP On-Base Percentage. OBP measures how frequently a batter reaches base per plate appearance OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
MLB Pitching (df_pitchings)
df_pitchings: Pitching statistics- A data frame with 26384 observations on the 33 variables.
df_pitchings <- read_csv("http://bcdanl.github.io/data/mlb_pitchings.csv")playerID Player ID code
yearID Year
stint player’s stint (order of appearances within a season)
teamID Team; a factor
lgID League; a factor with levels AA AL FL NL PL UA
W Wins
L Losses
G Games
GS Games Started
CG Complete Games
SHO Shutouts
SV Saves
IPouts Outs Pitched (innings pitched x 3)
H Hits
BAOpp Batting average against (a measure of how effectively a pitcher prevents hitters from getting hits) BAOpp = H / (H + IPouts)
ER Earned Runs
HR Homeruns
BB Walks
WHIP Walks plus Hits per Inning Pitched. The number of base runners a pitcher has allowed per inning pitched. It is a common indicator of a pitcher’s effectiveness. WHIP = (H + BB) * 3 / IPouts
SO Strikeouts
BAOpp Opponent’s Batting Average
ERA Earned Run Average
IBB Intentional Walks
KperBB Strikeouts per Walk. KperBB represents the ratio of strikeouts to walks, indicating a pitcher’s control and ability to dominate hitters. KperBB = SO/(BB - IBB)
WP Wild Pitches
HBP Batters Hit By Pitch
BK Balks
BFP Batters faced by Pitcher
GF Games Finished
R Runs Allowed
SH Sacrifices by opposing batters
SF Sacrifice flies by opposing batters
GIDP Grounded into double plays by opposing batter
MLB Salaries (df_salaries)
df_salaries: Player salary data from 1985 to 2016- A data frame with 26428 observations on the 5 variables.
df_salaries <- read_csv("http://bcdanl.github.io/data/mlb_salaries.csv")yearID Year
teamID Team; a factor
lgID League; a factor
playerID Player ID code
salary Salary
Team Salaries (df_salariesTeam)
df_salariesTeam: Team salary data from 1985 to 2016
df_salariesTeam <- read_csv("http://bcdanl.github.io/data/mlb_payrolls.csv")payroll Team payroll in million dollars
WSWin Whether or not winning the World Series
MLB Post Season (df_postseasons)
df_postseasons: Post season series information- A data frame with 389 observations on the 9 variables.
df_postseasons <- read_csv("http://bcdanl.github.io/data/mlb_postseasons.csv")yearID Year
round Level of playoffs
teamIDwinner Team ID of the team that won the series; a factor
lgIDwinner League ID of the team that won the series; a factor with levels AL NL
teamIDloser Team ID of the team that lost the series; a factor
lgIDloser League ID of the team that lost the series; a factor with levels AL NL
wins Wins by team that won the series
losses Losses by team that won the series
ties Tie games
MLB Players (df_players)
df_players: People table - Player names, DOB, and biographical info.- A data frame with 21010 observations on the 26 variables.
- This data.frame is to be used to get details about players listed in the
df_battings,df_pitchings, anddf_salarieswhere players are identified only by variableplayerID.
df_players <- read_csv("http://bcdanl.github.io/data/mlb_players.csv")playerID A unique code assigned to each player. The playerID links the data in this data.frame with records on players in the other data.frames.
birthYear Year player was born
birthMonth Month player was born
birthDay Day player was born
birthCountry Country where player was born
birthState State where player was born
birthCity City where player was born
deathYear Year player died
deathMonth Month player died
deathDay Day player died
deathCountry Country where player died
deathState State where player died
deathCity City where player died
nameFirst Player’s first name
nameLast Player’s last name
nameGiven Player’s given name (typically first and middle)
weight Player’s weight in pounds
height Player’s height in inches
bats a factor: Player’s batting hand (left (L), right (R), or both (B))
throws a factor: Player’s throwing hand (left(L) or right(R))
debut Date that player made first major league appearance
finalGame Date that player made first major league appearance (blank if still active)
retroID ID used by retrosheet, https://www.retrosheet.org/
bbrefID ID used by Baseball Reference website, https://www.baseball-reference.com/
birthDate Player’s birthdate, in as.Date format
deathDate Player’s deathdate, in as.Date format
Sabermetrics
Baseball analysis underwent a significant transformation with Bill James’s pioneering work in the late 1970s. Through his publication, the Bill James Baseball Abstract, James challenged the reliance on traditional metrics like batting average, runs, and RBIs for hitters, and wins, ERA, and strikeouts for pitchers. Instead, he introduced innovative metrics such as runs created, Pythagorean win percentage, and game score. He called this analytical framework sabermetrics, named after the Society for American Baseball Research (SABR). While initially dismissed by traditionalists as secondary to the expertise of seasoned “baseball people,” James’s methods have become foundational in modern baseball. Today, every Major League Baseball team employs analytics experts who utilize advanced mathematics, predictive modeling, and optimization techniques to assemble competitive teams and make in-game decisions aimed at maximizing success.
For many, the 2011 movie Moneyball provides the most accessible introduction to sports analytics. The film recounts how the Oakland A’s used data-driven insights during their 2002 season, despite skepticism from traditional scouts who mocked Bill James’s ideas. Since that era, analytics have not replaced scouts but have become an essential complement, with every team now leveraging sophisticated data analyses alongside traditional evaluations.
Baseball Metrics
Historical Metrics
For over a century, baseball players were evaluated using a limited set of straightforward offensive and defensive statistics. However, advancements in computing and the introduction of sabermetrics revolutionized how player performance is assessed, giving rise to more nuanced and meaningful metrics.
Offensive Statistics
| Statistic | Definition |
|---|---|
| Batting Average | Hits divided by at-bats (Hits ÷ At-Bats) |
| Home Runs | Total number of home runs hit by a player |
| Runs Batted In (RBI) | Number of runs a player has batted in |
| Walks | Number of times a batter reaches first base via four balls (also known as Base on Balls) |
| On Base Percentage (OBP) | Times on base (hits + walks + hit by pitch) divided by plate appearances |
| Runs Scored | Number of times a player crosses home plate to score a run |
| Slugging Percentage (SLG) | Total bases divided by at-bats (Total Bases ÷ At-Bats) |
Defensive Statistics
| Statistic | Definition |
|---|---|
| Wins | Credited to the pitcher who was pitching when his team took the lead for good |
| Innings Pitched | Total number of innings a pitcher has pitched |
| Earned Run Average (ERA) | Average number of earned runs a pitcher allows per nine innings pitched |
| Walks | Number of batters a pitcher has allowed to reach first base via four balls |
| Strikeouts | Number of batters a pitcher has retired by striking them out |
| WHIP | Walks plus hits per inning pitched ((Walks + Hits) ÷ Innings Pitched) |
| Saves | Credited to a pitcher who finishes a game for the winning team under certain conditions |
| Fielding Percentage | Percentage of times a defensive player handles the ball without making an error |
Runs Created
One of Bill James’s original metrics, Runs Created (\(RC\)), estimates the number of runs a player or team contributes to scoring. The formula is:
\[ RC = \frac{(H + BB + HBP) \times TB}{AB + BB + HBP} \]
Where:
- \(H\) = Hits
- \(BB\) = Walks (Base on Balls)
- \(HBP\) = Hit by Pitch
- \(TB\) = Total Bases
- \(AB\) = At-Bats
This metric helps evaluate a player or team’s offensive value. For example, if considering adding a player to a team, comparing their runs created to other candidates provides a data-driven basis for decision-making.