<- read_csv("http://bcdanl.github.io/data/mlb_teams.csv") df_teams
Data Storytelling Team Project - Baseball
Data
The following lists data frames about MLB since 1985:
df_teams
: Yearly statistics and standings for MLB teamsdf_battings
: Batting statisticsdf_pitchings
: Pitching statisticsdf_salaries
: Player salary datadf_salariesTeam
: Team salary datadf_postseasons
: Post season series informationdf_players
: People table - Player names, DOB, and biographical info.- This data.frame is to be used to get details about players listed in the
df_battings
,df_pitchings
, anddf_salaries
where players are identified only by variableplayerID
.
- This data.frame is to be used to get details about players listed in the
MLB Teams (df_teams
)
df_teams
: Yearly statistics and standings for MLB teams- A data frame with 1128 observations on the 52 variables.
yearID
Year
lgID
League; a factor with levels AA AL FL NL PL UA
teamID
Team; a factor
franchID
Franchise (links to TeamsFranchises table)
divID
Team’s division; a factor with levels C E W
Rank
Position in final standings
G
Games played
Ghome
Games played at home
W
Wins
L
Losses
DivWin
Division Winner (Y or N)
WCWin
Wild Card Winner (Y or N)
LgWin
League Champion(Y or N)
WSWin
World Series Winner (Y or N)
R
Runs scored
AB
At bats
H
Hits by batters
X1B
Singles
X2B
Doubles
X3B
Triples
HR
Homeruns by batters
TB
Total bases TB = X1B + 2*X2B + 3*X3B + 4*HR
BB
Walks by batters
SO
Strikeouts by batters
SB
Stolen bases
CS
Caught stealing
HBP
Batters hit by pitch
BB_HBP
BB_HBP = BB + HBP
RC
Runs created RC = (H + BB + HBP)*TB/(AB + BB + HBP)
SF
Sacrifice flies
RA
Opponents runs scored
ER
Earned runs allowed
ERA
Earned run average
CG
Complete games
SHO
Shutouts
SV
Saves
IPouts
Outs Pitched (innings pitched x 3)
HA
Hits allowed
HRA
Homeruns allowed
BBA
Walks allowed
SOA
Strikeouts by pitchers
E
Errors
DP
Double Plays
FP
Fielding percentage
name
Team’s full name
park
Name of team’s home ballpark
attendance
Home attendance total
BPF
Three-year park factor for batters
PPF
Three-year park factor for pitchers
teamIDBR
Team ID used by Baseball Reference website
teamIDlahman45
Team ID used in Lahman database version 4.5
teamIDretro
Team ID used by Retrosheet
MLB Batting (df_battings
)
df_battings
: Batting statistics- A data frame with 113799 observations on the 28 variables.
<- read_csv("http://bcdanl.github.io/data/mlb_battings.csv") df_battings
playerID
Player ID code
yearID
Year
stint
player’s stint (order of appearances within a season)
teamID
Team; a factor
lgID
League; a factor with levels AA AL FL NL PL UA
G
Games: number of games in which a player played
AB
At Bats
R
Runs
H
Hits: times reached base because of a batted, fair ball without error by the defense
X2B
Singles
X2B
Doubles: hits on which the batter reached second base safely
X3B
Triples: hits on which the batter reached third base safely
HR
Homeruns
TB
Total bases TB = X1B + 2*X2B + 3*X3B + 4*HR
RBI
Runs Batted In
SB
Stolen Bases
CS
Caught Stealing
BB
Base on Balls
SO
Strikeouts
IBB
Intentional walks
HBP
Hit by pitch
BB_HBP
Representing all the times a player reaches base via walks (including intentional walks) and hit by pitches. BB_HBP = BB + HBP
RC
Runs created RC = (H + BB + HBP)*TB/(AB + BB + HBP)
SH
Sacrifice hits
SF
Sacrifice flies
GIDP
Grounded into double plays
Outs
The total number of outs a player is responsible for during a season Outs = 0.982 * AB - H + GIDP + SF + SH + CS
Multiply AB by 0.982 due to the fact that approximately 0.8% of all at bats result in an error
OBP
On-Base Percentage. OBP measures how frequently a batter reaches base per plate appearance OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
MLB Pitching (df_pitchings
)
df_pitchings
: Pitching statistics- A data frame with 26384 observations on the 33 variables.
<- read_csv("http://bcdanl.github.io/data/mlb_pitchings.csv") df_pitchings
playerID
Player ID code
yearID
Year
stint
player’s stint (order of appearances within a season)
teamID
Team; a factor
lgID
League; a factor with levels AA AL FL NL PL UA
W
Wins
L
Losses
G
Games
GS
Games Started
CG
Complete Games
SHO
Shutouts
SV
Saves
IPouts
Outs Pitched (innings pitched x 3)
H
Hits
BAOpp
Batting average against (a measure of how effectively a pitcher prevents hitters from getting hits) BAOpp = H / (H + IPouts)
ER
Earned Runs
HR
Homeruns
BB
Walks
WHIP
Walks plus Hits per Inning Pitched. The number of base runners a pitcher has allowed per inning pitched. It is a common indicator of a pitcher’s effectiveness. WHIP = (H + BB) * 3 / IPouts
SO
Strikeouts
BAOpp
Opponent’s Batting Average
ERA
Earned Run Average
IBB
Intentional Walks
KperBB
Strikeouts per Walk. KperBB represents the ratio of strikeouts to walks, indicating a pitcher’s control and ability to dominate hitters. KperBB = SO/(BB - IBB)
WP
Wild Pitches
HBP
Batters Hit By Pitch
BK
Balks
BFP
Batters faced by Pitcher
GF
Games Finished
R
Runs Allowed
SH
Sacrifices by opposing batters
SF
Sacrifice flies by opposing batters
GIDP
Grounded into double plays by opposing batter
MLB Salaries (df_salaries
)
df_salaries
: Player salary data from 1985 to 2016- A data frame with 26428 observations on the 5 variables.
<- read_csv("http://bcdanl.github.io/data/mlb_salaries.csv") df_salaries
yearID
Year
teamID
Team; a factor
lgID
League; a factor
playerID
Player ID code
salary
Salary
Team Salaries (df_salariesTeam
)
df_salariesTeam
: Team salary data from 1985 to 2016
<- read_csv("http://bcdanl.github.io/data/mlb_payrolls.csv") df_salariesTeam
payroll
Team payroll in million dollars
WSWin
Whether or not winning the World Series
MLB Post Season (df_postseasons
)
df_postseasons
: Post season series information- A data frame with 389 observations on the 9 variables.
<- read_csv("http://bcdanl.github.io/data/mlb_postseasons.csv") df_postseasons
yearID
Year
round
Level of playoffs
teamIDwinner
Team ID of the team that won the series; a factor
lgIDwinner
League ID of the team that won the series; a factor with levels AL NL
teamIDloser
Team ID of the team that lost the series; a factor
lgIDloser
League ID of the team that lost the series; a factor with levels AL NL
wins
Wins by team that won the series
losses
Losses by team that won the series
ties
Tie games
MLB Players (df_players
)
df_players
: People table - Player names, DOB, and biographical info.- A data frame with 21010 observations on the 26 variables.
- This data.frame is to be used to get details about players listed in the
df_battings
,df_pitchings
, anddf_salaries
where players are identified only by variableplayerID
.
<- read_csv("http://bcdanl.github.io/data/mlb_players.csv") df_players
playerID
A unique code assigned to each player. The playerID
links the data in this data.frame with records on players in the other data.frames.
birthYear
Year player was born
birthMonth
Month player was born
birthDay
Day player was born
birthCountry
Country where player was born
birthState
State where player was born
birthCity
City where player was born
deathYear
Year player died
deathMonth
Month player died
deathDay
Day player died
deathCountry
Country where player died
deathState
State where player died
deathCity
City where player died
nameFirst
Player’s first name
nameLast
Player’s last name
nameGiven
Player’s given name (typically first and middle)
weight
Player’s weight in pounds
height
Player’s height in inches
bats
a factor: Player’s batting hand (left (L), right (R), or both (B))
throws
a factor: Player’s throwing hand (left(L) or right(R))
debut
Date that player made first major league appearance
finalGame
Date that player made first major league appearance (blank if still active)
retroID
ID used by retrosheet, https://www.retrosheet.org/
bbrefID
ID used by Baseball Reference website, https://www.baseball-reference.com/
birthDate
Player’s birthdate, in as.Date format
deathDate
Player’s deathdate, in as.Date format
Sabermetrics
Baseball analysis underwent a significant transformation with Bill James’s pioneering work in the late 1970s. Through his publication, the Bill James Baseball Abstract, James challenged the reliance on traditional metrics like batting average, runs, and RBIs for hitters, and wins, ERA, and strikeouts for pitchers. Instead, he introduced innovative metrics such as runs created, Pythagorean win percentage, and game score. He called this analytical framework sabermetrics, named after the Society for American Baseball Research (SABR). While initially dismissed by traditionalists as secondary to the expertise of seasoned “baseball people,” James’s methods have become foundational in modern baseball. Today, every Major League Baseball team employs analytics experts who utilize advanced mathematics, predictive modeling, and optimization techniques to assemble competitive teams and make in-game decisions aimed at maximizing success.
For many, the 2011 movie Moneyball provides the most accessible introduction to sports analytics. The film recounts how the Oakland A’s used data-driven insights during their 2002 season, despite skepticism from traditional scouts who mocked Bill James’s ideas. Since that era, analytics have not replaced scouts but have become an essential complement, with every team now leveraging sophisticated data analyses alongside traditional evaluations.
Baseball Metrics
Historical Metrics
For over a century, baseball players were evaluated using a limited set of straightforward offensive and defensive statistics. However, advancements in computing and the introduction of sabermetrics revolutionized how player performance is assessed, giving rise to more nuanced and meaningful metrics.
Offensive Statistics
Statistic | Definition |
---|---|
Batting Average | Hits divided by at-bats (Hits ÷ At-Bats) |
Home Runs | Total number of home runs hit by a player |
Runs Batted In (RBI) | Number of runs a player has batted in |
Walks | Number of times a batter reaches first base via four balls (also known as Base on Balls) |
On Base Percentage (OBP) | Times on base (hits + walks + hit by pitch) divided by plate appearances |
Runs Scored | Number of times a player crosses home plate to score a run |
Slugging Percentage (SLG) | Total bases divided by at-bats (Total Bases ÷ At-Bats) |
Defensive Statistics
Statistic | Definition |
---|---|
Wins | Credited to the pitcher who was pitching when his team took the lead for good |
Innings Pitched | Total number of innings a pitcher has pitched |
Earned Run Average (ERA) | Average number of earned runs a pitcher allows per nine innings pitched |
Walks | Number of batters a pitcher has allowed to reach first base via four balls |
Strikeouts | Number of batters a pitcher has retired by striking them out |
WHIP | Walks plus hits per inning pitched ((Walks + Hits) ÷ Innings Pitched) |
Saves | Credited to a pitcher who finishes a game for the winning team under certain conditions |
Fielding Percentage | Percentage of times a defensive player handles the ball without making an error |
Runs Created
One of Bill James’s original metrics, Runs Created (\(RC\)), estimates the number of runs a player or team contributes to scoring. The formula is:
\[ RC = \frac{(H + BB + HBP) \times TB}{AB + BB + HBP} \]
Where:
- \(H\) = Hits
- \(BB\) = Walks (Base on Balls)
- \(HBP\) = Hit by Pitch
- \(TB\) = Total Bases
- \(AB\) = At-Bats
This metric helps evaluate a player or team’s offensive value. For example, if considering adding a player to a team, comparing their runs created to other candidates provides a data-driven basis for decision-making.