Data Storytelling Team Project - Baseball

Author
Affiliation

Byeong-Hak Choe

SUNY Geneseo

Published

November 20, 2024

Data

The following lists data frames about MLB since 1985:

  • df_teams: Yearly statistics and standings for MLB teams
  • df_battings: Batting statistics
  • df_pitchings: Pitching statistics
  • df_salaries: Player salary data
  • df_salariesTeam: Team salary data
  • df_postseasons: Post season series information
  • df_players: People table - Player names, DOB, and biographical info.
    • This data.frame is to be used to get details about players listed in the df_battings, df_pitchings, and df_salaries where players are identified only by variable playerID.

MLB Teams (df_teams)

  • df_teams: Yearly statistics and standings for MLB teams
    • A data frame with 1128 observations on the 52 variables.
df_teams <- read_csv("http://bcdanl.github.io/data/mlb_teams.csv")


yearID Year

lgID League; a factor with levels AA AL FL NL PL UA

teamID Team; a factor

franchID Franchise (links to TeamsFranchises table)

divID Team’s division; a factor with levels C E W

Rank Position in final standings

G Games played

Ghome Games played at home

W Wins

L Losses

DivWin Division Winner (Y or N)

WCWin Wild Card Winner (Y or N)

LgWin League Champion(Y or N)

WSWin World Series Winner (Y or N)

R Runs scored

AB At bats

H Hits by batters

X1B Singles

X2B Doubles

X3B Triples

HR Homeruns by batters

TB Total bases TB = X1B + 2*X2B + 3*X3B + 4*HR

BB Walks by batters

SO Strikeouts by batters

SB Stolen bases

CS Caught stealing

HBP Batters hit by pitch

BB_HBP BB_HBP = BB + HBP

RC Runs created RC = (H + BB + HBP)*TB/(AB + BB + HBP)

SF Sacrifice flies

RA Opponents runs scored

ER Earned runs allowed

ERA Earned run average

CG Complete games

SHO Shutouts

SV Saves

IPouts Outs Pitched (innings pitched x 3)

HA Hits allowed

HRA Homeruns allowed

BBA Walks allowed

SOA Strikeouts by pitchers

E Errors

DP Double Plays

FP Fielding percentage

name Team’s full name

park Name of team’s home ballpark

attendance Home attendance total

BPF Three-year park factor for batters

PPF Three-year park factor for pitchers

teamIDBR Team ID used by Baseball Reference website

teamIDlahman45 Team ID used in Lahman database version 4.5

teamIDretro Team ID used by Retrosheet

MLB Batting (df_battings)

  • df_battings: Batting statistics
    • A data frame with 113799 observations on the 28 variables.
df_battings <- read_csv("http://bcdanl.github.io/data/mlb_battings.csv")


playerID Player ID code

yearID Year

stint player’s stint (order of appearances within a season)

teamID Team; a factor

lgID League; a factor with levels AA AL FL NL PL UA

G Games: number of games in which a player played

AB At Bats

R Runs

H Hits: times reached base because of a batted, fair ball without error by the defense

X2B Singles

X2B Doubles: hits on which the batter reached second base safely

X3B Triples: hits on which the batter reached third base safely

HR Homeruns

TB Total bases TB = X1B + 2*X2B + 3*X3B + 4*HR

RBI Runs Batted In

SB Stolen Bases

CS Caught Stealing

BB Base on Balls

SO Strikeouts

IBB Intentional walks

HBP Hit by pitch

BB_HBP Representing all the times a player reaches base via walks (including intentional walks) and hit by pitches. BB_HBP = BB + HBP

RC Runs created RC = (H + BB + HBP)*TB/(AB + BB + HBP)

SH Sacrifice hits

SF Sacrifice flies

GIDP Grounded into double plays

Outs The total number of outs a player is responsible for during a season Outs = 0.982 * AB - H + GIDP + SF + SH + CS Multiply AB by 0.982 due to the fact that approximately 0.8% of all at bats result in an error

OBP On-Base Percentage. OBP measures how frequently a batter reaches base per plate appearance OBP = (H + BB + HBP) / (AB + BB + HBP + SF)

MLB Pitching (df_pitchings)

  • df_pitchings: Pitching statistics
    • A data frame with 26384 observations on the 33 variables.
df_pitchings <- read_csv("http://bcdanl.github.io/data/mlb_pitchings.csv")


playerID Player ID code

yearID Year

stint player’s stint (order of appearances within a season)

teamID Team; a factor

lgID League; a factor with levels AA AL FL NL PL UA

W Wins

L Losses

G Games

GS Games Started

CG Complete Games

SHO Shutouts

SV Saves

IPouts Outs Pitched (innings pitched x 3)

H Hits

BAOpp Batting average against (a measure of how effectively a pitcher prevents hitters from getting hits) BAOpp = H / (H + IPouts)

ER Earned Runs

HR Homeruns

BB Walks

WHIP Walks plus Hits per Inning Pitched. The number of base runners a pitcher has allowed per inning pitched. It is a common indicator of a pitcher’s effectiveness. WHIP = (H + BB) * 3 / IPouts

SO Strikeouts

BAOpp Opponent’s Batting Average

ERA Earned Run Average

IBB Intentional Walks

KperBB Strikeouts per Walk. KperBB represents the ratio of strikeouts to walks, indicating a pitcher’s control and ability to dominate hitters. KperBB = SO/(BB - IBB)

WP Wild Pitches

HBP Batters Hit By Pitch

BK Balks

BFP Batters faced by Pitcher

GF Games Finished

R Runs Allowed

SH Sacrifices by opposing batters

SF Sacrifice flies by opposing batters

GIDP Grounded into double plays by opposing batter

MLB Salaries (df_salaries)

  • df_salaries: Player salary data from 1985 to 2016
    • A data frame with 26428 observations on the 5 variables.
df_salaries <- read_csv("http://bcdanl.github.io/data/mlb_salaries.csv")


yearID Year

teamID Team; a factor

lgID League; a factor

playerID Player ID code

salary Salary

Team Salaries (df_salariesTeam)

  • df_salariesTeam: Team salary data from 1985 to 2016
df_salariesTeam <- read_csv("http://bcdanl.github.io/data/mlb_payrolls.csv")

payroll Team payroll in million dollars

WSWin Whether or not winning the World Series


MLB Post Season (df_postseasons)

  • df_postseasons: Post season series information
    • A data frame with 389 observations on the 9 variables.
df_postseasons <- read_csv("http://bcdanl.github.io/data/mlb_postseasons.csv")


yearID Year

round Level of playoffs

teamIDwinner Team ID of the team that won the series; a factor

lgIDwinner League ID of the team that won the series; a factor with levels AL NL

teamIDloser Team ID of the team that lost the series; a factor

lgIDloser League ID of the team that lost the series; a factor with levels AL NL

wins Wins by team that won the series

losses Losses by team that won the series

ties Tie games

MLB Players (df_players)

  • df_players: People table - Player names, DOB, and biographical info.
    • A data frame with 21010 observations on the 26 variables.
    • This data.frame is to be used to get details about players listed in the df_battings, df_pitchings, and df_salaries where players are identified only by variable playerID.
df_players <- read_csv("http://bcdanl.github.io/data/mlb_players.csv")


playerID A unique code assigned to each player. The playerID links the data in this data.frame with records on players in the other data.frames.

birthYear Year player was born

birthMonth Month player was born

birthDay Day player was born

birthCountry Country where player was born

birthState State where player was born

birthCity City where player was born

deathYear Year player died

deathMonth Month player died

deathDay Day player died

deathCountry Country where player died

deathState State where player died

deathCity City where player died

nameFirst Player’s first name

nameLast Player’s last name

nameGiven Player’s given name (typically first and middle)

weight Player’s weight in pounds

height Player’s height in inches

bats a factor: Player’s batting hand (left (L), right (R), or both (B))

throws a factor: Player’s throwing hand (left(L) or right(R))

debut Date that player made first major league appearance

finalGame Date that player made first major league appearance (blank if still active)

retroID ID used by retrosheet, https://www.retrosheet.org/

bbrefID ID used by Baseball Reference website, https://www.baseball-reference.com/

birthDate Player’s birthdate, in as.Date format

deathDate Player’s deathdate, in as.Date format

Sabermetrics

Baseball analysis underwent a significant transformation with Bill James’s pioneering work in the late 1970s. Through his publication, the Bill James Baseball Abstract, James challenged the reliance on traditional metrics like batting average, runs, and RBIs for hitters, and wins, ERA, and strikeouts for pitchers. Instead, he introduced innovative metrics such as runs created, Pythagorean win percentage, and game score. He called this analytical framework sabermetrics, named after the Society for American Baseball Research (SABR). While initially dismissed by traditionalists as secondary to the expertise of seasoned “baseball people,” James’s methods have become foundational in modern baseball. Today, every Major League Baseball team employs analytics experts who utilize advanced mathematics, predictive modeling, and optimization techniques to assemble competitive teams and make in-game decisions aimed at maximizing success.

For many, the 2011 movie Moneyball provides the most accessible introduction to sports analytics. The film recounts how the Oakland A’s used data-driven insights during their 2002 season, despite skepticism from traditional scouts who mocked Bill James’s ideas. Since that era, analytics have not replaced scouts but have become an essential complement, with every team now leveraging sophisticated data analyses alongside traditional evaluations.

Baseball Metrics

Historical Metrics

For over a century, baseball players were evaluated using a limited set of straightforward offensive and defensive statistics. However, advancements in computing and the introduction of sabermetrics revolutionized how player performance is assessed, giving rise to more nuanced and meaningful metrics.

Offensive Statistics

Statistic Definition
Batting Average Hits divided by at-bats (Hits ÷ At-Bats)
Home Runs Total number of home runs hit by a player
Runs Batted In (RBI) Number of runs a player has batted in
Walks Number of times a batter reaches first base via four balls (also known as Base on Balls)
On Base Percentage (OBP) Times on base (hits + walks + hit by pitch) divided by plate appearances
Runs Scored Number of times a player crosses home plate to score a run
Slugging Percentage (SLG) Total bases divided by at-bats (Total Bases ÷ At-Bats)

Defensive Statistics

Statistic Definition
Wins Credited to the pitcher who was pitching when his team took the lead for good
Innings Pitched Total number of innings a pitcher has pitched
Earned Run Average (ERA) Average number of earned runs a pitcher allows per nine innings pitched
Walks Number of batters a pitcher has allowed to reach first base via four balls
Strikeouts Number of batters a pitcher has retired by striking them out
WHIP Walks plus hits per inning pitched ((Walks + Hits) ÷ Innings Pitched)
Saves Credited to a pitcher who finishes a game for the winning team under certain conditions
Fielding Percentage Percentage of times a defensive player handles the ball without making an error

Runs Created

One of Bill James’s original metrics, Runs Created (\(RC\)), estimates the number of runs a player or team contributes to scoring. The formula is:

\[ RC = \frac{(H + BB + HBP) \times TB}{AB + BB + HBP} \]

Where:

  • \(H\) = Hits
  • \(BB\) = Walks (Base on Balls)
  • \(HBP\) = Hit by Pitch
  • \(TB\) = Total Bases
  • \(AB\) = At-Bats

This metric helps evaluate a player or team’s offensive value. For example, if considering adding a player to a team, comparing their runs created to other candidates provides a data-driven basis for decision-making.

Back to top