Midterm Exam 2
Classwork 13
Summary for Midterm Exam 2 Performance
The following provides the descriptive statistics for each part of the Midterm Exam 2 questions:
The following describes the distribution of Midterm Exam 2 score:
Section 1. Multiple Choice, Short Answer, and Various Questions
Question 1
The quote “I would have written a shorter letter, but I did not have the time” by Blaise Pascal emphasizes the challenge of being:
- Detailed
- Accurate
- Concise
- Creative
Answer:
- Concise
Explanation:
Blaise Pascal’s quote underscores the difficulty of expressing ideas succinctly. Being concise often requires more thought and effort to distill complex ideas into clear, brief statements.
Question 2
When the distribution of a variable has a single peak and is positively skewed (i.e., having a long right tail), which of the following is correct?
- Mean < Mode < Median
- Mode < Median < Mean
- Median < Mean < Mode
- Median = Mean = Mode
Answer:
- Mode < Median < Mean
Explanation:
In a positively skewed distribution, the long tail is on the right side. The mean is pulled in the direction of the skew (right), so the mean is greater than the median, which is greater than the mode.
Question 3
What is NOT an essential component in ggplot()
data visualization?
- Data frames
- Geometric objects
- Facets
- Aesthetic attributes
Answer:
- Facets
Explanation:
While facets are useful for creating multiple plots based on a factor, they are not essential components of a ggplot()
. The essential components are data frames, geometric objects, and aesthetic attributes.
Question 4
____(1)____ does not necessarily imply ____(2)____
(1)
Correlation;(2)
causation(1)
Causation;(2)
correlation(1)
Correlation;(2)
correlation(1)
Causation;(2)
causation
Answer:
(1)
Correlation;(2)
causation
Explanation:
The phrase “correlation does not necessarily imply causation” means that just because two variables are correlated, it doesn’t mean one causes the other.
Questions 5-12
For Questions 5-12, consider the following R packages and the data.frame, nwsl_player_stats
, containing individual player statistics for the National Women’s Soccer League (NWSL) in the 2022 season:
library(tidyverse)
library(skimr)
<- read_csv("https://bcdanl.github.io/data/nwsl_player_stats.csv") nwsl_player_stats
The first 5 observations in the nwsl_player_stats
data frame are displayed below:
player | nation | pos | squad | age | mp | starts | min |
---|---|---|---|---|---|---|---|
M. A. Vignola | us USA | MFFW | Angel City | 23 | 2 | 0 | 18 |
Michaela Abam | cm CMR | FW | Dash | 24 | 12 | 3 | 273 |
Kerry Abello | us USA | FWMF | Pride | 22 | 21 | 12 | 1042 |
Jillienne Aguilera | NA | DFMF | Red Stars | NA | 17 | 5 | 580 |
Tinaya Alexander | eng ENG | FWMF | Spirit | 22 | 9 | 1 | 167 |
player | nation | pos | squad | xGp90 | xAp90 | xGxAp90 |
---|---|---|---|---|---|---|
M. A. Vignola | us USA | MFFW | Angel City | 0.00 | 0.00 | 0.00 |
Michaela Abam | cm CMR | FW | Dash | 0.26 | 0.10 | 0.36 |
Kerry Abello | us USA | FWMF | Pride | 0.16 | 0.05 | 0.20 |
Jillienne Aguilera | NA | DFMF | Red Stars | 0.05 | 0.04 | 0.09 |
Tinaya Alexander | eng ENG | FWMF | Spirit | 0.58 | 0.03 | 0.62 |
player | nation | pos | squad | npxGp90 | npxGxAp90 |
---|---|---|---|---|---|
M. A. Vignola | us USA | MFFW | Angel City | 0.00 | 0.00 |
Michaela Abam | cm CMR | FW | Dash | 0.26 | 0.36 |
Kerry Abello | us USA | FWMF | Pride | 0.16 | 0.20 |
Jillienne Aguilera | NA | DFMF | Red Stars | 0.05 | 0.09 |
Tinaya Alexander | eng ENG | FWMF | Spirit | 0.16 | 0.19 |
- The
nwsl_player_stats
data frame is with 314 observations and 13 variables.
Description of Variables in nwsl_player_stats
:
player
: Player namenation
: Player home countrypos
: Player position (e.g., GK, FW, MF, etc.)squad
: Player teamage
: Age of playermp
: Matches playedstarts
: Number of matches in which player started the gamemin
: Total minutes played in the seasonxGp90
: Expected goals per ninety minutesxG
is simply the probability of scoring a goal from a given spot on the field when a shot is taken.
xAp90
: Expected assists per ninety minutesxA
is simply the probability of assisting a goal by delivering a pass that creates a scoring opportunity.
xGxAp90
: Expected goals plus assists per ninety minutesnpxGp90
: Expected goals minus penalty goals per ninety minutesnpxGxAp90
: Expected goals plus assists minus penalty goals per ninety minutes
A player who is consistently achieving a high number of xG
(or xA
) will be one who is getting into a good position consistently on the field. Coaches and scouts can use this to evaluate whether a player is exceedingly (un)lucky over a given number of games, and this will help in evaluating that player’s offensive skills beyond simple counts.
The followings are the summary of the nwsl_player_stats
data.frame, including descriptive statistics for each variable.
Name | nwsl_player_stats |
Number of rows | 314 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | min | max | empty | n_unique |
---|---|---|---|---|---|
player | 0 | 5 | 26 | 0 | 303 |
nation | 17 | 6 | 7 | 0 | 29 |
pos | 0 | 2 | 4 | 0 | 10 |
squad | 0 | 4 | 10 | 0 | 12 |
Variable type: numeric
skim_variable | n_missing | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|
age | 15 | 25.98 | 3.96 | 16 | 23.00 | 25.00 | 28.00 | 38.00 |
mp | 0 | 12.61 | 6.95 | 1 | 6.00 | 14.00 | 19.00 | 22.00 |
starts | 0 | 9.25 | 7.31 | 0 | 2.00 | 9.00 | 16.00 | 22.00 |
min | 0 | 831.81 | 631.52 | 1 | 234.00 | 743.50 | 1398.00 | 1980.00 |
xGp90 | 2 | 0.13 | 0.16 | 0 | 0.01 | 0.06 | 0.20 | 0.77 |
xAp90 | 2 | 0.10 | 0.47 | 0 | 0.01 | 0.05 | 0.11 | 8.26 |
xGxAp90 | 2 | 0.23 | 0.50 | 0 | 0.04 | 0.13 | 0.32 | 8.26 |
npxGp90 | 2 | 0.12 | 0.14 | 0 | 0.01 | 0.06 | 0.18 | 0.77 |
npxGxAp90 | 2 | 0.22 | 0.50 | 0 | 0.04 | 0.13 | 0.30 | 8.26 |
Question 5
Write a code to produce the above summary for the nwsl_player_stats
data.frame, including descriptive statistics for each variable.
Answer: skim(nwsl_player_stats)
Explanation:
The skim()
function from the skimr package provides a comprehensive summary of a data frame, including descriptive statistics for each variable.
Question 6
What code would you use to count the number of players in each team?
nwsl_player_stats |> count(player)
nwsl_player_stats |> count(nation)
nwsl_player_stats |> count(pos)
nwsl_player_stats |> count(squad)
Answer:
nwsl_player_stats |> count(squad)
Explanation:
To count the number of players in each team, you need to count the occurrences of each team in the squad
variable.
Question 7
What is the median value of mp
? Find this value from the summary of the nwsl_player_stats
data frame.
Answer: 14.00
Explanation:
From the summary provided by skim(nwsl_player_stats)
, the median value for mp
is 14.00. This means that half of the players played 14 matches or fewer.
Question 8
- We are interested in players who score or assist on a goal.
- To achieve this, we create a new data.frame, a new data.frame,
nwsl_nonGK_stats
, which includes only players who are NOT a goal keeper from thenwsl_player_stats
data frame.
<- nwsl_player_stats |>
nwsl_nonGK_stats filter(___BLANK___)
- The
pos
value is “GK” for a goal keeper. Which condition correctly fills in the BLANK to complete the code above?
!is.na(pos)
is.na(pos)
pos == "GK"
pos != "GK"
Answer:
pos != "GK"
Explanation:
To filter out goalkeepers, we select observations where the position (pos
) is not equal to “GK”.
Question 9
- Additionally, we are interested in non-goalkeeper players who played matches consistently throughout the season.
- To achieve this, we create a new data.frame,
nwsl_nonGK_stats_filtered
, which includes only non GK players who played in at least 10 matches (mp
) and started in at least 7 matches (starts
) .
<- nwsl_nonGK_stats |>
nwsl_nonGK_stats_filtered filter(___BLANK___)
- Which condition correctly fills in the BLANK to complete the code above?
mp > 10 | starts > 7
mp >= 10 | starts >= 7
mp > 10 & starts > 7
mp >= 10 & starts >= 7
Answer:
mp >= 10 & starts >= 7
Explanation:
We want players who meet both conditions: played in at least 10 matches (mp >= 10
) and started in at least 7 matches (starts >= 7
). The logical operator & ensures both conditions are met.
Question 10
How would you describe the relationship between age
and xGxAp90
(expected goals plus assists per ninety minutes) using the nwsl_nonGK_stats_filtered
data.frame?
- To identify outlier players, such as star players and young players, some player names are added to such points in the plot.
- Note that it is NOT required to provide the code for adding these texts to the plot.
Complete the code by filling in the blanks (1)-(4).
ggplot(data = ___(1)___,
mapping = aes(x = ___(2)___,
y = ___(3)___)) +
___(4)___(alpha = 0.5) +
geom_smooth()
Blank (1)
nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats
Answer:
nwsl_nonGK_stats_filtered
Blank (2)
age
xGxAp90
xGp90
xAp90
Answer:
age
Blank (3)
age
xGxAp90
xGp90
xAp90
Answer:
xGxAp90
Blank (4)
geom_boxplot
geom_scatterplot
geom_point
geom_histogram
Answer:
geom_point
Young Players
Who are the young players under the age
of 20?
Answer: Olivia Moultrie and Trinity Rodman
Star Players
Who are the star players whose xGxAp90
is greater than 0.75?
Answer: Sophia Smith, Mallory Pugh, Debinha, and Megan Rapinoe
Relationship
Describe the overall relationship between age
and xGxAp90
(expected goal plus assist per ninety minutes).
Answer: Overall, xGxAp90
decreases as age
increases up to 24, after which it remains relatively constant.
Question 11
How would you describe how the distribution of xAp90
(expected assist per ninety minutes) varies by teams (squad
) using the nwsl_nonGK_stats_filtered
data.frame?
- Note that the
squad
categories are sorted by the median ofxAp90
in the plot.
Complete the code by filling in the blanks.
ggplot(data = ___(1)___,
mapping = aes(x = ___(2)___,
y = ___(3)___)) +
___(4)___() +
labs(y = "NWSL Teams")
Blank (1)
nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats
Answer:
nwsl_nonGK_stats_filtered
Blank (2)
squad
xGxAp90
xGp90
xAp90
Answer:
x = xAp90
Blank (3)
squad
xGxAp90
xGp90
xAp90
fct_reorder(squad, xGxAp90)
fct_reorder(xGxAp90, squad)
fct_reorder(squad, xGp90)
fct_reorder(xGp90, squad)
fct_reorder(squad, xAp90)
fct_reorder(xAp90, squad)
Answer:
fct_reorder(squad, xAp90)
Blank (4)
geom_boxplot
geom_bar
geom_point
geom_histogram
Answer:
geom_boxplot
Question 12
Question 12 is about a ggplot code to visualize how the distribution of pos
(player position) varies by teams (squad
) using the nwsl_player_stats
data.frame.
Part 1
Complete the code by filling in the blanks to replicate the given plot.
ggplot(data = ___(1)___,
mapping = aes(___(2)___ = squad,
___(3)___ = pos)) +
geom_bar(___(4)___)
Blank (1)
nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats
Answer:
nwsl_player_stats
Blank (2)
x
y
color
fill
count
Answer:
y
Blank (3)
x
y
color
fill
count
Answer:
fill
Blank (4)
position = "stack"
position = "fill"
position = "dodge"
- Leaving (4) empty
- Both a and d
- Both b and d
- Both c and d
Answer:
- Both a and d
Explanation:
Since position = "stack"
is the default for geom_bar()
, leaving it empty achieves the same effect. So both options are correct.
Part 2
Complete the code by filling in the blanks to replicate the given plot.
ggplot(data = ___(1)___,
mapping = aes(___(2)___ = squad,
___(3)___ = pos)) +
geom_bar(___(4)___)
Blank (1)
nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats
Answer:
nwsl_player_stats
Blank (2)
x
y
color
fill
count
Answer:
y
Blank (3)
x
y
color
fill
count
Answer:
fill
Blank (4)
position = "stack"
position = "fill"
position = "dodge"
- Leaving (4) empty
- Both a and d
- Both b and d
- Both c and d
Answer:
position = "dodge"
Part 3
Complete the code by filling in the blanks to replicate the given plot.
ggplot(data = ___(1)___,
mapping = aes(___(2)___ = squad,
___(3)___ = pos)) +
geom_bar(___(4)___) +
labs(x = "Proportion")
Blank (1)
nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats
Answer:
nwsl_player_stats
Blank (2)
x
y
color
fill
count
Answer:
y
Blank (3)
x
y
color
fill
count
Answer:
fill
Blank (4)
position = "stack"
position = "fill"
position = "dodge"
- Leaving (4) empty
- Both a and d
- Both b and d
- Both c and d
Answer:
position = "fill"
Section 2. Filling-in-the-Blanks
Question 13
When collecting data in real life, measured values often differ. In this context, we can observe variation easily; for example, if we measure any numeric variable (e.g., friends’ heights) twice, we are likely to get two different values.
Question 14
The mode of a variable is the value that appears most frequently within the set of that variable’s values.
Question 15
The gg
in ggplot
stands for Grammar of Graphics.
Question 16
Using regression—a machine learning method—the geom_smooth()
visualizes the predicted value of the y
variable for a given value of the x
variable. The grey ribbon around the curve illustrates the uncertainty surrounding the estimated curve.
Question 17
When making a scatterplot, it is a common practice to place the input variable along the x-axis and the output variable along the y-axis.
Question 18
In ggplot, we can set alpha
between 0 (full transparency) and 1 (no transparency) manually to adjust a geometric object’s transparency level.
Section 3. Short Essay
Question 19
List at least three distinct pieces of advice shared by the invited guests for students seeking jobs in the data analytics industry.
Answer:
- Leverage Available Opportunities:
- Build resumes through tutoring, research, and extracurricular activities while in college.
- Gain Relevant Experience:
- Secure internships and engage in projects to stand out in a competitive job market.
- Work on Personal Projects:
- Develop one or two significant projects to showcase during interviews.
- Master Essential Tools:
- Focus on Python, R, and SQL for data-related roles.
- Combine Skills with Passion:
- Align data analytics skills with industries of personal interest for a fulfilling career.
- Understand Business Fundamentals:
- Learn basic finance and accounting to connect analytics with business needs.
- Develop Soft Skills:
- Enhance interpersonal and communication skills to collaborate with non-technical stakeholders effectively.
- Stay Current and Adaptable:
- Embrace learning new tools and technologies to remain relevant in the field.
- Network Strategically:
- Build relationships with professionals to gain insights and collaboration opportunities.
- Explore Various Projects:
- Experiment with diverse projects to gain confidence and transition smoothly from academia to industry.
Question 20
How does data storytelling bridge the gap between data and insights?
Answer:
Data Storytelling: Bridge the gap between data and insights by incorporating descriptive statistics, visualization, and narration within the appropriate audience context to effectively present your findings and drive data-informed decisions.
Question 21
What two main factors should a storyteller consider about the context before creating a data visualization or communication?
Answer:
- Audience: Understanding who the audience is, their level of expertise, interests, and what they care about, to tailor the message accordingly.
- Purpose: Clarifying the goal of the communication—whether to inform, persuade, or explore—and what action or understanding is desired from the audience.
Question 22
Provide at least three techniques to make data visualization more colorblind-friendly.
Answer:
- Use Colorblind-Friendly Palettes: Utilize colorblind-safe palettes (e.g.,
scale_color_tableau()
,scale_color_colorblind()
), which are designed to be distinguishable by colorblind individuals. - Incorporate Textures and Patterns: Use different shapes or line types in addition to colors to differentiate data points.
- Have Additional Visual Cue: Ensure that color is not the only means of conveying information; include labels, annotations, or legends that clarify the data.