Midterm Exam 2

Classwork 13

Author

Byeong-Hak Choe

Published

December 7, 2024

Modified

December 7, 2024

Summary for Midterm Exam 2 Performance

The following provides the descriptive statistics for each part of the Midterm Exam 2 questions:

The following describes the distribution of Midterm Exam 2 score:

Section 1. Multiple Choice, Short Answer, and Various Questions

Question 1

The quote “I would have written a shorter letter, but I did not have the time” by Blaise Pascal emphasizes the challenge of being:

Detailed
Accurate
Concise
Creative

Answer:

Concise

Explanation:

Blaise Pascal’s quote underscores the difficulty of expressing ideas succinctly. Being concise often requires more thought and effort to distill complex ideas into clear, brief statements.

Question 2

When the distribution of a variable has a single peak and is positively skewed (i.e., having a long right tail), which of the following is correct?

Mean < Mode < Median
Mode < Median < Mean
Median < Mean < Mode
Median = Mean = Mode

Answer:

Mode < Median < Mean

Explanation:

In a positively skewed distribution, the long tail is on the right side. The mean is pulled in the direction of the skew (right), so the mean is greater than the median, which is greater than the mode.

Question 3

What is NOT an essential component in ggplot() data visualization?

Data frames
Geometric objects
Facets
Aesthetic attributes

Answer:

Facets

Explanation:

While facets are useful for creating multiple plots based on a factor, they are not essential components of a ggplot(). The essential components are data frames, geometric objects, and aesthetic attributes.

Question 4

____(1)____ does not necessarily imply ____(2)____

(1) Correlation; (2) causation
(1) Causation; (2) correlation
(1) Correlation; (2) correlation
(1) Causation; (2) causation

Answer:

(1) Correlation; (2) causation

Explanation:

The phrase “correlation does not necessarily imply causation” means that just because two variables are correlated, it doesn’t mean one causes the other.

Questions 5-12

For Questions 5-12, consider the following R packages and the data.frame, nwsl_player_stats, containing individual player statistics for the National Women’s Soccer League (NWSL) in the 2022 season:

library(tidyverse)
library(skimr)
nwsl_player_stats <- read_csv("https://bcdanl.github.io/data/nwsl_player_stats.csv")

The first 5 observations in the nwsl_player_stats data frame are displayed below:

player	nation	pos	squad	age	mp	starts	min
M. A. Vignola	us USA	MFFW	Angel City	23	2	0	18
Michaela Abam	cm CMR	FW	Dash	24	12	3	273
Kerry Abello	us USA	FWMF	Pride	22	21	12	1042
Jillienne Aguilera	NA	DFMF	Red Stars	NA	17	5	580
Tinaya Alexander	eng ENG	FWMF	Spirit	22	9	1	167

player	nation	pos	squad	xGp90	xAp90	xGxAp90
M. A. Vignola	us USA	MFFW	Angel City	0.00	0.00	0.00
Michaela Abam	cm CMR	FW	Dash	0.26	0.10	0.36
Kerry Abello	us USA	FWMF	Pride	0.16	0.05	0.20
Jillienne Aguilera	NA	DFMF	Red Stars	0.05	0.04	0.09
Tinaya Alexander	eng ENG	FWMF	Spirit	0.58	0.03	0.62

player	nation	pos	squad	npxGp90	npxGxAp90
M. A. Vignola	us USA	MFFW	Angel City	0.00	0.00
Michaela Abam	cm CMR	FW	Dash	0.26	0.36
Kerry Abello	us USA	FWMF	Pride	0.16	0.20
Jillienne Aguilera	NA	DFMF	Red Stars	0.05	0.09
Tinaya Alexander	eng ENG	FWMF	Spirit	0.16	0.19

The nwsl_player_stats data frame is with 314 observations and 13 variables.

Description of Variables in `nwsl_player_stats`:

player: Player name
nation: Player home country
pos: Player position (e.g., GK, FW, MF, etc.)
squad: Player team
age: Age of player
mp: Matches played
starts: Number of matches in which player started the game
min: Total minutes played in the season
xGp90: Expected goals per ninety minutes
- xG is simply the probability of scoring a goal from a given spot on the field when a shot is taken.
xAp90: Expected assists per ninety minutes
- xA is simply the probability of assisting a goal by delivering a pass that creates a scoring opportunity.
xGxAp90: Expected goals plus assists per ninety minutes
npxGp90: Expected goals minus penalty goals per ninety minutes
npxGxAp90: Expected goals plus assists minus penalty goals per ninety minutes

A player who is consistently achieving a high number of xG (or xA) will be one who is getting into a good position consistently on the field. Coaches and scouts can use this to evaluate whether a player is exceedingly (un)lucky over a given number of games, and this will help in evaluating that player’s offensive skills beyond simple counts.

The followings are the summary of the nwsl_player_stats data.frame, including descriptive statistics for each variable.

Data summary
Name	nwsl_player_stats
Number of rows	314
Number of columns	13
_______________________
Column type frequency:
character	4
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	min	max	n_unique
player	0	5	26	303
nation	17	6	7	29
pos	0	2	4	10
squad	0	4	10	12

Variable type: numeric

skim_variable	n_missing	mean	sd	p0	p25	p50	p75	p100
age	15	25.98	3.96	16	23.00	25.00	28.00	38.00
mp	0	12.61	6.95	1	6.00	14.00	19.00	22.00
starts	0	9.25	7.31	0	2.00	9.00	16.00	22.00
min	0	831.81	631.52	1	234.00	743.50	1398.00	1980.00
xGp90	2	0.13	0.16	0	0.01	0.06	0.20	0.77
xAp90	2	0.10	0.47	0	0.01	0.05	0.11	8.26
xGxAp90	2	0.23	0.50	0	0.04	0.13	0.32	8.26
npxGp90	2	0.12	0.14	0	0.01	0.06	0.18	0.77
npxGxAp90	2	0.22	0.50	0	0.04	0.13	0.30	8.26

Question 5

Write a code to produce the above summary for the nwsl_player_stats data.frame, including descriptive statistics for each variable.

Answer: skim(nwsl_player_stats)

Explanation:

The skim() function from the skimr package provides a comprehensive summary of a data frame, including descriptive statistics for each variable.

Question 6

What code would you use to count the number of players in each team?

nwsl_player_stats |> count(player)
nwsl_player_stats |> count(nation)
nwsl_player_stats |> count(pos)
nwsl_player_stats |> count(squad)

Answer:

nwsl_player_stats |> count(squad)

Explanation:

To count the number of players in each team, you need to count the occurrences of each team in the squad variable.

Question 7

What is the median value of mp? Find this value from the summary of the nwsl_player_stats data frame.

Answer: 14.00

Explanation:

From the summary provided by skim(nwsl_player_stats), the median value for mp is 14.00. This means that half of the players played 14 matches or fewer.

Question 8

We are interested in players who score or assist on a goal.
To achieve this, we create a new data.frame, a new data.frame, nwsl_nonGK_stats, which includes only players who are NOT a goal keeper from the nwsl_player_stats data frame.

nwsl_nonGK_stats <- nwsl_player_stats |> 
  filter(___BLANK___)

The pos value is “GK” for a goal keeper. Which condition correctly fills in the BLANK to complete the code above?

!is.na(pos)
is.na(pos)
pos == "GK"
pos != "GK"

Answer:

pos != "GK"

Explanation:

To filter out goalkeepers, we select observations where the position (pos) is not equal to “GK”.

Question 9

Additionally, we are interested in non-goalkeeper players who played matches consistently throughout the season.
To achieve this, we create a new data.frame, nwsl_nonGK_stats_filtered, which includes only non GK players who played in at least 10 matches (mp) and started in at least 7 matches (starts) .

nwsl_nonGK_stats_filtered <- nwsl_nonGK_stats |> 
  filter(___BLANK___)

Which condition correctly fills in the BLANK to complete the code above?

mp > 10 | starts > 7
mp >= 10 | starts >= 7
mp > 10 & starts > 7
mp >= 10 & starts >= 7

Answer:

mp >= 10 & starts >= 7

Explanation:

We want players who meet both conditions: played in at least 10 matches (mp >= 10) and started in at least 7 matches (starts >= 7). The logical operator & ensures both conditions are met.

Question 10

How would you describe the relationship between age and xGxAp90 (expected goals plus assists per ninety minutes) using the nwsl_nonGK_stats_filtered data.frame?

To identify outlier players, such as star players and young players, some player names are added to such points in the plot.
Note that it is NOT required to provide the code for adding these texts to the plot.

Complete the code by filling in the blanks (1)-(4).

ggplot(data = ___(1)___,
       mapping = aes(x = ___(2)___, 
                     y = ___(3)___)) +
  ___(4)___(alpha = 0.5) +
  geom_smooth()

Blank (1)

nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats

Answer:

nwsl_nonGK_stats_filtered

Blank (2)

age
xGxAp90
xGp90
xAp90

Answer:

age

Blank (3)

age
xGxAp90
xGp90
xAp90

Answer:

xGxAp90

Blank (4)

geom_boxplot
geom_scatterplot
geom_point
geom_histogram

Answer:

geom_point

Young Players

Who are the young players under the age of 20?

Answer: Olivia Moultrie and Trinity Rodman

Star Players

Who are the star players whose xGxAp90 is greater than 0.75?

Answer: Sophia Smith, Mallory Pugh, Debinha, and Megan Rapinoe

Relationship

Describe the overall relationship between age and xGxAp90 (expected goal plus assist per ninety minutes).

Answer: Overall, xGxAp90 decreases as age increases up to 24, after which it remains relatively constant.

Question 11

How would you describe how the distribution of xAp90 (expected assist per ninety minutes) varies by teams (squad) using the nwsl_nonGK_stats_filtered data.frame?

Note that the squad categories are sorted by the median of xAp90 in the plot.

Complete the code by filling in the blanks.

ggplot(data = ___(1)___,
       mapping = aes(x = ___(2)___, 
                     y = ___(3)___)) +
  ___(4)___() +
  labs(y = "NWSL Teams")

Blank (1)

nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats

Answer:

nwsl_nonGK_stats_filtered

Blank (2)

squad
xGxAp90
xGp90
xAp90

Answer:

x = xAp90

Blank (3)

squad
xGxAp90
xGp90
xAp90
fct_reorder(squad, xGxAp90)
fct_reorder(xGxAp90, squad)
fct_reorder(squad, xGp90)
fct_reorder(xGp90, squad)
fct_reorder(squad, xAp90)
fct_reorder(xAp90, squad)

Answer:

fct_reorder(squad, xAp90)

Blank (4)

geom_boxplot
geom_bar
geom_point
geom_histogram

Answer:

geom_boxplot

Question 12

Question 12 is about a ggplot code to visualize how the distribution of pos (player position) varies by teams (squad) using the nwsl_player_stats data.frame.

Part 1

Complete the code by filling in the blanks to replicate the given plot.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___ = squad,
                     ___(3)___ = pos)) +
  geom_bar(___(4)___)

Blank (1)

nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats

Answer:

nwsl_player_stats

Blank (2)

x
y
color
fill
count

Answer:

Blank (3)

x
y
color
fill
count

Answer:

fill

Blank (4)

position = "stack"
position = "fill"
position = "dodge"
Leaving (4) empty
Both a and d
Both b and d
Both c and d

Answer:

Both a and d

Explanation:

Since position = "stack" is the default for geom_bar(), leaving it empty achieves the same effect. So both options are correct.

Part 2

Complete the code by filling in the blanks to replicate the given plot.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___ = squad,
                     ___(3)___ = pos)) +
  geom_bar(___(4)___)

Blank (1)

nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats

Answer:

nwsl_player_stats

Blank (2)

x
y
color
fill
count

Answer:

Blank (3)

x
y
color
fill
count

Answer:

fill

Blank (4)

position = "stack"
position = "fill"
position = "dodge"
Leaving (4) empty
Both a and d
Both b and d
Both c and d

Answer:

position = "dodge"

Part 3

Complete the code by filling in the blanks to replicate the given plot.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___ = squad,
                     ___(3)___ = pos)) +
  geom_bar(___(4)___) +
  labs(x = "Proportion")

Blank (1)

nwsl_nonGK_stats_filtered
nwsl_nonGK_stats
nwsl_player_stats

Answer:

nwsl_player_stats

Blank (2)

x
y
color
fill
count

Answer:

Blank (3)

x
y
color
fill
count

Answer:

fill

Blank (4)

position = "stack"
position = "fill"
position = "dodge"
Leaving (4) empty
Both a and d
Both b and d
Both c and d

Answer:

position = "fill"

Section 2. Filling-in-the-Blanks

Question 13

When collecting data in real life, measured values often differ. In this context, we can observe variation easily; for example, if we measure any numeric variable (e.g., friends’ heights) twice, we are likely to get two different values.

Question 14

The mode of a variable is the value that appears most frequently within the set of that variable’s values.

Question 15

The gg in ggplot stands for Grammar of Graphics.

Question 16

Using regression—a machine learning method—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable. The grey ribbon around the curve illustrates the uncertainty surrounding the estimated curve.

Question 17

When making a scatterplot, it is a common practice to place the input variable along the x-axis and the output variable along the y-axis.

Question 18

In ggplot, we can set alpha between 0 (full transparency) and 1 (no transparency) manually to adjust a geometric object’s transparency level.

Section 3. Short Essay

Question 19

List at least three distinct pieces of advice shared by the invited guests for students seeking jobs in the data analytics industry.

Answer:

Leverage Available Opportunities:
- Build resumes through tutoring, research, and extracurricular activities while in college.
Gain Relevant Experience:
- Secure internships and engage in projects to stand out in a competitive job market.
Work on Personal Projects:
- Develop one or two significant projects to showcase during interviews.
Master Essential Tools:
- Focus on Python, R, and SQL for data-related roles.
Combine Skills with Passion:
- Align data analytics skills with industries of personal interest for a fulfilling career.
Understand Business Fundamentals:
- Learn basic finance and accounting to connect analytics with business needs.
Develop Soft Skills:
- Enhance interpersonal and communication skills to collaborate with non-technical stakeholders effectively.
Stay Current and Adaptable:
- Embrace learning new tools and technologies to remain relevant in the field.
Network Strategically:
- Build relationships with professionals to gain insights and collaboration opportunities.
Explore Various Projects:
- Experiment with diverse projects to gain confidence and transition smoothly from academia to industry.

Question 20

How does data storytelling bridge the gap between data and insights?

Answer:

Data Storytelling: Bridge the gap between data and insights by incorporating descriptive statistics, visualization, and narration within the appropriate audience context to effectively present your findings and drive data-informed decisions.

Question 21

What two main factors should a storyteller consider about the context before creating a data visualization or communication?

Answer:

Audience: Understanding who the audience is, their level of expertise, interests, and what they care about, to tailor the message accordingly.
Purpose: Clarifying the goal of the communication—whether to inform, persuade, or explore—and what action or understanding is desired from the audience.

Question 22

Provide at least three techniques to make data visualization more colorblind-friendly.

Answer:

Use Colorblind-Friendly Palettes: Utilize colorblind-safe palettes (e.g., scale_color_tableau(), scale_color_colorblind()), which are designed to be distinguishable by colorblind individuals.
Incorporate Textures and Patterns: Use different shapes or line types in addition to colors to differentiate data points.
Have Additional Visual Cue: Ensure that color is not the only means of conveying information; include labels, annotations, or legends that clarify the data.

Summary for Midterm Exam 2 Performance

Section 1. Multiple Choice, Short Answer, and Various Questions

Question 1

Question 2

Question 3

Question 4

Questions 5-12

Description of Variables in nwsl_player_stats:

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Blank (1)

Blank (2)

Blank (3)

Blank (4)

Young Players

Star Players

Relationship

Question 11

Blank (1)

Blank (2)

Blank (3)

Blank (4)

Question 12

Part 1

Blank (1)

Blank (2)

Blank (3)

Blank (4)

Part 2

Blank (1)

Blank (2)

Blank (3)

Blank (4)

Part 3

Blank (1)

Blank (2)

Blank (3)

Blank (4)

Section 2. Filling-in-the-Blanks

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Section 3. Short Essay

Question 19

Question 20

Question 21

Question 22

Description of Variables in `nwsl_player_stats`: