Midterm Exam 2

Classwork 13

Author

Byeong-Hak Choe

Published

December 7, 2024

Modified

December 7, 2024

Summary for Midterm Exam 2 Performance

The following provides the descriptive statistics for each part of the Midterm Exam 2 questions:


The following describes the distribution of Midterm Exam 2 score:


Section 1. Multiple Choice, Short Answer, and Various Questions

Question 1

The quote “I would have written a shorter letter, but I did not have the time” by Blaise Pascal emphasizes the challenge of being:

  1. Detailed
  2. Accurate
  3. Concise
  4. Creative

Answer:

  1. Concise

Explanation:

Blaise Pascal’s quote underscores the difficulty of expressing ideas succinctly. Being concise often requires more thought and effort to distill complex ideas into clear, brief statements.

Question 2

When the distribution of a variable has a single peak and is positively skewed (i.e., having a long right tail), which of the following is correct?

  1. Mean < Mode < Median
  2. Mode < Median < Mean
  3. Median < Mean < Mode
  4. Median = Mean = Mode

Answer:

  1. Mode < Median < Mean

Explanation:

In a positively skewed distribution, the long tail is on the right side. The mean is pulled in the direction of the skew (right), so the mean is greater than the median, which is greater than the mode.

Question 3

What is NOT an essential component in ggplot() data visualization?

  1. Data frames
  2. Geometric objects
  3. Facets
  4. Aesthetic attributes

Answer:

  1. Facets

Explanation:

While facets are useful for creating multiple plots based on a factor, they are not essential components of a ggplot(). The essential components are data frames, geometric objects, and aesthetic attributes.

Question 4

____(1)____ does not necessarily imply ____(2)____

  1. (1) Correlation; (2) causation
  2. (1) Causation; (2) correlation
  3. (1) Correlation; (2) correlation
  4. (1) Causation; (2) causation

Answer:

  1. (1) Correlation; (2) causation

Explanation:

The phrase “correlation does not necessarily imply causation” means that just because two variables are correlated, it doesn’t mean one causes the other.

Questions 5-12

For Questions 5-12, consider the following R packages and the data.frame, nwsl_player_stats, containing individual player statistics for the National Women’s Soccer League (NWSL) in the 2022 season:

library(tidyverse)
library(skimr)
nwsl_player_stats <- read_csv("https://bcdanl.github.io/data/nwsl_player_stats.csv")

The first 5 observations in the nwsl_player_stats data frame are displayed below:

player nation pos squad age mp starts min
M. A. Vignola us USA MFFW Angel City 23 2 0 18
Michaela Abam cm CMR FW Dash 24 12 3 273
Kerry Abello us USA FWMF Pride 22 21 12 1042
Jillienne Aguilera NA DFMF Red Stars NA 17 5 580
Tinaya Alexander eng ENG FWMF Spirit 22 9 1 167
player nation pos squad xGp90 xAp90 xGxAp90
M. A. Vignola us USA MFFW Angel City 0.00 0.00 0.00
Michaela Abam cm CMR FW Dash 0.26 0.10 0.36
Kerry Abello us USA FWMF Pride 0.16 0.05 0.20
Jillienne Aguilera NA DFMF Red Stars 0.05 0.04 0.09
Tinaya Alexander eng ENG FWMF Spirit 0.58 0.03 0.62
player nation pos squad npxGp90 npxGxAp90
M. A. Vignola us USA MFFW Angel City 0.00 0.00
Michaela Abam cm CMR FW Dash 0.26 0.36
Kerry Abello us USA FWMF Pride 0.16 0.20
Jillienne Aguilera NA DFMF Red Stars 0.05 0.09
Tinaya Alexander eng ENG FWMF Spirit 0.16 0.19
  • The nwsl_player_stats data frame is with 314 observations and 13 variables.

Description of Variables in nwsl_player_stats:

  • player: Player name

  • nation: Player home country

  • pos: Player position (e.g., GK, FW, MF, etc.)

  • squad: Player team

  • age: Age of player

  • mp: Matches played

  • starts: Number of matches in which player started the game

  • min: Total minutes played in the season

  • xGp90: Expected goals per ninety minutes

    • xG is simply the probability of scoring a goal from a given spot on the field when a shot is taken.
  • xAp90: Expected assists per ninety minutes

    • xA is simply the probability of assisting a goal by delivering a pass that creates a scoring opportunity.
  • xGxAp90: Expected goals plus assists per ninety minutes

  • npxGp90: Expected goals minus penalty goals per ninety minutes

  • npxGxAp90: Expected goals plus assists minus penalty goals per ninety minutes

A player who is consistently achieving a high number of xG (or xA) will be one who is getting into a good position consistently on the field. Coaches and scouts can use this to evaluate whether a player is exceedingly (un)lucky over a given number of games, and this will help in evaluating that player’s offensive skills beyond simple counts.


The followings are the summary of the nwsl_player_stats data.frame, including descriptive statistics for each variable.

Data summary
Name nwsl_player_stats
Number of rows 314
Number of columns 13
_______________________
Column type frequency:
character 4
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing min max empty n_unique
player 0 5 26 0 303
nation 17 6 7 0 29
pos 0 2 4 0 10
squad 0 4 10 0 12

Variable type: numeric

skim_variable n_missing mean sd p0 p25 p50 p75 p100
age 15 25.98 3.96 16 23.00 25.00 28.00 38.00
mp 0 12.61 6.95 1 6.00 14.00 19.00 22.00
starts 0 9.25 7.31 0 2.00 9.00 16.00 22.00
min 0 831.81 631.52 1 234.00 743.50 1398.00 1980.00
xGp90 2 0.13 0.16 0 0.01 0.06 0.20 0.77
xAp90 2 0.10 0.47 0 0.01 0.05 0.11 8.26
xGxAp90 2 0.23 0.50 0 0.04 0.13 0.32 8.26
npxGp90 2 0.12 0.14 0 0.01 0.06 0.18 0.77
npxGxAp90 2 0.22 0.50 0 0.04 0.13 0.30 8.26

Question 5

Write a code to produce the above summary for the nwsl_player_stats data.frame, including descriptive statistics for each variable.

Answer: skim(nwsl_player_stats)

Explanation:

The skim() function from the skimr package provides a comprehensive summary of a data frame, including descriptive statistics for each variable.

Question 6

What code would you use to count the number of players in each team?

  1. nwsl_player_stats |> count(player)
  2. nwsl_player_stats |> count(nation)
  3. nwsl_player_stats |> count(pos)
  4. nwsl_player_stats |> count(squad)

Answer:

  1. nwsl_player_stats |> count(squad)

Explanation:

To count the number of players in each team, you need to count the occurrences of each team in the squad variable.

Question 7

What is the median value of mp? Find this value from the summary of the nwsl_player_stats data frame.

Answer: 14.00

Explanation:

From the summary provided by skim(nwsl_player_stats), the median value for mp is 14.00. This means that half of the players played 14 matches or fewer.

Question 8

  • We are interested in players who score or assist on a goal.
  • To achieve this, we create a new data.frame, a new data.frame, nwsl_nonGK_stats, which includes only players who are NOT a goal keeper from the nwsl_player_stats data frame.
nwsl_nonGK_stats <- nwsl_player_stats |> 
  filter(___BLANK___)
  • The pos value is “GK” for a goal keeper. Which condition correctly fills in the BLANK to complete the code above?
  1. !is.na(pos)
  2. is.na(pos)
  3. pos == "GK"
  4. pos != "GK"

Answer:

  1. pos != "GK"

Explanation:

To filter out goalkeepers, we select observations where the position (pos) is not equal to “GK”.

Question 9

  • Additionally, we are interested in non-goalkeeper players who played matches consistently throughout the season.
  • To achieve this, we create a new data.frame, nwsl_nonGK_stats_filtered, which includes only non GK players who played in at least 10 matches (mp) and started in at least 7 matches (starts) .
nwsl_nonGK_stats_filtered <- nwsl_nonGK_stats |> 
  filter(___BLANK___)
  • Which condition correctly fills in the BLANK to complete the code above?
  1. mp > 10 | starts > 7
  2. mp >= 10 | starts >= 7
  3. mp > 10 & starts > 7
  4. mp >= 10 & starts >= 7

Answer:

  1. mp >= 10 & starts >= 7

Explanation:

We want players who meet both conditions: played in at least 10 matches (mp >= 10) and started in at least 7 matches (starts >= 7). The logical operator & ensures both conditions are met.

Question 10

How would you describe the relationship between age and xGxAp90 (expected goals plus assists per ninety minutes) using the nwsl_nonGK_stats_filtered data.frame?

  • To identify outlier players, such as star players and young players, some player names are added to such points in the plot.
  • Note that it is NOT required to provide the code for adding these texts to the plot.

Complete the code by filling in the blanks (1)-(4).

ggplot(data = ___(1)___,
       mapping = aes(x = ___(2)___, 
                     y = ___(3)___)) +
  ___(4)___(alpha = 0.5) +
  geom_smooth()

Blank (1)

  1. nwsl_nonGK_stats_filtered
  2. nwsl_nonGK_stats
  3. nwsl_player_stats

Answer:

  1. nwsl_nonGK_stats_filtered

Blank (2)

  1. age
  2. xGxAp90
  3. xGp90
  4. xAp90

Answer:

  1. age

Blank (3)

  1. age
  2. xGxAp90
  3. xGp90
  4. xAp90

Answer:

  1. xGxAp90

Blank (4)

  1. geom_boxplot
  2. geom_scatterplot
  3. geom_point
  4. geom_histogram

Answer:

  1. geom_point

Young Players

Who are the young players under the age of 20?

Answer: Olivia Moultrie and Trinity Rodman

Star Players

Who are the star players whose xGxAp90 is greater than 0.75?

Answer: Sophia Smith, Mallory Pugh, Debinha, and Megan Rapinoe

Relationship

Describe the overall relationship between age and xGxAp90 (expected goal plus assist per ninety minutes).

Answer: Overall, xGxAp90 decreases as age increases up to 24, after which it remains relatively constant.

Question 11

How would you describe how the distribution of xAp90 (expected assist per ninety minutes) varies by teams (squad) using the nwsl_nonGK_stats_filtered data.frame?

  • Note that the squad categories are sorted by the median of xAp90 in the plot.

Complete the code by filling in the blanks.

ggplot(data = ___(1)___,
       mapping = aes(x = ___(2)___, 
                     y = ___(3)___)) +
  ___(4)___() +
  labs(y = "NWSL Teams")

Blank (1)

  1. nwsl_nonGK_stats_filtered
  2. nwsl_nonGK_stats
  3. nwsl_player_stats

Answer:

  1. nwsl_nonGK_stats_filtered

Blank (2)

  1. squad
  2. xGxAp90
  3. xGp90
  4. xAp90

Answer:

  1. x = xAp90

Blank (3)

  1. squad
  2. xGxAp90
  3. xGp90
  4. xAp90
  5. fct_reorder(squad, xGxAp90)
  6. fct_reorder(xGxAp90, squad)
  7. fct_reorder(squad, xGp90)
  8. fct_reorder(xGp90, squad)
  9. fct_reorder(squad, xAp90)
  10. fct_reorder(xAp90, squad)

Answer:

  1. fct_reorder(squad, xAp90)

Blank (4)

  1. geom_boxplot
  2. geom_bar
  3. geom_point
  4. geom_histogram

Answer:

  1. geom_boxplot

Question 12

Question 12 is about a ggplot code to visualize how the distribution of pos (player position) varies by teams (squad) using the nwsl_player_stats data.frame.

Part 1

Complete the code by filling in the blanks to replicate the given plot.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___ = squad,
                     ___(3)___ = pos)) +
  geom_bar(___(4)___)

Blank (1)
  1. nwsl_nonGK_stats_filtered
  2. nwsl_nonGK_stats
  3. nwsl_player_stats

Answer:

  1. nwsl_player_stats
Blank (2)
  1. x
  2. y
  3. color
  4. fill
  5. count

Answer:

  1. y
Blank (3)
  1. x
  2. y
  3. color
  4. fill
  5. count

Answer:

  1. fill
Blank (4)
  1. position = "stack"
  2. position = "fill"
  3. position = "dodge"
  4. Leaving (4) empty
  5. Both a and d
  6. Both b and d
  7. Both c and d

Answer:

  1. Both a and d

Explanation:

Since position = "stack" is the default for geom_bar(), leaving it empty achieves the same effect. So both options are correct.

Part 2

Complete the code by filling in the blanks to replicate the given plot.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___ = squad,
                     ___(3)___ = pos)) +
  geom_bar(___(4)___)

Blank (1)
  1. nwsl_nonGK_stats_filtered
  2. nwsl_nonGK_stats
  3. nwsl_player_stats

Answer:

  1. nwsl_player_stats
Blank (2)
  1. x
  2. y
  3. color
  4. fill
  5. count

Answer:

  1. y
Blank (3)
  1. x
  2. y
  3. color
  4. fill
  5. count

Answer:

  1. fill
Blank (4)
  1. position = "stack"
  2. position = "fill"
  3. position = "dodge"
  4. Leaving (4) empty
  5. Both a and d
  6. Both b and d
  7. Both c and d

Answer:

  1. position = "dodge"

Part 3

Complete the code by filling in the blanks to replicate the given plot.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___ = squad,
                     ___(3)___ = pos)) +
  geom_bar(___(4)___) +
  labs(x = "Proportion")

Blank (1)
  1. nwsl_nonGK_stats_filtered
  2. nwsl_nonGK_stats
  3. nwsl_player_stats

Answer:

  1. nwsl_player_stats
Blank (2)
  1. x
  2. y
  3. color
  4. fill
  5. count

Answer:

  1. y
Blank (3)
  1. x
  2. y
  3. color
  4. fill
  5. count

Answer:

  1. fill
Blank (4)
  1. position = "stack"
  2. position = "fill"
  3. position = "dodge"
  4. Leaving (4) empty
  5. Both a and d
  6. Both b and d
  7. Both c and d

Answer:

  1. position = "fill"

Section 2. Filling-in-the-Blanks

Question 13

When collecting data in real life, measured values often differ. In this context, we can observe variation easily; for example, if we measure any numeric variable (e.g., friends’ heights) twice, we are likely to get two different values.

Question 14

The mode of a variable is the value that appears most frequently within the set of that variable’s values.

Question 15

The gg in ggplot stands for Grammar of Graphics.

Question 16

Using regression—a machine learning method—the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable. The grey ribbon around the curve illustrates the uncertainty surrounding the estimated curve.

Question 17

When making a scatterplot, it is a common practice to place the input variable along the x-axis and the output variable along the y-axis.

Question 18

In ggplot, we can set alpha between 0 (full transparency) and 1 (no transparency) manually to adjust a geometric object’s transparency level.

Section 3. Short Essay

Question 19

List at least three distinct pieces of advice shared by the invited guests for students seeking jobs in the data analytics industry.

Answer:

  1. Leverage Available Opportunities:
    • Build resumes through tutoring, research, and extracurricular activities while in college.
  2. Gain Relevant Experience:
    • Secure internships and engage in projects to stand out in a competitive job market.
  3. Work on Personal Projects:
    • Develop one or two significant projects to showcase during interviews.
  4. Master Essential Tools:
    • Focus on Python, R, and SQL for data-related roles.
  5. Combine Skills with Passion:
    • Align data analytics skills with industries of personal interest for a fulfilling career.
  6. Understand Business Fundamentals:
    • Learn basic finance and accounting to connect analytics with business needs.
  7. Develop Soft Skills:
    • Enhance interpersonal and communication skills to collaborate with non-technical stakeholders effectively.
  8. Stay Current and Adaptable:
    • Embrace learning new tools and technologies to remain relevant in the field.
  9. Network Strategically:
    • Build relationships with professionals to gain insights and collaboration opportunities.
  10. Explore Various Projects:
    • Experiment with diverse projects to gain confidence and transition smoothly from academia to industry.

Question 20

How does data storytelling bridge the gap between data and insights?

Answer:

Data Storytelling: Bridge the gap between data and insights by incorporating descriptive statistics, visualization, and narration within the appropriate audience context to effectively present your findings and drive data-informed decisions.

Question 21

What two main factors should a storyteller consider about the context before creating a data visualization or communication?

Answer:

  1. Audience: Understanding who the audience is, their level of expertise, interests, and what they care about, to tailor the message accordingly.
  2. Purpose: Clarifying the goal of the communication—whether to inform, persuade, or explore—and what action or understanding is desired from the audience.

Question 22

Provide at least three techniques to make data visualization more colorblind-friendly.

Answer:

  1. Use Colorblind-Friendly Palettes: Utilize colorblind-safe palettes (e.g., scale_color_tableau(), scale_color_colorblind()), which are designed to be distinguishable by colorblind individuals.
  2. Incorporate Textures and Patterns: Use different shapes or line types in addition to colors to differentiate data points.
  3. Have Additional Visual Cue: Ensure that color is not the only means of conveying information; include labels, annotations, or legends that clarify the data.
Back to top