Midterm Exam II

Version A

Published

November 19, 2025

Section 1. Multiple Choice

Question 1

During the ETL process, the “Transform” stage in a DANL 101 context using R is best represented by which set of functions?

  1. read_csv()
  2. nrow() and head()
  3. filter(), select(), and left_join()
  4. mean() and sd()

Answer: c

These are typical data-wrangling verbs used to transform data.


Question 2

When the distribution of a variable has a single peak and is negatively skewed (i.e., having a long left tail), which of the following is correct?

  1. Median < Mode < Mean
  2. Mean < Median < Mode
  3. Mode < Mean < Median
  4. Median = Mean = Mode

Answer: b

With a long left tail, the mean is pulled left the most, then the median, then the mode stays at the peak:
Mean < Median < Mode.


Question 3

What is NOT an essential component in ggplot() data visualization?

  1. Data frames
  2. Geometric objects
  3. Facets
  4. Aesthetic attributes

Answer: c

Facets are useful but optional. You can build a valid ggplot without them.


Question 4

____(1)____ does not necessarily imply ____(2)____

    1. Correlation; (2) causation
    1. Causation; (2) correlation
    1. Correlation; (2) correlation
    1. Causation; (2) causation

Answer: a

The classic warning: correlation does not imply causation.


Question 5

In the context of the lecture, which of the following correctly interprets a change in log-transformed GDP per capita and its meaning for GDP per capita?

  1. A one-unit increase in log(GDP per capita) means to a 1% increase in GDP per capita.
  2. A one-unit increase in log(GDP per capita) means to a 100% increase in GDP per capita.
  3. A one-unit increase in GDP per capita means an 8.4% increase in GDP per capita.
  4. A one-unit increase in GDP per capita means a 0.084% increase in GDP per capita.

Answer: b

A 1-unit increase in log(GDP per capita) corresponds to GDP per capita multiplying by about e (roughly a 100%+ increase), not just 1%.


Question 6

In a relational database, a key is best described as:

  1. A column that stores numeric values only
  2. Any column used in a chart or visualization
  3. A special column that can never have missing values
  4. A column that uniquely identifies each row in a table

Answer: d

A key is a column (or combination of columns) whose value is unique for each row.



Section 2. Filling-in-the-Blanks

Question 7

When collecting data in real life, measured values often differ. In this context, we can observe ____________________ easily; for example, if we measure any numeric variable (e.g., friends’ heights) twice, we are likely to get two different values.

Answer: variability


Question 8

The ____________________ of a variable is the value that appears most frequently within the set of that variable’s values.

Answer: mode


Question 9

The gg in ggplot stands for ____________________.

Answer: Grammar of Graphics


Question 10

Using ____________________—a machine learning method—the geom_smooth() visualizes the ____________________ value of the y variable for a given value of the x variable. The grey ribbon around the curve illustrates the ____________________ surrounding the estimated curve.

Using regression — a machine learning method — the geom_smooth() visualizes the predicted value of the y variable for a given value of the x variable. The grey ribbon around the curve illustrates the uncertainty surrounding the estimated curve.


Question 11

When making a scatterplot, it is a common practice to place the ____________________ variable along the x-axis and the ____________________ variable along the y-axis.

x-axis: input variable
y-axis: outcome variable


Question 12

In ggplot, we can set alpha between ____________________ (full transparency) and ____________________ (no transparency) manually to adjust a geometric object’s transparency level.

Answer: 0 (full transparency) and 1 (no transparency)


Section 3. Data Transformation and Visualization with R

Questions 13-19

For Questions 13-19, consider the following R packages and the data.frame, esg_info, containing individual company statistics for the Environmental, Social, and Governance (ESG) risk score in 2024.

library(tidyverse)
library(skimr)
esg_info <- read_csv("https://bcdanl.github.io/data/stock_esg_simple.csv")
  • The esg_info data.frame is with 631 observations and 7 variables.
  • The first 5 observations in the esg_info data.frame are displayed below:
Ticker Company_Name sector total_esg
A Agilent Technologies, Inc. Industrials 13.6
AA Alcoa Corporation Industrials 24.0
AAL American Airlines Group Inc. Consumer Discretionary 26.4
AAP Advance Auto Parts, Inc. Consumer Discretionary 11.5
AAPL Apple Inc Technology 17.2
Ticker Company_Name sector environmental social governance
A Agilent Technologies, Inc. Industrials 1.1 6.4 6.1
AA Alcoa Corporation Industrials 13.8 5.9 4.3
AAL American Airlines Group Inc. Consumer Discretionary 9.9 11.6 4.8
AAP Advance Auto Parts, Inc. Consumer Discretionary 0.1 8.3 3.1
AAPL Apple Inc Technology 0.5 7.4 9.4

Description of Variables in esg_info:

  • Ticker: The stock symbol used to uniquely identify a publicly traded company on financial markets.
  • Company_Name: The full name of the company corresponding to the ticker symbol.
  • sector: The broader industry category to which the company belongs, such as Technology, Healthcare, or Financials.
  • total_esg: The company’s overall Environmental, Social, and Governance (ESG) score. It reflects how well the company is performing in terms of sustainability and ethical impact.
  • environmental: The company’s score related to environmental practices, such as energy efficiency, waste management, and carbon footprint.
  • social: The company’s score related to social practices, including employee relations, diversity, community impact, and human rights.
  • governance: The company’s score related to governance practices, like board structure, executive pay, and shareholder rights.

Interpreting ESG Data

The ESG data helps investors evaluate a company’s sustainability profile and exposure to long-term environmental, social, and governance risks. Here’s how to interpret each metric:

Total ESG Risk Score

  • What it means: A composite score reflecting the company’s overall exposure to ESG-related risks.
  • How to interpret:
    • Higher score = higher risk → More vulnerable to ESG-related issues.
    • Example: A company with total_ESG = 15 is considered to have lower ESG risk than one with total_ESG = 30.

Environmental Risk Score

  • What it measures: Exposure to environmental risks such as:
    • Carbon emissions, waste management, climate change strategy
    • Higher score → more environmental liabilities or poor sustainability measures.

Social Risk Score

  • What it measures: Exposure to social risks, including:
    • Labor practices, human rights, inclusive culture and representation, customer and community relations
    • Higher score = more risk from labor issues, public relation (PR) problems, etc.

Governance Risk Score

  • What it measures: Exposure to governance-related risks, such as:
    • Board structure and independence, executive compensation, shareholder rights, transparency and ethics
    • Higher score suggests poor governance structures.


The followings are the summary of the esg_info data.frame, including descriptive statistics for each variable.

Data summary
Name esg_info
Number of rows 631
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing min max empty n_unique
Ticker 0 1 5 0 631
Company_Name 0 5 47 0 630
sector 0 6 22 0 12

Variable type: numeric

skim_variable n_missing mean sd p0 p25 p50 p75 p100
total_esg 0 21.62 7.08 6.4 16.30 21.20 26.10 52.0
environmental 23 5.78 5.25 0.0 1.78 3.95 9.00 25.3
social 23 9.02 3.57 0.8 6.70 8.90 11.20 22.5
governance 23 6.83 2.40 2.4 5.20 6.30 7.93 19.4

Question 13

Write a code to produce the above summary for the esg_info data.frame, including descriptive statistics for each variable.

Answer: ______________________________________________

skim(esg_info)


Question 14

What code would you use to count the number of companies in each sector?

  1. esg_info |> count(Sector)
  2. esg_info |> count(Company_Name)
  3. esg_info |> count(Company_Name, Sector)
  4. esg_info |> count(Sector, Company_Name)

Answer: a


Question 15

What is the median value of total_esg? Find this value from the summary of the esg_info data.frame.

Answer: ______________________________________________

Answer: 21.2


Question 16

  • We are interested in companies whose overall ESG performance is relatively weak.
  • To achieve this, we create a new data.frame, a new data.frame, bottom25_esg, which includes only companies whose total_esg value is greater than or equal to the third quartile of total_esg variable.
bottom25_esg <- esg_info |> 
  filter(total_esg >= ___BLANK___)
  • Using the summary of the esg_info data.frame, find condition correctly fills in the BLANK to complete the code above:
  1. 6.4
  2. 7.08
  3. 16.3
  4. 21.2
  5. 21.6
  6. 26.1

Answer: f


Question 17

  • Additionally, we are interested in companies with excellent environmental or governance performance, despite having weak overall ESG scores.
  • To achieve this, we create a new data.frame, bottom25_esg_filtered, which includes only companies whose environmental risk score is at most 3.95 or whose governance risk score is at most 5.2.
bottom25_esg_filtered <- esg_info |> 
  filter(___BLANK___)
  • Which condition correctly fills in the BLANK to complete the code above?
  1. environmental <= 3.95 | governance <= 5.2
  2. environmental <= 3.95 , governance <= 5.2
  3. environmental <= 3.95 & governance <= 5.2
  4. environmental <= 3.95 ! governance <= 5.2
  5. Both (b) and (c)

Answer: a


Question 18

How would you describe the relationship between environmental and governance using the bottom25_esg_filtered data.frame?

  • To identify leader companies, such as environmental leaders and governance leaders, some company names are added to such points in the plot.
  • Note that it is NOT required to provide the code for adding these texts to the plot.

Complete the code by filling in the blanks (1)-(4).

ggplot(data = ___(1)___,
       mapping = aes(x = ___(2)___, 
                     y = ___(3)___)) +
  geom_point(alpha = 0.5) +
  ___(4)___()

Blank (1)

  1. bottom25_esg_filtered
  2. bottom25_esg
  3. esg_info

Answer: a


Blank (2)

  1. environmental
  2. social
  3. governance
  4. total_esg

Answer: a


Blank (3)

  1. environmental
  2. social
  3. governance
  4. total_esg

Answer: c


Blank (4)

  1. geom_fit
  2. geom_scatterplot
  3. geom_smooth
  4. geom_histogram

Answer: c


Environmental Leaders

Which companies qualify as environmental leaders — that is, companies with an environmental risk score below 2.5 whose plotted points lie below the gray ribbon in the figure?

Answer: ______________________________________________

JPMorgan Chase, PNC Financial, Bank of America, AbbVie, and Ionis Pharmaceuticals


Governance Leaders

Which companies qualify as governance leaders — that is, companies with an governance risk score below 10 whose plotted points lie below the gray ribbon in the figure?

Answer: ______________________________________________

Ionis Pharmaceuticals, HCA Healthcare, IAC, and Stericycle


Relationship

Describe the overall relationship between environmental and governance, in the given plot.

Overall, the governance risk score decreases as the environmental risk score increases up to about 10, after which it levels off and remains relatively constant.

This pattern suggests that among companies with weaker overall ESG performance, deterioration in environmental risk is associated with stronger governance up to a threshold, beyond which further changes in environmental risk have little additional correlation with governance risk.


Question 19

How would you describe how the distribution of total_esg varies by sector (sector) using the bottom25_esg_filtered data.frame?

  • Note that the sector categories are sorted by the median of total_esg in the plot.

Complete the code by filling in the blanks.

ggplot(data = ___(1)___,
       mapping = aes(___(2)___, 
                     y = ___(3)___)) +
  ___(4)___() +
  labs(y = "Sector")

Blank (1)

  1. bottom25_esg_filtered
  2. bottom25_esg
  3. esg_info

a


Blank (2)

  1. x = total_esg
  2. y = total_esg
  3. x = environmental
  4. y = environmental
  5. x = social
  6. y = social
  7. x = governance
  8. y = governance

a


Blank (3)

  1. fct_reorder(total_esg, sector)
  2. fct_reorder(environmental, total_esg)
  3. fct_reorder(total_esg, social)
  4. fct_reorder(governance, total_esg)
  5. fct_reorder(total_esg, controversy)
  6. fct_reorder(sector, total_esg)

f


Blank (4)

  1. geom_bar
  2. geom_box
  3. geom_boxplot
  4. geom_histogram

c


Question 20

For Question 20, you will use the following R packages and a data.frame named chess_top4, which contains information about chess games played by four of the world’s top online chess players during a special event called “Titled Tuesday” on chess.com. These games were played in a format where each player has 3 minutes to make all their moves, with 1 second added to their clock after each move. The data includes games from October 2022 to October 2024 played only among the following four players:

  1. Magnus Carlsen
  2. Hikaru Nakamura
  3. Alireza Firouzja
  4. Daniel Naroditsky

Note: Titled Tuesday is a weekly event held every Tuesday on chess.com, where titled chess players (such as Grandmasters and International Masters) compete in online tournaments.

library(tidyverse)
chess_top4 <- read_csv("https://bcdanl.github.io/data/chess_titled_tuesday.csv")

The first 15 observations in the chess_top4 data.frame are displayed below:

Date White Black Result
2022-10-11 Hikaru Nakamura Magnus Carlsen White Wins
2022-10-18 Hikaru Nakamura Alireza Firouzja Draw
2022-10-25 Daniel Naroditsky Alireza Firouzja Draw
2022-11-08 Hikaru Nakamura Alireza Firouzja Black Wins
2022-12-13 Alireza Firouzja Hikaru Nakamura Draw
2022-12-20 Hikaru Nakamura Alireza Firouzja White Wins
2022-12-20 Magnus Carlsen Hikaru Nakamura Draw
2023-01-03 Daniel Naroditsky Hikaru Nakamura Black Wins
2023-01-03 Hikaru Nakamura Magnus Carlsen White Wins
2023-01-24 Hikaru Nakamura Magnus Carlsen White Wins
2023-02-28 Alireza Firouzja Magnus Carlsen Draw
2023-02-28 Hikaru Nakamura Alireza Firouzja Draw
2023-02-28 Hikaru Nakamura Magnus Carlsen Draw
2023-02-28 Magnus Carlsen Hikaru Nakamura Draw
2023-03-14 Hikaru Nakamura Alireza Firouzja White Wins
  • The chess_top4 data.frame contains 70 observations and 4 variables, representing 70 unique chess games.

Description of Variables in chess_top4:

  • Date: The date when the game was played.

  • White: The name of the player who played with the white pieces.

  • Black: The name of the player who played with the black pieces.

  • Result: The outcome of the game, which can be one of the following:

    • “White Wins” (the player with the white pieces won the game)
    • “Black Wins” (the player with the black pieces won the game)
    • “Draw” (the game ended in a tie)

Question 20 is about a ggplot code to visualize how the distribution of Result varies among these top 4 chess players.

Part 1

Complete the code by filling in the blanks to replicate the given plot.

  • The White player is displayed on the vertical axis.
  • The Black player is labeled at the top of each panel.
ggplot(data = chess_top4,
       mapping = aes(___(1)___,
                     fill = ___(2)___)) +
  geom_bar(___(3)___) +
  facet_wrap(___(4)___, ncol = 1) +
  labs(x = "Proportion")

Blank (1)
  1. x = White
  2. y = White
  3. x = Black
  4. y = Black
  5. x = Proportion
  6. y = Proportion

b


Blank (2)
  1. chess_top4
  2. White
  3. Black
  4. Result
  5. count

d


Blank (3)
  1. position = "stack"
  2. position = "fill"
  3. position = "dodge"
  4. Leaving (3) empty
  5. Both a and d
  6. Both b and d
  7. Both c and d

b


Blank (4)
  1. White
  2. ~White
  3. Black
  4. ~Black
  5. Player
  6. ~Player
  7. both a and b
  8. both c and d
  9. both e and f
  10. both b and f
  11. both d and f

d


Part 2

Complete the code by filling in the blanks to replicate the given plot.

  • The White player is displayed on the vertical axis.
  • The Black player is labeled at the top of each panel.
ggplot(data = chess_top4,
       mapping = aes(___(1)___,
                     fill = ___(2)___)) +
  geom_bar(___(3)___) +
  facet_wrap(___(4)___, ncol = 1)

Blank (1)
  1. x = White
  2. y = White
  3. x = Black
  4. y = Black
  5. x = count
  6. y = count

b


Blank (2)
  1. chess_top4
  2. White
  3. Black
  4. Result
  5. count

d


Blank (3)
  1. position = "stack"
  2. position = "fill"
  3. position = "dodge"
  4. Leaving (3) empty
  5. Both a and d
  6. Both b and d
  7. Both c and d

e


Blank (4)
  1. White
  2. ~White
  3. Black
  4. ~Black
  5. Player
  6. ~Player
  7. both a and b
  8. both c and d
  9. both e and f
  10. both b and f
  11. both d and f

d


Part 3

Complete the code by filling in the blanks to replicate the given plot.

  • The White player is displayed on the vertical axis.
  • The Black player is labeled at the top of each panel.
ggplot(data = chess_top4,
       mapping = aes(___(1)___,
                     fill = ___(2)___)) +
  geom_bar(___(3)___) +
  facet_wrap(___(4)___, ncol = 1)

Blank (1)
  1. x = White
  2. y = White
  3. x = Black
  4. y = Black
  5. x = count
  6. y = count

b


Blank (2)
  1. chess_top4
  2. White
  3. Black
  4. Result
  5. count

d


Blank (3)
  1. position = "stack"
  2. position = "fill"
  3. position = "dodge"
  4. Leaving (3) empty
  5. Both a and d
  6. Both b and d
  7. Both c and d

c


Blank (4)
  1. White
  2. ~White
  3. Black
  4. ~Black
  5. Player
  6. ~Player
  7. both a and b
  8. both c and d
  9. both e and f
  10. both b and f
  11. both d and f

d


Part 4 - Magnus Carlsen vs. Hikaru Nakamura in the Titled Tuesday

Who had more wins in the games where Magnus Carlsen played with the white pieces and Hikaru Nakamura played with the black pieces in the Titled Tuesday?

Answer: ______________________________________________

Magnus Carlsen


Who had more wins in the games where Hikaru Nakamura played with the white pieces and Magnus Carlsen played with the black pieces in the Titled Tuesday?

Answer: ______________________________________________

Hikaru Nakamura


Who won more games in the encounters between Magnus Carlsen and Hikaru Nakamura in the Titled Tuesday?

Answer: ______________________________________________

They had an equal number of wins.


Section 3. Short Essay

Question 21

Explain the “Connection Principle” in the context of line charts and why it is useful for visualizing time-series data.

The Connection Principle states that points in a line chart should be visually connected when the data represent a meaningful sequence, such as time. This connection helps viewers naturally perceive trends, patterns, growth, and cycles across periods. It is especially useful in time-series data because it reinforces the idea of continuity and change over time rather than treating each observation as isolated.


Question 22

How does data storytelling bridge the gap between data and insights?

Data Storytelling bridges the gap between data and insight by integrating descriptive statistics, data transformation, visualization, and narration within the appropriate audience context to communicate findings effectively and support data-informed decision-making.


Question 23

Provide two main reasons why the log transformation of a variable can be useful.

Two main reasons for using a log transformation are:

  1. It helps reduce skewness in highly right-skewed data, making the distribution more symmetric and the variation easier to see.
  2. It allows us to interpret changes in percentage terms, which is especially useful for growth rates and economic variables such as income, GDP, stock price, or housing price.


Question 24

Provide at least three techniques to make data visualization more colorblind-friendly.

To make visualizations more accessible and colorblind-friendly, consider:

  1. Using colorblind-friendly color palettes
  2. Adding non-color cues like shape to scatterplots or linetype to line charts
  3. Including additional visual cues to highlight important information (e.g., annotations or labels)
  4. Ensuring strong contrast between colors and between the foreground and background so that elements remain distinguishable.


Question 25

For each of the workplaces represented by our three alumni guest speakers, explain how generative AI is used in their day-to-day work.

Across the three alumni workplaces, generative AI is integrated into daily workflows in different but complementary ways.

At the analytics and engineering–focused workplace, many employees (roughly 60–80%) use AI assistants such as Copilot or Claude to support coding, debugging, and data processing. AI significantly speeds up routine analytical tasks, but all outputs still require human review, meaning AI augments rather than replaces analysts.

For a traditional data analyst role, generative AI is commonly used to assist with SQL coding, data summarization, and report drafting. Analysts rely on AI to improve efficiency, but data sensitivity and responsible AI use are essential, so such tools are typically deployed only within secure internal environments.

At the policy and government-related workplace (e.g., the Federal Reserve), the use of public AI tools like ChatGPT is prohibited due to strict data confidentiality rules. Instead, analysts rely on approved internal systems and traditional analytical tools to protect sensitive economic and financial information.

Together, these examples show how generative AI is widely used to increase productivity, while security, privacy, and human oversight remain central across all workplaces.


Back to top