Data Storytelling Team Project - Guideline

What You Should Do for the Team Project

Author

Affiliation

Byeong-Hak Choe

SUNY Geneseo

Published

December 2, 2024

Presentation Schedule and Format

Each team will deliver a 10-minute presentation, followed by a 1–2 minute Q&A session, during one of the following class sessions:
- December 4, Wednesday: Four teams
- December 6, Friday: Four teams
- December 9, Monday: Two teams
The order of team presentations will be determined by a random draw during class.
- If multiple teams choose the same topic, I will try to schedule these teams in different sessions to minimize repetition within a single class.
To ensure fairness and equal participation, each student must contribute evenly to the presentation.

Suggested Topics

You are welcome to use your own dataset for the project; however, here are some suggested topics with corresponding data for your convenience.
- Dataset Approval: If your team decides to use a dataset that is not suggested by Byeong-Hak, you must obtain approval in advance.
- Dataset Submission:
  - Once your dataset is approved, make sure to send the dataset files to Byeong-Hak.
  - If your team adds any additional datasets to the suggested topic you have chosen, please also submit those additional dataset files to Byeong-Hak.
The suggested data frames are quite large.
- Feel free to transform the data as needed to focus on the story your team wants to tell.

1. Beer Market

The beer_markets data frame contains detailed information about household beer purchases across different brands and markets in the United States. It includes purchase details, product attributes, promotional information, and demographic data of the households.

beer_markets <- read_csv("https://bcdanl.github.io/data/beer_markets_all.csv")

Variable Description

hh: an identifier of the household;
X_purchase_desc: details on the purchased item;
quantity: the number of items purchased;
brand: Bud Light, Busch Light, Coors Light, Miller Lite, or Natural Light;
dollar_spent: total dollar value of purchase;
beer_floz: total volume of beer, in fluid ounces;
price_floz: price per fl.oz. (i.e., dollar_spent/beer_floz);
container: the type of container;
promo: Whether the item was promoted (coupon or otherwise);
region: US region
state: US state
market: Scan-track market (or state if rural);
demographic data, including gender, marital status, household income, class of work, race, education, age, the size of household, and whether or not the household has a microwave or a dishwasher.

2. NYC Housing Market

The nyc_housing_sales data frame includes property sale transactions in New York City from 2003 to 2024. It provides detailed information about each sale, including property characteristics, sales prices, and building classifications.

nyc_housing_sales <- read_csv("https://bcdanl.github.io/data/nyc_housing_sales_2003-2024.csv")

Variable Description

For the description of variables in the nyc_housing_sales data.frame, please refer to the following webpage:
- https://www.nyc.gov/site/finance/property/glossary-property-sales.page
  - For the variables of building class code, please refer to the following webpage:
    - https://www.nyc.gov/assets/finance/jump/hlpbldgcode.html

3. Stock Market and ESG (Environmental, Social, and Governance)

The stock_markets data frame contains daily trading data for publicly traded companies, including information about stock prices, trading volume, dividends, and stock splits, obtained from Yahoo! Finance.

stock_markets <- read_csv("https://bcdanl.github.io/data/stock_history_2024_10.csv")

Description of Variables in the `stock_markets` data frame

Date: The specific date for the recorded stock data.
Ticker: The unique symbol assigned to a publicly traded company’s stock. It is used to identify the stock on financial markets (631 unique values).
Close: The stock’s price at the end of the trading day. It does not account for adjustments like dividends or splits.
Dividends: The cash dividend paid per share on the given date, if applicable. It represents the portion of a company’s earnings distributed to shareholders.
High: The highest price at which the stock traded during the trading day.
Low: The lowest price at which the stock traded during the trading day.
Open: The stock’s price at the beginning of the trading day.
Stock_Splits: The ratio of any stock split that occurred on the given date. A stock split increases the number of shares outstanding while reducing the price per share proportionally.
Volume: The total number of shares traded during the day. It reflects the activity level and liquidity of the stock.

The esg_info data frame provides ESG scores and company details, including sector, industry, and market capitalization, obtained from Yahoo! Finance.

esg_info <- read_csv("https://bcdanl.github.io/data/stock_esg_list.csv")

Description of Variables in the `esg_info` data frame

Ticker: The stock symbol used to uniquely identify a publicly traded company on financial markets.
Company_Name: The full name of the company corresponding to the ticker symbol.
Sector: The broader industry category to which the company belongs, such as Technology, Healthcare, or Financials.
Industry: A more specific category within the sector that describes the company’s line of business. For example, within the Technology sector, an industry could be Semiconductors or Software.
Market_Cap: The total market value of the company’s outstanding shares. It is calculated by multiplying the current stock price by the total number of outstanding shares and is used to classify companies as small-cap, mid-cap, or large-cap.
Country: The country where the company is headquartered. This variable helps identify the geographical location of the company’s operations.
IPO_Year: The year in which the company went public and its shares were first offered on a stock exchange.
total_esg: The company’s overall Environmental, Social, and Governance (ESG) score. It reflects how well the company is performing in terms of sustainability and ethical impact.
environmental: The company’s score related to environmental practices, such as energy efficiency, waste management, and carbon footprint.
social: The company’s score related to social practices, including employee relations, diversity, community impact, and human rights.
governance: The company’s score related to governance practices, like board structure, executive pay, and shareholder rights.
controversy: A score reflecting the level of public controversies associated with the company, which may impact its reputation. A higher score often indicates more significant controversies.

(Optional) The income_stmts data frame contains income statement details for each company, including revenue, expenses, and net income.

income_stmts <- read_csv("https://bcdanl.github.io/data/stock_income_stmts_2024_10.csv")

(Optional) The balance_sheets data frame contains balance sheet details, including assets, liabilities, and shareholder equity.

balance_sheets <- read_csv("https://bcdanl.github.io/data/stock_balance_sheets_2024_10.csv")

4. Chess

The compressed chesscom_3100.zip file includes chesscom_3100.csv, containing records of high-level blitz or bullet games on Chess.com, played by 171 players with a peak rating of at least 3100.

Download the compressed zip file, extract it (this can be automatically extracted), and upload it to your team project in Posit Cloud.

Description of Variables in the `chesscom_3100.csv` file

Date: The date on which the chess game was played, formatted as YYYY-MM-DD.
Year: The year of the game, extracted from the date (e.g., 2017).
Month: The month of the game, extracted from the date (e.g., 2 for February).
Day: The day of the month on which the game was played, extracted from the date (e.g., 13).
White: The username of the player who played with the white pieces.
Black: The username of the player who played with the black pieces.
Result: The outcome of the game from White’s perspective, with options like:
- “1-0” indicating a win for White.
- “0-1” indicating a win for Black.
- “1/2-1/2” indicating a draw.
WhiteElo: The Elo rating of the player playing White at the time of the game.
BlackElo: The Elo rating of the player playing Black at the time of the game.
ECO: The ECO (Encyclopaedia of Chess Openings) code that classifies the opening used in the game, such as D02 or E36.
ECO_name: A description or detailed name of the chess opening corresponding to the ECO code, which may include the opening variation.
ECO_moves: The sequence of the opening moves for the chess opening corresponding to the ECO code
Termination: The method by which the game ended, such as “won by resignation,” “drawn by insufficient material,” or “won on time.”
TimeControl: The total time control for the game in seconds (e.g., 180 indicates a three-minute game).

The rating_3100_players data frame provides 68 players’ profiles, including rankings, ratings, titles, and follower counts on Chess.com.

rating_3100_players <- read_csv("https://bcdanl.github.io/data/chesscom_GMs_profile.csv")

Description of Variables in the `rating_3100_players` data frame

username: The unique username associated with the player on Chess.com.
name: The real name of the player, if available.
username_raw: The original or unformatted version of the player’s username.
country: The country where the player is located or registered.
rank_classical: The player’s ranking position in Classical chess format.
rank_rapid: The player’s ranking position in Rapid chess format.
rank_blitz: The player’s ranking position in Blitz chess format.
rating_classical: The player’s Elo rating in Classical chess format.
rating_rapid: The player’s Elo rating in Rapid chess format.
rating_blitz: The player’s Elo rating in Blitz chess format.
title: The official chess title held by the player, such as GM (Grandmaster), IM (International Master), etc.
profile_name: The full name displayed on the player’s profile page.
profile_username: The URL or link to the player’s profile on Chess.com.
profile_image: The URL to the player’s profile image.
followers: The number of followers the player has on Chess.com.
is_streamer: Indicates whether the player is a streamer (TRUE or FALSE).
status: The account status, such as “premium” or other types of account membership.
date_joined: The date when the player joined Chess.com, formatted as YYYY-MM-DD.

5. Sports

Sports data can be quite complex. If your team wants to work with sports data, please schedule a meeting with Byeong-Hak for guidance at your earliest convenience.
Byeong-Hak can assist with data for:
- Major League Baseball (MLB)
- National Basketball Association (NBA)
- National Football League (NFL)
- National Hockey League (NHL)
- Soccer (EPL, La Liga, Serie A, Bundesliga, MLS)
- Golf

Key Components in the Data Storytelling Project

Title:
- Pick a title that’s clear, catchy, and gives a good sense of what your project is about.
Introduction:
- Background: Give some context about your topic and why it matters. Think of this as setting the stage for your data story and explaining what motivated you to dig into this topic.
- Statement of the Project Interest: Spell out the problem or issue you’re tackling. This will help guide your data analysis and keep things focused.
Data Storytelling:
- Questions and Objectives: List the questions you’re trying to answer. Use these to shape your story and show how your data insights relate to real-world problems.
- Data Transformation and Descriptive Statistics: Walk your audience through your findings, weaving together data transformations and stats to highlight the big takeaways. Explain how your data transformations bring out the important stuff.
- Data Visualization: Use clear, color-blind friendly visuals that fit right into your narrative. Each visual should pack a punch, highlighting key insights and moving your story forward. Make sure they’re easy to interpret and add value to your story.
Significance of the Project:
- Talk about why your findings matter. How can they be used in the real world, influence business decisions, or inform public policy? Connect your data analysis to broader themes and show why it’s relevant.
Visual Materials and Slide Quality:
- Keep your slides clean, visually appealing, and easy to follow. Good visuals and a smart layout will make your story more engaging.
- Your slides will be judged on how clear and effective they are, and how well they pull everything together.
Team Presentation:
- Make sure your presentation is engaging and flows well. Everyone on the team should contribute, showing a solid grasp of the project while keeping the audience interested.
- We’ll be looking at how well you deliver, how organized your presentation is, and how clearly you explain your ideas.
Code Quality:
- Write clean, efficient, and well-documented code for your data work. Keep it organized and readable, with helpful comments so it’s easy to follow.
- Your code should show best practices and clearly support your analysis.
References:
- List all your sources properly and make sure your citations are consistent and complete. Give credit where it’s due!

Rubric for the Data Storytelling Project

Attribute	Very Deficient (1)	Somewhat Deficient (2)	Acceptable (3)	Very Good (4)	Outstanding (5)
1. Quality of Data Transformation and Descriptive Statistics	- No transformation or cleaning applied - Very poor data transformation - Contains significant errors	- Minimal transformation or cleaning - Basic data transformation with errors - Contains several errors	- Basic transformation applied - Adequate data transformation - Contains minor errors	- Effective transformation - Thorough data transformation - Data is accurate	- Advanced transformation - Exceptional data transformation - Data is impeccable
2. Quality of Data Visualization	- Visualizations are missing or unclear - Misrepresents data	- Visualizations are basic and lack clarity - Some misrepresentation	- Visualizations are clear and accurate - Data is appropriately represented	- Visualizations are insightful and enhance understanding - Data is accurately represented	- Visualizations are highly creative and compelling - Data representation is impeccable
3. Effectiveness of Data Storytelling	- No narrative or storyline - Insights are absent or irrelevant - Fails to engage the audience	- Weak narrative structure - Insights are superficial - Minimal audience engagement	- Clear narrative present - Insights are relevant - Audience is adequately engaged	- Compelling narrative - Insights are significant - Engages audience effectively	- Exceptional and captivating narrative - Insights are profound and impactful - Audience is highly engaged
4. Quality of Slides and Visual Materials	- Very poorly organized - Difficult to read and understand - Numerous errors present	- Somewhat disorganized - Some slides are unclear - Several errors present	- Well organized - Mostly clear and understandable - Few errors present	- Very well organized - Clear and visually appealing - Very few errors	- Exceptionally well organized - Highly clear and visually compelling - No errors
5. Quality of Team Presentation	- Presentation is disjointed - Poor team coordination - Unable to address questions	- Lacks flow - Some coordination issues - Difficulty with several questions	- Cohesive presentation - Team works well together - Addresses most questions adequately	- Engaging presentation - Team is well-coordinated - Addresses almost all questions professionally	- Highly engaging and polished presentation - Excellent team coordination - Addresses all questions expertly
6. Quality of Code (Descriptive Statistics, Transformation, Visualization)	- Code is missing or non-functional - No documentation - Disorganized code	- Code has major errors - Minimal documentation - Code is somewhat disorganized	- Code is functional - Basic documentation provided - Code is organized	- Code is efficient and well-structured - Good documentation - Code is well-organized	- Code is highly efficient and elegant - Excellent documentation - Code is exceptionally well-organized

Data Transformation Support

Although useful data transformation functions will be covered in class to support your project, feel free to reach out to Byeong-Hak if you need additional guidance with data transformation tasks for your team project.
- Consider what your ideal data frame should look like for effective visualization and storytelling.

Requirements

Peer Evaluation

If a team member fails to collaborate on the project by December 1, 2024, please send an email to bchoe@geneseo.edu and cc the non-collaborating team member in the email.
- If the conflict is not resolved after the initial notification, the non-collaborating team member may receive a reduced or zero score for the project component, depending on the severity of their lack of participation. Byeong-Hak will follow up with both parties to attempt to mediate and address the issue fairly.
Each student is required to evaluate the presentations of other teams. Peer evaluations will make up 5% of the total project score.
- An Excel spreadsheet for the peer evaluation will be provided. Make sure to save the spreadsheet and submit it to Brightspace.
- Failure to complete the peer evaluation will result in a reduction of your class participation score.
- Score Calculation: For each category of Rubric 1-5, the four highest and four lowest scores will be dropped to ensure fairness when calculating the peer evaluation score.

Submission

Each team must email (1) the presentation slides (in Microsoft PowerPoint or Google Slides format) and (2) the associated R script for the project at least 30 minutes before class time on the day of your presentation.
- New Techniques: If your project involves data transformation or visualizations not covered in class, your team must provide a brief explanation of the code for these sections during the presentation.
R Script Guidelines:
- Organization: Use section headers (created with Ctrl/Cmd + Shift + R) and comments (#) to clearly label which parts of your code correspond to specific visualizations, transformations, and descriptive statistics. This will make your script easy to follow.
- Explanation: Include comments to explain any parts of your code that use techniques not covered in the course. This should provide enough detail for others to understand the purpose and functionality of your code.
- Reproducibility: Ensure your R script is complete and reproducible, meaning anyone who runs it should be able to replicate your results without needing additional information.
- Clarity: Write clear and concise comments throughout your code to enhance readability and comprehension. Avoid overly complex or redundant code.
- Error Handling: Make sure your code runs smoothly without errors.

Presentation Schedule and Format

Suggested Topics

1. Beer Market

Variable Description

2. NYC Housing Market

Variable Description

3. Stock Market and ESG (Environmental, Social, and Governance)

Description of Variables in the stock_markets data frame

Description of Variables in the esg_info data frame

4. Chess

Description of Variables in the chesscom_3100.csv file

Description of Variables in the rating_3100_players data frame

5. Sports

Key Components in the Data Storytelling Project

Rubric for the Data Storytelling Project

Data Transformation Support

Requirements

Peer Evaluation

Submission

Description of Variables in the `stock_markets` data frame

Description of Variables in the `esg_info` data frame

Description of Variables in the `chesscom_3100.csv` file

Description of Variables in the `rating_3100_players` data frame