Lecture 1

Syllabus, Course Outline, and Introduction

Byeong-Hak Choe

SUNY Geneseo

January 22, 2025

Instructor

Instructor

Current Appointment & Education

  • Name: Byeong-Hak Choe.

  • Assistant Professor of Data Analytics and Economics, School of Business at SUNY Geneseo.

  • Ph.D. in Economics from University of Wyoming.

  • M.S. in Economics from Arizona State University.

  • M.A. in Economics from SUNY Stony Brook.

  • B.A. in Economics & B.S. in Applied Mathematics from Hanyang University at Ansan, South Korea.

    • Minor in Business Administration.
    • Concentration in Finance.

Instructor

Economics and Data Science

  • Choe, B.H., Newbold, S. and James, A., “Estimating the Value of Statistical Life through Big Data”
    • Question: How much is the society willing to pay to reduce the likelihood of fatality?
  • Choe, B.H., “Social Media Campaigns, Lobbying and Legislation: Evidence from #climatechange and Energy Lobbies.”
    • Question: To what extent do social media campaigns compete with fossil fuel lobbying on climate change legislation?
  • Choe, B.H. and Ore-Monago, T., 2024. “Governance and Climate Finance in the Developing World”
    • Question: In what ways and through what forms does poor governance act as a significant barrier to reducing greenhouse gas emissions in developing countries?

Syllabus

Syllabus

Email, Class & Office Hours

Syllabus

Course Description

  • This course is designed to provide a comprehensive overview of data handling techniques, focusing on practical application through case studies.

  • Key topics include:

    1. data loading, cleaning, transformation, merging, and reshaping;
    2. techniques for slicing, dicing, and summarizing datasets;
    3. data collection via web scraping and APIs.
  • These areas will be explored through detailed, real-world examples to address common data analysis challenges.

  • Throughout the course, students will gain hands-on experience with Python and its data analysis libraries, along with practical applications of git and GitHub.

Syllabus

Required Materials

Syllabus

Reference Materials

Syllabus

Course Requirements

  • Laptop: You should bring your own laptop (Mac or Windows) to the classroom.

    • The minimum specification for your laptop in this course is 2+ core CPU, 4+ GB RAM, and 500+ GB disk storage.
  • Homework: There will be six homework assignments.

  • Project: There will be one project on a personal website.

  • Exams: There will be two Midterm Exams and one Final Exam.

    • The final exam is comprehensive.
  • Discussions: You are encouraged to participate in GitHub-based online discussions and class discussion, and office hours.

    • Checkout the netiquette policy in the syllabus.

Syllabus

Personal Website

  • You will create your own website using Quarto, R Studio, and Git.

  • You will publish your homework assignments and team project on your website.

  • Your website will be hosted in GitHub.

  • The basics in Markdown will be discussed.

  • References:

Syllabus

Why Personal Website?

  • Here are the example websites:
  • Professional Showcase: Display skills and projects
  • Visibility and Networking: Increase online presence
  • Content Sharing and Engagement: Publish articles, insights
  • Job Opportunities: Attract potential employers and clients
  • Long-term Asset: A growing repository of your career journey

Syllabus

Team Project

  • Team formation is scheduled for late March.

    • Each team must have one to two students.
  • The project report should include data collection and exploratory data analysis using summary statistics, visual representations, and data wrangling.

  • The document for the team project must be published in each member’s website.

  • Any changes to team composition require approval from Byeong-Hak Choe.

Syllabus

Class Schedule and Exams

  • There will be tentatively 42 class sessions.

  • The Midterm Exam I is scheduled on February 28, 2025, Friday, during the class time.

  • The Midterm Exam II is scheduled on April 9, 2025, Wednesday, during the class time.

  • The Final Exam is scheduled on May 14, Wednesday, 8:30 A.M.–10:30 A.M.

  • No class on

    • March 17, 19, and 21 (Spring Break)
    • April 23 (GREAT Day)
  • The due for the team project is May 16, 2025, Friday, 11:59 P.M., Eastern Time

Syllabus

Course Contents

  • The first part of the course covers Python basics and pandas basics.

Syllabus

Course Contents

  • The second part of the course covers data collection.

Syllabus

Course Contents

  • The third part of the course covers advanced pandas.

Syllabus

Grading

\[ \begin{align} (\text{Total Percentage Grade}) =&\quad\;\, 0.05\times(\text{Attendance Score})\notag\\ &\,+\, 0.05\times(\text{Participation Score})\notag\\ &\,+\, 0.15\times(\text{Project and Website Score})\notag\\ &\,+\, 0.25\times(\text{Total Homework Score})\notag\\ &\,+\, 0.50\times(\text{Total Exam Score}).\notag \end{align} \]

Syllabus

Grading

  • You are allowed up to 5 absences without penalty.

    • Send me an email if you have standard excused reasons (illness, family emergency, transportation problems, etc.).
  • For each absence beyond the initial five, there will be a deduction of 1% from the Total Percentage Grade.

  • Participation will be evaluated by quantity and quality of GitHub-based online discussions and in-person discussion.

  • The single lowest homework score will be dropped when calculating the total homework score.

Syllabus

Grading

\[ \begin{align} &(\text{Midterm Exam Score}) \\ =\, &\text{max}\,\left\{0.50\times(\text{Midterm Exam I Score}) \,+\, 0.50\times(\text{Midterm Exam II Score})\right.,\notag\\ &\qquad\;\,\left.0.25\times(\text{Midterm Exam I Score}) \,+\, 0.75\times(\text{Midterm Exam II Score})\right\}.\notag \end{align} \]

  • The Midterm Exam Score is the maximum between
    1. the simple average of the Midterm Exam I score and the Midterm Exam II Score and
    2. the weighted average of them with one-fourth weight on the Midterm Exam I Score and three-third weight on the Midterm Exam II Score.

Syllabus

Grading

\[ \begin{align} &(\text{Total Exam Score}) \\ =\, &\text{max}\,\left\{0.50\times(\text{Midterm Exam Score}) \,+\, 0.50\times(\text{Final Exam Score})\right.,\notag\\ &\qquad\;\,\left.0.25\times(\text{Midterm Exam Score}) \,+\, 0.75\times(\text{Final Exam Score})\right\}.\notag \end{align} \]

  • The Total Exam Score is the maximum between
    1. the simple average of the Midterm Exam Score and the Final Exam Score and
    2. the weighted average of them with one-fourth weight on the Midterm Exam Score and three-third weight on the Final Exam Score.

Syllabus

Make-up Policy

  • Make-up exams will not be given unless you have either a medically verified excuse or an absence excused by the University.

  • If you cannot take exams because of religious obligations, notify me by email at least two weeks in advance so that an alternative exam time may be set.

  • A missed exam without an excused absence earns a grade of zero.

  • Late submissions for homework assignment will be accepted with a penalty.

  • A zero will be recorded for a missed assignment.

Syllabus

Academic Integrity and Plagiarism

  • All homework assignments and exams must be the original work by you.

  • Examples of academic dishonesty include:

    • representing the work, thoughts, and ideas of another person as your own
    • allowing others to represent your work, thoughts, or ideas as theirs, and
    • being complicit in academic dishonesty by suspecting or knowing of it and not taking action.
  • Geneseo’s Library offers frequent workshops to help you understand how to paraphrase, quote, and cite outside sources properly.

Syllabus

Accessibility

  • The Office of Accessibility will coordinate reasonable accommodations for persons with physical, emotional, or cognitive disabilities to ensure equal access to academic programs, activities, and services at Geneseo.

  • Please contact me and the Office of Accessibility Services for questions related to access and accommodations.

Syllabus

Well-being

  • You are strongly encouraged to communicate your needs to faculty and staff and seek support if you are experiencing unmanageable stress or are having difficulties with daily functioning.

  • Liz Felski, the School of Business Student Advocate (felski@geneseo.edu, South Hall 303), or the Dean of Students (585-245-5706) can assist and provide direction to appropriate campus resources.

  • For more information, see https://www.geneseo.edu/dean_students.

Syllabus

Career Design

  • To get information about career development, you can visit the Career Development Events Calendar (https://www.geneseo.edu/career_development/events/calendar).

  • You can stop by South 112 to get assistance in completing your Handshake Profile https://app.joinhandshake.com/login.

    • Handshake is ranked #1 by students as the best place to find full-time jobs.
    • 50% of the 2018-2020 graduates received a job or internship offer on Handshake.
    • Handshake is trusted by all 500 of the Fortune 500.

Prologue

Why Data Analytics?

  • Fill in the gaps left by traditional business and economics classes.
    • Practical skills that will benefit your future career.
    • Neglected skills like how to actually find datasets in the wild and clean them.
  • Data analytics skills are largely distinct from (and complementary to) the core quantitative works familiar to business undergrads.
    • Data visualization, cleaning and wrangling; databases; machine learning; etc.
  • In short, we will cover things that I wish someone had taught me when I was undergraduate.

You, at the end of this course

Why Data Analytics?

  • Data analysts use analytical tools and techniques to extract meaningful insights from data.
    • Skills in data analytics are also useful for business analysts or market analysts.
  • Breau of Labor Statistics forecasts that the projected growth rate of the employment in the industry related to data analytics from 2021 to 2031 is 36%.
    • The average growth rate for all occupations is 5%.

The State of the Art

Generative AI and ChatGPT

Data Science and Big Data Trend

From 2008 to 2023

Programmers in 2025

The State of the Art

Generative AI and ChatGPT

  • Users around the world have explored how to best utilize GPT for writing essays and programming codes.
  • Is AI a threat to data analytics?
    • Fundamental understanding of the subject matter is still crucial for effectively utilizing AI’s capabilities.
  • If you use Generative AI such as ChatGPT, please try to understand what ChatGPT gives you.
    • Copying and pasting it without any understanding harms your learning opportunity.

DANL Tools

What is Git?

\(\quad\)

  • Git is the most popular version control tool for any software development.
    • It tracks changes in a series of snapshots of the project, allowing developers to revert to previous versions, compare changes, and merge different versions.
    • It is the industry standard and ubiquitous for coding collaboration.

What is GitHub?

  • GitHub is a web-based hosting platform for Git repositories to store, manage, and share code.

  • Out class website is hosted on a GitHub repository.

  • Course contents will be posted not only in Brightspace but also in our GitHub repositories (“repos”) and websites.

  • Github is useful for many reasons, but the main reason is how user friendly it makes uploading and sharing code.

What is GitHub?

What is RStudio?

  • RStudio is an integrated development environment (IDE) mainly for R programming.
    • An IDE is a software application that provides comprehensive facilities (e.g., text code editor, graphical user interface (GUI)) to computer programmers for software development.
  • RStudio is a user-friendly interface that makes using R easier and more interactive.
    • It provides a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging, and workspace management.

What is Python?

  • Python is a versatile programming language known for its simplicity and readability.

  • Python has become a dominant tool in various fields including data analysis, machine learning, and web development.

    • It is widely used among developers, data scientists, and researchers for building applications and performing data-driven tasks.
    • Python is open source and has a vast ecosystem of libraries and frameworks.

What is Jupyter?

  • Jupyter is an open-source IDE primarily for Python, though it supports many other languages.
    • Jupyter provides a notebook interface that allows users to write and execute code in a more interactive and visual format.
  • Jupyter Notebook (*.ipynb) is a user-friendly environment that enhances coding, data analysis, and visualization.
    • It offers a web-based interface that combines live code, equations, visualizations, and narrative text.
    • Jupyter Notebook is widely used for data science, machine learning, and research, enabling easy sharing and collaboration.
  • We will use a free cloud version of Jupyter, which is Google Colab.

Installing the Tools

Installing the Tools

Anaconda

Installing the Tools

R programming

Installing the Tools

R Studio

  • For Mac users, try the following steps:
    1. Run RStudio-*.dmg file.
    2. From the Pop-up menu, click the RStudio icon.
    3. While clicking the RStudio icon, drag it to the Applications directory.

Installing the Tools

RStudio Environment

  • Script Pane is where you write R commands in a script file that you can save.
  • An R script is simply a text file containing R commands.
  • RStudio will color-code different elements of your code to make it easier to read.
  • To open an R script,
    • File \(>\) New File \(>\) R Script
  • To save the R script,
    • File \(>\) Save

Installing the Tools

RStudio Environment

  • Console Pane allows you to interact directly with the R interpreter and type commands where R will immediately execute them.

Installing the Tools

RStudio Environment

  • Environment Pane is where you can see the values of variables, data frames, and other objects that are currently stored in memory.

  • Type below in the Console Pane, and then hit Enter:

a <- 1

Installing the Tools

RStudio Environment

  • Plots Pane contains any graphics that you generate from your R code.

Installing the Tools

R Packages and tidyverse

  • R packages are collections of R functions, compiled code, and data that are combined in a structured format.
  • The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.
    • The tidyverse packages work harmoniously together to make data manipulation, exploration, and visualization more.
    • We will use several R packages from tidyverse throughout the course. (e.g., ggplot2, dplyr, tidyr)

Installing the Tools

Installing R packages with install.packages("packageName")

  • R packages can be easily installed from within R using functions install.packages("packageName").
    • To install the R package tidyverse, type and run the following from R console:
install.packages("tidyverse")
  • While running the above codes, you may encounter the question below from the R Console:
  • Mac: “Do you want to install from sources the packages which need compilation?” from Console Pane.
  • Windows: “Would you like to use a personal library instead?” from Pop-up message.
  • Type no in the R Console, and then hit Enter.

Installing the Tools

Loading R packages with library(packageName)

  • Once installed, a package is loaded into an R session using library(packageName) so that its functions and data can be used.
    • To load the R package tidyverse, type and run the following command from a R script:
library(tidyverse)
df_mpg <- mpg
  • mpg is the data.frame provided by the R package ggplot2, one of the R pakcages in tidyverse.

Installing the Tools

RStudio Options Setting

  • This option menu is found by menus as follows:
    • Tools \(>\) Global Options
  • Check the boxes as in the left.
  • Choose the option Never for Save workspace to .RData on exit:

Building a Personal Website on GitHub

Let’s Practice Markdown!

  • Jupyter Notebook, Quarto, and GitHub-based Discussion Boards use markdown as its underlying document syntax.

  • Let’s do Classwork 2.