Lecture 1

Syllabus, Course Outline, and Installing the Tools

Byeong-Hak Choe

SUNY Geneseo

January 23, 2024

Instructor

Instructor

Current Appointment & Education

  • Name: Byeong-Hak Choe.

  • Assistant Professor of Data Analytics and Economics, School of Business at SUNY Geneseo.

  • Ph.D. in Economics from University of Wyoming.

  • M.S. in Economics from Arizona State University.

  • M.A. in Economics from SUNY Stony Brook.

  • B.A. in Economics & B.S. in Applied Mathematics from Hanyang University at Ansan, South Korea.

    • Minor in Business Administration.
    • Concentration in Finance.

Instructor

Economics, Data Science, and Climate Change

  • Choe, B.H., 2021. “Social Media Campaigns, Lobbying and Legislation: Evidence from #climatechange and Energy Lobbies.”

  • Question: To what extent do social media campaigns compete with fossil fuel lobbying on climate change legislation?

  • Data include:

    • 5.0 million tweets with #climatechange/#globalwarming around the globe;
    • 12.0 million retweets/likes to those tweets;
    • 0.8 million Twitter users who wrote those tweets;
    • 1.4 million Twitter users who retweeted or liked those tweets;
    • 0.3 million US Twitter users with their location at a city level;
    • Firm-level lobbying data (expenses, targeted bills, etc.).

Instructor

Economics, Data Science, and Climate Change

  • Choe, B.H. and Ore-Monago, T., 2024. “Governance and Climate Finance in the Developing World”

  • Climate finance refers to the financial resources allocated for mitigating and adapting to climate change, including support for initiatives that reduce greenhouse gas emissions and enhance resilience to climate impacts.

    • We focus on transnational financing that rich countries provide poor countries with financial resources, in order to help them adapt to climate change and mitigate greenhouse gas (GHG) emissions.
    • Since the GHG emissions in developing countries are rapidly growing, it is crucial to assess the effectiveness of climate finance.
    • Poor governance can be significant barriers to emissions reductions within these countries.

Instructor

Economics, Data Science, and Climate Change

  • Choe, B.H. and Ore-Monago, T., 2024. “Governance and Climate Finance in the Developing World”

  • Data include:

    • Global climate finance data (e.g., donors, recipients, characteristics of climate change projects)
    • World Bank Governance Indicators over the years (e.g., government effectiveness, voice and accountability, political stability and absence of violence/terrorism, regulatory quality, rule of law, control of corruption)
    • Various economic indicators (e.g., trade pattern of low carbon technology products, macroeconomic risks, energy)

Syllabus

Syllabus

Email, Class & Office Hours

Syllabus

Course Description

  • This course offers a practical overview of data analytics processes and methodologies.

  • Its objective is to empower you with the skills to leverage data analysis effectively, enhancing your decision-making capabilities.

  • You will acquire fundamental data analytics competencies, laying a foundation for a career or advanced studies in data analytics.

  • The curriculum includes:

    1. An introduction to Data Analytics concepts;
    2. Hands-on exploratory data analysis using data wrangling and visualization techniques;
    3. Creating HTML documentation for data analysis using Quarto.
  • You will gain practical experience with the R programming language.

Syllabus

Required Materials

Syllabus

Course Requirements

  • Laptop: You should bring your own laptop (Mac or Windows) to the classroom.
    • It is recommended to have 2+ core CPU, 4+ GB RAM, and 500+ GB disk storage in your laptop for this course.
  • Homework: There will be six homework assignments.

  • Team Project: There will be one team project on a personal website.

  • Exams: There will be one Midterm Exam and one final exam.

    • The final exam is comprehensive.
  • Discussions: You are encouraged to participate in GitHub-based online discussions for each lecture, classwork, and homework.

    • Checkout the netiquette policy in the syllabus.

Syllabus

Personal Website

  • You will create your own website using Quarto, R Studio, and Git.

  • You will publish your homework assignments and team project on your website.

  • Your website will be hosted in GitHub.

  • The basics in Markdown will be discussed.

  • References:

Syllabus

Team Project

  • Team formation is scheduled for late March.
  • Each team must have three to five students.

  • For the team project, a team must choose data related to business or socioeconomic issues.

  • The project report should include exploratory data analysis using summary statistics, visual representations, and data wrangling.

  • The document for the team project must be published in each member’s website.

  • Any changes to team composition require approval from Byeong-Hak Choe.

Syllabus

Class Schedule and Exams

  • There will be tentatively 27 lecture sessions.
  • The Midterm Exam is scheduled on March 7, 2024, Thursday, during the class time.

  • The Final Exam is scheduled on May 14, 2024, Tuesday, 3:30 P.M.–5:30 P.M.

  • No class on

    • February 27, Tuesday (Diversity Summit)
    • March 12, Tuesday and March 14, Thursday (Spring Break)
  • The due for the team project is scheduled on May 16, 2024, Thursday.

Syllabus

Course Contents

  • The first half of the course covers R basics and data visualization:

Syllabus

Course Contents

  • The second half of the course covers data wrangling:

Syllabus

Grading

\[ \begin{align} (\text{Total Percentage Grade}) =&\quad\;\, 0.05\times(\text{Total Attendance Score})\notag\\ &\,+\, 0.05\times(\text{Discussion Score})\notag\\ &\,+\, 0.05\times(\text{Website Score})\notag\\ &\,+\, 0.15\times(\text{Team Project and Website Score})\notag\\ &\,+\, 0.20\times(\text{Total Homework Score})\notag\\ &\,+\, 0.50\times(\text{Total Exam Score}).\notag \end{align} \]

Syllabus

Grading

  • You are allowed up to 5 absences without penalty.
    • Send me an email if you have standard excused reasons (illness, family emergency, transportation problems, etc.).
  • For each absence beyond the initial five, there will be a deduction of 1% from the Total Percentage Grade.

  • Participation in discussions will be evaluated by quantity and quality of discussions in the GitHub-based discussion boards.

  • The single lowest homework score will be dropped when calculating the total homework score.

    • Each homework except for the homework with the lowest score accounts for 20% of the total homework score.

Syllabus

Grading

  • The total exam score is the maximum between
    1. the simple average of the midterm exam score and the final exam score and
    2. the weighted average of them with one-fourth weight on the midterm exam score and three-third weight on the final exam score:

\[ \begin{align} &(\text{Total Exam Score}) \\ =\, &\text{max}\,\left\{0.50\times(\text{Midterm Exam Score}) \,+\, 0.50\times(\text{Final Exam Score})\right.,\notag\\ &\qquad\;\,\left.0.25\times(\text{Midterm Exam Score}) \,+\, 0.75\times(\text{Final Exam Score})\right\}.\notag \end{align} \]

Syllabus

Make-up Policy

  • Make-up exams will not be given unless you have either a medically verified excuse or an absence excused by the University.
  • If you cannot take exams because of religious obligations, notify me by email at least two weeks in advance so that an alternative exam time may be set.

  • A missed exam without an excused absence earns a grade of zero.

  • Late submissions for homework assignment will be accepted with a penalty.

  • A zero will be recorded for a missed assignment.

Syllabus

Academic Integrity and Plagiarism

  • All homework assignments and exams must be the original work by you.

  • Examples of academic dishonesty include:

    • representing the work, thoughts, and ideas of another person as your own
    • allowing others to represent your work, thoughts, or ideas as theirs, and
    • being complicit in academic dishonesty by suspecting or knowing of it and not taking action.
  • Geneseo’s Library offers frequent workshops to help you understand how to paraphrase, quote, and cite outside sources properly.

Syllabus

Accessibility

  • The Office of Accessibility will coordinate reasonable accommodations for persons with physical, emotional, or cognitive disabilities to ensure equal access to academic programs, activities, and services at Geneseo.

  • Please contact me and the Office of Accessibility Services for questions related to access and accommodations.

Syllabus

Well-being

  • You are strongly encouraged to communicate your needs to faculty and staff and seek support if you are experiencing unmanageable stress or are having difficulties with daily functioning.

  • Liz Felski, the School of Business Student Advocate (felski@geneseo.edu, South Hall 303), or the Dean of Students (585-245-5706) can assist and provide direction to appropriate campus resources.

  • For more information, see https://www.geneseo.edu/dean_students.

Syllabus

Career Design

  • To get information about career development, you can visit the Career Development Events Calendar (https://www.geneseo.edu/career_development/events/calendar).

  • You can stop by South 112 to get assistance in completing your Handshake Profile https://app.joinhandshake.com/login.

    • Handshake is ranked #1 by students as the best place to find full-time jobs.
    • 50% of the 2018-2020 graduates received a job or internship offer on Handshake.
    • Handshake is trusted by all 500 of the Fortune 500.

Prologue

Why Data Analytics?

  • Fill in the gaps left by traditional business and economics classes.
    • Practical skills that will benefit your future career.
    • Neglected skills like how to actually find datasets in the wild and clean them.
  • Data analytics skills are largely distinct from (and complementary to) the core quantitative works familiar to business undergrads.
    • Data visualization, cleaning and wrangling; databases; machine learning; etc.
  • In short, we will cover things that I wish someone had taught me when I was undergraduate.

You, at the end of this course

Why Data Analytics?

  • Data analysts use analytical tools and techniques to extract meaningful insights from data.
    • Skills in data analytics are also useful for business analysts or market analysts.
  • Breau of Labor Statistics forecasts that the projected growth rate of the employment in the industry related to data analytics from 2021 to 2031 is 36%.
    • The average growth rate for all occupations is 5%.

Why Personal Website?

Benefits of a Personal Website in Data Analytics

  • Professional Showcase: Display skills and projects
  • Visibility and Networking: Increase online presence
  • Controlled Narrative: Manage your professional brand
  • Content Sharing and Engagement: Publish articles, insights
  • Job Opportunities: Attract potential employers and clients
  • Long-term Asset: A growing repository of your career journey
  • Reproducible Research: Showcase data-driven reports

Why R, Python, and Databases?

Why R, Python, and Databases?

  • Stack Overflow is the most popular Q & A website specifically for programmers and software developers in the world.

  • See how programming languages have trended over time based on use of their tags in Stack Overflow from 2008 to 2022.

Data Science and Big Data

The State of the Art

Generative AI and ChatGPT

Data Science and Big Data Trend

From 2008 to 2023

Programmers in 2024

The State of the Art

Generative AI and ChatGPT

  • Generative AI refers to a category of artificial intelligence (AI) that is capable of generating new content, ranging from text, images, and videos to music and code.
  • In the early 2020s, advances in transformer-based deep neural networks enabled a number of generative AI systems notable for accepting natural language prompts as input.
    • These include large language model (LLM) chatbots such as ChatGPT, Copilot, Bard, and LLaMA.
  • ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot developed by OpenAI and launched on November 30, 2022.
    • By January 2023, it had become what was then the fastest-growing consumer software application in history.

The State of the Art

Generative AI and ChatGPT

  • Users around the world have explored how to best utilize GPT for writing essays and programming codes.
  • Is AI a threat to data analytics?
    • Fundamental understanding of the subject matter is still crucial for effectively utilizing AI’s capabilities.
  • If you use Generative AI such as ChatGPT, please try to understand what ChatGPT gives you.
    • Copying and pasting it without any understanding harms your learning opportunity.

Today’s Learning Objectives

Learning Objectives

  • Understand the concept of the tools we will use throughout the course:
    • Git
    • GitHub
    • R
    • RStudio
    • R Packages
  • Set up the tools in your laptop.
    • R
    • RStudio
    • R Packages

DANL Tools

What is Git?

\(\quad\)

  • Git is the most popular version control tool for any software development.
    • It tracks changes in a series of snapshots of the project, allowing developers to revert to previous versions, compare changes, and merge different versions.
    • It is the industry standard and ubiquitous for coding collaboration.

What is Git?

git add .
git commit -m "any message is here"
git push -u origin main

\(\quad\)

  • Git operates primarily through command-line tools (e.g., Terminal) and is local to a user’s computer.

    • It has a steep learning curve.
  • We will not do git collaboration but use only the 3-step git commands on Terminal to update a website.

What is GitHub?

  • GitHub is a web-based hosting platform for Git repositories to store, manage, and share code.
  • Your personal website will be hosted on a GitHub repository.

  • Course contents will be posted not only in Brightspace but also in our GitHub repositories (“repos”) and websites.

  • Github is useful for many reasons, but the main reason is how user friendly it makes uploading and sharing code.

What is R?

  • R is a programming language and software environment designed for statistical computing and graphics.

  • R has become a major tool in data analysis, statistical modeling, and visualization.

    • It is widely used among statisticians and data scientists for developing statistical software and performing data analysis.
    • R is open source and freely available.

What is RStudio?

  • RStudio is an integrated development environment (IDE) for R.
    • An IDE is a software application that provides comprehensive facilities (e.g., text code editor, graphical user interface (GUI)) to computer programmers for software development.
  • RStudio is a user-friendly interface that makes using R easier and more interactive.
    • It provides a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging, and workspace management.
    • It integrates well with Git.

Installing the Tools

Installing the Tools

Getting a GitHub account

  • Create the GitHub account with your Geneseo email.
    1. Go to GitHub.
    2. Click “Sign up for GitHub”.
  • Choose your GitHub username carefully:
    • https://YOUR_GITHUB_USERNAME.github.io will be the address for your website.
    • Byeong-Hak’s GitHub username is bcdanl, so that Byeong-Hak owns the web address https://bcdanl.github.io.
  • It is better to have a username with all lower cases.

Installing the Tools

R programming

Installing the Tools

R Studio

  • For Mac users, try the following steps:
    1. Run RStudio-*.dmg file.
    2. From the Pop-up menu, click the RStudio icon.
    3. While clicking the RStudio icon, drag it to the Applications directory.

Installing the Tools

RStudio Environment

  • Script Pane is where you write R commands in a script file that you can save.
  • An R script is simply a text file containing R commands.
  • RStudio will color-code different elements of your code to make it easier to read.
  • To open an R script,
    • File \(>\) New File \(>\) R Script
  • To save the R script,
    • File \(>\) Save

Installing the Tools

RStudio Environment

  • Console Pane allows you to interact directly with the R interpreter and type commands where R will immediately execute them.

Installing the Tools

RStudio Environment

  • Environment Pane is where you can see the values of variables, data frames, and other objects that are currently stored in memory.

  • Type below in the Console Pane, and then hit Enter:

a <- 1

Installing the Tools

RStudio Environment

  • Plots Pane contains any graphics that you generate from your R code.

Installing the Tools

R Packages and tidyverse

  • R packages are collections of R functions, compiled code, and data that are combined in a structured format.
  • The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.
    • The tidyverse packages work harmoniously together to make data manipulation, exploration, and visualization more.
    • We will use several R packages from tidyverse throughout the course. (e.g., ggplot2, dplyr, tidyr)

Installing the Tools

Installing R packages with install.packages("packageName")

  • R packages can be easily installed from within R using functions install.packages("packageName").
    • To install the R package tidyverse, type and run the following from R console:
install.packages("tidyverse")
  • While running the above codes, you may encounter the question below from the R Console:
  • Mac: “Do you want to install from sources the packages which need compilation?” from Console Pane.
  • Windows: “Would you like to use a personal library instead?” from Pop-up message.
  • Type no in the R Console, and then hit Enter.

Installing the Tools

Loading R packages with library(packageName)

  • Once installed, a package is loaded into an R session using library(packageName) so that its functions and data can be used.
    • To load the R package tidyverse, type and run the following command from a R script:
library(tidyverse)
df_mpg <- mpg
  • mpg is the data.frame provided by the R package ggplot2, one of the R pakcages in tidyverse.

Installing the Tools

RStudio Options Setting

  • This option menu is found by menus as follows:
    • Tools \(>\) Global Options
  • Check the boxes as in the left.
  • Choose the option Never for Save workspace to .RData on exit: