Lecture 20

Data Visualization - Overview

Byeong-Hak Choe

SUNY Geneseo

October 23, 2024

Data Visualization

Data Visualization

  • Data Visualization: Convert data into meaningful graphics for better understanding of data.

  • There are many different graphs and other types of visual displays of information.

  • We will visualize:

    • The distribution of a categorical/numeric variable
    • The relationship between two numeric variables
    • The time trend of a numeric variable

Data Visualization

Distribution

  • Distribution refers to how the values of a variable are spread out or grouped within a data.frame.
    • It visualizes what type of variation occurs within a variable.
  • Variation is the tendency of the values of a variable to change from measurement to measurement.
    • We can see variation easily in real life; if we measure any numeric variable twice, we will be likely to get two different numbers.
    • Which values are the most common? Why?
      • The mode of a variable is the value that appears most frequently within the set of that variable’s values.
    • Which values are rare? Why? Does that match your expectations?

Data Visualization

Distribution

  • How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.
  • Categorical Variables: Represent categories or groups (e.g., colors, departments, types)
    • Common visualizations:
      Bar charts
    • Example: Distribution of favorite sports among students
  • Numerical Variables: Represent numbers with meaningful values (e.g., age, income, temperature)
    • Common visualizations: Histograms, Box plots
    • Example: Distribution of heights in a class

Data Visualization

Skewness

  • For a histogram, we can consider a measure of the asymmetry of the distribution—skewness.

Data Visualization

Titanic

Data Visualization

Bar Chart

Data Visualization

Bar Chart

Data Visualization

Stacked Bar Chart

Data Visualization

100% Stacked Bar Chart

Data Visualization

Clustered Bar Chart

Data Visualization

Histogram

Data Visualization

Histogram

Data Visualization

Boxplot

Data Visualization

Relationship

  • From the plots with two numeric variables, we want to see co-variation, the tendency for the values of two or more variables to vary together in a related way.

  • What type of co-variation occurs between variables?

    • Are they positively associated?
    • Are they negatively associated?
    • Are there no association between them?
  • Common visualizations:

    • Scatterplot
    • Fitted curves/line

Data Visualization

Orange Juice Sales

Data Visualization

Scatterplot

Data Visualization

Scatterplot with Fitted Line

Data Visualization

Scatterplot with Fitted Line

Data Visualization

MPG

Data Visualization

Scatterplot with Fitted Line

Data Visualization

Weather

Data Visualization

Scatterplot with Fitted Line

Data Visualization

Time Trend

  • A time trend plot, (also known as a time series plot), is used to visualize trends, patterns, and fluctuations in a variable over a specific time period.

    • The x-axis typically represents time, while the y-axis represents the variable being measured.
  • We can check the overall direction in which the time-series variable are moving—upwards, downwards, or staying relatively constant over time.

  • Common visualizations:

    • Line chart
    • Fitted Curve

Data Visualization

NVDA Stock Price

Data Visualization

Line Chart

Data Visualization

Line Chart with Fitted Curve

Data Visualization

Visualization Tools

  • Many tools for visualizing data – Power BI, Tableau, Excel, Python, R, and more

  • Power BI and Tableau have drag-and-drop interfaces, making them accessible to users with little to no coding experience.

  • In R, there are multiple packages for creating data visualizations—ggplot2 is the most widely used one.

    • While we will briefly use Power BI and Excel for visualization, the primary visualization tool for this course will be ggplot2 in R.
  • Using ggplot2 helps develop important coding and data skills, which are critical for more advanced data analytics work.