Data Preparation and Management
September 20, 2024
Airbnb attributes much of its growth from a small San Francisco startup to a $77.23 billion company (as of September 2024) to its effective use of data science and management.
The company shifted from using basic “cold numeric data” to leveraging complex data analysis for understanding individual user experiences and community trends.
Big data and analytics are likely to be significant components of future careers across various fields.
Big data refers to enormous and complex data collections that traditional data management tools can’t handle effectively.
Five key characteristics of big data (5 V’s):
Name | Symbol | Value |
---|---|---|
Kilobyte | kB | 10³ |
Megabyte | MB | 10⁶ |
Gigabyte | GB | 10⁹ |
Terabyte | TB | 10¹² |
Petabyte | PB | 10¹⁵ |
Exabyte | EB | 10¹⁸ |
Zettabyte | ZB | 10²¹ |
Yottabyte | YB | 10²⁴ |
Brontobyte* | BB | 10²⁷ |
Gegobyte* | GeB | 10³⁰ |
Note: The asterisks (*) next to Brontobyte and Gegobyte in the original image have been preserved in this table. These likely indicate that these units are less commonly used or are proposed extensions to the standard system of byte units.
Increase in size of the global datasphere
data.frame
data.frame
is a table-like data structure in R used for storing data in a tabular format with rows and columns.data.frame
.data.frame
data.frame
, representing a specific attribute or characteristic measured across different units of observation.Name
, Age
, Grade
, Major
EmployeeID
, Name
, Age
, Department
data.frame
data.frame
, each representing a single entity or unit for which data is collected and recorded.ID | Animal |
---|---|
1 | Dog |
2 | Cat |
3 | Bird |
ID | Education Level |
---|---|
1 | Bachelor’s |
2 | Master’s |
3 | PhD |
Ordinal Data: Categorical data where the categories have a meaningful order or ranking.
Order Matters: Categories can be ranked or ordered, but the differences between categories are not necessarily uniform.
Examples:
ID | Temperature (°F) |
---|---|
1 | 70 |
2 | 80 |
3 | 90 |
Interval Data: Numeric data where the differences between values are meaningful, but there is no true zero point.
Meaningful Intervals: The difference between values is consistent.
No True Zero: Zero does not indicate the absence of the quantity.
Examples:
ID | Height (cm) | Weight (kg) |
---|---|---|
1 | 160 | 55 |
2 | 175 | 70 |
3 | 170 | 65 |
Ratio Data: Numeric data with a true zero point, allowing for a full range of mathematical operations.
Meaningful Ratios: Comparisons like twice as much or half as much are valid.
True Zero: Zero indicates the absence of the quantity.
Examples: