import pandas as pd
import numpy as np
# Below is for an interactive display of DataFrame in Colab
from google.colab import data_table
data_table.enable_dataframe_formatter()
cereal = pd.read_csv('https://bcdanl.github.io/data/cereals_oatmeal.csv')Classwork 17
Pandas Fundamental VI: Group Operation 1
Direction
The dataset ,cereals_oatmeal.csv,(with its pathname https://bcdanl.github.io/data/cereals_oatmeal.csv) is a listing of 77 popular breakfast cereals and oatmeal.
Question 1
Group the data by Manufacturer, and determine the number of groups and the number of cereals per group.
cereal['Manufacturer'].nunique()
len(
cereal.groupby('Manufacturer')
)(
cereal
.groupby('Manufacturer')
.size()
)
(
cereal['Manufacturer']
.value_counts()
)Question 2
Calculate the mean of the Calories, Fiber, and Sugars for every manufacturer.
q2 = (
cereal
.groupby('Manufacturer')
.agg(
Calories_mean = ("Calories", "mean"),
Fiber_mean = ("Fiber", "mean"),
Sugars_mean = ("Sugars", "mean"),
)
)
q2Question 3
Create a new DataFrame that includes the maximum Sugars and the minimum Fiber per manufacturer.
q3 = (
cereal
.groupby('Manufacturer')
.agg(
Sugars_max = ("Sugars", "max"),
Fiber_min = ("Fiber", "min")
)
)
q3Question 4
Add a Normalized_Sugars variable to the cereal DataFrame that standardizes each cereal’s sugar content within its manufacturer group.
\[ \text{Normalized\_Sugars} = \frac{\text{Sugars} - \text{mean(Sugars)}}{\text{std(Sugars)}} \]
This formula adjusts the sugar content of each product by subtracting the mean sugar content of its manufacturer and then dividing by the standard deviation of the sugar content within its manufacturer.
cereal['Normalized_Sugars'] = (
cereal
.groupby('Manufacturer')['Sugars']
.transform(
lambda s: ( s - s.mean() ) / s.std()
)
)
cerealQuestion 5
Put the two highest-sugar cereals for every manufacturer in a new DataFrame.
q5 = (
cereal
.groupby('Manufacturer')
.apply(
lambda df: df.nlargest(2, "Sugars", keep = "all")
)
)
q5Question 6
- Compute the correlation between
CaloriesandSugarsfor each manufacturer.corr()calculates the correlation betweenSERIES_1andSERIES_2.
cereal["Calories"].corr( cereal["Sugars"] ) # returns a correlation value
cereal[["Calories", "Sugars"]].corr() # returns a DataFrame of correlation matrix
cereal[["Calories", "Sugars", "Fiber"]].corr() # returns a DataFrame of correlation matrixq6 = (
cereal
.groupby('Manufacturer')[["Calories", "Sugars"]]
.corr()
)
q6Question 7
For each manufacturer, choose one “representative cereal” using these rules:
- First, calculate the manufacturer’s average calories.
- Keep only cereals with calories below the manufacturer’s average.
- If a manufacturer has no cereals below its average calories, use all cereals from that manufacturer instead.
- Among the remaining cereals, choose the cereal with the highest fiber.
- If there is a tie in fiber, choose the cereal with the lowest sugars.
- If there is still a tie, choose the cereal with the alphabetically first name.
- Return the full row for the selected cereal.
def rep_cereal(df):
avg = df['Calories'].mean()
candidate = df[ df['Calories'] < avg ]
if len(candidate) == 0:
candidate = df
candidate = candidate.sort_values(['Fiber', 'Sugars', 'Name'],
ascending = [False, True, True]).iloc[0]
return candidate
q7 = (
cereal
.groupby('Manufacturer')
.apply(rep_cereal)
)
q7Discussion
Welcome to our Classwork 17 Discussion Board! đź‘‹
This space is designed for you to engage with your classmates about the material covered in Classwork 17.
Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.
If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 17 materials or need clarification on any points, don’t hesitate to ask here.
Let’s collaborate and learn from each other!