Homework 5

Dealing with Missing Values; Group Operations

Author

Byeong-Hak Choe

Published

March 5, 2024

Modified

March 5, 2024

Direction

Please submit your Jupyter Notebook for Homework 5 to the Brightspace with the name below:
- danl-m1-hw5-LASTNAME-FIRSTNAME.ipynb
  ( e.g., danl-m1-hw5-choe-byeonghak.ipynb )
The due is March 12, 2024, 7:00 P.M.
Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.
Please prepare a Jupyter/Python Notebook (*.ipynb) to address all questions.
Make at least some simple comment (# ...) in each question.
Make one text cell to explain things in each question.

Import the pandas library, and read fao_stat.csv as fao:

import pandas as pd

fao = pd.read_csv("https://bcdanl.github.io/data/fao_stat.csv",
                  encoding = 'ISO-8859-1')

Variable Description

The fao DataFrame contains the country-year level observation regarding variables below:

SSA: A boolean indicating if the country is in Sub-Saharan Africa.
Area: The name of the country.
Year: The year of observation.
gdp_per_capita: GDP per capita.
drinking_water: The percentage of the population with access to safe drinking water.
sanitation_service: The percentage of the population with access to improved sanitation services.
children_stunted: The percentage of children under 5 years old who are stunted.
children_overweight: The percentage of children under 5 years old who are overweight.
investment_pct: The percentage of GDP invested in public health.

Question 1

What percentage of the values is missing for each variable?

Answer

Question 2

Fill missing values in the gdp_per_capita variable with the mean value of that variable.

Answer

Question 3

Drop observations where drinking_water or sanitation_service information is missing.

Answer

Question 4

What is the average drinking_water access percentage for each Area grouped by SSA status?

Hint: We can group a DataFrame by a list of multiple variables. Then, each group corresponds to a unique combination of values across the specified variables.

(
  fao
  .groupby(['VARIABLE_1', 'VARIABLE_2'])
)

Answer

Question 5

Calculate the mean sanitation_service percentage for each combination of SSA status and Year.

Answer

Question 6

For each year, find the 5 worst countries in terms of drinking_water.

Hint: It would be a good idea to start with the DataFrame.sort_values() method.
Note: DataFrameGroupBy does not support the sort_values() method. That is, DataFrameGroupBy.sort_values() will result in an error.

Answer

Question 7

For each year, find the 5 worst countries in terms of children_stunted.

Answer