Homework 5

Dealing with Missing Values; Group Operations

Author

Byeong-Hak Choe

Published

March 5, 2024

Modified

March 5, 2024

Direction

  • Please submit your Jupyter Notebook for Homework 5 to the Brightspace with the name below:

    • danl-m1-hw5-LASTNAME-FIRSTNAME.ipynb
      ( e.g., danl-m1-hw5-choe-byeonghak.ipynb )
  • The due is March 12, 2024, 7:00 P.M.

  • Please send Byeong-Hak an email (bchoe@geneseo.edu) if you have any questions.

  • Please prepare a Jupyter/Python Notebook (*.ipynb) to address all questions.

  • Make at least some simple comment (# ...) in each question.

  • Make one text cell to explain things in each question.


Import the pandas library, and read fao_stat.csv as fao:

import pandas as pd

fao = pd.read_csv("https://bcdanl.github.io/data/fao_stat.csv",
                  encoding = 'ISO-8859-1')

Variable Description

The fao DataFrame contains the country-year level observation regarding variables below:

  • SSA: A boolean indicating if the country is in Sub-Saharan Africa.
  • Area: The name of the country.
  • Year: The year of observation.
  • gdp_per_capita: GDP per capita.
  • drinking_water: The percentage of the population with access to safe drinking water.
  • sanitation_service: The percentage of the population with access to improved sanitation services.
  • children_stunted: The percentage of children under 5 years old who are stunted.
  • children_overweight: The percentage of children under 5 years old who are overweight.
  • investment_pct: The percentage of GDP invested in public health.




Question 1

What percentage of the values is missing for each variable?

Answer





Question 2

Fill missing values in the gdp_per_capita variable with the mean value of that variable.

Answer



Question 3

Drop observations where drinking_water or sanitation_service information is missing.

Answer



Question 4

What is the average drinking_water access percentage for each Area grouped by SSA status?

Hint: We can group a DataFrame by a list of multiple variables. Then, each group corresponds to a unique combination of values across the specified variables.

(
  fao
  .groupby(['VARIABLE_1', 'VARIABLE_2'])
)

Answer



Question 5

Calculate the mean sanitation_service percentage for each combination of SSA status and Year.

Answer



Question 6

For each year, find the 5 worst countries in terms of drinking_water.

  • Hint: It would be a good idea to start with the DataFrame.sort_values() method.
  • Note: DataFrameGroupBy does not support the sort_values() method. That is, DataFrameGroupBy.sort_values() will result in an error.

Answer



Question 7

For each year, find the 5 worst countries in terms of children_stunted.

Answer



Back to top