Classwork 5

PySpark Basics - Loading, Summarizing, Selecting, Counting, and Sorting Data

Author

Byeong-Hak Choe

Published

February 10, 2025

Modified

February 12, 2025

Direction

The nfl.csv file contains a list of players in the National Football League with similar Name, Team, Position, Birthday, and Salary variables in the nba.csv file.

import pandas as pd
from pyspark.sql import SparkSession
df = pd.read_csv("https://bcdanl.github.io/data/nfl.csv")
nfl = spark.createDataFrame(df)

Question 1

How can we read the nfl.csv file, and assign it to a PySpark DataFrame object, nfl?

Answer:

Question 2

How many observations are in nfl?
What are the mean, standard deviation, minimum, and maximum of Salary in nfl?

Answer:

Question 3

How can we count the number of players per team in nfl?
How many unique teams are in nfl?

Answer:

Question 4

What is an effective way to convert the values in its Birthday variable to date?
- The format of Birthday is “M/d/yy”

Answer:

Question 5

Who are the five highest-paid players?
Who is the oldest player?

Answer:

Question 6

How can we sort the DataFrame first by Team in alphabetical order and then by Salary in descending order?

Answer:

Question 7

Who is the oldest player on the Kansas City Chiefs roster, and what is his birthday?

Answer:

Question 8

What is the median of Salary in nfl?

Answer:

Discussion

Welcome to our Classwork 5 Discussion Board! 👋

This space is designed for you to engage with your classmates about the material covered in Classwork 5.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) regarding the Classwork 5 materials or need clarification on any points, don’t hesitate to ask here.

All comments will be stored here.

Let’s collaborate and learn from each other!