Pandas Basics - Loading, Summarizing, Selecting, Counting, Sorting, and Indexing Data
February 13, 2024
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
Series
and DataFrame
Series
: a collection of a one-dimensional object containing a sequence of values.
DataFrame
: a collection of Series
columns with an index.
read_csv()
A CSV (comma-separated values) is a plain-text file that uses a comma to separate values (e.g., nba.csv).
The CSV is widely used for storing data, and we will use this throughout the module.
We use the read_csv()
function to load a CSV data file.
DataFrame
is the workhorse of the pandas library and the data structure.read_csv()
parse_dates
parameter to coerce the values into datetimes
.from google.colab import data_table
data_table.enable_dataframe_formatter() # Enabling an interactive DataFrame display
nba
We can use the from
keyword when specifying Python package from which we want to import
something (e.g., functions).
Colab includes an extension that renders pandas DataFrames
into interactive tables.
The dot operator (DataFrame.
) is used for an attribute or a method on DataFrame
.
A method (DataFrame.METHOD()
) is a function that we can call on a DataFrame
to perform operations, modify data, or derive insights.
nba.info()
An attribute (DataFrame.ATTRIBUTE
) is a property that provides information about the DataFrame
’s structure or content without modifying it.
nba.dtype
DataFrame
with .info()
DataFrame
object has a .info()
method that provides a summary of a DataFrame:
.columns
).shape
).dtypes
).count()
)
NaN
.DataFrame
with .describe()
.describe()
method generates descriptive statistics that summarize the central tendency, dispersion, and distribution of each variable.
string
-type variables if specified explicitly (include='all'
).nba_player_name_1 = nba['Name'] # Series
nba_player_name_1
nba_player_name_2 = nba[ ['Name'] ] # DataFrame
nba_player_name_2
DataFrame
, we can access the variable with its name using squared brackets, [ ]
.
DataFrame[ 'var_1' ]
DataFrame[ ['var_1'] ]
DataFrame[ ['var_1', 'var_2', ... ] ]
select_dtypes()
# To include only string variables
nba.select_dtypes(include = "object")
# To exclude string and integer variables
nba.select_dtypes(exclude = ["object", "int"])
select_dtypes()
method to select columns based on their data types.
include
and exclude
..count()
.count()
counts the number of non-missing values in a Series
/DataFrame
..value_counts()
.value_counts()
counts the number of occurrences of each unique value in a Series
/DataFrame
..nunique()
.nunique()
counts the number of unique values in each variable in a DataFrame
.Let’s do Questions 1-4 in Classwork 2!
n
observations with .head()
& .tail()
.head()
/.tail()
method of a DataFrame
to keep only the first/last n
observations.
sort_values()
sort_values()
method’s first parameter, by
, accepts the variables that pandas should use to sort the DataFrame
.sort_values()
sort_values()
method’s ascending
parameter determines the sort order.
ascending
has a default argument of True
.sort_values()
and head()
& tail()
sort_values()
with .head()
or .tail()
can be useful to find the observations with the n
smallest/largest values in a variable.DataFrame
has various methods that modify the existing DataFrame
.nsmallest()
and nlargest()
nsmallest()
are useful to get the first n
observations ordered by a variable in ascending order.
nlargest()
are useful to get the first n
observations ordered by a variable in descending order.
nsmallest()
and nlargest()
keep = "all"
keeps all duplicates, even it means selecting more than n
observations.sort_values()
DataFrame
by multiple columns by passing a list to the by
parameter.sort_values()
ascending
parameter to apply the same sort order to each column.sort_values()
ascending
parameter.sort_values()
Q. Which players on each team are paid the most?
sort_index()
How can we return it to its original form of DataFrame
?
Our nba
DataFrame
still has its numeric index.
If we could sort the data set by index positions rather than by column values, we could return it to its original shape.
sort_index()
sort_index()
method can also be used to change the order of variables in an alphabetical order.
axis
parameter and pass it an argument of "columns"
or 1
.set_index()
method when we want to change the current index of a DataFrame
to one or more existing columns.
set_index()
set_index()
method returns a new DataFrame
with a given column set as the index.
keys
, accepts the column name.reset_index()
nba2 = nba.set_index("Name")
nba2.reset_index(inplace=True) # Useful for the chain of method operations
reset_index()
method:
DataFrame
column;inplace=True
, the operation alters the original DataFrame
directly.Let’s do Questions 5-7 in Classwork 2!