Add a new variable with mutate()
April 16, 2024
There will be two more homework assignments.
The single lowest homework score will be dropped when calculating the total homework score.
Each homework except for the homework with the lowest score accounts for 25% of the total homework score.
A team can be composed of two, three, four, or five members.
One representative for a team must send Byeong-Hak an email at bchoe@geneseo with the following details by April 23, Tuesday, 11:59 P.M:
The project is about presenting your findings through data analysis using skim
, ggplot
, and dplyr
functions on your personal website.
Think of this project as a more comprehensive homework assignment.
Details of the project will be announced next week.
data.frame
for the project.data.frames
if the data.frames
are related.data.frames
for the project, I give each team an option to freely choose own data.frame
for the project upon Choe’s approval.While this is team work, your website and its corresponding GitHub repository will be independently assessed for the project.
|>
) Operatordata.frame
and the output is a data.frame
, dplyr
verbs work well with the pipe, |>
|>
) takes the thing on its left and passes it along to the function on its right so that
f(x, y)
is equivalent to x |> f(y)
.filter(DATA_FRAME, LOGICAL_STATEMENT)
is equivalent to DATA_FRAME |> filter(LOGICAL_STATEMENT)
.|>
) is “then”.
|>
) is super useful when we have a chain of data transforming operations to do.dplyr
basicsdata.frame |> filter(LOGICAL_CONDITIONS)
data.frame |> arrange(VARIABLES)
data.frame |> distinct(VARIABLES)
data.frame |> select(VARIABLES)
data.frame |> rename(NEW_VARIABLE = EXISTING_VARIABLE)
data.frame |> mutate(NEW_VARIABLE = ... )
data.frame |> relocate(VARIABLES)
data.frame |> group_by(VARIABLES)
data.frame |> summarize(NEW_VARIABLE = ...)
The subsequent arguments describe what to do with the data.frame, mostly using the variable names.
The result is a data.frame
.
mutate()
mutate()
mutate()
is useful to add new variables that are functions of existing variables.
+
, -
, *
, /
, ^
%/%
(integer division) and %%
(remainder).mutate()
mutate()
function.mutate()
mutate()
Offsets: lead()
and lag()
If-else conditions: ifelse()
Ranking functions: min_rank()
, dense_rank()
, percent_rank()
, row_number()
, and more
Other useful functions: log()
, log10()
, exp()
, sqrt()
, round()
, as.character()
, as.numeric()
, as.integer()
, and more
Factor-related functions: factor()
, fct_reorder()
, and more
String-related functions: str_detect()
, str_replace()
, str_replace_all()
, str_sub()
, and more
mutate()
lead()
and lag()
lead()
and lag()
allow us to refer to leading or lagging values.\[ \begin{align} \Delta GDP_{y} = GDP_{y} - GDP_{y-1} \end{align} \]
\[ \begin{align} \%\Delta GDP_{y} = \frac{GDP_{y} - GDP_{y-1}}{GDP_{y}} \end{align} \]
mutate()
ifelse()
ifelse()
ifelse(CONDITION, <if TRUE>, <else>)
mutate()
rank_me <- data.frame( x = c(10, 5, 1, 5, 5, NA) )
rank_me_asce <- rank_me |>
mutate(x_min_rank = min_rank(x),
x_dense_rank = dense_rank(x),
x_row_number = row_number(x),
x_perc_rank = percent_rank(x) )
rank_me_desc <- rank_me |>
mutate(x_min_rank = min_rank(-x), # instead of -x, we can use desc(x)
x_dense_rank = dense_rank(-x),
x_row_number = row_number(-x),
x_perc_rank = percent_rank(-x) )
min_rank()
, dense_rank()
, row_number()
, percent_rank()
, and moremutate()
df <- data.frame( x = c(1:10) ) |>
mutate(x_log = log(x),
x_log10 = log10(x),
x_exp = exp(x),
x_sqrt = sqrt(x),
x_sqrt_round = round(x_sqrt, 2),
x_fct = factor(x),
x_chr = as.character(x),
x_num = as.numeric(x),
x_int = as.integer(x) )
log()
, log10()
, exp()
, sqrt()
, round(VAR, digit)
, factor()
, as.character()
, as.numeric()
, as.integer()
, and moreselect()
, rename()
, relocate()
, and mutate()
Let’s do Question 1 and Question 2 up to Q2e in Classwork 10!
mutate()
df <- data.frame(
city = c("Rochester", "Buffalo",
"Geneseo", "Syracus"),
income = c(80, 82, 70, 75) )
df <- df |>
mutate( city_fct = factor(city) )
df$city
df$city_fct
R
, factors are categorical variables, variables that have a fixed and known set of possible values.NY_cities <- c("Geneseo", "Rochester",
"Buffalo", "Syracus")
df <- df |>
mutate(city_fct_new =
factor(city,
levels = NY_cities)
)
levels(df$city_fct)
levels(df$city_fct_new)
levels
.fct_reorder(f, x, fun)
, which can take three arguments.
f
: the factor whose levels we want to modify.x
: a numeric vector that we want to use to reorder the levels.fun
: a function that’s used if there are multiple values of x
for each value of f
.fun
is median.mutate()
ggplot
with a factor variable - Sorted bar chart/dot plotmutate()
df_str <- data.frame(
fruit = c("apple", "banana", "pear")
)
df_str <- df_str |>
mutate( fruit_e = str_detect(fruit, "e")
)
str_detect()
returns TRUE
if a character value matches a pattern. Otherwise, FALSE
.select()
, rename()
, relocate()
, and mutate()
Let’s do Q2f and Q2g in Classwork 10!