R Basics I

Classwork 1

Author

Byeong-Hak Choe

Published

September 16, 2024

Modified

October 8, 2024

Question 1.

base-R provides the R object state.name. Write an R code to assign state.name to a variable, US_states.

Answer:

US_states <- state.name

state.name is a predefined R object that contains the names of all 50 U.S. states. The code assigns the contents of state.name to a new variable US_states, effectively storing the state names in this new variable.
- In the general R environment, a variable is a name assigned to any object or data stored in memory, whether it’s a simple value, a vector, or a more complex structure like a data frame. For this reason, I often refer to a variable as “the name of this object” in the general R environment.
- In a data.frame, a variable is a column that represents a particular attribute of the data.frame.

Question 2.

The temp_F vector contains the average high temperatures in January for the following cities: Seoul, Lagos, Paris, Rio de Janeiro, San Juan, and Rochester.

temp_F <- c(35, 88, 42, 84, 81, 30)

Create a new vector named temp_C that stores the converted Celsius temperatures. Below is the conversion formula:

\[ C = \frac{5}{9}\times(F - 32) \]

Answer:

temp_F <- c(35, 88, 42, 84, 81, 30)
temp_C <- (5/9) * (temp_F - 32)
temp_C

[1]  1.666667 31.111111  5.555556 28.888889 27.222222 -1.111111

The formula to convert Fahrenheit to Celsius is applied element-wise to the temp_F vector, which stores the temperatures in Fahrenheit. The code then assigns the converted Celsius temperatures to the temp_C vector.

Question 3.

Write an R code to calculate the standard deviation (SD) of the integer vector x below manually. That is to calculate the SD without using the sd() or the var() functions.

x <- 1:25

Also, write an R code to test whether the standard deviation you calculate manually above is equal to sd(x).

Answer:

# Manual calculation of standard deviation
n <- length(x)
mean_x <- sum(x) / n
variance_manual <- sum((x - mean_x)^2) / (n - 1)
sd_manual <- sqrt(variance_manual)

# Test if it is equal to sd(x)
sd_manual == sd(x)

[1] TRUE

The formula for standard deviation is the square root of the variance. Variance is calculated as the sum of the squared differences from the mean, divided by the number of observations minus 1 (for a sample).
The sd_manual is then compared to the result of the built-in sd() function to ensure correctness.

Question 4.

Consider the vectors:

my_vec <- c(-10, -20, 30, 10, 50, 40, -100)
beers <- c("BUD LIGHT", "BUSCH LIGHT", "COORS LIGHT", 
           "GENESEE LIGHT", "MILLER LITE", "NATURAL LIGHT")

Write an R code to filter only the positive values in my_vec.
Write an R code to access the beers that are in positions 2, 4, and 6 using indexing.

Answer:

# Filtering positive values
positive_values <- my_vec[my_vec > 0]

# Accessing beers in positions 2, 4, and 6
beers <- c("BUD LIGHT", "BUSCH LIGHT", "COORS LIGHT", 
           "GENESEE LIGHT", "MILLER LITE", "NATURAL LIGHT")
selected_beers <- beers[c(2, 4, 6)]

positive_values

[1] 30 10 50 40

selected_beers

[1] "BUSCH LIGHT"   "GENESEE LIGHT" "NATURAL LIGHT"

The positive values from my_vec are filtered using logical indexing (my_vec > 0), and the selected beers are accessed using the positions 2, 4, and 6 through direct indexing (beers[c(2, 4, 6)]).

Question 5.

Write an R code to read the CSV file, https://bcdanl.github.io/data/mlb_teams.csv using the tidyverse’s read_csv() function, and assign it to MLB_teams.

Answer:

library(tidyverse)  # to use the read_csv() function
MLB_teams <- read_csv("https://bcdanl.github.io/data/mlb_teams.csv")

The read_csv() function from the tidyverse package is used to read the CSV file from the given URL and assign it to the name, MLB_teams. This function automatically handles reading in the CSV file and properly parsing the data.

Question 6.

Write an R code to provide descriptive statistics—mean, standard deviation, minimum, first quartile, median, third quartile, and maximum—for variables in the MLB_teams data.frame.

Answer:

library(skimr)
skim(MLB_teams)

Data summary
Name	MLB_teams
Number of rows	300
Number of columns	56
_______________________
Column type frequency:
character	13
numeric	43
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
lgID	1	2	2	2
teamID	1	3	3	31
franchID	1	3	3	30
divID	1	1	1	3
DivWin	1	1	1	2
WCWin	1	1	1	2
LgWin	1	1	1	2
WSWin	1	1	1	2
name	1	12	29	31
park	1	8	31	37
teamIDBR	1	3	3	31
teamIDlahman45	1	3	3	30
teamIDretro	1	3	3	31

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
yearID	1	2013.50	2.88	2009.00	2011.00	2013.50	2016.00	2018.00	▇▇▇▇▇
Rank	1	3.01	1.44	1.00	2.00	3.00	4.00	6.00	▇▃▅▃▁
G	1	161.99	0.26	161.00	162.00	162.00	162.00	163.00	▁▁▇▁▁
Ghome	1	80.99	0.53	78.00	81.00	81.00	81.00	84.00	▁▁▇▁▁
W	1	80.99	11.40	47.00	73.00	81.00	90.00	108.00	▁▅▇▇▂
L	1	80.99	11.37	54.00	72.00	81.00	89.00	115.00	▂▇▇▅▁
R	1	707.24	73.22	513.00	650.75	707.00	755.00	915.00	▁▆▇▅▁
AB	1	5519.63	70.84	5294.00	5465.00	5519.50	5565.00	5735.00	▁▅▇▅▁
H	1	1405.71	73.91	1199.00	1353.50	1403.00	1452.00	1625.00	▁▆▇▃▁
X2B	1	278.00	25.55	219.00	260.00	276.50	294.00	363.00	▂▇▇▂▁
X3B	1	29.05	9.12	5.00	22.00	29.00	35.00	57.00	▁▆▇▃▁
HR	1	167.32	35.86	91.00	141.75	164.00	191.25	267.00	▃▇▇▅▁
BB	1	504.87	64.18	375.00	457.00	503.00	547.00	672.00	▃▇▇▅▂
SO	1	1235.67	135.21	905.00	1142.75	1232.00	1324.25	1594.00	▂▅▇▅▁
SB	1	93.12	29.71	19.00	71.00	91.00	112.25	194.00	▂▇▇▂▁
CS	1	35.53	9.59	13.00	29.00	34.00	42.00	74.00	▂▇▅▂▁
HBP	1	54.38	13.74	26.00	44.00	53.00	63.00	101.00	▃▇▅▂▁
SF	1	41.70	7.94	24.00	36.00	41.00	47.00	64.00	▃▇▇▃▁
RA	1	707.24	79.76	525.00	646.75	704.50	760.00	894.00	▂▆▇▅▂
ER	1	652.10	74.82	478.00	598.00	649.50	700.50	846.00	▂▇▇▅▂
ERA	1	4.06	0.49	2.94	3.71	4.04	4.37	5.36	▂▆▇▃▂
CG	1	3.83	2.91	0.00	2.00	3.00	6.00	18.00	▇▅▁▁▁
SHO	1	10.36	4.07	2.00	7.00	10.00	13.00	23.00	▃▇▇▂▁
SV	1	41.44	7.09	24.00	37.00	41.00	46.00	62.00	▂▇▇▃▁
IPouts	1	4341.87	40.37	4235.00	4314.75	4340.00	4369.00	4485.00	▂▇▇▂▁
HA	1	1405.71	84.18	1125.00	1350.50	1405.00	1462.25	1637.00	▁▃▇▆▁
HRA	1	167.32	28.32	96.00	147.00	167.00	184.00	258.00	▂▆▇▂▁
BBA	1	504.87	55.82	352.00	466.00	504.00	540.00	653.00	▁▅▇▅▂
SOA	1	1235.67	132.99	911.00	1153.00	1231.00	1312.50	1687.00	▂▇▇▂▁
E	1	96.28	15.14	54.00	86.00	97.00	106.25	143.00	▁▆▇▃▁
DP	1	144.00	17.18	95.00	133.00	144.00	155.00	190.00	▁▅▇▅▁
FP	1	0.98	0.00	0.98	0.98	0.98	0.99	0.99	▁▃▇▅▁
attendance	1	2439186.37	647151.08	811104.00	1924926.50	2373285.50	2988512.50	3857500.00	▁▆▇▆▃
BPF	1	100.09	5.58	88.00	96.00	99.50	103.00	120.00	▃▇▇▁▁
PPF	1	100.09	5.45	88.00	97.00	100.00	103.00	121.00	▂▇▅▁▁
TB	1	2243.78	154.63	1810.00	2136.75	2235.00	2346.25	2703.00	▁▅▇▅▁
WinPct	1	0.50	0.07	0.29	0.45	0.50	0.56	0.67	▁▅▇▇▂
rpg	1	4.37	0.45	3.17	4.02	4.37	4.66	5.65	▁▆▇▅▁
hrpg	1	1.03	0.22	0.56	0.88	1.01	1.18	1.65	▃▇▇▅▁
tbpg	1	13.85	0.95	11.17	13.19	13.81	14.47	16.69	▁▅▇▅▁
kpg	1	7.63	0.83	5.59	7.06	7.61	8.18	9.84	▂▅▇▅▁
k2bb	1	2.48	0.40	1.53	2.20	2.49	2.74	3.75	▂▇▇▃▁
whip	1	1.32	0.07	1.16	1.27	1.31	1.37	1.56	▂▇▆▂▁

skimr::skim() is a function from the skimr package in R, which provides an enhanced and comprehensive summary of data compared to the traditional summary() function.
- It generates descriptive statistics for a vector or for each column (variable) in a data.frame, offering an easy-to-read output with more details than basic summaries.

Discussion

Welcome to our Classwork 1 Discussion Board! 👋

This space is designed for you to engage with your classmates about the material covered in Classwork 1.

Whether you are looking to delve deeper into the content, share insights, or have questions about the content, this is the perfect place for you.

If you have any specific questions for Byeong-Hak (@bcdanl) or peer classmate (@GitHub-Username) regarding the Classwork 1 materials or need clarification on any points, don’t hesitate to ask here.

Let’s collaborate and learn from each other!