library(tidyverse)Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
library(tidyverse)Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
In this lesson, we are going to cover two fundamental ideas to more advanced programming
Loops
Creating Functions
<-In essence, both of these skills are built off something we have been doing this whole class, assignment
We have assigned data
data <- haven::read_dta("data/hsls-small.dta")plot <- ggplot(data) +
geom_histogram(aes(x = x1txmtscor))data_sum <- data |>
summarize(mean = mean(x1txmtscor, na.rm = T))uf_age <- 2025 - 1853Everything we are going to cover today comes back to this basic principle, things being assigned names
The watchwords for this lesson are DRY vs WET:
Let’s say you have a three-step analysis process for 20 files (read, lower names, add a column). Under a WET programming paradigm in which each command gets its own line of code, that’s 60 lines of code. If the number of your files grows to 50, that’s now 150 lines of code — for just three tasks! When you write every time, you not only make your code longer and harder to parse, you also increase the likelihood that your code will contain bugs while simultaneously decreasing its scalability.
If you need to repeat an analytic task (which may be a set of commands), then it’s better to have one statement of that process that you repeat, perhaps in a loop or in a function. Don’t repeat yourself — say it once and have R repeat it for you!
The goal of DRY programming is not abstraction or slickness for its own sake. That runs counter to the clarity and replicability we’ve been working toward. Instead, we aspire to DRY code since it is more scalable and less buggy than WET code. To be clear, a function or loop can still have bugs, but the bugs it introduces are often the same across repetitions and fixed at a single point of error. That is, it’s typically easier to debug when the bug has a single root cause than when it could be anywhere in 150 similar but slightly different lines of code.
As we work through the lesson examples, keep in the back of your mind:
We’ll use a combination of nonce data and the school test score data we’ve used in a past lesson. We won’t read in the school test score data until the last section, but we’ll continue following our good organizational practice by setting the directory paths at the top of our script.
for() LoopsThe idea of loops is relatively simple
Take a list of things
for(i in class_list) {Do a set of things
print(i) }Wait, but what’s i
i is the most common word to use here, but we could call it anything
i for item) in the listOkay, but what’s { and }
i in the listclass_list <- c("Let's", "go", "Gators", "!")
for(i in class_list) { print(i) }[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
ifor(word in class_list) { print(word) }[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
for(gator_egg in class_list) { print(gator_egg) }[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
All we are doing is assigning a name to the item in the list
We can do the name thing with numbers
gators_points_23 <- c(11, 49, 29, 22, 14, 38, 41, 20, 36, 35, 31, 15)
for(i in gators_points_23) { print(i) }[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
for(billy_napier in gators_points_23) { print(billy_napier) }[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
Quick exercise
Create a list of the names of every school you’ve attended, then use a for loop to print them out
if() and else() to LoopsLoops that print things are all well and good, but really we want to be able to do a little more than that
We are going to use if() and else() to do that
Remember ifelse() from Data Wrangling?
This is just splitting that up, if() something is true, do this, else() do that
if()for(i in gators_points_23) {
if(i > 30) {
print(i)
}
}[1] 49
[1] 38
[1] 41
[1] 36
[1] 35
[1] 31
Notice we only got scores if they were above 30
Next, we can add an else() to say what to do if the score was not above 30
for(i in gators_points_23) {
if(i > 30) {
print(i)
} else {
print(i)
}
}[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
Quick Question: Is that the same list we had before? Why or why not?
Let’s see how we can make it different
We are going to use a new command paste() which combines strings, then print() that
paste() in your assignment as wellfor(i in gators_points_23) {
if(i > 30) {
paste("Yay the Gators scored", i, "points, which is more than 30!") |> print()
} else {
print(i)
}
}[1] 11
[1] "Yay the Gators scored 49 points, which is more than 30!"
[1] 29
[1] 22
[1] 14
[1] "Yay the Gators scored 38 points, which is more than 30!"
[1] "Yay the Gators scored 41 points, which is more than 30!"
[1] 20
[1] "Yay the Gators scored 36 points, which is more than 30!"
[1] "Yay the Gators scored 35 points, which is more than 30!"
[1] "Yay the Gators scored 31 points, which is more than 30!"
[1] 15
else() statementfor(i in gators_points_23) {
if(i > 30) {
paste("Yay, the Gators scored", i, "points, which is more than 30!") |> print()
} else {
paste("Sad times, the Gators only scored", i, "points...") |> print()
}
}[1] "Sad times, the Gators only scored 11 points..."
[1] "Yay, the Gators scored 49 points, which is more than 30!"
[1] "Sad times, the Gators only scored 29 points..."
[1] "Sad times, the Gators only scored 22 points..."
[1] "Sad times, the Gators only scored 14 points..."
[1] "Yay, the Gators scored 38 points, which is more than 30!"
[1] "Yay, the Gators scored 41 points, which is more than 30!"
[1] "Sad times, the Gators only scored 20 points..."
[1] "Yay, the Gators scored 36 points, which is more than 30!"
[1] "Yay, the Gators scored 35 points, which is more than 30!"
[1] "Yay, the Gators scored 31 points, which is more than 30!"
[1] "Sad times, the Gators only scored 15 points..."
This may seem a little silly right now, but this fun example was just meant to show the basics of we are doing
We will cover these in a more serious way at the end of the lesson
Functions work much the same way as loops, whatever we say inside { } is done
The difference is that instead of doing it for each item in a list, we do it for a single input
We also have to use the function for it to work
filter(), summarize(), mutate() are all functions just like the one we are going to makeTo demonstrate, let’s make a function that prints a welcome message for students arriving at UF
welcome <- function() { print("Welcome to UF!") }
welcome()[1] "Welcome to UF!"
To do this, we need some data, so we are just going to make some up
tribble() is just a way of making a tidyverse data frame (think about it as a table with rows and columns), don’t worry about it for now, it’s not the main idea for the lessonfake_data <- tribble(~ufid, ~name, ~dorm, ~first_class, ~meal_plan, ~roommate,
1853, "Jack", "Cyprus", "BIO-1001", 1, "Mike",
1854, "Hailey", "Simpson", "BIO-1001", 0, "Jessica",
1855, "Tamika", "Simpson", "CHEM-1002", 1, "Hannah",
1856, "Jessica", "Simpson", "ARCH-1003", 1, "Hailey",
1857, "Mike", "Cyrpus", "STA-1002", 0, "Jack",
1858, "Hannah", "Simpson", "EDF-1005", 1, "Tamika")For our function to be able to work, it needs to be able to take an input, in this case UFID
Let’s run our same function again, but adding ufid to the brackets, saying that it takes ufid as the only input
welcome <- function(id) { print("Welcome to UF!") }
welcome()[1] "Welcome to UF!"
Quick question: It ran, but why did this not change anything?
welcome <- function(id) {
student <- fake_data |> filter(ufid == id)
print(student)
}
welcome(1853)# A tibble: 1 × 6
ufid name dorm first_class meal_plan roommate
<dbl> <chr> <chr> <chr> <dbl> <chr>
1 1853 Jack Cyprus BIO-1001 1 Mike
welcome <- function(id) {
student <- fake_data |> filter(ufid == id)
name <- student |> pull(name)
paste("Welcome to UF", name)
}
welcome(1853)[1] "Welcome to UF Jack"
Okay, now we’re getting somewhere!
Let’s add a bit more info to say where they live and what their first class will be
welcome <- function(id) {
student <- fake_data |> filter(ufid == id)
name <- student |> pull(name)
dorm <- student |> pull(dorm)
first_class <- student |> pull(first_class)
paste("Welcome to UF", name, "you will be living in", dorm, "and your first class is", first_class)
}
welcome(1853)[1] "Welcome to UF Jack you will be living in Cyprus and your first class is BIO-1001"
Quick exercise: Add to the above block of code, to also say who their roommate is
list.files() function
full.names = TRUEQuick Exercise
Try removing the
full.names = TRUEsee what the differences are, and think why we need to include it
files <- list.files("data/sch-test/by-school",
full.names = T)for(i in files) {
print(i)
}[1] "data/sch-test/by-school/bend-gate-1980.csv"
[1] "data/sch-test/by-school/bend-gate-1981.csv"
[1] "data/sch-test/by-school/bend-gate-1982.csv"
[1] "data/sch-test/by-school/bend-gate-1983.csv"
[1] "data/sch-test/by-school/bend-gate-1984.csv"
[1] "data/sch-test/by-school/bend-gate-1985.csv"
[1] "data/sch-test/by-school/east-heights-1980.csv"
[1] "data/sch-test/by-school/east-heights-1981.csv"
[1] "data/sch-test/by-school/east-heights-1982.csv"
[1] "data/sch-test/by-school/east-heights-1983.csv"
[1] "data/sch-test/by-school/east-heights-1984.csv"
[1] "data/sch-test/by-school/east-heights-1985.csv"
[1] "data/sch-test/by-school/niagara-1980.csv"
[1] "data/sch-test/by-school/niagara-1981.csv"
[1] "data/sch-test/by-school/niagara-1982.csv"
[1] "data/sch-test/by-school/niagara-1983.csv"
[1] "data/sch-test/by-school/niagara-1984.csv"
[1] "data/sch-test/by-school/niagara-1985.csv"
[1] "data/sch-test/by-school/spottsville-1980.csv"
[1] "data/sch-test/by-school/spottsville-1981.csv"
[1] "data/sch-test/by-school/spottsville-1982.csv"
[1] "data/sch-test/by-school/spottsville-1983.csv"
[1] "data/sch-test/by-school/spottsville-1984.csv"
[1] "data/sch-test/by-school/spottsville-1985.csv"
Okay, that looks like what we want, a list of files to read in
Now let’s try reading these
for(i in files) {
read_csv(i)
}Okay, that read the files, but we didn’t save them anywhere, we need to assign them
Now, this gets a little tricky, as anything we assign with <- in a loop only exists in a loop, so we need two extra steps
First, we read it in mostly like normal, into something called file , which is temporary and only exists within the loop
Second, we make another temporary object, which is the name we want to assign the data frame to
data_<name of the file>Third, we use assign(), which basically does what <- would do in normal circumstances, but keep the object after the loop
name <- read_csv() becomes file <- read_csv() then assign(name, file)for(i in files) {
file <- read_csv(i)
name <- paste0("data_", i)
assign(name, file)
}Cool! That seems to have worked, but, our data_ names are really really long, which probably isn’t that useful for future analysis
So, instead of simply pasting together data_ and i, let’s use our friend regular expressions to get something more usable
Get the school name (anything that matches the options we give)
Get the year (any digits)
Paste those together with data_ to get our nicer object names
for(i in files) {
school <- str_extract(i, "niagara|bend|east|spot")
year <- str_extract(i, "\\d+")
name <- paste0("data_", school, year)
file <- read_csv(i)
assign(name, file)
}Much better!
One last thing, instead of reading each of those into a new data frame, we could append them all together
join data instead. Review Data Wrangling II for a refresher on how to append and join data appropriatelyTo do this, we first need to set up an empty data frame (a.k.a., “tibble”), as otherwise we will get an error the first time through the loop, as we would be attempting to bind something that doesn’t exist
Then, we simply run the loop like before, but use bind_rows() to stack each file onto the existing list
data_bind <- tibble()
for(i in files) {
file <- read_csv(i)
data_bind <- bind_rows(data_bind, file)
}Great, that worked!
Finally, what if we wanted to this only for certain schools?
Hint: this might be useful in the homework
With a loop, we do something for each item in the list
So do something only for certain schools, we want to change the list, not the loop
In this case, we can add a pattern to our list.files() function saying to only list files that match that pattern
files_niagara <- list.files("data/sch-test/by-school",
full.names = T,
pattern = "niagara")
data_niagara <- tibble()
for(i in files_niagara) {
file <- read_csv(i)
data_niagara <- bind_rows(data_niagara, file)
}print(data_niagara)# A tibble: 6 × 5
school year math read science
<chr> <dbl> <dbl> <dbl> <dbl>
1 Niagara 1980 514 292 787
2 Niagara 1981 499 268 762
3 Niagara 1982 507 310 771
4 Niagara 1983 497 301 814
5 Niagara 1984 483 311 818
6 Niagara 1985 489 275 805
In this lesson we have mostly print()-ed our output in this lesson, because it’s one of the easiest ways to see what’s going on
But, you can use functions and loops for other things too, like modifying variables, reading data, etc.
print()-ing what you are looping through, then make it more sophisticated from thereLoops and Functions can be tricky, so don’t worry if you can’t get everything now
You might see some variations of the loops we learned today such as
while() loops
for() loops that use the index (line) number of the list instead
What’s important is that the underlying logic remains the same
a) Using a for() loop, read in the school test files from data/sch-test/by-school/.
b) Do the same, but, only for Bend Gate and Niagara
Hint: Looking inside the
list.files()help file for somewhere you might use some regex
c) Do it one more time, but include some code that add a column called file_path which contains the location where each row’s information is coming from
d) Bind the data sets from part c) together
a) Read in hsls-small.dta
b) Write a function that
stu_id as the inputpaste()s back a sentence which says if() they ever attended college else() whether their parent expected them to go to collegeif() the student went to college, add a second sentence that says how many months they had between high school and going to collegeOnce complete turn in the .qmd file (it must render/run) and the rendered PDF to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.
## -----------------------------------------------------------------------------
##
##' [PROJ: EDH 7916]
##' [FILE: Functions & Loops Solution]
##' [INIT: March 18 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##' [EDIT: Jue Wu]
##
## -----------------------------------------------------------------------------
setwd(this.path::here())
## ---------------------------
##' [Libraries]
## ---------------------------
library(tidyverse)
## ---------------------------
##' [Q1]
## ---------------------------
# 1a
files <- list.files("data/sch-test/by-school",
full.names = TRUE)
for(i in files) {
school <- str_extract(i, "niagara|bend|east|spot")
year <- str_extract(i, "\\d+")
name <- paste0("data_", school, year)
file <- read_csv(i)
assign(name, file)
}
# 1b
files_niagara_bend <- list.files("data/sch-test/by-school",
full.names = T,
pattern = "niagara|bend")
for(i in files_niagara_bend) {
school <- str_extract(i, "niagara|bend")
year <- str_extract(i, "\\d+")
name <- paste0("data_", school, year)
file <- read_csv(i)
assign(name, file)
}
# 1c
files_niagara_bend <- list.files("data/sch-test/by-school",
full.names = T,
pattern = "niagara|bend")
for(i in files_niagara_bend) {
school <- str_extract(i, "niagara|bend")
year <- str_extract(i, "\\d+")
name <- paste0("data_", school, year)
file <- read_csv(i) |> mutate(file_path = i)
assign(name, file)
}
# 1d
files_niagara_bend <- list.files("data/sch-test/by-school",
full.names = T,
pattern = "niagara|bend")
data_niagara_bend_bind <- tibble()
for(i in files_niagara_bend) {
file <- read_csv(i) |>
mutate(file_path = i)
data_niagara_bend_bind <- bind_rows(data_niagara_bend_bind, file)
}
## ---------------------------
##' [Q2]
## ---------------------------
# 2a
data <- haven::read_dta("data/hsls-small.dta")
# 2b
college <- function(id) {
student <- data |> filter(stu_id == id)
college <- student |> pull(x4evratndclg) # pull out if the student attend college
parent_exp <- student |> pull(x1paredexpct) # pull out the parent student expectation
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
} else if(college == 1) { # if they went to college
paste("This student went to college") # paste this message
} else {
if(is.na(parent_exp)|parent_exp == 11) {
paste("This student did not go to college, and their parental expectation is unknown")
} else if(parent_exp >= 5) {
paste("This student did not go to college, but their parents expected them to")
} else if(parent_exp < 5) {
paste("This student did not go to college, and their parents did not expect them to")
}
}
}
# test
college(10031)
# Optional
college <- function(id) {
student <- data |> filter(stu_id == id)
college <- student |> pull(x4evratndclg) # pull out if the student attend college
parent_exp <- student |> pull(x1paredexpct) # pull out the parent student expectation
months_gap <- student |> pull(x4hs2psmos) # pull out the months gap
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
} else if(college == 1) {
if(is.na(months_gap)) {
paste("This student went to college, but we do not know how long the gap was")
} else{
paste("This student went to college, and they had", months_gap, "months between high school and going to college")
}
} else {
if(is.na(parent_exp)|parent_exp == 11) {
paste("This student did not go to college, and their parental expectation is unknown")
} else if(parent_exp >= 5) {
paste("This student did not go to college, but their parents expected them to")
} else if(parent_exp < 5) {
paste("This student did not go to college, and their parents did not expect them to")
}
}
}
# test
college(10007)
# Super Optional
college <- function(id) {
student <- data |> filter(stu_id == id)
college <- student |> pull(x4evratndclg) # pull out if the student attend college
parent_exp <- student |> pull(x1paredexpct) # pull out the parent student expectation
months_gap <- student |> pull(x4hs2psmos) # pull out the months gap
avg_delay <- mean(data$x4hs2psmos, na.rm = TRUE) # calculate the average delay
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
} else if(college == 1) {
if(is.na(months_gap)) {
paste("This student went to college, but we do not know how long the gap was")
} else{
if(months_gap == avg_delay) {
paste("This student went to college, and their gap was the average delay between high school and going to college")
} else if(months_gap < avg_delay) {
paste("This student went to college, and their gap was shorter than the average delay")
} else if(months_gap > avg_delay) {
paste("This student went to college, and their gap was longer than the average delay")
}
}
} else {
if(is.na(parent_exp)|parent_exp == 11) {
paste("This student did not go to college, and their parental expectation is unknown")
} else if(parent_exp >= 5) {
paste("This student did not go to college, but their parents expected them to")
} else if(parent_exp < 5) {
paste("This student did not go to college, and their parents did not expect them to")
}
}
}
# test
college(10007)
## Test it using a for loop
test_ids <- data |> slice_head(n = 50) |> pull(stu_id)
for(i in test_ids) { print(college(i)) }
##'[Matt's Solution]
##'[Without the optional part]
id <- 10007 # Tip: if you're debugging a function or loop, manually assign something to input values (for a function) or i (for a loop) and then you can test it
did_they_go <- function(id) {
student <- df |> filter(stu_id == id) # pull out the student id (so we can directly use the value)
college <- student |> pull(x4evratndclg) # pull out if the student attend college
parent <- student |> pull(x1paredexpct) # pull out the parent student expectation
expect <- if(is.na(parent)|parent == 11) { # if the parent expectation is either NA or 11
"and we do not know if their parents wanted them to" # assign this text to expect
} else if(parent >= 5) { # otherwise if the parent expectation is 5 or above
"and their parents expected them to" # assign this text to expect
} else if(parent < 5) { # otherwise if the parent expectation is less than 5
"and their parents did not expect them to" # assign this text to expect
}
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
} else if(college == 1) { # if they went to college
paste("This student went to college") # paste this message
} else if(college == 0) { # if they did not go to college
paste("This student never went to college", expect) # paste this message
}
}
## Test it using a for loop
test_ids <- df |> slice_head(n = 50) |> pull(stu_id)
for(i in test_ids) { print(did_they_go(i)) }
##'[With the optional part]
did_they_go <- function(id) {
student <- df |> filter(stu_id == id) # pull out the student id (so we can directly use the value)
college <- student |> pull(x4evratndclg) # pull out if the student attend college
parent <- student |> pull(x1paredexpct) # pull out the parent student expectation
expect <- if(is.na(parent)|parent == 11) { # if the parent expectation is either NA or 11
"and we do not know if their parents wanted them to" # assign this text to expect
} else if(parent >= 5) { # otherwise if the parent expectation is 5 or above
"and their parents expected them to" # assign this text to expect
} else if(parent < 5) { # otherwise if the parent expectation is less than 5
"and their parents did not expect them to" # assign this text to expect
}
## NEW PART
median_delay <- df |> summarize(median = median(x4hs2psmos, na.rm = T)) |> pull(median) # summarize the median completion and pull the value out (notice we start with df not student)
delay <- student |> pull(x4hs2psmos) # pull out the students months delay
difference <- delay - median_delay # calculate the difference between the median delay and the students delay
delay_statement <- if(is.na(delay)) { # if the students delay was NA (note, it doesn't matter if they didn't go to college, as we only use this if they went)
"and we do not know how long they delayed college" # paste this message
} else if(delay == 0) { # if the students didn't delay going to college
"and they did not delay attending college at all" # paste this message
} else if(difference == 0) { # if the students delay was the median
paste("and they delayed college by", delay, "months which is the average amount of time") # paste this, using the value delay from above
} else if(difference < 0) { # if the students delay was below the median
paste("and they delayed college by", delay, "months which is", abs(difference), "months less than the average") # paste this, using delay and abs(difference) which is the absolute value of the difference (i.e., the value regardless of positive vs negative)
} else if(difference > 0) { # if the students delay was above the median
paste("and they delayed college by", delay, "months which is", abs(difference), "months above than the average") # paste this, using delay and abs(difference) which is the absolute value of the difference (i.e., the value regardless of positive vs negative)
}
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
} else if(college == 1) { # if they went to college
paste("This student went to college", delay_statement) # paste this message NEW added delay_statement
} else if(college == 0) { # if they did not go to college
paste("This student never went to college", expect) # paste this message
}
}
## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------