library(tidyverse)
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
library(tidyverse)
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
In this lesson, we are going to cover two fundamental ideas to more advanced programming
Loops
Creating Functions
<-
In essence, both of these skills are built off something we have been doing this whole class, assignment
We have assigned data
<- haven::read_dta("data/hsls-small.dta") data
<- ggplot(data) +
plot geom_histogram(aes(x = x1txmtscor))
<- data |>
data_sum summarize(mean = mean(x1txmtscor, na.rm = T))
<- 2025 - 1853 uf_age
Everything we are going to cover today comes back to this basic principle, things being assigned names
The watchwords for this lesson are DRY vs WET:
Let’s say you have a three-step analysis process for 20 files (read, lower names, add a column). Under a WET programming paradigm in which each command gets its own line of code, that’s 60 lines of code. If the number of your files grows to 50, that’s now 150 lines of code — for just three tasks! When you write every time, you not only make your code longer and harder to parse, you also increase the likelihood that your code will contain bugs while simultaneously decreasing its scalability.
If you need to repeat an analytic task (which may be a set of commands), then it’s better to have one statement of that process that you repeat, perhaps in a loop or in a function. Don’t repeat yourself — say it once and have R repeat it for you!
The goal of DRY programming is not abstraction or slickness for its own sake. That runs counter to the clarity and replicability we’ve been working toward. Instead, we aspire to DRY code since it is more scalable and less buggy than WET code. To be clear, a function or loop can still have bugs, but the bugs it introduces are often the same across repetitions and fixed at a single point of error. That is, it’s typically easier to debug when the bug has a single root cause than when it could be anywhere in 150 similar but slightly different lines of code.
As we work through the lesson examples, keep in the back of your mind:
We’ll use a combination of nonce data and the school test score data we’ve used in a past lesson. We won’t read in the school test score data until the last section, but we’ll continue following our good organizational practice by setting the directory paths at the top of our script.
for()
LoopsThe idea of loops is relatively simple
Take a list of things
for(i in class_list) {
Do a set of things
print(i) }
Wait, but what’s i
i
is the most common word to use here, but we could call it anything
i
for item) in the listOkay, but what’s {
and }
i
in the list<- c("Let's", "go", "Gators", "!")
class_list
for(i in class_list) { print(i) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
i
for(word in class_list) { print(word) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
for(gator_egg in class_list) { print(gator_egg) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
All we are doing is assigning a name to the item in the list
We can do the name thing with numbers
<- c(11, 49, 29, 22, 14, 38, 41, 20, 36, 35, 31, 15)
gators_points_23
for(i in gators_points_23) { print(i) }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
for(billy_napier in gators_points_23) { print(billy_napier) }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
Quick exercise
Create a list of the names of every school you’ve attended, then use a for loop to print them out
if()
and else()
to LoopsLoops that print things are all well and good, but really we want to be able to do a little more than that
We are going to use if()
and else()
to do that
Remember ifelse()
from Data Wrangling?
This is just splitting that up, if()
something is true, do this, else()
do that
if()
for(i in gators_points_23) {
if(i > 30) {
print(i)
} }
[1] 49
[1] 38
[1] 41
[1] 36
[1] 35
[1] 31
Notice we only got scores if they were above 30
Next, we can add an else()
to say what to do if the score was not above 30
for(i in gators_points_23) {
if(i > 30) {
print(i)
else {
} print(i)
} }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
Quick Question: Is that the same list we had before? Why or why not?
Let’s see how we can make it different
We are going to use a new command paste()
which combines strings, then print()
that
paste()
in your assignment as wellfor(i in gators_points_23) {
if(i > 30) {
paste("Yay the Gators scored", i, "points, which is more than 30!") |> print()
else {
} print(i)
} }
[1] 11
[1] "Yay the Gators scored 49 points, which is more than 30!"
[1] 29
[1] 22
[1] 14
[1] "Yay the Gators scored 38 points, which is more than 30!"
[1] "Yay the Gators scored 41 points, which is more than 30!"
[1] 20
[1] "Yay the Gators scored 36 points, which is more than 30!"
[1] "Yay the Gators scored 35 points, which is more than 30!"
[1] "Yay the Gators scored 31 points, which is more than 30!"
[1] 15
else()
statementfor(i in gators_points_23) {
if(i > 30) {
paste("Yay, the Gators scored", i, "points, which is more than 30!") |> print()
else {
} paste("Sad times, the Gators only scored", i, "points...") |> print()
} }
[1] "Sad times, the Gators only scored 11 points..."
[1] "Yay, the Gators scored 49 points, which is more than 30!"
[1] "Sad times, the Gators only scored 29 points..."
[1] "Sad times, the Gators only scored 22 points..."
[1] "Sad times, the Gators only scored 14 points..."
[1] "Yay, the Gators scored 38 points, which is more than 30!"
[1] "Yay, the Gators scored 41 points, which is more than 30!"
[1] "Sad times, the Gators only scored 20 points..."
[1] "Yay, the Gators scored 36 points, which is more than 30!"
[1] "Yay, the Gators scored 35 points, which is more than 30!"
[1] "Yay, the Gators scored 31 points, which is more than 30!"
[1] "Sad times, the Gators only scored 15 points..."
This may seem a little silly right now, but this fun example was just meant to show the basics of we are doing
We will cover these in a more serious way at the end of the lesson
Functions work much the same way as loops, whatever we say inside { }
is done
The difference is that instead of doing it for each item in a list, we do it for a single input
We also have to use the function for it to work
filter()
, summarize()
, mutate()
are all functions just like the one we are going to makeTo demonstrate, let’s make a function that prints a welcome message for students arriving at UF
<- function() { print("Welcome to UF!") }
welcome
welcome()
[1] "Welcome to UF!"
To do this, we need some data, so we are just going to make some up
tribble()
is just a way of making a tidyverse data frame (think about it as a table with rows and columns), don’t worry about it for now, it’s not the main idea for the lesson<- tribble(~ufid, ~name, ~dorm, ~first_class, ~meal_plan, ~roommate,
fake_data 1853, "Jack", "Cyprus", "BIO-1001", 1, "Mike",
1854, "Hailey", "Simpson", "BIO-1001", 0, "Jessica",
1855, "Tamika", "Simpson", "CHEM-1002", 1, "Hannah",
1856, "Jessica", "Simpson", "ARCH-1003", 1, "Hailey",
1857, "Mike", "Cyrpus", "STA-1002", 0, "Jack",
1858, "Hannah", "Simpson", "EDF-1005", 1, "Tamika")
For our function to be able to work, it needs to be able to take an input, in this case UFID
Let’s run our same function again, but adding ufid
to the brackets, saying that it takes ufid
as the only input
<- function(id) { print("Welcome to UF!") }
welcome
welcome()
[1] "Welcome to UF!"
Quick question: It ran, but why did this not change anything?
<- function(id) {
welcome
<- fake_data |> filter(ufid == id)
student
print(student)
}
welcome(1853)
# A tibble: 1 × 6
ufid name dorm first_class meal_plan roommate
<dbl> <chr> <chr> <chr> <dbl> <chr>
1 1853 Jack Cyprus BIO-1001 1 Mike
<- function(id) {
welcome
<- fake_data |> filter(ufid == id)
student
<- student |> pull(name)
name
paste("Welcome to UF", name)
}
welcome(1853)
[1] "Welcome to UF Jack"
Okay, now we’re getting somewhere!
Let’s add a bit more info to say where they live and what their first class will be
<- function(id) {
welcome
<- fake_data |> filter(ufid == id)
student
<- student |> pull(name)
name <- student |> pull(dorm)
dorm <- student |> pull(first_class)
first_class
paste("Welcome to UF", name, "you will be living in", dorm, "and your first class is", first_class)
}
welcome(1853)
[1] "Welcome to UF Jack you will be living in Cyprus and your first class is BIO-1001"
Quick exercise: Add to the above block of code, to also say who their roommate is
list.files()
function
full.names = TRUE
Quick Exercise
Try removing the
full.names = TRUE
see what the differences are, and think why we need to include it
<- list.files("data/sch-test/by-school",
files full.names = T)
for(i in files) {
print(i)
}
[1] "data/sch-test/by-school/bend-gate-1980.csv"
[1] "data/sch-test/by-school/bend-gate-1981.csv"
[1] "data/sch-test/by-school/bend-gate-1982.csv"
[1] "data/sch-test/by-school/bend-gate-1983.csv"
[1] "data/sch-test/by-school/bend-gate-1984.csv"
[1] "data/sch-test/by-school/bend-gate-1985.csv"
[1] "data/sch-test/by-school/east-heights-1980.csv"
[1] "data/sch-test/by-school/east-heights-1981.csv"
[1] "data/sch-test/by-school/east-heights-1982.csv"
[1] "data/sch-test/by-school/east-heights-1983.csv"
[1] "data/sch-test/by-school/east-heights-1984.csv"
[1] "data/sch-test/by-school/east-heights-1985.csv"
[1] "data/sch-test/by-school/niagara-1980.csv"
[1] "data/sch-test/by-school/niagara-1981.csv"
[1] "data/sch-test/by-school/niagara-1982.csv"
[1] "data/sch-test/by-school/niagara-1983.csv"
[1] "data/sch-test/by-school/niagara-1984.csv"
[1] "data/sch-test/by-school/niagara-1985.csv"
[1] "data/sch-test/by-school/spottsville-1980.csv"
[1] "data/sch-test/by-school/spottsville-1981.csv"
[1] "data/sch-test/by-school/spottsville-1982.csv"
[1] "data/sch-test/by-school/spottsville-1983.csv"
[1] "data/sch-test/by-school/spottsville-1984.csv"
[1] "data/sch-test/by-school/spottsville-1985.csv"
Okay, that looks like what we want, a list of files to read in
Now let’s try reading these
for(i in files) {
read_csv(i)
}
Okay, that read the files, but we didn’t save them anywhere, we need to assign them
Now, this gets a little tricky, as anything we assign with <-
in a loop only exists in a loop, so we need two extra steps
First, we read it in mostly like normal, into something called file
, which is temporary and only exists within the loop
Second, we make another temporary object, which is the name we want to assign the data frame to
data_<name of the file>
Third, we use assign()
, which basically does what <-
would do in normal circumstances, but keep the object after the loop
name <- read_csv()
becomes file <- read_csv()
then assign(name, file)
for(i in files) {
<- read_csv(i)
file <- paste0("data_", i)
name assign(name, file)
}
Cool! That seems to have worked, but, our data_
names are really really long, which probably isn’t that useful for future analysis
So, instead of simply pasting together data_
and i
, let’s use our friend regular expressions to get something more usable
Get the school name (anything that matches the options we give)
Get the year (any digits)
Paste those together with data_
to get our nicer object names
for(i in files) {
<- str_extract(i, "niagara|bend|east|spot")
school <- str_extract(i, "\\d+")
year <- paste0("data_", school, year)
name <- read_csv(i)
file assign(name, file)
}
Much better!
One last thing, instead of reading each of those into a new data frame, we could append
them all together
join
data instead. Review Data Wrangling II for a refresher on how to append
and join
data appropriatelyTo do this, we first need to set up an empty data frame (a.k.a., “tibble”), as otherwise we will get an error the first time through the loop, as we would be attempting to bind
something that doesn’t exist
Then, we simply run the loop like before, but use bind_rows()
to stack each file
onto the existing list
<- tibble()
data_bind
for(i in files) {
<- read_csv(i)
file <- bind_rows(data_bind, file)
data_bind }
Great, that worked!
Finally, what if we wanted to this only for certain schools?
Hint: this might be useful in the homework
With a loop, we do something for each item in the list
So do something only for certain schools, we want to change the list, not the loop
In this case, we can add a pattern
to our list.files()
function saying to only list files that match that pattern
<- list.files("data/sch-test/by-school",
files_niagara full.names = T,
pattern = "niagara")
<- tibble()
data_niagara
for(i in files_niagara) {
<- read_csv(i)
file <- bind_rows(data_niagara, file)
data_niagara }
print(data_niagara)
# A tibble: 6 × 5
school year math read science
<chr> <dbl> <dbl> <dbl> <dbl>
1 Niagara 1980 514 292 787
2 Niagara 1981 499 268 762
3 Niagara 1982 507 310 771
4 Niagara 1983 497 301 814
5 Niagara 1984 483 311 818
6 Niagara 1985 489 275 805
In this lesson we have mostly print()
-ed our output in this lesson, because it’s one of the easiest ways to see what’s going on
But, you can use functions and loops for other things too, like modifying variables, reading data, etc.
print()
-ing what you are looping through, then make it more sophisticated from thereLoops and Functions can be tricky, so don’t worry if you can’t get everything now
You might see some variations of the loops we learned today such as
while()
loops
for()
loops that use the index (line) number of the list instead
What’s important is that the underlying logic remains the same
a) Using a for()
loop, read in the school test files from data/sch-test/by-school/
.
b) Do the same, but, only for Bend Gate and Niagara
Hint: Looking inside the
list.files()
help file for somewhere you might use some regex
c) Do it one more time, but include some code that add a column called file_path
which contains the location where each row’s information is coming from
d) Bind the data sets from part c) together
a) Read in hsls-small.dta
b) Write a function that
stu_id
as the inputpaste()
s back a sentence which says if()
they ever attended college else()
whether their parent expected them to go to collegeif()
the student went to college, add a second sentence that says how many months they had between high school and going to collegeOnce complete turn in the .qmd file (it must render/run) and the rendered PDF to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.
## -----------------------------------------------------------------------------
##
##' [PROJ: EDH 7916]
##' [FILE: Functions & Loops Solution]
##' [INIT: March 18 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##' [EDIT: Jue Wu]
##
## -----------------------------------------------------------------------------
setwd(this.path::here())
## ---------------------------
##' [Libraries]
## ---------------------------
library(tidyverse)
## ---------------------------
##' [Q1]
## ---------------------------
# 1a
<- list.files("data/sch-test/by-school",
files full.names = TRUE)
for(i in files) {
<- str_extract(i, "niagara|bend|east|spot")
school <- str_extract(i, "\\d+")
year <- paste0("data_", school, year)
name <- read_csv(i)
file assign(name, file)
}
# 1b
<- list.files("data/sch-test/by-school",
files_niagara_bend full.names = T,
pattern = "niagara|bend")
for(i in files_niagara_bend) {
<- str_extract(i, "niagara|bend")
school <- str_extract(i, "\\d+")
year <- paste0("data_", school, year)
name <- read_csv(i)
file assign(name, file)
}
# 1c
<- list.files("data/sch-test/by-school",
files_niagara_bend full.names = T,
pattern = "niagara|bend")
for(i in files_niagara_bend) {
<- str_extract(i, "niagara|bend")
school <- str_extract(i, "\\d+")
year <- paste0("data_", school, year)
name <- read_csv(i) |> mutate(file_path = i)
file assign(name, file)
}
# 1d
<- list.files("data/sch-test/by-school",
files_niagara_bend full.names = T,
pattern = "niagara|bend")
<- tibble()
data_niagara_bend_bind
for(i in files_niagara_bend) {
<- read_csv(i) |>
file mutate(file_path = i)
<- bind_rows(data_niagara_bend_bind, file)
data_niagara_bend_bind
}
## ---------------------------
##' [Q2]
## ---------------------------
# 2a
<- haven::read_dta("data/hsls-small.dta")
data
# 2b
<- function(id) {
college
<- data |> filter(stu_id == id)
student <- student |> pull(x4evratndclg) # pull out if the student attend college
college <- student |> pull(x1paredexpct) # pull out the parent student expectation
parent_exp
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
else if(college == 1) { # if they went to college
} paste("This student went to college") # paste this message
else {
} if(is.na(parent_exp)|parent_exp == 11) {
paste("This student did not go to college, and their parental expectation is unknown")
else if(parent_exp >= 5) {
} paste("This student did not go to college, but their parents expected them to")
else if(parent_exp < 5) {
} paste("This student did not go to college, and their parents did not expect them to")
}
}
}
# test
college(10031)
# Optional
<- function(id) {
college
<- data |> filter(stu_id == id)
student <- student |> pull(x4evratndclg) # pull out if the student attend college
college <- student |> pull(x1paredexpct) # pull out the parent student expectation
parent_exp <- student |> pull(x4hs2psmos) # pull out the months gap
months_gap
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
else if(college == 1) {
} if(is.na(months_gap)) {
paste("This student went to college, but we do not know how long the gap was")
else{
} paste("This student went to college, and they had", months_gap, "months between high school and going to college")
}else {
} if(is.na(parent_exp)|parent_exp == 11) {
paste("This student did not go to college, and their parental expectation is unknown")
else if(parent_exp >= 5) {
} paste("This student did not go to college, but their parents expected them to")
else if(parent_exp < 5) {
} paste("This student did not go to college, and their parents did not expect them to")
}
}
}
# test
college(10007)
# Super Optional
<- function(id) {
college
<- data |> filter(stu_id == id)
student <- student |> pull(x4evratndclg) # pull out if the student attend college
college <- student |> pull(x1paredexpct) # pull out the parent student expectation
parent_exp <- student |> pull(x4hs2psmos) # pull out the months gap
months_gap <- mean(data$x4hs2psmos, na.rm = TRUE) # calculate the average delay
avg_delay
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
else if(college == 1) {
} if(is.na(months_gap)) {
paste("This student went to college, but we do not know how long the gap was")
else{
} if(months_gap == avg_delay) {
paste("This student went to college, and their gap was the average delay between high school and going to college")
else if(months_gap < avg_delay) {
} paste("This student went to college, and their gap was shorter than the average delay")
else if(months_gap > avg_delay) {
} paste("This student went to college, and their gap was longer than the average delay")
}
}else {
} if(is.na(parent_exp)|parent_exp == 11) {
paste("This student did not go to college, and their parental expectation is unknown")
else if(parent_exp >= 5) {
} paste("This student did not go to college, but their parents expected them to")
else if(parent_exp < 5) {
} paste("This student did not go to college, and their parents did not expect them to")
}
}
}
# test
college(10007)
## Test it using a for loop
<- data |> slice_head(n = 50) |> pull(stu_id)
test_ids for(i in test_ids) { print(college(i)) }
##'[Matt's Solution]
##'[Without the optional part]
<- 10007 # Tip: if you're debugging a function or loop, manually assign something to input values (for a function) or i (for a loop) and then you can test it
id
<- function(id) {
did_they_go
<- df |> filter(stu_id == id) # pull out the student id (so we can directly use the value)
student <- student |> pull(x4evratndclg) # pull out if the student attend college
college
<- student |> pull(x1paredexpct) # pull out the parent student expectation
parent <- if(is.na(parent)|parent == 11) { # if the parent expectation is either NA or 11
expect "and we do not know if their parents wanted them to" # assign this text to expect
else if(parent >= 5) { # otherwise if the parent expectation is 5 or above
} "and their parents expected them to" # assign this text to expect
else if(parent < 5) { # otherwise if the parent expectation is less than 5
} "and their parents did not expect them to" # assign this text to expect
}
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
else if(college == 1) { # if they went to college
} paste("This student went to college") # paste this message
else if(college == 0) { # if they did not go to college
} paste("This student never went to college", expect) # paste this message
}
}
## Test it using a for loop
<- df |> slice_head(n = 50) |> pull(stu_id)
test_ids for(i in test_ids) { print(did_they_go(i)) }
##'[With the optional part]
<- function(id) {
did_they_go
<- df |> filter(stu_id == id) # pull out the student id (so we can directly use the value)
student <- student |> pull(x4evratndclg) # pull out if the student attend college
college
<- student |> pull(x1paredexpct) # pull out the parent student expectation
parent <- if(is.na(parent)|parent == 11) { # if the parent expectation is either NA or 11
expect "and we do not know if their parents wanted them to" # assign this text to expect
else if(parent >= 5) { # otherwise if the parent expectation is 5 or above
} "and their parents expected them to" # assign this text to expect
else if(parent < 5) { # otherwise if the parent expectation is less than 5
} "and their parents did not expect them to" # assign this text to expect
}
## NEW PART
<- df |> summarize(median = median(x4hs2psmos, na.rm = T)) |> pull(median) # summarize the median completion and pull the value out (notice we start with df not student)
median_delay <- student |> pull(x4hs2psmos) # pull out the students months delay
delay <- delay - median_delay # calculate the difference between the median delay and the students delay
difference <- if(is.na(delay)) { # if the students delay was NA (note, it doesn't matter if they didn't go to college, as we only use this if they went)
delay_statement "and we do not know how long they delayed college" # paste this message
else if(delay == 0) { # if the students didn't delay going to college
} "and they did not delay attending college at all" # paste this message
else if(difference == 0) { # if the students delay was the median
} paste("and they delayed college by", delay, "months which is the average amount of time") # paste this, using the value delay from above
else if(difference < 0) { # if the students delay was below the median
} paste("and they delayed college by", delay, "months which is", abs(difference), "months less than the average") # paste this, using delay and abs(difference) which is the absolute value of the difference (i.e., the value regardless of positive vs negative)
else if(difference > 0) { # if the students delay was above the median
} paste("and they delayed college by", delay, "months which is", abs(difference), "months above than the average") # paste this, using delay and abs(difference) which is the absolute value of the difference (i.e., the value regardless of positive vs negative)
}
if(is.na(college)) { # if whether they went to college is missing
paste("We do not know if this student went to college") # print this message
else if(college == 1) { # if they went to college
} paste("This student went to college", delay_statement) # paste this message NEW added delay_statement
else if(college == 0) { # if they did not go to college
} paste("This student never went to college", expect) # paste this message
}
}
## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------