library(tidyverse)
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
library(tidyverse)
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
In this lesson, we are going to cover two fundamental ideas to more advanced programming
Loops
Creating Functions
<-
In essence, both of these skills are built off something we have been doing this whole class, assignment
We have assigned data
<- haven::read_dta("data/hsls-small.dta") data
<- ggplot(data) +
plot geom_histogram(aes(x = x1txmtscor))
<- data |>
data_sum summarize(mean = mean(x1txmtscor, na.rm = T))
<- 2024 - 1853 uf_age
Everything we are going to cover today comes back to this basic principle, things being assigned names
The watchwords for this lesson are DRY vs WET:
Let’s say you have a three-step analysis process for 20 files (read, lower names, add a column). Under a WET programming paradigm in which each command gets its own line of code, that’s 60 lines of code. If the number of your files grows to 50, that’s now 150 lines of code — for just three tasks! When you write every time, you not only make your code longer and harder to parse, you also increase the likelihood that your code will contain bugs while simultaneously decreasing its scalability.
If you need to repeat an analytic task (which may be a set of commands), then it’s better to have one statement of that process that you repeat, perhaps in a loop or in a function. Don’t repeat yourself — say it once and have R repeat it for you!
The goal of DRY programming is not abstraction or slickness for its own sake. That runs counter to the clarity and replicability we’ve been working toward. Instead, we aspire to DRY code since it is more scalable and less buggy than WET code. To be clear, a function or loop can still have bugs, but the bugs it introduces are often the same across repetitions and fixed at a single point of error. That is, it’s typically easier to debug when the bug has a single root cause than when it could be anywhere in 150 similar but slightly different lines of code.
As we work through the lesson examples, keep in the back of your mind:
We’ll use a combination of nonce data and the school test score data we’ve used in a past lesson. We won’t read in the school test score data until the last section, but we’ll continue following our good organizational practice by setting the directory paths at the top of our script.
for()
LoopsThe idea of loops if relatively simple
Take a list of things
for(i in class_list) {
Do a set of things
print(i) }
Wait, but what’s i
i
is the most common word to use here, but we could call it anything
i
for item) in the listOkay, but what’s {
and }
i
in the list<- c("Let's", "go", "Gators", "!")
class_list
for(i in class_list) { print(i) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
i
for(word in class_list) { print(word) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
for(gator_egg in class_list) { print(gator_egg) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
All we are doing is assigning a name to the item in the list
We can do the name thing with numbers
<- c(11, 49, 29, 22, 14, 38, 41, 20, 36, 35, 31, 15)
gators_points_23
for(i in gators_points_23) { print(i) }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
for(billy_napier in gators_points_23) { print(billy_napier) }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
Quick exercise
Create a list of the names of every school you’ve attended, then use a for loop to print them out
if()
and else()
to LoopsLoops that print things are all well and good, but really we want to be able to do a little more than that
We are going to use if()
and else()
to that
Remember ifelse()
from Data Wrangling?
This is just splitting that up, if()
something is true, do this, else()
do that
if()
for(i in gators_points_23) {
if(i > 30) {
print(i)
} }
[1] 49
[1] 38
[1] 41
[1] 36
[1] 35
[1] 31
Notice we only got scores if they were above 30
Next, we can add an else()
to say what to do if the score was not above 30
for(i in gators_points_23) {
if(i > 30) {
print(i)
else {
} print(i)
} }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
Quick Question: Is that the same list we had before? Why or why not?
Let’s see how we can make it different
We are going to use a new command paste()
which combines strings, then print()
that
paste()
in your assignment as wellfor(i in gators_points_23) {
if(i > 30) {
paste("Yay the Gators scored", i, "points, which is more than 30!") |> print()
else {
} print(i)
} }
[1] 11
[1] "Yay the Gators scored 49 points, which is more than 30!"
[1] 29
[1] 22
[1] 14
[1] "Yay the Gators scored 38 points, which is more than 30!"
[1] "Yay the Gators scored 41 points, which is more than 30!"
[1] 20
[1] "Yay the Gators scored 36 points, which is more than 30!"
[1] "Yay the Gators scored 35 points, which is more than 30!"
[1] "Yay the Gators scored 31 points, which is more than 30!"
[1] 15
else()
statementfor(i in gators_points_23) {
if(i > 30) {
paste("Yay, the Gators scored", i, "points, which is more than 30!") |> print()
else {
} paste("Sad times, the Gators only scored", i, "points...") |> print()
} }
[1] "Sad times, the Gators only scored 11 points..."
[1] "Yay, the Gators scored 49 points, which is more than 30!"
[1] "Sad times, the Gators only scored 29 points..."
[1] "Sad times, the Gators only scored 22 points..."
[1] "Sad times, the Gators only scored 14 points..."
[1] "Yay, the Gators scored 38 points, which is more than 30!"
[1] "Yay, the Gators scored 41 points, which is more than 30!"
[1] "Sad times, the Gators only scored 20 points..."
[1] "Yay, the Gators scored 36 points, which is more than 30!"
[1] "Yay, the Gators scored 35 points, which is more than 30!"
[1] "Yay, the Gators scored 31 points, which is more than 30!"
[1] "Sad times, the Gators only scored 15 points..."
This may seem a little silly right now, but this fun example was just meant to show the basics of we are doing
We will cover these in a more serious way at the end of the lesson
Functions work much the same way as loops, whatever we say inside { }
is done
The difference is that instead of doing it for each item in a list, we do it for a single input
We also have to use the function for it to work
filter()
, summarize()
, mutate()
are all functions just like the one we are going to makeTo demonstrate, let’s make a function that prints a welcome message for students arriving at UF
<- function() { print("Welcome to UF!") }
welcome
welcome()
[1] "Welcome to UF!"
To do this, we need some data, so we are just going to make some up
tribble()
is just a way of making a tidyverse data frame, don’t worry about it for now, it’s not the main idea for the lesson<- tribble(~ufid, ~name, ~dorm, ~first_class, ~meal_plan, ~roommate,
fake_data 1853, "Jack", "Cyprus", "BIO-1001", 1, "Mike",
1854, "Hailey", "Simpson", "BIO-1001", 0, "Jessica",
1855, "Tamika", "Simpson", "CHEM-1002", 1, "Hannah",
1856, "Jessica", "Simpson", "ARCH-1003", 1, "Hailey",
1857, "Mike", "Cyrpus", "STA-1002", 0, "Jack",
1858, "Hannah", "Simpson", "EDF-1005", 1, "Tamika")
For our function to be able to work, it needs to be able to take an input, in this case UFID
Let’s run our same function again, but adding ufid
to the brackets, saying that it takes ufid
as the only input
<- function(id) { print("Welcome to UF!") }
welcome
welcome()
[1] "Welcome to UF!"
Quick question: It ran, but why did this not change anything?
<- function(id) {
welcome
<- fake_data |> filter(ufid == id)
student
print(student)
}
welcome(1853)
# A tibble: 1 × 6
ufid name dorm first_class meal_plan roommate
<dbl> <chr> <chr> <chr> <dbl> <chr>
1 1853 Jack Cyprus BIO-1001 1 Mike
<- function(id) {
welcome
<- fake_data |> filter(ufid == id)
student
<- student |> pull(name)
name
paste("Welcome to UF", name)
}
welcome(1853)
[1] "Welcome to UF Jack"
Okay, now we’re getting somewhere!
Let’s add a bit more info to say where they live and what their first class will be
<- function(id) {
welcome
<- fake_data |> filter(ufid == id)
student
<- student |> pull(name)
name <- student |> pull(dorm)
dorm <- student |> pull(first_class)
first_class
paste("Welcome to UF", name, "you will be living in", dorm, "and your first class is", first_class)
}
welcome(1853)
[1] "Welcome to UF Jack you will be living in Cyprus and your first class is BIO-1001"
Quick exercise: Add to the above block of code, to also say who their roommate is
list.files()
function
full.names = TRUE
Quick Exercise
Try removing the
full.names = TRUE
see what the differences are, and think why we need to include it
<- list.files("data/sch-test/by-school",
files full.names = T)
for(i in files) {
print(i)
}
[1] "data/sch-test/by-school/bend-gate-1980.csv"
[1] "data/sch-test/by-school/bend-gate-1981.csv"
[1] "data/sch-test/by-school/bend-gate-1982.csv"
[1] "data/sch-test/by-school/bend-gate-1983.csv"
[1] "data/sch-test/by-school/bend-gate-1984.csv"
[1] "data/sch-test/by-school/bend-gate-1985.csv"
[1] "data/sch-test/by-school/east-heights-1980.csv"
[1] "data/sch-test/by-school/east-heights-1981.csv"
[1] "data/sch-test/by-school/east-heights-1982.csv"
[1] "data/sch-test/by-school/east-heights-1983.csv"
[1] "data/sch-test/by-school/east-heights-1984.csv"
[1] "data/sch-test/by-school/east-heights-1985.csv"
[1] "data/sch-test/by-school/niagara-1980.csv"
[1] "data/sch-test/by-school/niagara-1981.csv"
[1] "data/sch-test/by-school/niagara-1982.csv"
[1] "data/sch-test/by-school/niagara-1983.csv"
[1] "data/sch-test/by-school/niagara-1984.csv"
[1] "data/sch-test/by-school/niagara-1985.csv"
[1] "data/sch-test/by-school/spottsville-1980.csv"
[1] "data/sch-test/by-school/spottsville-1981.csv"
[1] "data/sch-test/by-school/spottsville-1982.csv"
[1] "data/sch-test/by-school/spottsville-1983.csv"
[1] "data/sch-test/by-school/spottsville-1984.csv"
[1] "data/sch-test/by-school/spottsville-1985.csv"
Okay, that looks like what we want, a list of files to read in
Now let’s try reading these
for(i in files) {
read_csv(i)
}
Okay, that read the files, but we didn’t save them anywhere, we need to assign them
Now, this gets a little tricky, as anything we assign with <-
in a loop only exists in a loop, so we need two extra steps
First, we read it in mostly like normal, into something called file
, which is temporary and only exists within the loop
Second, we make another temporary object, which is the name we want to assign the data frame to
data_<name of the file>
Third, we use assign()
, which basically does what <-
would do in normal circumstances, but keep the object after the loop
name <- read_csv()
becomes file <- read_csv()
then assign(name, file)
for(i in files) {
<- read_csv(i)
file <- paste0("data_", i)
name assign(name, file)
}
Cool! That seems to have worked, but, our data_
names are really really long, which probably isn’t that useful for future analysis
So, instead of simply pasting together data_
and i
, let’s use our friend regular expressions to get something more usable
Get the school name (anything that matches the options we give)
Get the year (any digits)
Paste those together with data_
to get our nicer object names
for(i in files) {
<- str_extract(i, "niagara|bend|east|spot")
school <- str_extract(i, "\\d+")
year <- paste0("data_", school, year)
name <- read_csv(i)
file assign(name, file)
}
Much better!
One last thing, instead of reading each of those into a new data frame, we could append
them all together
join
data instead. Review Data Wrangling II for a refresher on how to append
and join
data appropriatelyTo do this, we first need to set up an empty data frame (a.k.a., “tibble”), as otherwise we will get an error the first time through the loop, as we would be attempting to bind
something that doesn’t exist
Then, we simply run the loop like before, but use bind_rows()
to stack each file
onto the existing list
<- tibble()
data_bind
for(i in files) {
<- read_csv(i)
file <- bind_rows(data_bind, file)
data_bind }
Great, that worked!
Finally, what if we wanted to this only for certain schools?
Hint: this might be useful in the homework
With a loop, we do something for each item in the list
So do something only for certain schools, we want to change the list, not the loop
In this case, we can add a pattern
to our list.files()
function saying to only list files that match that pattern
<- list.files("data/sch-test/by-school",
files_niagara full.names = T,
pattern = "niagara")
<- tibble()
data_niagara
for(i in files_niagara) {
<- read_csv(i)
file <- bind_rows(data_niagara, file)
data_niagara }
print(data_niagara)
# A tibble: 6 × 5
school year math read science
<chr> <dbl> <dbl> <dbl> <dbl>
1 Niagara 1980 514 292 787
2 Niagara 1981 499 268 762
3 Niagara 1982 507 310 771
4 Niagara 1983 497 301 814
5 Niagara 1984 483 311 818
6 Niagara 1985 489 275 805
In this lesson we have mostly print()
-ed our output in this lesson, because it’s one of the easiest ways to see what’s going on
But, you can use functions and loops for other things too, like modifying variables, reading data, etc.
print()
-ing what you are looping through, then make it more sophisticated from thereLoops and Functions can be tricky, so don’t worry if you can’t get everything now
You might see some variations of the loops we learned today such as
while()
loops
for()
loops that use the index (line) number of the list instead
What’s important is that the underlying logic remains the same
a) Using a for()
loop, read in the school test files from data/sch-test/by-school/
.
b) Do the same, but, only for Bend Gate and Niagara
Hint: Looking inside the
list.files()
help file for somewhere you might use some regex
c) Do it one more time, but include some code that add a column called file_path
which contains the location where each row’s information is coming from
d) Bind the data sets from part c) together
a) Read in hsls-small.dta
b) Write a function that
stu_id
as the inputpaste()
s back a sentence which says if()
they ever attended college else()
whether their parent expected them to go to collegeif()
the student went to college, add a second sentence that says how many months they had between high school and going to collegeOnce complete turn in the .qmd file (it must render/run) and the rendered PDF to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.