Functions & Loops

Lesson
Assignment

library(tidyverse)

Warning: package 'purrr' was built under R version 4.4.1

Warning: package 'lubridate' was built under R version 4.4.1

In this lesson, we are going to cover two fundamental ideas to more advanced programming
- Loops
  - Do something (or multiple things) for each value in a list
- Creating Functions
  - Do something (or multiple things) using things I give you

Review of Assignment `<-`

In essence, both of these skills are built off something we have been doing this whole class, assignment
We have assigned data

data <- haven::read_dta("data/hsls-small.dta")

We have assigned plots

plot <- ggplot(data) +
  geom_histogram(aes(x = x1txmtscor))

We have assigned summary tables

data_sum <- data |>
  summarize(mean = mean(x1txmtscor, na.rm = T))

We even assigned results

uf_age <- 2024 - 1853

Everything we are going to cover today comes back to this basic principle, things being assigned names
- We will be working with more than one assigned or named thing at once, which gets confusing at first, but it always comes back to this idea

Why Write Loops & Functions: DRY vs WET Code

The watchwords for this lesson are DRY vs WET:

DRY: Don’t repeat yourself
WET: Write every time

Let’s say you have a three-step analysis process for 20 files (read, lower names, add a column). Under a WET programming paradigm in which each command gets its own line of code, that’s 60 lines of code. If the number of your files grows to 50, that’s now 150 lines of code — for just three tasks! When you write every time, you not only make your code longer and harder to parse, you also increase the likelihood that your code will contain bugs while simultaneously decreasing its scalability.

If you need to repeat an analytic task (which may be a set of commands), then it’s better to have one statement of that process that you repeat, perhaps in a loop or in a function. Don’t repeat yourself — say it once and have R repeat it for you!

The goal of DRY programming is not abstraction or slickness for its own sake. That runs counter to the clarity and replicability we’ve been working toward. Instead, we aspire to DRY code since it is more scalable and less buggy than WET code. To be clear, a function or loop can still have bugs, but the bugs it introduces are often the same across repetitions and fixed at a single point of error. That is, it’s typically easier to debug when the bug has a single root cause than when it could be anywhere in 150 similar but slightly different lines of code.

As we work through the lesson examples, keep in the back of your mind:

What would this code look like if I wrote everything twice (WET)?
How does this DRY process not only reduce the number of lines of code, but also make my intent clearer?

Setup

We’ll use a combination of nonce data and the school test score data we’ve used in a past lesson. We won’t read in the school test score data until the last section, but we’ll continue following our good organizational practice by setting the directory paths at the top of our script.

`for()` Loops

The idea of loops if relatively simple
- Take a list of things
  - for(i in class_list) {
- Do a set of things
  - print(i) }
Wait, but what’s i
- i is the most common word to use here, but we could call it anything
  - It is just the name we are assigning to the item (think i for item) in the list
Okay, but what’s { and }
- Since we are doing one or more things for each i in the list

class_list <- c("Let's", "go", "Gators", "!")

for(i in class_list) { print(i) }

[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"

But, we can use anything we want instead of i

for(word in class_list) { print(word) }

[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"

Literally anything

for(gator_egg in class_list) { print(gator_egg) }

[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"

All we are doing is assigning a name to the item in the list
We can do the name thing with numbers

gators_points_23 <- c(11, 49, 29, 22, 14, 38, 41, 20, 36, 35, 31, 15)

for(i in gators_points_23) { print(i) }

[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15

Again, we can literally use anything as the name

for(billy_napier in gators_points_23) { print(billy_napier) }

[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15

Quick exercise

Create a list of the names of every school you’ve attended, then use a for loop to print them out

Adding `if()` and `else()` to Loops

Loops that print things are all well and good, but really we want to be able to do a little more than that
We are going to use if() and else() to that
- Remember ifelse() from Data Wrangling?
- This is just splitting that up, if() something is true, do this, else() do that
  - Let’s just start with an if()

for(i in gators_points_23) {
  if(i > 30) {
    print(i)
  }
}

[1] 49
[1] 38
[1] 41
[1] 36
[1] 35
[1] 31

Notice we only got scores if they were above 30
Next, we can add an else() to say what to do if the score was not above 30

for(i in gators_points_23) {
  if(i > 30) {
    print(i)
  } else {
    print(i)
  }
}

[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15

Quick Question: Is that the same list we had before? Why or why not?

Let’s see how we can make it different
- We are going to use a new command paste() which combines strings, then print() that
  - Top tip: You’re going to want to use paste() in your assignment as well

for(i in gators_points_23) {
  if(i > 30) {
    paste("Yay the Gators scored", i, "points, which is more than 30!") |> print()
  } else {
    print(i)
  }
}

[1] 11
[1] "Yay the Gators scored 49 points, which is more than 30!"
[1] 29
[1] 22
[1] 14
[1] "Yay the Gators scored 38 points, which is more than 30!"
[1] "Yay the Gators scored 41 points, which is more than 30!"
[1] 20
[1] "Yay the Gators scored 36 points, which is more than 30!"
[1] "Yay the Gators scored 35 points, which is more than 30!"
[1] "Yay the Gators scored 31 points, which is more than 30!"
[1] 15

Then we can extend that same logic to the else() statement

for(i in gators_points_23) {
  if(i > 30) {
    paste("Yay, the Gators scored", i, "points, which is more than 30!") |> print()
  } else {
    paste("Sad times, the Gators only scored", i, "points...") |> print()
  }
}

[1] "Sad times, the Gators only scored 11 points..."
[1] "Yay, the Gators scored 49 points, which is more than 30!"
[1] "Sad times, the Gators only scored 29 points..."
[1] "Sad times, the Gators only scored 22 points..."
[1] "Sad times, the Gators only scored 14 points..."
[1] "Yay, the Gators scored 38 points, which is more than 30!"
[1] "Yay, the Gators scored 41 points, which is more than 30!"
[1] "Sad times, the Gators only scored 20 points..."
[1] "Yay, the Gators scored 36 points, which is more than 30!"
[1] "Yay, the Gators scored 35 points, which is more than 30!"
[1] "Yay, the Gators scored 31 points, which is more than 30!"
[1] "Sad times, the Gators only scored 15 points..."

This may seem a little silly right now, but this fun example was just meant to show the basics of we are doing
We will cover these in a more serious way at the end of the lesson

Writing Your Own Functions

Functions work much the same way as loops, whatever we say inside { } is done
The difference is that instead of doing it for each item in a list, we do it for a single input
We also have to use the function for it to work
- We have been using functions all semester filter(), summarize(), mutate() are all functions just like the one we are going to make
To demonstrate, let’s make a function that prints a welcome message for students arriving at UF

welcome <- function() { print("Welcome to UF!") }

welcome()

[1] "Welcome to UF!"

To do this, we need some data, so we are just going to make some up
- tribble() is just a way of making a tidyverse data frame, don’t worry about it for now, it’s not the main idea for the lesson

fake_data <- tribble(~ufid, ~name, ~dorm, ~first_class, ~meal_plan, ~roommate,
                     1853, "Jack", "Cyprus", "BIO-1001", 1, "Mike",
                     1854, "Hailey", "Simpson", "BIO-1001", 0, "Jessica",
                     1855, "Tamika", "Simpson", "CHEM-1002", 1, "Hannah",
                     1856, "Jessica", "Simpson", "ARCH-1003", 1, "Hailey",
                     1857, "Mike", "Cyrpus", "STA-1002", 0, "Jack",
                     1858, "Hannah", "Simpson", "EDF-1005", 1, "Tamika")

For our function to be able to work, it needs to be able to take an input, in this case UFID
- You can imagine a much more sophisticated version of this could be used for automated dorm check in
Let’s run our same function again, but adding ufid to the brackets, saying that it takes ufid as the only input

welcome <- function(id) { print("Welcome to UF!") }

welcome()

[1] "Welcome to UF!"

Quick question: It ran, but why did this not change anything?

welcome <- function(id) {
  
  student <- fake_data |> filter(ufid == id)
  
  print(student)
  
}

welcome(1853)

# A tibble: 1 × 6
   ufid name  dorm   first_class meal_plan roommate
  <dbl> <chr> <chr>  <chr>           <dbl> <chr>   
1  1853 Jack  Cyprus BIO-1001            1 Mike

Okay so that ran, but it spat a data frame, how can we make it more of a welcome message?

welcome <- function(id) {
  
  student <- fake_data |> filter(ufid == id)
  
  name <- student |> pull(name)
  
  paste("Welcome to UF", name)
  
}

welcome(1853)

[1] "Welcome to UF Jack"

Okay, now we’re getting somewhere!
Let’s add a bit more info to say where they live and what their first class will be

welcome <- function(id) {
  
  student <- fake_data |> filter(ufid == id)
  
  name <- student |> pull(name)
  dorm <- student |> pull(dorm)
  first_class <- student |> pull(first_class)
  
  paste("Welcome to UF", name, "you will be living in", dorm, "and your first class is", first_class)
  
}

welcome(1853)

[1] "Welcome to UF Jack you will be living in Cyprus and your first class is BIO-1001"

Quick exercise: Add to the above block of code, to also say who their roommate is

Practical Example: Batch Reading Files

Remember, to use a loop, we need a list to loop through
To get that, let’s use the list.files() function
- By default, this will list files in the current working directory
- If we want to list files in a different folder, we need to say where using a relative path from the working directory
  - For this, we want to list the by school files we used in Data Wrangling II, so we give it a path to that folder
- We also include the argument full.names = TRUE

Quick Exercise

Try removing the full.names = TRUE see what the differences are, and think why we need to include it

files <- list.files("data/sch-test/by-school",
                    full.names = T)

As a starting point, let’s print out what this list looks like using a loop just like before

for(i in files) {
  print(i)
}

[1] "data/sch-test/by-school/bend-gate-1980.csv"
[1] "data/sch-test/by-school/bend-gate-1981.csv"
[1] "data/sch-test/by-school/bend-gate-1982.csv"
[1] "data/sch-test/by-school/bend-gate-1983.csv"
[1] "data/sch-test/by-school/bend-gate-1984.csv"
[1] "data/sch-test/by-school/bend-gate-1985.csv"
[1] "data/sch-test/by-school/east-heights-1980.csv"
[1] "data/sch-test/by-school/east-heights-1981.csv"
[1] "data/sch-test/by-school/east-heights-1982.csv"
[1] "data/sch-test/by-school/east-heights-1983.csv"
[1] "data/sch-test/by-school/east-heights-1984.csv"
[1] "data/sch-test/by-school/east-heights-1985.csv"
[1] "data/sch-test/by-school/niagara-1980.csv"
[1] "data/sch-test/by-school/niagara-1981.csv"
[1] "data/sch-test/by-school/niagara-1982.csv"
[1] "data/sch-test/by-school/niagara-1983.csv"
[1] "data/sch-test/by-school/niagara-1984.csv"
[1] "data/sch-test/by-school/niagara-1985.csv"
[1] "data/sch-test/by-school/spottsville-1980.csv"
[1] "data/sch-test/by-school/spottsville-1981.csv"
[1] "data/sch-test/by-school/spottsville-1982.csv"
[1] "data/sch-test/by-school/spottsville-1983.csv"
[1] "data/sch-test/by-school/spottsville-1984.csv"
[1] "data/sch-test/by-school/spottsville-1985.csv"

Okay, that looks like what we want, a list of files to read in
Now let’s try reading these

for(i in files) {
  read_csv(i)
}

Okay, that read the files, but we didn’t save them anywhere, we need to assign them
Now, this gets a little tricky, as anything we assign with <- in a loop only exists in a loop, so we need two extra steps
- First, we read it in mostly like normal, into something called file , which is temporary and only exists within the loop
- Second, we make another temporary object, which is the name we want to assign the data frame to
  - For now, let’s just use data_<name of the file>
- Third, we use assign(), which basically does what <- would do in normal circumstances, but keep the object after the loop
  - name <- read_csv() becomes file <- read_csv() then assign(name, file)

for(i in files) {
  file <- read_csv(i)
  name <- paste0("data_", i)
  assign(name, file)
}

Cool! That seems to have worked, but, our data_ names are really really long, which probably isn’t that useful for future analysis
So, instead of simply pasting together data_ and i, let’s use our friend regular expressions to get something more usable
- Get the school name (anything that matches the options we give)
- Get the year (any digits)
- Paste those together with data_ to get our nicer object names

for(i in files) {
  school <- str_extract(i, "niagara|bend|east|spot")
  year <- str_extract(i, "\\d+")
  name <- paste0("data_", school, year)
  file <- read_csv(i)
  assign(name, file)
}

Much better!
One last thing, instead of reading each of those into a new data frame, we could append them all together
- Note: This is only appropriate here due to the nature of the school data, with each data frame having the school name and year in it, other times we may want to join data instead. Review Data Wrangling II for a refresher on how to append and join data appropriately
To do this, we first need to set up an empty data frame (a.k.a., “tibble”), as otherwise we will get an error the first time through the loop, as we would be attempting to bind something that doesn’t exist
Then, we simply run the loop like before, but use bind_rows() to stack each file onto the existing list

data_bind <- tibble()

for(i in files) {
  file <- read_csv(i)
  data_bind <- bind_rows(data_bind, file)
}

Great, that worked!
Finally, what if we wanted to this only for certain schools?

Hint: this might be useful in the homework

With a loop, we do something for each item in the list
So do something only for certain schools, we want to change the list, not the loop
- In this case, we can add a pattern to our list.files() function saying to only list files that match that pattern
  - This could be some fancy regex, but in our case, we just need any files that have the word “niagara” it their name

files_niagara <- list.files("data/sch-test/by-school",
                            full.names = T,
                            pattern = "niagara")

data_niagara <- tibble()

for(i in files_niagara) {
  file <- read_csv(i)
  data_niagara <- bind_rows(data_niagara, file)
}

Let’s see what our final output looks like

print(data_niagara)

# A tibble: 6 × 5
  school   year  math  read science
  <chr>   <dbl> <dbl> <dbl>   <dbl>
1 Niagara  1980   514   292     787
2 Niagara  1981   499   268     762
3 Niagara  1982   507   310     771
4 Niagara  1983   497   301     814
5 Niagara  1984   483   311     818
6 Niagara  1985   489   275     805

Perfect!

Summary

In this lesson we have mostly print()-ed our output in this lesson, because it’s one of the easiest ways to see what’s going on
But, you can use functions and loops for other things too, like modifying variables, reading data, etc.
- This said, whenever you are writing a loop, you can always start by print()-ing what you are looping through, then make it more sophisticated from there
Loops and Functions can be tricky, so don’t worry if you can’t get everything now
- While confusing, they are super useful in the real world
You might see some variations of the loops we learned today such as
- while() loops
- for() loops that use the index (line) number of the list instead
- What’s important is that the underlying logic remains the same

Assignment Template

Question One

a) Using a for() loop, read in the school test files from data/sch-test/by-school/.

b) Do the same, but, only for Bend Gate and Niagara

Hint: Looking inside the list.files() help file for somewhere you might use some regex

c) Do it one more time, but include some code that add a column called file_path which contains the location where each row’s information is coming from

d) Bind the data sets from part c) together

Question Two

a) Read in hsls-small.dta

b) Write a function that

Takes a stu_id as the input
paste()s back a sentence which says if() they ever attended college else() whether their parent expected them to go to college

If a student went to college it should return something like “this student went to college”
If they didn’t, it should return something like “this student did not go to college, their students did/did not expect them to”

Optional: if() the student went to college, add a second sentence that says how many months they had between high school and going to college
Super-Optional: Following from that, edit the second sentence to include how that student’s delay compares to the average delay (you can pick, mean or median)

Submission

Once complete turn in the .qmd file (it must render/run) and the rendered PDF to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.

Review of Assignment <-