I: Functions & Loops

R Code

library(tidyverse)
  • In this lesson, we are going to cover two fundamental ideas to more advanced programming

    • Loops

      • Do something (or multiple things) for each value in a list
    • Creating Functions

      • Do something (or multiple things) using things I give you

Review of Assignment <-

  • In essence, both of these skills are built off something we have been doing this whole class, assignment

  • We have assigned data

df <- haven::read_dta("data/hsls-small.dta")
  • We have assigned plots
plot <- ggplot(df) +
  geom_histogram(aes(x = x1txmtscor))
  • We have assigned summary tables
df_sum <- df |>
  summarize(mean = mean(x1txmtscor, na.rm = T))
  • We even assigned results
uf_age <- 2024 - 1853
  • Everything we are going to cover today comes back to this basic principle, things being assigned names

    • We will be working with more than one assigned or named thing at once, which gets confusing at first, but it always comes back to this idea

Why Write Loops & Functions: DRY vs WET Code

The watchwords for this lesson are DRY vs WET:

  • DRY: Don’t repeat yourself
  • WET: Write every time

Let’s say you have a three-step analysis process for 20 files (read, lower names, add a column). Under a WET programming paradigm in which each command gets its own line of code, that’s 60 lines of code. If the number of your files grows to 50, that’s now 150 lines of code — for just three tasks! When you write every time, you not only make your code longer and harder to parse, you also increase the likelihood that your code will contain bugs while simultaneously decreasing its scalability.

If you need to repeat an analytic task (which may be a set of commands), then it’s better to have one statement of that process that you repeat, perhaps in a loop or in a function. Don’t repeat yourself — say it once and have R repeat it for you!

The goal of DRY programming is not abstraction or slickness for its own sake. That runs counter to the clarity and replicability we’ve been working toward. Instead, we aspire to DRY code since it is more scalable and less buggy than WET code. To be clear, a function or loop can still have bugs, but the bugs it introduces are often the same across repetitions and fixed at a single point of error. That is, it’s typically easier to debug when the bug has a single root cause than when it could be anywhere in 150 similar but slightly different lines of code.

As we work through the lesson examples, keep in the back of your mind:

  1. What would this code look like if I wrote everything twice (WET)?
  2. How does this DRY process not only reduce the number of lines of code, but also make my intent clearer?

Setup

We’ll use a combination of nonce data and the school test score data we’ve used in a past lesson. We won’t read in the school test score data until the last section, but we’ll continue following our good organizational practice by setting the directory paths at the top of our script.

for() Loops

  • The idea of loops if relatively simple

    • Take a list of things

      • for(i in matts_list) {
    • Do a set of things

      • print(i) }
  • Wait, but what’s i

    • i is the most common word to use here, but we could call it anything

      • It is just the name we are assigning to the item (think i for item) in the list
  • Okay, but what’s { and }

    • Since we are doing one or more things for each i in the list
matts_list <- c("Let's", "go", "Gators", "!")

for(i in matts_list) { print(i) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
  • But, we can use anything we want instead of i
for(word in matts_list) { print(word) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
  • Literally anything
for(gator_egg in matts_list) { print(gator_egg) }
[1] "Let's"
[1] "go"
[1] "Gators"
[1] "!"
  • All we are doing is assigning a name to the item in the list

  • We can do the name thing with numbers

gators_points_23 <- c(11, 49, 29, 22, 14, 38, 41, 20, 36, 35, 31, 15)

for(i in gators_points_23) { print(i) }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15
  • Again, we can literally use anything as the name
for(billy_napier in gators_points_23) { print(billy_napier) }
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15

Quick exercise

Create a list of the names of every school you’ve attended, then use a for loop to print them out

Adding if() and else() to Loops

  • Loops that print things are all well and good, but really we want to be able to do a little more than that

  • We are going to use if() and else() to that

    • Remember ifelse() from Data Wrangling?

    • This is just splitting that up, if() something if true, do this, else() do that

      • Let’s just start with an if()
for(i in gators_points_23) {
  if(i > 30) {
    print(i)
  }
}
[1] 49
[1] 38
[1] 41
[1] 36
[1] 35
[1] 31
  • Notice we only got scores if they were above 30

  • Next, we can add an else() to say what to do if the score was not above 30

for(i in gators_points_23) {
  if(i > 30) {
    print(i)
  } else {
    print(i)
  }
}
[1] 11
[1] 49
[1] 29
[1] 22
[1] 14
[1] 38
[1] 41
[1] 20
[1] 36
[1] 35
[1] 31
[1] 15

Quick Question: Is that the same list we had before? Why or why not?

  • Let’s see how we can make it different

    • We are going to use a new command paste() which combines strings, then print() that

      • Top tip: You’re going to want to use paste() in your homework as well
for(i in gators_points_23) {
  if(i > 30) {
    paste("Yay the Gators scored", i, "points, which is more than 30!") |> print()
  } else {
    print(i)
  }
}
[1] 11
[1] "Yay the Gators scored 49 points, which is more than 30!"
[1] 29
[1] 22
[1] 14
[1] "Yay the Gators scored 38 points, which is more than 30!"
[1] "Yay the Gators scored 41 points, which is more than 30!"
[1] 20
[1] "Yay the Gators scored 36 points, which is more than 30!"
[1] "Yay the Gators scored 35 points, which is more than 30!"
[1] "Yay the Gators scored 31 points, which is more than 30!"
[1] 15
  • Then we can extend that same logic to the else() statement
for(i in gators_points_23) {
  if(i > 30) {
    paste("Yay, the Gators scored", i, "points, which is more than 30!") |> print()
  } else {
    paste("Sad times, the Gators only scored", i, "points...") |> print()
  }
}
[1] "Sad times, the Gators only scored 11 points..."
[1] "Yay, the Gators scored 49 points, which is more than 30!"
[1] "Sad times, the Gators only scored 29 points..."
[1] "Sad times, the Gators only scored 22 points..."
[1] "Sad times, the Gators only scored 14 points..."
[1] "Yay, the Gators scored 38 points, which is more than 30!"
[1] "Yay, the Gators scored 41 points, which is more than 30!"
[1] "Sad times, the Gators only scored 20 points..."
[1] "Yay, the Gators scored 36 points, which is more than 30!"
[1] "Yay, the Gators scored 35 points, which is more than 30!"
[1] "Yay, the Gators scored 31 points, which is more than 30!"
[1] "Sad times, the Gators only scored 15 points..."
  • This may seem a little silly right now, but this fun example was just meant to show the basics of we are doing

  • We will cover these in a more serious way at the end of the lesson

Writing Your Own Functions

  • Functions work much the same way as loops, whatever we say inside { } is done

  • The difference is that instead of doing it for each item in a list, we do it for a single input

  • We also have to use the function for it to work

    • We have been using functions all semester filter(), summarize(), mutate() are all functions just like the one we are going to make
  • To demonstrate, let’s make a function that prints a welcome message for students arriving at UF

welcome <- function() { print("Welcome to UF!") }

welcome()
[1] "Welcome to UF!"
  • To do this, we need some data, so I am just going to make some up

    • tribble() is just a way of making a tidyverse data frame, don’t worry about it for now, it’s not the main idea for the lesson
fake_data <- tribble(~ufid, ~name, ~dorm, ~first_class, ~meal_plan, ~roommate,
                     1853, "Jack", "Cyprus", "BIO-1001", 1, "Mike",
                     1854, "Hailey", "Simpson", "BIO-1001", 0, "Jessica",
                     1855, "Tamika", "Simpson", "CHEM-1002", 1, "Hannah",
                     1856, "Jessica", "Simpson", "ARCH-1003", 1, "Hailey",
                     1857, "Mike", "Cyrpus", "STA-1002", 0, "Jack",
                     1858, "Hannah", "Simpson", "EDF-1005", 1, "Tamika")
  • For our function to be able to work, it needs to be able to take an input, in this case UFID

    • You can imagine a much more sophisticated version of this could be used for automated dorm check in
  • Let’s run our same function again, but adding ufid to the brackets, saying that it takes ufid as the only input

welcome <- function(id) { print("Welcome to UF!") }

welcome()
[1] "Welcome to UF!"

Quick question: It ran, but why did this not change anything?

welcome <- function(id) {
  
  student <- fake_data |> filter(ufid == id)
  
  print(student)
  
}

welcome(1853)
# A tibble: 1 × 6
   ufid name  dorm   first_class meal_plan roommate
  <dbl> <chr> <chr>  <chr>           <dbl> <chr>   
1  1853 Jack  Cyprus BIO-1001            1 Mike    
  • Okay so that ran, but it spat a data frame, how can we make it more of a welcome message?
welcome <- function(id) {
  
  student <- fake_data |> filter(ufid == id)
  
  name <- student |> pull(name)
  
  paste("Welcome to UF", name)
  
}

welcome(1853)
[1] "Welcome to UF Jack"
  • Okay, now we’re getting somewhere!

  • Let’s add a bit more info to say where they live and what their first class will be

welcome <- function(id) {
  
  student <- fake_data |> filter(ufid == id)
  
  name <- student |> pull(name)
  dorm <- student |> pull(dorm)
  first_class <- student |> pull(first_class)
  
  paste("Welcome to UF", name, "you will be living in", dorm, "and your first class is", first_class)
  
}

welcome(1853)
[1] "Welcome to UF Jack you will be living in Cyprus and your first class is BIO-1001"

Quick exercise: Add to the above block of code, to also say who their roommate is

Practical Example: Batch Reading Files

  • Remember, to use a loop, we need a list to loop through
  • To get that, let’s use the list.files() function
    • By default, this will list files in the current working directory
    • If we want to list files in a different folder, we need to say where using a relative path from the working directory
      • For this, we want to list the by school files we used in Data Wrangling II, so we give it a path to that folder
    • We also include the argument full.names = TRUE

Quick Exercise

Try removing the full.names = TRUE see what the differences are, and think why we need to include it

files <- list.files("data/sch-test/by-school",
                    full.names = T)
  • As a starting point, let’s print out what this list looks like using a loop just like before
for(i in files) {
  print(i)
}
[1] "data/sch-test/by-school/bend-gate-1980.csv"
[1] "data/sch-test/by-school/bend-gate-1981.csv"
[1] "data/sch-test/by-school/bend-gate-1982.csv"
[1] "data/sch-test/by-school/bend-gate-1983.csv"
[1] "data/sch-test/by-school/bend-gate-1984.csv"
[1] "data/sch-test/by-school/bend-gate-1985.csv"
[1] "data/sch-test/by-school/east-heights-1980.csv"
[1] "data/sch-test/by-school/east-heights-1981.csv"
[1] "data/sch-test/by-school/east-heights-1982.csv"
[1] "data/sch-test/by-school/east-heights-1983.csv"
[1] "data/sch-test/by-school/east-heights-1984.csv"
[1] "data/sch-test/by-school/east-heights-1985.csv"
[1] "data/sch-test/by-school/niagara-1980.csv"
[1] "data/sch-test/by-school/niagara-1981.csv"
[1] "data/sch-test/by-school/niagara-1982.csv"
[1] "data/sch-test/by-school/niagara-1983.csv"
[1] "data/sch-test/by-school/niagara-1984.csv"
[1] "data/sch-test/by-school/niagara-1985.csv"
[1] "data/sch-test/by-school/spottsville-1980.csv"
[1] "data/sch-test/by-school/spottsville-1981.csv"
[1] "data/sch-test/by-school/spottsville-1982.csv"
[1] "data/sch-test/by-school/spottsville-1983.csv"
[1] "data/sch-test/by-school/spottsville-1984.csv"
[1] "data/sch-test/by-school/spottsville-1985.csv"
  • Okay, that looks like what we want, a list of files to read in

  • Now let’s try reading these

for(i in files) {
  read_csv(i)
}
  • Okay, that read the files, but we didn’t save them anywhere, we to assign them

  • Now, this gets a little tricky, as anything we assign with <- in a loop only exists in a loop, so we two extra steps

    • First, we read it in mostly like normal, into something called file , which is temporary and only exists within the loop

    • Second, we make another temporary object, which is the name we want to assign the data frame to

      • For now, let’s just use df_<name of the file>
    • Third, we use assign(), which basically does what <- would do in normal circumstances, but keep the object after the loop

      • name <- read_csv() becomes file <- read_csv() then assign(name, file)
for(i in files) {
  file <- read_csv(i)
  name <- paste0("df_", i)
  assign(name, file)
}
  • Cool! That seems to have worked, but, our df_ names are really really long, which probably isn’t that useful for future analysis

  • So, instead of simply pasting together df_ and i, let’s use our friend regular expressions to get something more usable

    • Get the school name (anything that matches the options we give)

    • Get the year (any digits)

    • Paste those together with df_ to get our nicer object names

for(i in files) {
  school <- str_extract(i, "niagara|bend|east|spot")
  year <- str_extract(i, "\\d+")
  name <- paste0("df_", school, year)
  file <- read_csv(i)
  assign(name, file)
}
  • Much better!

  • One last thing, instead of reading each of those into a new data frame, we could bind them all together

    • Note: This is only appropriate here due to the nature of the school data, with each data frame having the school name and year in it, other times we may want to join data instead. Review Data Wrangling II for a refresher on how to bind and join data appropriately
  • To do this, we first need to set up an empty data frame (a.k.a., “tibble”), as otherwise we will get an error the first time through the loop, as we would be attempting to bind something that doesn’t exist

  • Then, we simply run the loop like before, but use bind_rows() to stack each file onto the existing list

df_bind <- tibble()

for(i in files) {
  file <- read_csv(i)
  df_bind <- bind_rows(df_bind, file)
}
  • Great, that worked!

  • Finally, what if we wanted to this only for certain schools?

Hint: this might be useful in the homework

  • With a loop, we do something for each item in the list

  • So do something only for certain schools, we want to change the list, not the loop

    • In this case, we can add a pattern to our list.files() function saying to only list files that match that pattern

      • This could be some fancy regex, but in our case, we just need any files that have the word “niagara” it their name
files_niagara <- list.files("data/sch-test/by-school",
                            full.names = T,
                            pattern = "niagara")

df_niagara <- tibble()

for(i in files_niagara) {
  file <- read_csv(i)
  df_niagara <- bind_rows(df_niagara, file)
}
  • Let’s see what our final output looks like
print(df_niagara)
# A tibble: 6 × 5
  school   year  math  read science
  <chr>   <dbl> <dbl> <dbl>   <dbl>
1 Niagara  1980   514   292     787
2 Niagara  1981   499   268     762
3 Niagara  1982   507   310     771
4 Niagara  1983   497   301     814
5 Niagara  1984   483   311     818
6 Niagara  1985   489   275     805
  • Perfect!

Summary

  • In this lesson we have mostly print()-ed our output in this lesson, because it’s one of the easiest ways to see what’s going on

  • But, you can use functions and loops for other things too, like modifying variables, reading data, etc.

    • This said, whenever I am writing a loop, I usually start by print()-ing what I am looping through, then make is more sophisticated from there
  • Loops and Functions were something that I didn’t fully grasp until I had been using R for about a year, so I don’t expect you to get everything

    • While confusing, they are super useful in the real world
  • You might see some variations of the loops we learned today such as

    • while() loops

    • for() loops that use the index (line) number of the list instead

    • What’s important is that the underlying logic remains the same

  • Time permitting, let’s take a look a few examples of my code that uses loops and functions!

  1. Using a for() loop, do the following:

    • Read in each of the individual school test files for Bend Gate and Niagara only.

      HINT The vertical pipe operator, |, means OR in regular expression patterns

    • Add some code to the loop which adds a column to the data frame that is called relative_path and contains the string relative path to the data file you just read in

      • e.g., there should be a new variable in the data frame which says where each row comes from
    • Bind all the data sets together

  2. Read in hsls-small.dta

  3. Write your own function that

    • Takes a stu_id as the input

    • paste()s back a sentence which says if() they ever attended college else() whether their parent expected them to go to college

      • If a student went to college it should return something like “this student went to college”

      • If they didn’t, it should return something like “this student did not go to college, their students did/did not expect them to”

    • Optional: if() the student went to college, add a second sentence that says how many months they had between high school and going to college

    • Super-Optional: Following from that, edit the second sentence to include how that student’s delay compares to the average delay (you can pick, mean or median)

Once complete, turn in the .R script (no data etc.) to Canvas by the due date (Sunday 11:59pm following the lesson). Assignments will be graded on the following Monday (time permitting) in line with the grading policy outlined in the syllabus.

Solution

R Solution Code

## -----------------------------------------------------------------------------
##
##' [PROJ: EDH 7916]
##' [FILE: Functions & Loops Solution]
##' [INIT: March 18 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##
## -----------------------------------------------------------------------------

setwd(this.path::here())

## ---------------------------
##' [Libraries]
## ---------------------------

library(tidyverse)

## ---------------------------
##' [Q1]
## ---------------------------

files_to_read <- list.files("data/sch-test/by-school", # look in this folder
                            full.names = TRUE, # we want to keep the full path, not just the file names
                            pattern = "bend|niagara") # only list files that contain either "bend" or "niagara" 

data <- tibble() # create a blank tibble to store out data in

for(i in files_to_read) { # each loop through i becomes an item from the list of file path we created above)
  
  temp_data <- read_csv(i) |> # read in the file i 
    mutate(relative_path = i) # make a new variable relative_path that stores i (remember i is the path to the file, not the file itself)
  
  data <- bind_rows(data, temp_data) # bind data and the temp_data we just read in

}

## ---------------------------
##' [Q2]
## ---------------------------

df <- haven::read_dta("data/hsls-small.dta")

## ---------------------------
##' [Q3]
## ---------------------------

##'[Without the optional part]

id <- 10007 # Tip: if you're debugging a function or loop, manually assign something to input values (for a function) or i (for a loop) and then you can test it

did_they_go <- function(id) {
  
  student <- df |> filter(stu_id == id) # pull out the student id (so we can directly use the value)
  college <- student |> pull(x4evratndclg) # pull out if the student attend college
  
  parent <- student |> pull(x1paredexpct) # pull out the parent student expectation
  expect <- if(is.na(parent)|parent == 11) { # if the parent expectation is either NA or 11
    "and we do not know if their parents wanted them to" # assign this text to expect
  } else if(parent >= 5) { # otherwise if the parent expectation is 5 or above
    "and their parents expected them to" # assign this text to expect
  } else if(parent < 5) { # otherwise if the parent expectation is less than 5
    "and their parents did not expect them to" # assign this text to expect
  }
  
  if(is.na(college)) { # if whether they went to college is missing
    paste("We do not know if this student went to college") # print this message
  } else if(college == 1) { # if they went to college
    paste("This student went to college") # paste this message
  } else if(college == 0) { # if they did not go to college
    paste("This student never went to college", expect) # paste this message
    
  }
} 


## Test it using a for loop
test_ids <- df |> slice_head(n = 50) |> pull(stu_id)
for(i in test_ids) { print(did_they_go(i)) }


##'[With the optional part]

did_they_go <- function(id) {
  
  student <- df |> filter(stu_id == id) # pull out the student id (so we can directly use the value)
  college <- student |> pull(x4evratndclg) # pull out if the student attend college
  
  parent <- student |> pull(x1paredexpct) # pull out the parent student expectation
  expect <- if(is.na(parent)|parent == 11) { # if the parent expectation is either NA or 11
    "and we do not know if their parents wanted them to" # assign this text to expect
  } else if(parent >= 5) { # otherwise if the parent expectation is 5 or above
    "and their parents expected them to" # assign this text to expect
  } else if(parent < 5) { # otherwise if the parent expectation is less than 5
    "and their parents did not expect them to" # assign this text to expect
  }
  
  ## NEW PART
  median_delay <- df |> summarize(median = median(x4hs2psmos, na.rm = T)) |> pull(median) # summarize the median completion and pull the value out (notice we start with df not student)
  delay <- student |> pull(x4hs2psmos) # pull out the students months delay
  difference <- delay - median_delay # calculate the difference between the median delay and the students delay
  delay_statement <- if(is.na(delay)) { # if the students delay was NA (note, it doesn't matter if they didn't go to college, as we only use this if they went)
    "and we do not know how long they delayed college" # paste this message
  } else if(delay == 0) { # if the students didn't delay going to college
    "and they did not delay attending college at all" # paste this message
  } else if(difference == 0) { # if the students delay was the median
    paste("and they delayed college by", delay, "months which is the average amount of time") # paste this, using the value delay from above
  } else if(difference < 0) { # if the students delay was below the median
    paste("and they delayed college by", delay, "months which is", abs(difference), "months less than the average") # paste this, using delay and abs(difference) which is the absolute value of the difference (i.e., the value regardless of positive vs negative)
  } else if(difference > 0) { # if the students delay was above the median
    paste("and they delayed college by", delay, "months which is", abs(difference), "months above than the average") # paste this, using delay and abs(difference) which is the absolute value of the difference (i.e., the value regardless of positive vs negative)
  }
  
  if(is.na(college)) { # if whether they went to college is missing
    paste("We do not know if this student went to college") # print this message
  } else if(college == 1) { # if they went to college
    paste("This student went to college", delay_statement) # paste this message NEW added delay_statement
  } else if(college == 0) { # if they did not go to college
    paste("This student never went to college", expect) # paste this message
    
  }
} 

## Test it using a for loop
test_ids <- df |> slice_head(n = 50) |> pull(stu_id)
for(i in test_ids) { print(did_they_go(i)) }

## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------