IV: Tidyverse Tricks & SQL

R Code

  • In this brand new lesson for 2024 we are going to cover two main topics
    1. Some more advanced tidyverse commands that can save a lot of time
    2. How the tidyverse we have learned so far relates to SQL (Structured Query Language) commonly used to access databases

Tidyverse Tricks

  • Now you have a decent grasp on the core functions of tidyverse we can start exploring some helper functions that can make our lives much easier!
  • These commands have been chosen as I have personally found them wildly helpful in working with IPEDS and other data
    • To demonstrate, we are going to use IPEDS finance data files for 2018/19 school year
    • Notice: IPEDS uses separate data files for public _f1a non-profit _f2 and for-profit _f3 colleges
df_18_pub <- read_csv(file.path("data", "ipeds-finance", "f1819_f1a_rv.csv"))
df_18_np <- read_csv(file.path("data", "ipeds-finance", "f1819_f2_rv.csv"))
df_18_fp <- read_csv(file.path("data", "ipeds-finance", "f1819_f3_rv.csv"))
  • First, since there should be no college in more than one of these data files, and each college only has one row, we can bind_rows to stack each one on top of the other
    • Then run a quick test to check no UNITID appears more than once (no duplicates)
df_18 <- bind_rows(df_18_pub, df_18_np, df_18_fp)

df_18 |>
  count(UNITID) |>
  filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: UNITID <dbl>, n <int>
  • Notice: we now have 6069 obs. of 663 variables
    • Does this pass the “eyeball test”?
      • The number of obs. looks right as it is just sum of the obs. from the 3 dfs
      • The number of vars. may look concerning at first
        • Why, if all we did was “stack” rows on top of each other, would the number of vars increase?
  • What we have created is a somewhat “sparse” data frame, as each institution category has differently named variables (in part due to differing reporting requirements)

Our data looks something like this

Depiction of Sparse DataFrame
Public Vars Non-Profit Vars For-Profit Vars
Public IDs values NA NA
Non-Profit IDs NA values NA
For-Profit IDs NA NA values
  • Hmm, sounds like this could get tricky… Luckily tidyverse is here to help!

coelesce() Data Split Across Columns

  • Let’s say we want to combine this information into one variable to show how much all institutions spend on instruction, research, and student services, respectively
    • Side note: If combining variables like this in your own work, check the code book to ensure you know what you’re combining and why
    • In this case, the variables we will be combining appear to be close equivalents
  • First things first, let’s select() the relevant variables from the larger data set using the IPEDS dictionary files
    • Quick question: How did I find the variable names to use below?
df_18 <- df_18 |>
  select(UNITID,
         F1C011, F1C021, F1C061,
         F2E011, F2E021, F2E051,
         F3E011, F3E02A1, F3E03B1)

print(df_18[100:105,])
# A tibble: 6 × 10
  UNITID    F1C011  F1C021   F1C061 F2E011 F2E021 F2E051 F3E011 F3E02A1 F3E03B1
   <dbl>     <dbl>   <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1 109819  81760530       0 13575072     NA     NA     NA     NA      NA      NA
2 109907   8880360  241606  5093396     NA     NA     NA     NA      NA      NA
3 110246  40754283       0 13656135     NA     NA     NA     NA      NA      NA
4 110334  54226032       0 18432007     NA     NA     NA     NA      NA      NA
5 110398  28937539       0  5591440     NA     NA     NA     NA      NA      NA
6 110422 219559005 2847727 78536402     NA     NA     NA     NA      NA      NA
print(df_18[3000:3005,])
# A tibble: 6 × 10
  UNITID F1C011 F1C021 F1C061   F2E011 F2E021   F2E051 F3E011 F3E02A1 F3E03B1
   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>   <dbl>   <dbl>
1 206862     NA     NA     NA 17364863      0  8130862     NA      NA      NA
2 207157     NA     NA     NA  1055254      0   975876     NA      NA      NA
3 207324     NA     NA     NA 16367709      0  9449815     NA      NA      NA
4 207403     NA     NA     NA 14538443      0 10456812     NA      NA      NA
5 207458     NA     NA     NA 39875546      0 12083472     NA      NA      NA
6 207582     NA     NA     NA 26366669 101944 12445590     NA      NA      NA
  • Luckily, tidyverse has a command to help us, coalesce()
    • Returns the first non-missing value across any number of columns
      • This works perfectly in situations like this, when you only have one data point for each row, it could just be in any column

let’s do all the columns and get a clean data frame

df_18_clean <- df_18 |>
  mutate(inst_spend = coalesce(F1C011, F2E011, F3E011),
         rsch_spend = coalesce(F1C021, F2E021, F3E02A1),
         serv_spend = coalesce(F1C061, F2E051, F3E03B1)) |>
  select(UNITID, inst_spend, rsch_spend, serv_spend)

print(df_18_clean[100:105,])
# A tibble: 6 × 4
  UNITID inst_spend rsch_spend serv_spend
   <dbl>      <dbl>      <dbl>      <dbl>
1 109819   81760530          0   13575072
2 109907    8880360     241606    5093396
3 110246   40754283          0   13656135
4 110334   54226032          0   18432007
5 110398   28937539          0    5591440
6 110422  219559005    2847727   78536402
print(df_18_clean[3000:3005,])
# A tibble: 6 × 4
  UNITID inst_spend rsch_spend serv_spend
   <dbl>      <dbl>      <dbl>      <dbl>
1 206862   17364863          0    8130862
2 207157    1055254          0     975876
3 207324   16367709          0    9449815
4 207403   14538443          0   10456812
5 207458   39875546          0   12083472
6 207582   26366669     101944   12445590
  • This is a real time saver if working with IPEDS finance data
  • Another use case for this might be if you had 30 columns one for each test date and rows for student scores, they all took the test on one of the dates, you want a single column for student scores, but it could have been recorded in any of the date columns

Finding if_any() Issues

  • Although compliance is a federal requirement, completeness of reporting remains a struggle when using IPEDS data
  • From the snapshots of the data we have seen so far, it seems that there’s a good number of $0 reports for these values
  • Let’s try and get a data frame containing only institutions that reported $0 for one of these spending categories
df_0 <- df_18_clean |>
  filter(if_any(everything(), ~ . == 0)) ## h/t https://stackoverflow.com/questions/69585261/dplyr-if-any-and-numeric-filtering

print(df_0)
# A tibble: 4,803 × 4
   UNITID inst_spend rsch_spend serv_spend
    <dbl>      <dbl>      <dbl>      <dbl>
 1 100733          0          0          0
 2 100760    8740330          0    3605430
 3 100812   17101084          0    2918417
 4 101028    6298586          0    2062946
 5 101143    7532616          0    3094312
 6 101161   23308825          0    7205781
 7 101240   20593662          0    7278193
 8 101286   17872473          0    4563448
 9 101295   20945006          0    6225596
10 101301    5373481          0    3216811
# ℹ 4,793 more rows
  • Let’s walk through this code

    1. Assign our results to a new object called df_0
    2. Take df_18_clean and pipe it into filter
    3. Inside filter() we have our if_any helper function, which has two arguments
    1. .cols which columns to look across - Here we have just gone for all columns with everything(), but you could input a list of specific column names, or another selection function like where(is.numeric))
    2. .fns function to test the columns against - Here we have use ~ . == 0
      • We haven’t used purrr from tidyverse in this class, but the ~ comes from there, in short, it starts a very simple function for us
        • The function takes any value . and asks if it equals 0 == 0
        • If the function returns TRUE (as in it equals 0) for any column, that row will be filter()-ed in

Working across() Multiple Columns

  • Now let’s explore that data a little more with if_any()’s sister function across()
    • Internally if_any() and across() are set up the same
      • They both take
        1. .cols which columns to look across
        2. .fns function to test the columns against
    • The difference between them comes down to which function they work in
      • if_any() is used in a handful of functions like filter()
      • across is used in most functions like summarize() and mutate()
        • If you try to use across() where you aren’t meant to tidyverse will throw you a warning
  • Here, we will use the across function inside count() to get a breakdown of which spending categories are unreported most often
df_0 |>
  select(-UNITID) |>
  count(across(everything(), ~ . == 0))
# A tibble: 7 × 4
  inst_spend rsch_spend serv_spend     n
  <lgl>      <lgl>      <lgl>      <int>
1 FALSE      FALSE      TRUE          12
2 FALSE      TRUE       FALSE       4411
3 FALSE      TRUE       TRUE         336
4 TRUE       FALSE      FALSE          5
5 TRUE       FALSE      TRUE           2
6 TRUE       TRUE       FALSE         12
7 TRUE       TRUE       TRUE          25
  • The internal logic of the across() here is identical to the if_any() above
    • everything() works across all columns, ~ . == 0 is a simple function to test if any value equals 0
  • across() just works in the count() function
    • Outputs a count table counting combinations of the variables equaling 0
      • By far the most common variable to report zero is research spending, with 4411 reporting only that variable as 0
      • 25 schools reported 0 for all three variables
  • Although we’ve only been working across 3 variables in these examples, the power of these commands is that they can work across an unlimited number of columns, so the bigger your data, the more should be thinking across() and if_any()

Moving Beyond ifelse() with case_when()

  • Keeping digging into differences in spending categories, next, let’s say we want to create a new variable that says which order the three spending categories were for each school

  • Let’s walk through the case_when() code

    1. Much like we used ifelse() inside mutate to make a new variable in Data Wranling I, we can use case_when() when we have more than a binary test
    • case_when() goes down the list of conditions in order until it finds one that it answers TRUE at which point it returns the value on the right hand side of the ~
    • Here we have listed out all possible orders of spending categories with a label for each scenario
    1. Unlike ifelse() there is a unique danger that would don’t cover every eventuality with your conditions
    • This is why I like to end with TRUE, as that will be TRUE no matter what
      • I then assign some kind of catch-all phrase that makes me know I made an error (it will just NA if you don’t do this)
    1. To check if a case_when() worked, I like to pipe it into a count() for the new variable we made
df_18_clean |>
  mutate(highest_cat = case_when(inst_spend > rsch_spend & rsch_spend > serv_spend ~ "inst_rsch_serv",
                                 inst_spend > serv_spend & serv_spend > rsch_spend ~ "inst_serv_rsch",
                                 rsch_spend > inst_spend & inst_spend > serv_spend ~ "rsch_inst_serv",
                                 rsch_spend > serv_spend & serv_spend > inst_spend ~ "rsch_serv_inst",
                                 serv_spend > inst_spend & inst_spend > rsch_spend ~ "serv_inst_rsch",
                                 serv_spend > rsch_spend & rsch_spend > inst_spend ~ "serv_rsch_inst",
                                 TRUE ~ "You missed a condition Matt")) |>
  count(highest_cat)
# A tibble: 7 × 2
  highest_cat                     n
  <chr>                       <int>
1 You missed a condition Matt   377
2 inst_rsch_serv                262
3 inst_serv_rsch               5004
4 rsch_inst_serv                 35
5 rsch_serv_inst                  4
6 serv_inst_rsch                384
7 serv_rsch_inst                  3
  • Looks like I missed something in my case_when(), can anyone guess what it is?

  • What would happen to this school

    • inst_spend 35,000,000
    • rsch_spend 20,000,000
    • serv_spend 20,000,000
      • As I have only specified for all order of categories being “greater than” the other, situations where two categories are equal slip through the cracks
        • This was a genuine mistake when writing the lesson, precisely why I always include that catch all
  • Let’s see how our results change if I use “greater than or equal to” signs

df_18_clean |>
  mutate(highest_cat = case_when(inst_spend >= rsch_spend & rsch_spend >= serv_spend ~ "inst_rsch_serv",
                                 inst_spend >= serv_spend & serv_spend >= rsch_spend ~ "inst_serv_rsch",
                                 rsch_spend >= inst_spend & inst_spend >= serv_spend ~ "rsch_inst_serv",
                                 rsch_spend >= serv_spend & serv_spend >= inst_spend ~ "rsch_serv_inst",
                                 serv_spend >= inst_spend & inst_spend >= rsch_spend ~ "serv_inst_rsch",
                                 serv_spend >= rsch_spend & rsch_spend >= inst_spend ~ "serv_rsch_inst",
                                 TRUE ~ "You missed a condition Matt")) |>
  count(highest_cat)
# A tibble: 6 × 2
  highest_cat        n
  <chr>          <int>
1 inst_rsch_serv   624
2 inst_serv_rsch  5005
3 rsch_inst_serv    37
4 rsch_serv_inst     4
5 serv_inst_rsch   396
6 serv_rsch_inst     3
  • No missing conditions this time, hooray!

  • Also, I’m definitely surprised how few institutions spend most on research, and just how many spend instruction > services > research

    • Does this surprise you as well?

Tidyverse Tricks Summary

  • I hope some of these more advanced tidyverse commands will prove helpful, particularly as you move into the world of bigger data sets with more variables!
  • Next, we are going to revisit some code from Data Wrangling I and Data Wrangling II and see how it translates to SQL

From tidyverse to SQL

  • For this section, we are going to see how tasks from Data Wrangling II could be performed using SQL, a common tool used to work with databases

What is SQL?

  • SQL, short for Structured Query Language, is a way of retrieving, modifying, and storing data from databases. For a solid background page on SQL see this overview from AWS
  • One of thing that confuses people about SQL is that there’s the base language of SQL that all SQL-based products share and then multiple commercial product implementations of SQL that take the base language and add additional and unique functions
    • The dbplyr package we are going to use below attempts to mimic the implementation we tell it to
  • A key difference between R and SQL is that we don’t have the option to assign our results to an object like we do with R
    • Instead, an SQL query is simply the code to retrieve the data, what happens to that data (e.g., does it show up on your screen, is it saved somewhere, etc.) will be determined by the SQL-based software you’re using

How does SQL Relate to this Class?

  • First, SQL is a really important tool for data management jobs in higher education. Larger institutions like UF almost certainly store their institutional data in a database that uses SQL
    • However, it is rarely taught in Higher Ed degree programs, so, this little intro already sets you up for success
  • Second, the tidyverse language we have been learning this semester is in many ways similar to SQL
    • In fact, a lot of the functions and function names in tidyverse come directly from SQL
    • There’s an entire package dbplyr which can translate our standard dplyr commands to SQL queries
      • That’s what we are going to play around with today!

Data Wrangling II in SQL

  • To explore SQL in a familiar environment, we are going to re-visit Data Wrangling II
    • While more complicated data wrangling is certainly possible in SQL the most common use in education is to pull or “query” data out of institutional databases with some simple summaries or joins
  • First, we have to simulate an SQL database for dbplyr
    • This tells dbplyr exactly how to translate our code
    • For today’s class, we will simulate a Microsoft Access database with the simulate_access() command
  • Second, instead of a normal df, we want R to pretend that df is a table in a database, which we do with the memdb_frame() command
    • If you’re curious what this command does, try print(df) and print(db) to see the difference
df <- read_csv(file.path("data", "sch-test", "all-schools.csv"))

microsoft_access <- simulate_access()

db <- memdb_frame(df)

Create Summary Table

  • Our first command is pretty simple, we want to group our data by year then calculate the mean test score, which will give us average test scores for each year
# https://stackoverflow.com/questions/76724279/syntax-highlight-quarto-output
df_sum <- db |>
    ## grouping by year so average within each year
    group_by(year) |>
    ## get mean(<score>) for each test
    summarize(math_m = mean(math),
              read_m = mean(read),
              science_m = mean(science)) |>
  show_query()
<SQL>
SELECT
  `year`,
  AVG(`math`) AS `math_m`,
  AVG(`read`) AS `read_m`,
  AVG(`science`) AS `science_m`
FROM `dbplyr_oyyxBbLE68`
GROUP BY `year`
  • We start with SELECT
    • We list out what we want, which is the year variable and AVG()s of math, reading, and science test scores
  • We are taking them from our simulated databases dbplyr_002 which we want GROUP BY-ed year
    • As SQL code is nested, we can’t read it top to bottom, the GROUP BY takes place as we pull the data out of dbplyr_002 before we calculate the AVG()

Left-Join

  • Next, we want to join these averages back into the the main data frame
df_joined <- db |>
    ## pipe into left_join to join with df_sum using "year" as key
    left_join(df_sum, by = "year") |>
  show_query()
<SQL>
SELECT `dbplyr_oyyxBbLE68`.*, `math_m`, `read_m`, `science_m`
FROM `dbplyr_oyyxBbLE68`
LEFT JOIN (
  SELECT
    `year`,
    AVG(`math`) AS `math_m`,
    AVG(`read`) AS `read_m`,
    AVG(`science`) AS `science_m`
  FROM `dbplyr_oyyxBbLE68`
  GROUP BY `year`
) AS `RHS`
  ON (`dbplyr_oyyxBbLE68`.`year` = `RHS`.`year`)
  • We see our preceding query nested inside our new query from the second SELECT statement to the GROUP BY statement
    • That all sits become RHS inside our LEFT JOIN statement
  • We see this is joined ON the year variable with the original data dbplyr_002

Pivot-Longer

  • Our next query pivots the data longer, which remember from our original lesson take the math, reading, and science score columns, and turns them into one column for score type and column for score
df_long <- db |>
    ## cols: current test columns
    ## names_to: where "math", "read", and "science" will go
    ## values_to: where the values in cols will go
    pivot_longer(cols = c("math","read","science"),
                 names_to = "test",
                 values_to = "score") |>
  show_query()
<SQL>
SELECT `school`, `year`, 'math' AS `test`, `math` AS `score`
FROM `dbplyr_oyyxBbLE68`

UNION ALL

SELECT `school`, `year`, 'read' AS `test`, `read` AS `score`
FROM `dbplyr_oyyxBbLE68`

UNION ALL

SELECT `school`, `year`, 'science' AS `test`, `science` AS `score`
FROM `dbplyr_oyyxBbLE68`
  • What we see here is a bit different, as it’s manual
  • First we SELECT school and year as they are, math as test and math as score
  • This then UNION ALL-ed (think bind_rows style stacking) with the same query for reading and then again for science

Pivot-Wider

  • Next, let’s pivot that data back wide
df_wide <- df_long |>
    ## names_from: values in this column will become new column names
    ## values_from: values in this column will become values in new cols
    pivot_wider(names_from = "test",
                values_from = "score") |>
  show_query()
<SQL>
SELECT
  `school`,
  `year`,
  MAX(CASE WHEN (`test` = 'math') THEN `score` END) AS `math`,
  MAX(CASE WHEN (`test` = 'read') THEN `score` END) AS `read`,
  MAX(CASE WHEN (`test` = 'science') THEN `score` END) AS `science`
FROM (
  SELECT `school`, `year`, 'math' AS `test`, `math` AS `score`
  FROM `dbplyr_oyyxBbLE68`

  UNION ALL

  SELECT `school`, `year`, 'read' AS `test`, `read` AS `score`
  FROM `dbplyr_oyyxBbLE68`

  UNION ALL

  SELECT `school`, `year`, 'science' AS `test`, `science` AS `score`
  FROM `dbplyr_oyyxBbLE68`
) AS `q01`
GROUP BY `school`, `year`
  • Our first SELECT statement asks for school and year, then…
  • We use CASE WHEN for each type of test, creating a new variable for each subject WHEN the test type equaled that subject
  • Beneath that, we see our previous query that creates the long data that we pivoted back wider

There is no homework assignment associated with this lesson. Instead, your initial analysis for your reproducible report is due Sunday at 11:59pm. See the final project for details.