I: Enter the tidyverse

R Code

  • Today we take our first full dive into R. Though every data wrangling / analysis is unique, the best way to learn and practice R is to answer a question using data, which is how this lesson is set up
  • By the end of this lesson, you will have read in a data set, lightly cleaned it, produced results, and saved your findings in a new file
    • As part of this process, you will have practiced translating a research question into data analysis steps, which is a skill every bit as important as technical sophistication with a statistical language like R

Re: Urgent Data Question from the Provost

  • Through today’s lesson, we will explore some of the basics of data wrangling
    • But to make it more realistic, we will be doing so to answer a realistic question you may be asked by your advisor or supervisor

Using HSLS09 data, figure out average differences in college degree expectations across census regions; for a first pass, ignore missing values and use the higher of student and parental expectations if an observation has both.

  • A primary skill (often unremarked upon) in data analytic work is translation. Your advisor, IR director, funding agency director — even collaborator — won’t speak to you in the language of R
    • Instead, it’s up to you to
      1. translate a research question into the discrete steps coding steps necessary to provide an answer, and then
      2. translate the answer such that everyone understands what you’ve found

What we need to do is some combination of the following:

  1. Read in the data
  2. Select the variables we need
  3. Mutate a new value that’s the higher of student and parental degree expectations
  4. Filter out observations with missing degree expectation values
  5. Summarize the data within region to get average degree expectation values
  6. Write out the results to a file so we have it for later

Let’s do it!

  • Throughout this lesson (and class), we are going to lean heavily on the tidyverse collection of packages
    • If you don’t already have this installed, use install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Check working directory

  • This script — like the one from the organizing lesson — assumes that the class folder is the working directory and that the required data file is in the data sub-directory

  • If you need a refresher on setting the working directory, see the prior lesson.

  • Notice that I’m not setting (i.e. hard coding) the working directory in the script. That would not work well for sharing the code. Instead, I tell you where you need to be (a common landmark), let you get there, and then rely on relative paths afterwards

Reading Data in with read_csv()

  • Nationally representative, longitudinal study of 23,000+ 9th graders from 944 schools in 2009, with a first follow-up in 2012 and a second follow-up in 2016
  • Students followed throughout secondary and postsecondary years
  • Surveys of students, their parents, math and science teachers, school administrators, and school counselors
  • A new student assessment in algebraic skills, reasoning, and problem solving for 9th and 11th grades
  • 10 state representative data sets
  • If you are interested in using HSLS09 for future projects, DO NOT rely on this subset. Be sure to download the full data set with all relevant variables and weights if that’s the case. But for our purposes in this lesson, it will work just fine.

  • Throughout, we’ll need to consult the code book. An online version can be found at this link.

## data are CSV, so we use read_csv() from the readr library
df <- read_csv(file.path("data", "hsls-small.csv"))
Rows: 23503 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): stu_id, x1sex, x1race, x1stdob, x1txmtscor, x1paredu, x1hhnumber, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • You may notice the read_csv() prints out information about the data just read in. Nothing is wrong! The read_csv() function, like many other functions in the tidyverse, assumes you’d rather have more rather than less information and acts accordingly

  • Up to now, this should all seem fairly consistent with last week’s lesson. This is just the set up, now it’s time to dive in!

Pipe Operator |> in R

The pipe is one of the things that makes R code more intuitive than other programming languages, as it allows us write code in the order we think about it, passing it one from one function to another, rather than nesting it like traditional code would be written

For this example don’t worry about the actual processes (we will go over them more below), just look at how much more intuitive the code is with |>s.

First, let’s say I want to take the data we just read in and select the x1txmtscor (math test scores) column

Without |>

## Without |>
select(df, x1txmtscor)

With |>

## With |>
df |> select(x1txmtscor)

Neither is that confusing… But, what if we want to take that output and select only students with a math score above 50?

Without |>

## Without |>
filter(select(df, x1txmtscor), x1txmtscor > 50)

With |>

## With |>
df |> select(x1txmtscor) |> filter(x1txmtscor > 50)

See how the non-piped version is getting messy? Let’s add one more level to really make the point, creating a new variable that is the square root of the test score

Without |>

## Without |>
mutate(filter(select(df, x1txmtscor), x1txmtscor > 50), square_root = sqrt(x1txmtscor))

With |>

## With |>
df |> select(x1txmtscor) |> filter(x1txmtscor > 50) |> mutate(square_root = sqrt(x1txmtscor))

As we are getting longer, I’m going to use a new line for each pipe, just to make it even clearer (it makes no difference to R)

## Best to use  a new line for each pipe when code gets longer
df |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor))

Even though we haven’t covered any of these commands yet, I think (tell me if I’m wrong) the |> is still pretty easy to know roughly what’s going on. Whereas, the traditional nested way gets really tricky beyond a couple of commands.

Of course, if you wanted, you could do each step separately, constantly assigning and overwriting an object like below

## Without the |>, we could technically break it down step by step
temp <- select(df, x1txmtscor)
temp <- filter(temp, x1txmtscor > 50)
temp <- mutate(temp, square_root = sqrt(x1txmtscor))
temp

But again, I think it’s less intuitive than simply piping the results.

  • If we do want to see step-by-step output in a piped command, you can either
    1. Run the code as you write it line-by-line (usually what I do)
    2. Highlight sections of the piped code to run up to a point

Assigning Output from Pipes

  • Assigning output from pipes is the same as we have covered a few times, we use a <- to pass it backwards to an object (in this case we called that object df_backward_pass)
## Always assign backwards
df_backward_pass <- df |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor))
  • If you want to think of it this way, we are effectively continuing to pass it forward like this…
## You can think of the assignment as a continuation of the pipe like this
## but don't write it this way, it's then hard to find what you called something later
df |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor)) ->
  df_forward_pass
  • That’s how I first thought about how pipes and assignment work together, and you can see the outputs are all.equal()
## Checking they are the same
all.equal(df_backward_pass, df_forward_pass)
[1] TRUE
  • However, you really shouldn’t write code like this, as although it runs, it’s then hard to later find where you created df_forward_pass.
  • Starting the string of code with where you store the result is MUCH clearer.
    • But, it really helped me to think about it this way at first, so hopefully it will help some of you understand how |> and <- work together

A Brief History of the R Pipe

badge

  • The pipe was originally a tidyverse invention, and used %>% symbol, which is probably still a common pipe you see “in the wild”
  • The pipe we are using was brought into the Vanilla version of R a few years ago as |>
  • The reason for the change is some benefits that are beyond the scope of this class, but I wanted you to be aware that |> and %>% are essentially the same thing
    • You can intermingle them, but, but for clarity I would stick with |>

Basic Tidyverse Commands

  • Now we have the pipe |> covered, it’s time to dig into some basic data wrangling commands

Selecting Variables/Columns with select()

  • Often data sets contain hundreds, thousands, or even tens of thousands of variables, when we are only interested in a handful. Our first tidyverse command select() helps us deal with this, by, as you may have guessed, select()-ing the variables we want
  • Since we are going to pipe our R commands, we start with our data then pipe it into the select() command
  • In the select() command, we list out the variables we want
    • stu_id, x1stuedexpct, x1paredexpct, x1region
      • Notice: in this (and most tidyverse) command(s) we don’t have to use “quotes” around the variable names. A common error message when you did need to put something in quotes is Error: object <thing you should have "quoted"> not found
df |> select(stu_id, x1stuedexpct, x1paredexpct, x1region)
# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           11            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           11        1
 8  10008            8            6        1
 9  10009           11           11        3
10  10010            8            6        1
# ℹ 23,493 more rows

Quick question: if we want to use this reduced data-frame going forward, what should we do?

That’s right, assign it to an object! Let’s call that df_small

df_small <- df |> select(stu_id, x1stuedexpct, x1paredexpct, x1region)
  • Our next step is to create a variable that’s the higher of student and parent expectations. Sounds simple enough, but first, we have a problem…
    • Can anyone guess what it is?

Understanding our data

First things first, however, we need to check the code book to see what the numerical values for our two education expectation variables represent. To save time, I’ve copied them here:

x1stuedexpct

How far in school 9th grader thinks he/she will get

value label
1 Less than high school
2 High school diploma or GED
3 Start an Associate’s degree
4 Complete an Associate’s degree
5 Start a Bachelor’s degree
6 Complete a Bachelor’s degree
7 Start a Master’s degree
8 Complete a Master’s degree
9 Start Ph.D/M.D/Law/other prof degree
10 Complete Ph.D/M.D/Law/other prof degree
11 Don’t know
-8 Unit non-response
x1paredexpct

How far in school parent thinks 9th grader will go

value label
1 Less than high school
2 High school diploma or GED
3 Start an Associate’s degree
4 Complete an Associate’s degree
5 Start a Bachelor’s degree
6 Complete a Bachelor’s degree
7 Start a Master’s degree
8 Complete a Master’s degree
9 Start Ph.D/M.D/Law/other prof degree
10 Complete Ph.D/M.D/Law/other prof degree
11 Don’t know
-8 Unit non-response
-9 Missing
  • The good news is that the categorical values are the same for both variables (meaning we can make an easy comparison) and move in a logical progression
  • The bad news is that we have three values — -8, -9, and 11 — that we need to deal with so that the averages we compute later represent what we mean

Exploring Catagorical Variables with count()

  • First, let’s see how many observations are affected by these values using count()

Notice that we don’t assign to a new object; this means we’ll see the result in the console, but nothing in our data or object will change

## see unique values for student expectation
df_small |> count(x1stuedexpct)
# A tibble: 12 × 2
   x1stuedexpct     n
          <dbl> <int>
 1           -8  2059
 2            1    93
 3            2  2619
 4            3   140
 5            4  1195
 6            5   115
 7            6  3505
 8            7   231
 9            8  4278
10            9   176
11           10  4461
12           11  4631
## see unique values for parental expectation
df_small |> count(x1paredexpct)
# A tibble: 13 × 2
   x1paredexpct     n
          <dbl> <int>
 1           -9    32
 2           -8  6715
 3            1    55
 4            2  1293
 5            3   149
 6            4  1199
 7            5   133
 8            6  4952
 9            7    76
10            8  3355
11            9    37
12           10  3782
13           11  1725
  • Dealing with -8 and -9 is straight forward — we’ll convert it missing.
    • In R, missing values are technically stored as NA.
      • Not all statistical software uses the same values to represent missing values (for example, STATA uses a dot .)
    • NCES has decided to represent missing values as a limited number of negative values. In this case, -8 and -9 represent missing values
  • Note: how to handle missing values is a very important topic, one we could spend all semester discussing
    • For now, we are just going to drop observations with missing values; but be forewarned that how you handle missing values can have real ramifications for the quality of your final results
    • In real research, a better approach is usually to impute missing values, but that is beyond our scope right now
  • Deciding what to do with 11 is a little trickier. While it’s not a missing value per se, it also doesn’t make much sense in its current ordering, that is, to be “higher” than completing a professional degree
    • For now, we’ll make a decision to convert these to NA as well, effectively deciding that an answer of “I don’t know” is the same as missing an answer
  • So first step: convert -8, -9, and 11 in both variables to NA. For this, we’ll use the mutate() and ifelse() functions

Conditional Values with ifelse()

  • ifelse() is a really common command in R and has three parts
    1. statement that can be TRUE or FALSE
    2. What to return if the statement is TRUE
    3. else what to return when the statement is FALSE

Modifying an Existing Variable with mutate()

  • When we want to add variables and change existing ones, we can use the mutate() function
    • The basic idea of mutate commands is mutate(<where to go> = <what to go there>)
      • This is probably the trickiest function we cover today to understand
    • New variables are created if you provide <where to go> a new variable name (or nothing)
    • Variables are modified if you provide <where to go> an existing variable name
    • <what to go there> can as simple as a single number all the way to a chain of piped functions, so long as there’s a clear answer for every row
  • In this case, we want to modify x1stuedexpct to be NA when x1stuedexpct is -8, -9, or 11

Now, we have three values we want to covert to NA, so we could do them one-at-a-time, like below

df_small <- df_small |>
  mutate(x1stuedexpct = ifelse(x1stuedexpct == -8, NA, x1stuedexpct))
  • Let’s walk through this code
  1. Assign the results back to df_small (which will overwrite our previous df_small)
  2. Take df_small and pipe |> it into mutate()
  3. Inside mutate() assign our results to x1stuedexpct (which will modify the existing variable)
  4. Modify x1stuedexpct with an ifelse() statement, which remember has 3 parts
  1. statement which is asking “is x1stuedexpct == -8”? - Notice == means “is equal to”, while = means “assign to”. Yes it’s confusing, but you’ll get it over time!
  2. if that statement is true, make it NA
  3. else (if the statement is false) return the original variable x1stuedexpct

Okay, make sense? Let’s see what we just did (look at row 26)

print(df_small, n = 26)
# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           11            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           11        1
 8  10008            8            6        1
 9  10009           11           11        3
10  10010            8            6        1
11  10011            8            6        3
12  10012           11           11        2
13  10013            8           10        3
14  10014            2            6        3
15  10015           11           10        3
16  10016            4            6        2
17  10018            6            7        2
18  10019            8           -8        2
19  10020            8            8        4
20  10021            8           11        2
21  10022           10            8        4
22  10024            6            8        2
23  10025            8           -8        4
24  10026            7           10        3
25  10027           11            6        1
26  10028           NA           -8        3
# ℹ 23,477 more rows
  • This is fine, but we have to do it 3 times for both parent and student expectation
    • Instead, can anyone think (not in R code, just in terms of logic) how we could change our statement piece of the ifelse() to be more efficient?

Being Efficient with %in% and c()

  • What we can do, is group -8, -9, and 11 together into a list using c()
    • c() is a very common function in R used to create a list
  • Then, we can use the %in% operator to ask if that result is any of the numbers in that list
    • This keeps our code shorter and easier to read
df_small <- df_small |>
  mutate(x1stuedexpct = ifelse(x1stuedexpct %in% c(-8, -9, 11), NA, x1stuedexpct),
         x1paredexpct = ifelse(x1paredexpct %in% c(-8, -9, 11), NA, x1paredexpct))
  • The code now works just as above, but instead of asking if x1stuedexpct is equal to -8, it asks if it’s in the list of -8, -9, and 11, then does the same for parental expectations!
    • Let’s view those first 26 rows again to see what we did
print(df_small, n = 26)
# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           NA            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           NA        1
 8  10008            8            6        1
 9  10009           NA           NA        3
10  10010            8            6        1
11  10011            8            6        3
12  10012           NA           NA        2
13  10013            8           10        3
14  10014            2            6        3
15  10015           NA           10        3
16  10016            4            6        2
17  10018            6            7        2
18  10019            8           NA        2
19  10020            8            8        4
20  10021            8           NA        2
21  10022           10            8        4
22  10024            6            8        2
23  10025            8           NA        4
24  10026            7           10        3
25  10027           NA            6        1
26  10028           NA           NA        3
# ℹ 23,477 more rows
  • Just to be doubly-sure, lets check count() again
df_small |> count(x1stuedexpct) 
# A tibble: 11 × 2
   x1stuedexpct     n
          <dbl> <int>
 1            1    93
 2            2  2619
 3            3   140
 4            4  1195
 5            5   115
 6            6  3505
 7            7   231
 8            8  4278
 9            9   176
10           10  4461
11           NA  6690
df_small |> count(x1paredexpct)
# A tibble: 11 × 2
   x1paredexpct     n
          <dbl> <int>
 1            1    55
 2            2  1293
 3            3   149
 4            4  1199
 5            5   133
 6            6  4952
 7            7    76
 8            8  3355
 9            9    37
10           10  3782
11           NA  8472

Success!

Creating a New Variable with mutate()

  • So, with that tangent out of the way, let’s get back to our original task, creating a new variable that is the highest of parental and student expectations
  • To make a new variable which is the highest of two variables, we can use our friends mutate() and ifelse() some more
df_small <- df_small |>
  mutate(high_exp = ifelse(x1stuedexpct > x1paredexpct, x1stuedexpct, x1paredexpct))
  • That code is almost what we want to do
    • If x1stuedexpct is higher then take that, if not, take x1paredexpct
      • There’s two things I haven’t fully accounted for though…
        • One doesn’t actually matter here, but might in other circumstances
        • One definitely matters here
          • Without scrolling past the duck, can you tell me what they might be?

Sloppy Mistake 1 (doesn’t matter here)

  • I was a little sloppy with the statement piece, I just asked if x1stuedexpct was greater than x1paredexpct or not
    • If I was being more careful, I might have said “greater than or equal to”
      • Why doesn’t this matter in this context, and when might it matter?

Sloppy Mistake 2 (does matter here)

  • Now let’s check our data frame to see the one that does matter
print(df_small, n = 26)
# A tibble: 23,503 × 5
   stu_id x1stuedexpct x1paredexpct x1region high_exp
    <dbl>        <dbl>        <dbl>    <dbl>    <dbl>
 1  10001            8            6        2        8
 2  10002           NA            6        1       NA
 3  10003           10           10        4       10
 4  10004           10           10        3       10
 5  10005            6           10        3       10
 6  10006           10            8        3       10
 7  10007            8           NA        1       NA
 8  10008            8            6        1        8
 9  10009           NA           NA        3       NA
10  10010            8            6        1        8
11  10011            8            6        3        8
12  10012           NA           NA        2       NA
13  10013            8           10        3       10
14  10014            2            6        3        6
15  10015           NA           10        3       NA
16  10016            4            6        2        6
17  10018            6            7        2        7
18  10019            8           NA        2       NA
19  10020            8            8        4        8
20  10021            8           NA        2       NA
21  10022           10            8        4       10
22  10024            6            8        2        8
23  10025            8           NA        4       NA
24  10026            7           10        3       10
25  10027           NA            6        1       NA
26  10028           NA           NA        3       NA
# ℹ 23,477 more rows
  • Hmm, that seems odd, why would R consider NA to be greater than 6?
    • Any thoughts?
  • Generally, R is overly-cautious when dealing with NAs to ensure you don’t accidentally drop them without realizing it
    • For example, if I asked you what the mean(c(5, 6, 4, NA)) would be, you’d probably say 5, right?
      • R is never going to just ignore the NA values like that unless we tell it to
mean(c(5, 6, 4, NA))
[1] NA
  • See, what have to explicitly tell it to remove the NA values
mean(c(5, 6, 4, NA), na.rm = T)
[1] 5
  • So in our case of trying to get the highest expectation, R doesn’t want us to forget we have NA values, so it throws them at us.
  • For now, let’s keep the results we got, but, if it gave us an NA and there is a non-NA value in other other column, sub that in

Dealing With Missing Values with is.na(), &, and !

  • To do this, we will add another couple of code helpers -is.na(high_exp) simply asks if the high_exp is NA or not
    • R doesn’t let you just say high_exp == NA -! is really helpful tool, which can be used to negate or invert a command
      • !is.na(x1stuedexpct) just returns the opposite of is.na(x1stuedexpct) so it tells us that the column is not NA -& can be useful inside conditional statements, as it means both must be TRUE (FYI: | means or)
  • Used together, we can now ensure we have as few NAs as possible in the data (there will still be some when both student and parent were NA)
df_small <- df_small |>
  mutate(high_exp = ifelse(is.na(high_exp) & !is.na(x1stuedexpct), x1stuedexpct, high_exp),
         high_exp = ifelse(is.na(high_exp) & !is.na(x1paredexpct), x1paredexpct, high_exp))
  • Let’s print this one last time to check our work
print(df_small, n = 26)
# A tibble: 23,503 × 5
   stu_id x1stuedexpct x1paredexpct x1region high_exp
    <dbl>        <dbl>        <dbl>    <dbl>    <dbl>
 1  10001            8            6        2        8
 2  10002           NA            6        1        6
 3  10003           10           10        4       10
 4  10004           10           10        3       10
 5  10005            6           10        3       10
 6  10006           10            8        3       10
 7  10007            8           NA        1        8
 8  10008            8            6        1        8
 9  10009           NA           NA        3       NA
10  10010            8            6        1        8
11  10011            8            6        3        8
12  10012           NA           NA        2       NA
13  10013            8           10        3       10
14  10014            2            6        3        6
15  10015           NA           10        3       10
16  10016            4            6        2        6
17  10018            6            7        2        7
18  10019            8           NA        2        8
19  10020            8            8        4        8
20  10021            8           NA        2        8
21  10022           10            8        4       10
22  10024            6            8        2        8
23  10025            8           NA        4        8
24  10026            7           10        3       10
25  10027           NA            6        1        6
26  10028           NA           NA        3       NA
# ℹ 23,477 more rows
  • Okay, looks good!

  • There are other ways we could have gone about this analysis, some more sophisticated and slicker

    • But, what matters more is that we checked our work as we went, caught the issues, and finished with the correct data
      • We will learn more tools along the way that speed things up, especially in the brand new Data Wrangling IV Lesson
      • However, the more steps you take at once, the more you have to check at once!
      • There’s often never a single correct way to get to the right answer, so long as you get there, it’s clear what you did, and you catch issues as they come up

“The point to keep in mind that the process is often iterative (two steps forward, one step back…) and that there’s seldom an single correct way.” B.T. Skinner

  • Now it’s made correctly, let’s check the counts of our new variable
## get summary of our new variable
df_small |> count(high_exp)
# A tibble: 11 × 2
   high_exp     n
      <dbl> <int>
 1        1    71
 2        2  2034
 3        3   163
 4        4  1282
 5        5   132
 6        6  4334
 7        7   191
 8        8  5087
 9        9   168
10       10  6578
11       NA  3463
  • Hmm… We still have a large number of NAs, meaning neither the parent or student provided an expectation
    • In more sophisticated analyses, this is where we might need to think about imputation or something else to handle the missing-ness
      • For the sake of this lesson, however, we just want to drop the observations with NA for high_exp

Keeping Rows Based on a Condition with filter()

  • To do this, we are going to use the filter() command from tidyverse
  • filter() works by only keeping observations that meet the condition(s) we set
    • As in, to make it through the filter, a row must answer “yes” to “does it meet this condition?”
## filter out missing values
df_small_cut <- df_small |> filter(!is.na(high_exp))
  • I re-used our !is.na() helper to keep all rows that answer “yes” to “are you not NA?”
    • Double negatives are something you’ll have to get used to in coding, sorry
    • Notice, instead of overwriting df_small we assigned this to a new object df_small_cut
      • Generally, when making substantial changes to a data set like dropping observations, I like to be able to double check what I did, which is easier if we make a new df

Quick Question: A commmon confusion from this lesson is between filter() and select(). Can someone explain when you’d use select() over filter()?

  • To see if that worked, let’s count() our new variable one more time
df_small_cut |> count(high_exp)
# A tibble: 10 × 2
   high_exp     n
      <dbl> <int>
 1        1    71
 2        2  2034
 3        3   163
 4        4  1282
 5        5   132
 6        6  4334
 7        7   191
 8        8  5087
 9        9   168
10       10  6578
  • Okay, so no NAs, perfect!
  • Just to be extra sure we only removed NAs, we can check the difference in how many rows our original df_small has with df_small_cut
    • Who can tell me how many rows should have been dropped?
## does the original # of rows - current # or rows == NA in count?
nrow(df_small) - nrow(df_small_cut)
[1] 3463
  • nrow() does what you’d expect, counts the number of rows
    • You could also just check by looking at the dfs in the environment tab, but this way leaves no room for mental math errors

Summarizing Data with summarize()

  • Okay, so we have our data selected, we made our high_exp variable, and we’ve done some work to handle missing data
  • In our week one data talk we discussed how massive tables of data are not particularly helpful to for someone to read or interpret
    • We will cover how to make graphs of our data in a few weeks
    • For our final task today, we are going to make some summary tables using summarize() from the tidyverse
  • summarize() allows us to apply a summary statistic (mean, sd, median, etc.) to a column in our data
  • summarize() takes an entire data frame as an input, and spits out a small data frame with the just the summary variables
    • Note: for this reason, you rarely ever want to assign <- the output of summarize() back to the main data frame object, as you’ll overwrite it
      • You can either spit the summary tables out into the console without assignment (which we will do) or if you need to use them for something else, assign them to a new object
## get average (without storing)
df_small_cut |> summarize(mean(high_exp))
# A tibble: 1 × 1
  `mean(high_exp)`
             <dbl>
1             7.27
  • See, the output is a 1x1 table with the mean expectation mean(high_exp) of 7.27, just above a bachelors degree
    • Note: if we want to name the summary variable, we can name it just like we did earlier in mutate() with a single =
df_small_cut |> summarize(mean_exp = mean(high_exp))
# A tibble: 1 × 1
  mean_exp
     <dbl>
1     7.27
  • But, that wasn’t quite the question we were asked
    • We were asked if it varied by region…
      • For time’s sake, I can tell you the region variable is x1region and splits the US in 4 Census regions

Grouping Data with group_by()

  • The group_by() function, following the tidyverse principle of intuitive naming, groups the data and outputs by the variable(s) you say
    • So, since we want to calculate the average high expectation by region, we group_by(x1region)
      • Since we just want it for our summarize(), we just add it to the pipe
        • If you wanted to save the data in it’s group_by()-ed state, you could assign it to something
## get grouped average
df_small_cut |>
  group_by(x1region) |>
  summarize(mean_exp = mean(high_exp))
# A tibble: 4 × 2
  x1region mean_exp
     <dbl>    <dbl>
1        1     7.39
2        2     7.17
3        3     7.36
4        4     7.13
  • Success! While expectations are similar across the country, there’s some variance by region
    • While there are few things we could do to make this a little fancier (e.g., changing the region numbers to names, formatting the table, etc.) we have answered our question, and have clear documentation of how we got here, so I will call that a win!

Saving Data with write_csv()

  • Sometimes we want to be able to access objects from scripts without having to re-run the whole thing
    • Remember: one of the main advantages of R is the data we read in is untouched
  • To do this, we want to write_ a new csv() file, containing our modified data
    • unlike read_csv() which only needed a file name/path, write_csv() needs to know what you’re trying to save and the file name/path you want to save it to
      • The only way you can overwrite or change the original data is by saving to the same file name as the original data, so NEVER do that!
  • Since we didn’t assign our summary table to anything, we can just add write_csv() to the end of the pipe and add a file.path()
    • If you want to save a data frame you already assigned to an object write_csv(<object>, file.path(<path>)) would work just fine!
## write with useful name

df_small_cut |>
  group_by(x1region) |>
  summarize(mean_exp = mean(high_exp)) |>
  write_csv(file.path("data", "region-expects.csv"))

Phew!

Appendix: All at Once

We went through that piece by piece to demonstrate each function, but, there’s no reason we can’t just |> pipe it all together

## Let's redo the analysis above, but with a fully chained set of
## functions.

## start with original df
df |>
  ## select columns we want
  select(stu_id, x1stuedexpct, x1paredexpct, x1region) |>
  ## If expectation is -8, -9. or 11, make it NA
  mutate(student_exp = ifelse(x1stuedexpct %in% list(-8, -9, 11), NA, x1stuedexpct),
         parent_exp = ifelse(x1paredexpct %in% list(-8, -9, 11), NA, x1paredexpct)) |>
  ## Make a new variable called high_exp that is the higher or parent and student exp
  mutate(high_exp = ifelse(student_exp > parent_exp, student_exp, parent_exp)) |>
  ## If one exp is NA but the other isn't, keep the value not the NA
  mutate(high_exp = ifelse(is.na(high_exp) & !is.na(student_exp), student_exp, high_exp),
         high_exp = ifelse(is.na(high_exp) & !is.na(parent_exp), parent_exp, high_exp)) |>
  ## Drop is high_exp is still NA (neither parent or student answereed)
  filter(!is.na(high_exp)) |>
  ## Group the results by region
  group_by(x1region) |>
  ## Get the mean of high_exp (by region)
  summarize(mean_exp = mean(high_exp)) |>
  ## Write that to a .csv file
  write_csv(file.path("data", "region-expects-chain.csv"))

To double check, let’s just check these are the same…

non_chain <- read_csv(file.path("data", "region-expects.csv"))
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chain <- read_csv(file.path("data", "region-expects-chain.csv"))
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all.equal(non_chain, chain)
[1] TRUE

WooHoo!

Final notes

  • This rather lengthy lesson has thrown you in the (medium) deep end of the coding pool
    • By no means are you expected to get everything we just did
      • We will continue to revisit all these commands throughout the class, by the end of the semester, they will be second nature!
  • We also saw how to use code to answer a realistic question we might be asked in a data management job, a translation skill that will prove invaluable later on!
    • We had to plan out steps and then make some adjustments along the way (e.g., our NA issues), that’s all part of the process!

“Becoming a better quantitative researcher mostly means becoming a better translator: question –> data/coding –> answer.” B.T. Skinner

  • Using the hsls_small.csv data set answer the following questions
    • Hint: You’re going to need the code book to identify the necessary variables

Questions

  1. What is the average standardized math test score?
  2. How does this differ by gender?
  1. Among those students who are under 185% of the federal poverty line in the base year of the survey, what is the median household income category? (include what that category represents)
  1. Of the students who earned a high school credential (traditional diploma or GED), what percentage earned a GED or equivalency?
  2. How does this differ by region?
  1. What percentage of students ever attended a post-secondary institution by February 2016?
  2. Give the cross tabulation for both family incomes above/below $35,000 and region
    • This means you should have percentages for 8 groups: above/below $35k within each region
    • Hint: group_by() can be given more than one group

Once complete, turn in the .R script (no data etc.) to Canvas by the due date (Sunday 11:59pm following the lesson). Assignments will be graded on the following Monday (time permitting) in line with the grading policy outlined in the syllabus.

Solution

R Solution Code

## -----------------------------------------------------------------------------
##
##' [PROJ: EDH 7916]
##' [FILE: Data Wrangling I Solution]
##' [INIT: Jan 28 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##
## -----------------------------------------------------------------------------

## note to matt

setwd(this.path::here())

## ---------------------------
##' [Libraries]
## ---------------------------

library(tidyverse)

## ---------------------------
##' [Input]
## ---------------------------

df <- read_csv(file.path("data", "hsls-small.csv"))

## ---------------------------
##' [Q1]
## ---------------------------

## Part One
df |>
  filter(!x1txmtscor %in% c(-8, -9)) |>
  summarize(mean = mean(x1txmtscor))

## Part Two
math <- df |>
  filter(!x1txmtscor %in% c(-8, -9),
         x1sex != -9) |>
  group_by(x1sex) |>
  summarize(mean = mean(x1txmtscor))

## ---------------------------
##' [Q2]
## ---------------------------

df |>
  filter(x1poverty185 == 1,
         !x1famincome %in% c(-8,-9)) |>
  summarize(med_inc_cat = median(x1famincome))

print("Median income category in 2, which represents Family income > $15,000 and <= $35,000")

## ---------------------------
##' [Q3]
## ---------------------------

## Simplified
df |>
  filter(x4hscompstat %in% c(1,2)) |>
  count(x4hscompstat) |>
  mutate(perc = n / sum(n) * 100)

## Part One
df |>
  filter(x4hscompstat %in% c(1,2)) |>
  summarize(ged = sum(x4hscompstat == 2),
            total = sum(x4hscompstat %in% c(1,2)),
            perc = ged/total*100)

## Part Two
## Simplified
df |>
  filter(x4hscompstat %in% c(1,2)) |>
  group_by(x1region) |>
  count(x4hscompstat) |>
  mutate(perc = n / sum(n) * 100) |>
  filter(x4hscompstat == 1)

df |>
  filter(x4hscompstat %in% c(1,2)) |>
  group_by(x1region) |>
  summarize(ged = sum(x4hscompstat == 2),
            total = sum(x4hscompstat %in% c(1,2)),
            perc = ged/total*100)

## ---------------------------
##' [Q4]
## ---------------------------

## Part One
df |>
  filter(x4evratndclg != -8) |>
  count(x4evratndclg) |>
  mutate(perc = n / sum(n) * 100)


df |>
  filter(x4evratndclg != -8) |>
  summarize(college = sum(x4evratndclg == 1),
            total = sum(x4evratndclg %in% c(0, 1)),
            perc = college/total*100)

## Part Two
df |>
  filter(x4evratndclg != -8,
         !x1famincome %in% c(-8, -9)) |>
  mutate(below_35k = ifelse(x1famincome %in% c(1,2), 1, 0)) |>
  group_by(x1region, below_35k) |>
  count(x4evratndclg) |>
  mutate(perc = n / sum(n) * 100) |>
  filter(x4evratndclg == 1)



df |>
  filter(x4evratndclg != -8,
         !x1famincome %in% c(-8, -9)) |>
  mutate(below_35k = ifelse(x1famincome %in% c(1,2), 1, 0)) |>
  group_by(x1region, below_35k) |>
  summarize(college = sum(x4evratndclg == 1),
            total = sum(x4evratndclg %in% c(0, 1)),
            perc = college/total*100)

## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------