I: Enter the tidyverse

R Code

  • Today we take our first full dive into R. Though every data wrangling / analysis is unique, the best way to learn and practice R is to answer a question using data, which is how this lesson is set up
  • By the end of this lesson, you will have read in a data set, lightly cleaned it, produced results, and saved your findings in a new file
    • As part of this process, you will have practiced translating a research question into data analysis steps, which is a skill every bit as important as technical sophistication with a statistical language like R

Re: Urgent Data Question from the Provost

  • Through today’s lesson, we will explore some of the basics of data wrangling
    • But to make it more realistic, we will be doing so to answer a realistic question you may be asked by your advisor or supervisor

Using HSLS09 data, figure out average differences in college degree expectations across census regions; for a first pass, ignore missing values and use the higher of student and parental expectations if an observation has both.

  • A primary skill (often unremarked upon) in data analytic work is translation. Your advisor, IR director, funding agency director — even collaborator — won’t speak to you in the language of R
    • Instead, it’s up to you to
      1. translate a research question into the discrete steps coding steps necessary to provide an answer, and then
      2. translate the answer such that everyone understands what you’ve found

What we need to do is some combination of the following:

  1. Read in the data
  2. Select the variables we need
  3. Mutate a new value that’s the higher of student and parental degree expectations
  4. Filter out observations with missing degree expectation values
  5. Summarize the data within region to get average degree expectation values
  6. Write out the results to a file so we have it for later

Let’s do it!

  • Throughout this lesson (and class), we are going to lean heavily on the tidyverse collection of packages
    • If you don’t already have this installed, use install.packages("tidyverse")
library(tidyverse)
Warning: package 'lubridate' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Check working directory

  • This script — like the one from the organizing lesson — assumes that the class folder is the working directory and that the required data file is in the data sub-directory

  • If you need a refresher on setting the working directory, see the prior lesson.

  • Notice that we’re not setting (i.e. hard coding) the working directory in the script. That would not work well for sharing the code. Instead, we rely on relative paths once you know where you need to be and have gotten there

Reading Data in with read_csv()

  • Nationally representative, longitudinal study of 23,000+ 9th graders from 944 schools in 2009, with a first follow-up in 2012 and a second follow-up in 2016
  • Students followed throughout secondary and postsecondary years
  • Surveys of students, their parents, math and science teachers, school administrators, and school counselors
  • A new student assessment in algebraic skills, reasoning, and problem solving for 9th and 11th grades
  • 10 state representative data sets
  • If you are interested in using HSLS09 for future projects, DO NOT rely on this subset. Be sure to download the full data set with all relevant variables and weights if that’s the case. But for our purposes in this lesson, it will work just fine.

  • Throughout, we’ll need to consult the code book. An online version can be found at this link.

## data are CSV, so we use read_csv() from the readr library
data <- read_csv("data/hsls-small.csv")
Rows: 23503 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): stu_id, x1sex, x1race, x1stdob, x1txmtscor, x1paredu, x1hhnumber, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## alternatively, you can also use read_csv(file.path("data", "hsls-small.csv"))
  • You may notice the read_csv() prints out information about the data just read in. Nothing is wrong! The read_csv() function, like many other functions in the tidyverse, assumes you’d rather have more rather than less information and acts accordingly

  • Here we assign our data to an object called data

    • You can call it whatever you want
  • Up to now, this should all seem fairly consistent with last week’s lesson. This is just the set up, now it’s time to dive in!

Native Pipe Operator |> in R

The pipe is one of the things that makes R code more intuitive than other programming languages, as it allows us write code in the order we think about it, passing it one from one function to another, rather than nesting it like traditional code would be written

A Brief History of the R Pipe

badge

badge
  • The pipe was originally a tidyverse invention, and used %>% symbol, which is probably still a common pipe you see “in the wild”
  • The pipe we are using was brought into the Vanilla version of R a few years ago as |> (the “native pipe”)
  • The reason for the change is some benefits that are beyond the scope of this class, but you just need to know that |> and %>% are essentially the same thing
    • The default shortcut for pipe (ctrl + shift + m) will generate %>%, but you can change it to |> by going to Tools > Global Options > Code > Use native pipe operator, |>

How the Pipe Works

For this example don’t worry about the actual processes (we will go over them more below), just look at how much more intuitive the code is with |>s.

First, let’s say I want to take the data we just read in and select the x1txmtscor (math test scores) column

Without |>

## Without |>
select(data, x1txmtscor)

With |>

## With |>
data |> select(x1txmtscor)

Neither is that confusing… But, what if we want to take that output and select only students with a math score above 50?

Without |>

## Without |>
filter(select(data, x1txmtscor), x1txmtscor > 50)

With |>

## With |>
data |> select(x1txmtscor) |> filter(x1txmtscor > 50)

See how the non-piped version is getting messy? Let’s add one more level to really make the point, creating a new variable that is the square root of the test score

Without |>

## Without |>
mutate(filter(select(data, x1txmtscor), x1txmtscor > 50), square_root = sqrt(x1txmtscor))

With |>

## With |>
data |> select(x1txmtscor) |> filter(x1txmtscor > 50) |> mutate(square_root = sqrt(x1txmtscor))

As we are getting longer, let’s use a new line for each pipe, just to make it even clearer (it makes no difference to R)

## Best to use  a new line for each pipe when code gets longer
data |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor))

Even though we haven’t covered any of these commands yet, we can see that the |> is pretty easy to know roughly what’s going on. Whereas, the traditional nested way gets really tricky beyond a couple of commands.

Of course, if you wanted, you could do each step separately, constantly assigning and overwriting an object like below

## Without the |>, we could technically break it down step by step
temp <- select(data, x1txmtscor)
temp <- filter(temp, x1txmtscor > 50)
temp <- mutate(temp, square_root = sqrt(x1txmtscor))
temp

But it’s less intuitive than simply piping the results.

  • If we do want to see step-by-step output in a piped command, you can either
    1. Run the code as you write it line-by-line
    2. Highlight sections of the piped code to run up to a point

Assigning Output from Pipes

  • Assigning output from pipes is the same as we have covered a few times, we use a <- to pass it backwards to an object (in this case we called that object data_backward_pass)
## Always assign backwards
data_backward_pass <- data |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor))
  • If you want to think of it this way, we are effectively continuing to pass it forward like this…
## You can think of the assignment as a continuation of the pipe like this
## but don't write it this way, it's then hard to find what you called something later
data |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor)) ->
  data_forward_pass
  • That’s how pipes and assignment work together, and you can see the outputs are all.equal()
## Checking they are the same
all.equal(data_backward_pass, data_forward_pass)
[1] TRUE
  • However, you really shouldn’t write code like this, as although it runs, it’s then hard to later find where you created data_forward_pass.
  • Starting the string of code with where you store the result is MUCH clearer.
    • But hopefully it will help some of you understand how |> and <- work together

Basic Tidyverse Commands

  • Now we have the pipe |> covered, it’s time to dig into some basic data wrangling commands

Selecting Variables/Columns with select()

  • Often data sets contain hundreds, thousands, or even tens of thousands of variables, when we are only interested in a handful. Our first tidyverse command select() helps us deal with this, by, as you may have guessed, select()-ing the variables we want
  • Since we are going to pipe our R commands, we start with our data then pipe it into the select() command
  • In the select() command, we list out the variables we want
    • stu_id, x1stuedexpct, x1paredexpct, x1region
      • Notice: in this (and most tidyverse) command(s) we don’t have to use “quotes” around the variable names. A common error message when you did need to put something in quotes is Error: object <thing you should have "quoted"> not found
data |> select(stu_id, x1stuedexpct, x1paredexpct, x1region)
# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           11            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           11        1
 8  10008            8            6        1
 9  10009           11           11        3
10  10010            8            6        1
# ℹ 23,493 more rows

Quick question: if we want to use this reduced dataset going forward, what should we do?

That’s right, assign it to an object! Let’s call that data_small

data_small <- data |> select(stu_id, x1stuedexpct, x1paredexpct, x1region)
  • Our next step is to create a variable that’s the higher of student and parent expectations. Sounds simple enough, but first, we have a problem…
    • Can anyone guess what it is?

Understanding our data

First things first, however, we need to check the code book to see what the numerical values for our two education expectation variables represent. To save time, we’ve copied them here:

x1stuedexpct

How far in school 9th grader thinks he/she will get

value label
1 Less than high school
2 High school diploma or GED
3 Start an Associate’s degree
4 Complete an Associate’s degree
5 Start a Bachelor’s degree
6 Complete a Bachelor’s degree
7 Start a Master’s degree
8 Complete a Master’s degree
9 Start Ph.D/M.D/Law/other prof degree
10 Complete Ph.D/M.D/Law/other prof degree
11 Don’t know
-8 Unit non-response
x1paredexpct

How far in school parent thinks 9th grader will go

value label
1 Less than high school
2 High school diploma or GED
3 Start an Associate’s degree
4 Complete an Associate’s degree
5 Start a Bachelor’s degree
6 Complete a Bachelor’s degree
7 Start a Master’s degree
8 Complete a Master’s degree
9 Start Ph.D/M.D/Law/other prof degree
10 Complete Ph.D/M.D/Law/other prof degree
11 Don’t know
-8 Unit non-response
-9 Missing
  • The good news is that the categorical values are the same for both variables (meaning we can make an easy comparison) and move in a logical progression
  • The bad news is that we have three values — -8, -9, and 11 — that we need to deal with so that the averages we compute later represent what we mean

Exploring Catagorical Variables with count()

  • First, let’s see how many observations are affected by these values using count()

Notice that we don’t assign to a new object; this means we’ll see the result in the console, but nothing in our data or object will change

## see unique values for student expectation
data_small |> count(x1stuedexpct)
# A tibble: 12 × 2
   x1stuedexpct     n
          <dbl> <int>
 1           -8  2059
 2            1    93
 3            2  2619
 4            3   140
 5            4  1195
 6            5   115
 7            6  3505
 8            7   231
 9            8  4278
10            9   176
11           10  4461
12           11  4631
## see unique values for parental expectation
data_small |> count(x1paredexpct)
# A tibble: 13 × 2
   x1paredexpct     n
          <dbl> <int>
 1           -9    32
 2           -8  6715
 3            1    55
 4            2  1293
 5            3   149
 6            4  1199
 7            5   133
 8            6  4952
 9            7    76
10            8  3355
11            9    37
12           10  3782
13           11  1725
  • Dealing with -8 and -9 is straight forward — we’ll convert it missing.
    • In R, missing values are technically stored as NA.
      • Not all statistical software uses the same values to represent missing values (for example, STATA uses a dot .)
    • NCES has decided to represent missing values as a limited number of negative values. In this case, -8 and -9 represent missing values
  • Note: how to handle missing values is a very important topic, one we could spend all semester discussing
    • For now, we are just going to drop observations with missing values; but be forewarned that how you handle missing values can have real ramifications for the quality of your final results
    • In real research, a better approach is usually to impute missing values, but that is beyond our scope right now
  • Deciding what to do with 11 is a little trickier. While it’s not a missing value per se, it also doesn’t make much sense in its current ordering, that is, to be “higher” than completing a professional degree
    • For now, we’ll make a decision to convert these to NA as well, effectively deciding that an answer of “I don’t know” is the same as missing an answer
  • So first step: convert -8, -9, and 11 in both variables to NA. For this, we’ll use the mutate() and ifelse() functions

Conditional Values with ifelse()

  • ifelse() is a really common command in R and has three parts
    1. statement that can be TRUE or FALSE
    2. What to return if the statement is TRUE
    3. else what to return when the statement is FALSE

Modifying an Existing Variable with mutate()

  • When we want to add variables and change existing ones, we can use the mutate() function
    • The basic idea of mutate commands is mutate(<where to go> = <what to go there>)
      • This is probably the trickiest function we cover today to understand
    • New variables are created if you provide <where to go> a new variable name (or nothing)
    • Variables are modified if you provide <where to go> an existing variable name
    • <what to go there> can as simple as a single number all the way to a chain of piped functions, so long as there’s a clear answer for every row
  • In this case, we want to modify x1stuedexpct to be NA when x1stuedexpct is -8, -9, or 11

Now, we have three values we want to covert to NA, so we could do them one-at-a-time, like below

data_small <- data_small |>
  mutate(x1stuedexpct = ifelse(x1stuedexpct == -8, NA, x1stuedexpct))
  • Let’s walk through this code
  1. Assign the results back to data_small (which will overwrite our previous data_small)
  2. Take data_small and pipe |> it into mutate()
  3. Inside mutate() assign our results to x1stuedexpct (which will modify the existing variable)
  4. Modify x1stuedexpct with an ifelse() statement, which remember has 3 parts
  1. statement which is asking “is x1stuedexpct == -8”? - Notice == means “is equal to”, while = means “assign to”. Yes it’s confusing, but you’ll get it over time!
  2. if that statement is true, make it NA
  3. else (if the statement is false) return the original variable x1stuedexpct

Okay, make sense? Let’s see what we just did (look at row 26)

print(data_small, n = 26)
# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           11            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           11        1
 8  10008            8            6        1
 9  10009           11           11        3
10  10010            8            6        1
11  10011            8            6        3
12  10012           11           11        2
13  10013            8           10        3
14  10014            2            6        3
15  10015           11           10        3
16  10016            4            6        2
17  10018            6            7        2
18  10019            8           -8        2
19  10020            8            8        4
20  10021            8           11        2
21  10022           10            8        4
22  10024            6            8        2
23  10025            8           -8        4
24  10026            7           10        3
25  10027           11            6        1
26  10028           NA           -8        3
# ℹ 23,477 more rows
  • This is fine, but we have to do it 3 times for both parent and student expectation
    • Instead, can anyone think (not in R code, just in terms of logic) how we could change our statement piece of the ifelse() to be more efficient?

Being Efficient with %in% and c()

  • What we can do, is group -8, -9, and 11 together into a list using c()
    • c() is a very common function in R used to create a list
  • Then, we can use the %in% operator to ask if that result is any of the numbers in that list
    • This keeps our code shorter and easier to read
data_small <- data_small |>
  mutate(x1stuedexpct = ifelse(x1stuedexpct %in% c(-8, -9, 11), NA, x1stuedexpct),
         x1paredexpct = ifelse(x1paredexpct %in% c(-8, -9, 11), NA, x1paredexpct))
  • The code now works just as above, but instead of asking if x1stuedexpct is equal to -8, it asks if it’s in the list of -8, -9, and 11, then does the same for parental expectations!
    • Let’s view those first 26 rows again to see what we did
print(data_small, n = 26)
# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           NA            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           NA        1
 8  10008            8            6        1
 9  10009           NA           NA        3
10  10010            8            6        1
11  10011            8            6        3
12  10012           NA           NA        2
13  10013            8           10        3
14  10014            2            6        3
15  10015           NA           10        3
16  10016            4            6        2
17  10018            6            7        2
18  10019            8           NA        2
19  10020            8            8        4
20  10021            8           NA        2
21  10022           10            8        4
22  10024            6            8        2
23  10025            8           NA        4
24  10026            7           10        3
25  10027           NA            6        1
26  10028           NA           NA        3
# ℹ 23,477 more rows
  • Just to be doubly-sure, lets check count() again
data_small |> count(x1stuedexpct) 
# A tibble: 11 × 2
   x1stuedexpct     n
          <dbl> <int>
 1            1    93
 2            2  2619
 3            3   140
 4            4  1195
 5            5   115
 6            6  3505
 7            7   231
 8            8  4278
 9            9   176
10           10  4461
11           NA  6690
data_small |> count(x1paredexpct)
# A tibble: 11 × 2
   x1paredexpct     n
          <dbl> <int>
 1            1    55
 2            2  1293
 3            3   149
 4            4  1199
 5            5   133
 6            6  4952
 7            7    76
 8            8  3355
 9            9    37
10           10  3782
11           NA  8472

Success!

Creating a New Variable with mutate()

  • So, with that tangent out of the way, let’s get back to our original task, creating a new variable that is the highest of parental and student expectations
  • To make a new variable which is the highest of two variables, we can use our friends mutate() and ifelse() some more
data_small <- data_small |>
  mutate(high_exp = ifelse(x1stuedexpct > x1paredexpct, x1stuedexpct, x1paredexpct))
  • That code is almost what we want to do
    • If x1stuedexpct is higher then take that, if not, take x1paredexpct
      • There’s two things we haven’t fully accounted for though…
        • One doesn’t actually matter here, but might in other circumstances
        • One definitely matters here
          • Without scrolling past the duck, does anyone know what they might be?

Sloppy Mistake 1 (doesn’t matter here)

  • We were a little sloppy with the statement piece, we just asked if x1stuedexpct was greater than x1paredexpct or not
    • If we were being more careful, we might have said “greater than or equal to”
      • Why doesn’t this matter in this context, and when might it matter?

Sloppy Mistake 2 (does matter here)

  • Now let’s check our data frame to see the one that does matter
print(data_small, n = 26)
# A tibble: 23,503 × 5
   stu_id x1stuedexpct x1paredexpct x1region high_exp
    <dbl>        <dbl>        <dbl>    <dbl>    <dbl>
 1  10001            8            6        2        8
 2  10002           NA            6        1       NA
 3  10003           10           10        4       10
 4  10004           10           10        3       10
 5  10005            6           10        3       10
 6  10006           10            8        3       10
 7  10007            8           NA        1       NA
 8  10008            8            6        1        8
 9  10009           NA           NA        3       NA
10  10010            8            6        1        8
11  10011            8            6        3        8
12  10012           NA           NA        2       NA
13  10013            8           10        3       10
14  10014            2            6        3        6
15  10015           NA           10        3       NA
16  10016            4            6        2        6
17  10018            6            7        2        7
18  10019            8           NA        2       NA
19  10020            8            8        4        8
20  10021            8           NA        2       NA
21  10022           10            8        4       10
22  10024            6            8        2        8
23  10025            8           NA        4       NA
24  10026            7           10        3       10
25  10027           NA            6        1       NA
26  10028           NA           NA        3       NA
# ℹ 23,477 more rows
  • Hmm, that seems odd, why would R consider NA to be greater than 6?
    • Any thoughts?
  • Generally, R is overly-cautious when dealing with NAs to ensure you don’t accidentally drop them without realizing it
    • For example, if you were asked what the mean(c(5, 6, 4, NA)) would be, you’d probably say 5, right?
      • R is never going to just ignore the NA values like that unless we tell it to
mean(c(5, 6, 4, NA))
[1] NA
  • See, what have to explicitly tell it to remove the NA values
mean(c(5, 6, 4, NA), na.rm = T)
[1] 5
  • So in our case of trying to get the highest expectation, R doesn’t want us to forget we have NA values, so it throws them at us.

  • Dealing with missing values is a huge topic in data analysis, and there are many ways to handle them, which is beyond the scope of this lesson

    • For now, let’s remove rows that have NA values in either x1stuedexpct or x1paredexpct or both

Dealing With Missing Values with is.na(), &, and !

  • To do this, we will add another couple of code helpers -is.na(x1stuedexpct) simply asks if the x1stuedexpct is NA or not
    • R doesn’t let you just say x1stuedexpct == NA -! is really helpful tool, which can be used to negate or invert a command
      • !is.na(x1stuedexpct) just returns the opposite of is.na(x1stuedexpct) so it tells us that the column is not NA -& can be useful inside conditional statements, as it means both must be TRUE (FYI: | means or)
data_small_cut <- data_small |>
  filter(!is.na(x1stuedexpct) & !is.na(x1paredexpct))
  • So !is.na(x1stuedexpct) & !is.na(x1paredexpct) makes sure that both x1stuedexpct and x1paredexpct are not NA

  • Now what does filter() do here?

Keeping Rows Based on a Condition with filter()

  • The filter() command from tidyverse works by only keeping observations that meet the condition(s) we set
    • As in, to make it through the filter, a row must answer “yes” to “does it meet this condition?”
  • So in this case, we are keeping all rows where x1stuedexpct and x1paredexpct are not NA
    • Notice, instead of overwriting data_small we assigned this to a new object data_small_cut
      • Generally, when making substantial changes to a data set like dropping observations, we might want to be able to double check what we did, which is easier if we make a new data

Quick Question: A common confusion from this lesson is between filter() and select(). Can someone explain when you’d use select() over filter()?

  • Let’s check the counts of our x1stuedexpct and x1paredexpct again to see if filter() worked
data_small_cut |> count(x1stuedexpct) 
# A tibble: 10 × 2
   x1stuedexpct     n
          <dbl> <int>
 1            1    39
 2            2  1483
 3            3    83
 4            4   769
 5            5    78
 6            6  2539
 7            7   170
 8            8  3180
 9            9   127
10           10  3336
data_small_cut |> count(x1paredexpct)
# A tibble: 10 × 2
   x1paredexpct     n
          <dbl> <int>
 1            1    41
 2            2   911
 3            3   105
 4            4   825
 5            5    97
 6            6  3761
 7            7    66
 8            8  2746
 9            9    30
10           10  3222
  • Okay, so no NAs, perfect!
  • Just to be extra sure we only removed NAs, we can check the difference in how many rows our original data_small has with data_small_cut
## what's the difference between original # of rows and current # or rows?
nrow(data_small) - nrow(data_small_cut)
[1] 11699
  • nrow() does what you’d expect, counts the number of rows
    • You could also just check by looking at the dataframes in the environment tab, but this way leaves no room for mental math errors
  • Now let’s check our high_exp variable in the new data_small_cut
print(data_small_cut, n = 26)
# A tibble: 11,804 × 5
   stu_id x1stuedexpct x1paredexpct x1region high_exp
    <dbl>        <dbl>        <dbl>    <dbl>    <dbl>
 1  10001            8            6        2        8
 2  10003           10           10        4       10
 3  10004           10           10        3       10
 4  10005            6           10        3       10
 5  10006           10            8        3       10
 6  10008            8            6        1        8
 7  10010            8            6        1        8
 8  10011            8            6        3        8
 9  10013            8           10        3       10
10  10014            2            6        3        6
11  10016            4            6        2        6
12  10018            6            7        2        7
13  10020            8            8        4        8
14  10022           10            8        4       10
15  10024            6            8        2        8
16  10026            7           10        3       10
17  10029            2            2        2        2
18  10030            6            8        3        8
19  10036            8            6        2        8
20  10037            8            8        3        8
21  10038            6            4        4        6
22  10039            8           10        3       10
23  10041            3            1        3        3
24  10044           10           10        3       10
25  10045            4            4        2        4
26  10049            8            8        4        8
# ℹ 11,778 more rows
  • To be more straightforward, let’s compare the counts of high_exp in data_small and dataf_small_cut
data_small |> count(high_exp)
# A tibble: 11 × 2
   high_exp     n
      <dbl> <int>
 1        1     3
 2        2   516
 3        3    62
 4        4   482
 5        5    59
 6        6  2177
 7        7   120
 8        8  3380
 9        9   112
10       10  4893
11       NA 11699
data_small_cut |> count(high_exp)
# A tibble: 10 × 2
   high_exp     n
      <dbl> <int>
 1        1     3
 2        2   516
 3        3    62
 4        4   482
 5        5    59
 6        6  2177
 7        7   120
 8        8  3380
 9        9   112
10       10  4893
  • Yay, all NAs are gone!

  • Is there any other way to remove NAs?

    • You know what, there is a drop_na() function!

Simply Dropping NAs

  • The tidyr package in tidyverse has a handy function drop_na()
    • Without arguments in (), it will remove any row that has NA in any column
    • Specify columns to remove rows that have NA in the designated columns
data_small_drop <- data_small |> 
  drop_na(x1stuedexpct, x1paredexpct)
  • Let’s compare our two ways of removing NAs and see if they end up being the same
    • What’s the function to check if things are equal (we used it when introducing |> ?
## compare two ways of dropping NAs
all.equal(data_small_cut, data_small_drop)
[1] TRUE

Summarizing Data with summarize()

  • Okay, so we have our data selected, we made our high_exp variable, and we’ve done some work to handle missing data
  • In our week one data talk we discussed how massive tables of data are not particularly helpful to for someone to read or interpret
    • We will cover how to make graphs of our data in a few weeks
    • For our final task today, we are going to make some summary tables using summarize() from the tidyverse
  • summarize() allows us to apply a summary statistic (mean, sd, median, etc.) to a column in our data
  • summarize() takes an entire data frame as an input, and spits out a small data frame with the just the summary variables
    • Note: for this reason, you rarely ever want to assign <- the output of summarize() back to the main data frame object, as you’ll overwrite it
      • You can either spit the summary tables out into the console without assignment (which we will do) or if you need to use them for something else, assign them to a new object
## get average (without storing)
data_small_cut |> summarize(mean(high_exp))
# A tibble: 1 × 1
  `mean(high_exp)`
             <dbl>
1             7.99
  • See, the output is a 1x1 table with the mean expectation mean(high_exp) of 7.99, almost about completing a master’s degree
    • Note: if we want to name the summary variable, we can name it just like we did earlier in mutate() with a single =
data_small_cut |> summarize(mean_exp = mean(high_exp))
# A tibble: 1 × 1
  mean_exp
     <dbl>
1     7.99
  • But, that wasn’t quite the question we were asked
    • We were asked if it varied by region…
      • For time’s sake, we’re going to tell you the region variable is x1region and splits the US in 4 Census regions

Grouping Data with group_by()

  • The group_by() function, following the tidyverse principle of intuitive naming, groups the data and outputs by the variable(s) you say
    • So, since we want to calculate the average high expectation by region, we group_by(x1region)
      • Since we just want it for our summarize(), we just add it to the pipe
        • If you wanted to save the data in it’s group_by()-ed state, you could assign it to something
## get grouped average
data_small_cut |>
  group_by(x1region) |>
  summarize(mean_exp = mean(high_exp))
# A tibble: 4 × 2
  x1region mean_exp
     <dbl>    <dbl>
1        1     8.13
2        2     7.88
3        3     8.06
4        4     7.86
  • Success! While expectations are similar across the country, there’s some variance by region
    • While there are few things we could do to make this a little fancier (e.g., changing the region numbers to names, formatting the table, etc.) we have answered our question, and have clear documentation of how we got here, so let’s call that a win!

Saving Data with write_csv()

  • Sometimes we want to be able to access objects from scripts without having to re-run the whole thing
    • Remember: one of the main advantages of R is the data we read in is untouched
  • To do this, we want to write_ a new csv() file, containing our modified data
    • unlike read_csv() which only needed a file name/path, write_csv() needs to know what you’re trying to save and the file name/path you want to save it to
      • The only way you can overwrite or change the original data is by saving to the same file name as the original data, so NEVER do that!
  • Since we didn’t assign our summary table to anything, we can just add write_csv() to the end of the pipe and add a path to where you want to save it
    • If you want to save a data frame you already assigned to an object write_csv(<object>, "path") would work just fine!
## write with useful name

data_small_cut |>
  group_by(x1region) |>
  summarize(mean_exp = mean(high_exp)) |>
  write_csv("data/region-expects.csv")

Phew!

Appendix: All at Once

We went through that piece by piece to demonstrate each function, but, there’s no reason we can’t just |> pipe it all together

## Let's redo the analysis above, but with a fully chained set of
## functions.

## start with original dataset
data |>
  ## select columns we want
  select(stu_id, x1stuedexpct, x1paredexpct, x1region) |>
  ## If expectation is -8, -9. or 11, make it NA
  mutate(x1stuedexpct = ifelse(x1stuedexpct %in% list(-8, -9, 11), NA, x1stuedexpct),
         x1paredexpct = ifelse(x1paredexpct %in% list(-8, -9, 11), NA, x1paredexpct)) |>
  ## Make a new variable called high_exp that is the higher or parent and student exp
  mutate(high_exp = ifelse(x1stuedexpct > x1paredexpct, x1stuedexpct, x1paredexpct)) |>
  ## Drop if either or both parent or student exp is NA
  filter(!is.na(x1stuedexpct) & !is.na(x1paredexpct)) |> 
  ## Group the results by region
  group_by(x1region) |>
  ## Get the mean of high_exp (by region)
  summarize(mean_exp = mean(high_exp)) |>
  ## Write that to a .csv file
  write_csv("data/region-expects-chain.csv")

To double check, let’s just check these are the same…

non_chain <- read_csv("data/region-expects.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chain <- read_csv("data/region-expects-chain.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all.equal(non_chain, chain)
[1] TRUE

WooHoo!

Final notes

  • This rather lengthy lesson has thrown you in the (medium) deep end of the coding pool
    • By no means are you expected to get everything we just did
      • We will continue to revisit all these commands throughout the class, by the end of the semester, they will be second nature!
  • We also saw how to use code to answer a realistic question we might be asked in a data management job, a translation skill that will prove invaluable later on!
    • We had to plan out steps and then make some adjustments along the way (e.g., our NA issues), that’s all part of the process!

Assignment Template

Use the hsls_small.csv data set from the lesson answer the following questions

  • Throughout, you should account for missing values by dropping them

Hint: You’re going to need the code book to identify the necessary variables

Question One

a) What is the average (mean) standardized math score?

b) How does this differ by gender?

Question Two

a) Among those students who are under 185% of the federal poverty line in the base year of the survey, what is the median household income category? (Include a description what that category represents)

Question Three

a) Of the students who earned a high school credential (traditional diploma or GED), what percentage earned a GED or equivalency?

b) How does this differ by region?

Question Four

a) What percentage of students ever attended a post-secondary institution? (as of the data collection in february 2016)

b) Give the cross tabulation for both family incomes above/below $35,000 and region

  • This means you should have percentages for 8 groups: above/below $35,000 within each region

Hint: group_by() can be given more than one group

Submission

Once complete turn in the .qmd file (it must render/run) and the PDF rendered to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.