I: Enter the tidyverse

Lesson
Assignment

Today we take our first full dive into R. Though every data wrangling / analysis is unique, the best way to learn and practice R is to answer a question using data, which is how this lesson is set up
By the end of this lesson, you will have read in a data set, lightly cleaned it, produced results, and saved your findings in a new file
- As part of this process, you will have practiced translating a research question into data analysis steps, which is a skill every bit as important as technical sophistication with a statistical language like R

Re: Urgent Data Question from the Provost

Through today’s lesson, we will explore some of the basics of data wrangling
- But to make it more realistic, we will be doing so to answer a realistic question you may be asked by your advisor or supervisor

Using HSLS09 data, figure out average differences in college degree expectations across census regions; for a first pass, ignore missing values and use the higher of student and parental expectations if an observation has both.

A primary skill (often unremarked upon) in data analytic work is translation. Your advisor, IR director, funding agency director — even collaborator — won’t speak to you in the language of R
- Instead, it’s up to you to
  1. translate a research question into the discrete steps coding steps necessary to provide an answer, and then
  2. translate the answer such that everyone understands what you’ve found

What we need to do is some combination of the following:

Read in the data
Select the variables we need
Mutate a new value that’s the higher of student and parental degree expectations
Filter out observations with missing degree expectation values
Summarize the data within region to get average degree expectation values
Write out the results to a file so we have it for later

Let’s do it!

Throughout this lesson (and class), we are going to lean heavily on the tidyverse collection of packages
- If you don’t already have this installed, use install.packages("tidyverse")

library(tidyverse)

Warning: package 'purrr' was built under R version 4.4.1

Warning: package 'lubridate' was built under R version 4.4.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Check working directory

This script — like the one from the organizing lesson — assumes that the class folder is the working directory and that the required data file is in the data sub-directory
If you need a refresher on setting the working directory, see the prior lesson.
Notice that we’re not setting (i.e. hard coding) the working directory in the script. That would not work well for sharing the code. Instead, we rely on relative paths once you know where you need to be and have gotten there

Reading Data in with `read_csv()`

For this lesson, we’ll use a subset of the High School Longitudinal Study of 2009 (HSLS09), an IES /NCES data set that features:

Nationally representative, longitudinal study of 23,000+ 9th graders from 944 schools in 2009, with a first follow-up in 2012 and a second follow-up in 2016

Students followed throughout secondary and postsecondary years

Surveys of students, their parents, math and science teachers, school administrators, and school counselors

A new student assessment in algebraic skills, reasoning, and problem solving for 9th and 11th grades

10 state representative data sets

If you are interested in using HSLS09 for future projects, DO NOT rely on this subset. Be sure to download the full data set with all relevant variables and weights if that’s the case. But for our purposes in this lesson, it will work just fine.
Throughout, we’ll need to consult the code book. An online version can be found at this link.

## data are CSV, so we use read_csv() from the readr library
data <- read_csv("data/hsls-small.csv")

Rows: 23503 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): stu_id, x1sex, x1race, x1stdob, x1txmtscor, x1paredu, x1hhnumber, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## alternatively, you can also use read_csv(file.path("data", "hsls-small.csv"))

You may notice the read_csv() prints out information about the data just read in. Nothing is wrong! The read_csv() function, like many other functions in the tidyverse, assumes you’d rather have more rather than less information and acts accordingly
Here we assign our data to an object called data
- You can call it whatever you want
Up to now, this should all seem fairly consistent with last week’s lesson. This is just the set up, now it’s time to dive in!

Native Pipe Operator `|>` in R

The pipe is one of the things that makes R code more intuitive than other programming languages, as it allows us write code in the order we think about it, passing it one from one function to another, rather than nesting it like traditional code would be written

A Brief History of the R Pipe

badge

The pipe was originally a tidyverse invention, and used %>% symbol, which is probably still a common pipe you see “in the wild”
The pipe we are using was brought into the Vanilla version of R a few years ago as |> (the “native pipe”)
The reason for the change is some benefits that are beyond the scope of this class, but you just need to know that |> and %>% are essentially the same thing
- The default shortcut for pipe (ctrl + shift + m) will generate %>%, but you can change it to |> by going to Tools > Global Options > Code > Use native pipe operator, |>

How the Pipe Works

For this example don’t worry about the actual processes (we will go over them more below), just look at how much more intuitive the code is with |>s.

First, let’s say we want to take the data we just read in and select the x1txmtscor (math test scores) column

Without `|>`

## Without |>
select(data, x1txmtscor)

With `|>`

## With |>
data |> select(x1txmtscor)

Neither is that confusing… But, what if we want to take that output and select only students with a math score above 50?

Without `|>`

## Without |>
filter(select(data, x1txmtscor), x1txmtscor > 50)

With `|>`

## With |>
data |> select(x1txmtscor) |> filter(x1txmtscor > 50)

See how the non-piped version is getting messy? Let’s add one more level to really make the point, creating a new variable that is the square root of the test score

Without `|>`

## Without |>
mutate(filter(select(data, x1txmtscor), x1txmtscor > 50), square_root = sqrt(x1txmtscor))

With `|>`

## With |>
data |> select(x1txmtscor) |> filter(x1txmtscor > 50) |> mutate(square_root = sqrt(x1txmtscor))

As we are getting longer, let’s use a new line for each pipe, just to make it even clearer (it makes no difference to R)

## Best to use  a new line for each pipe when code gets longer
data |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor))

Even though we haven’t covered any of these commands yet, we can see that the |> is pretty easy to know roughly what’s going on. Whereas, the traditional nested way gets really tricky beyond a couple of commands.

Of course, if you wanted, you could do each step separately, constantly assigning and overwriting an object like below

## Without the |>, we could technically break it down step by step
temp <- select(data, x1txmtscor)
temp <- filter(temp, x1txmtscor > 50)
temp <- mutate(temp, square_root = sqrt(x1txmtscor))
temp

But it’s less intuitive than simply piping the results.

If we do want to see step-by-step output in a piped command, you can either
1. Run the code as you write it line-by-line
2. Highlight sections of the piped code to run up to a point

Assigning Output from Pipes

Assigning output from pipes is the same as we have covered a few times, we use a <- to pass it backwards to an object (in this case we called that object data_backward_pass)

## Always assign backwards
data_backward_pass <- data |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor))

If you want to think of it this way, we are effectively continuing to pass it forward like this…

## You can think of the assignment as a continuation of the pipe like this
## but don't write it this way, it's then hard to find what you called something later
data |>
  select(x1txmtscor) |>
  filter(x1txmtscor > 50) |>
  mutate(square_root = sqrt(x1txmtscor)) ->
  data_forward_pass

That’s how pipes and assignment work together, and you can see the outputs are all.equal()

## Checking they are the same
all.equal(data_backward_pass, data_forward_pass)

[1] TRUE

However, you really shouldn’t write code like this, as although it runs, it’s then hard to later find where you created data_forward_pass.
Starting the string of code with where you store the result is MUCH clearer.
- But hopefully it will help some of you understand how |> and <- work together

Basic Tidyverse Commands

Now we have the pipe |> covered, it’s time to dig into some basic data wrangling commands

Selecting Variables/Columns with `select()`

Often data sets contain hundreds, thousands, or even tens of thousands of variables, when we are only interested in a handful. Our first tidyverse command select() helps us deal with this, by, as you may have guessed, select()-ing the variables we want
Since we are going to pipe our R commands, we start with our data then pipe it into the select() command
In the select() command, we list out the variables we want
- stu_id, x1stuedexpct, x1paredexpct, x1region
  - Notice: in this (and most tidyverse) command(s) we don’t have to use “quotes” around the variable names. A common error message when you did need to put something in quotes is Error: object <thing you should have "quoted"> not found

data |> select(stu_id, x1stuedexpct, x1paredexpct, x1region)

# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           11            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           11        1
 8  10008            8            6        1
 9  10009           11           11        3
10  10010            8            6        1
# ℹ 23,493 more rows

Quick question: if we want to use this reduced dataset going forward, what should we do?

That’s right, assign it to an object! Let’s call that data_small

data_small <- data |> select(stu_id, x1stuedexpct, x1paredexpct, x1region)

Our next step is to create a variable that’s the higher of student and parent expectations. Sounds simple enough, but first, we have a problem…
- Can anyone guess what it is?

Understanding our data

First things first, however, we need to check the code book to see what the numerical values for our two education expectation variables represent. To save time, we’ve copied them here:

`x1stuedexpct`

How far in school 9th grader thinks he/she will get

value	label
1	Less than high school
2	High school diploma or GED
3	Start an Associate’s degree
4	Complete an Associate’s degree
5	Start a Bachelor’s degree
6	Complete a Bachelor’s degree
7	Start a Master’s degree
8	Complete a Master’s degree
9	Start Ph.D/M.D/Law/other prof degree
10	Complete Ph.D/M.D/Law/other prof degree
11	Don’t know
-8	Unit non-response

`x1paredexpct`

How far in school parent thinks 9th grader will go

value	label
1	Less than high school
2	High school diploma or GED
3	Start an Associate’s degree
4	Complete an Associate’s degree
5	Start a Bachelor’s degree
6	Complete a Bachelor’s degree
7	Start a Master’s degree
8	Complete a Master’s degree
9	Start Ph.D/M.D/Law/other prof degree
10	Complete Ph.D/M.D/Law/other prof degree
11	Don’t know
-8	Unit non-response
-9	Missing

The good news is that the categorical values are the same for both variables (meaning we can make an easy comparison) and move in a logical progression
The bad news is that we have three values — -8, -9, and 11 — that we need to deal with so that the averages we compute later represent what we mean

Exploring Catagorical Variables with `count()`

First, let’s see how many observations are affected by these values using count()

Notice that we don’t assign to a new object; this means we’ll see the result in the console, but nothing in our data or object will change

## see unique values for student expectation
data_small |> count(x1stuedexpct)

# A tibble: 12 × 2
   x1stuedexpct     n
          <dbl> <int>
 1           -8  2059
 2            1    93
 3            2  2619
 4            3   140
 5            4  1195
 6            5   115
 7            6  3505
 8            7   231
 9            8  4278
10            9   176
11           10  4461
12           11  4631

## see unique values for parental expectation
data_small |> count(x1paredexpct)

# A tibble: 13 × 2
   x1paredexpct     n
          <dbl> <int>
 1           -9    32
 2           -8  6715
 3            1    55
 4            2  1293
 5            3   149
 6            4  1199
 7            5   133
 8            6  4952
 9            7    76
10            8  3355
11            9    37
12           10  3782
13           11  1725

Dealing with -8 and -9 is straight forward — we’ll convert it missing.
- In R, missing values are technically stored as NA.
  - Not all statistical software uses the same values to represent missing values (for example, STATA uses a dot .)
- NCES has decided to represent missing values as a limited number of negative values. In this case, -8 and -9 represent missing values
Note: how to handle missing values is a very important topic, one we could spend all semester discussing
- For now, we are just going to drop observations with missing values; but be forewarned that how you handle missing values can have real ramifications for the quality of your final results
- In real research, a better approach is usually to impute missing values, but that is beyond our scope right now
Deciding what to do with 11 is a little trickier. While it’s not a missing value per se, it also doesn’t make much sense in its current ordering, that is, to be “higher” than completing a professional degree
- For now, we’ll make a decision to convert these to NA as well, effectively deciding that an answer of “I don’t know” is the same as missing an answer
So first step: convert -8, -9, and 11 in both variables to NA. For this, we’ll use the mutate() and ifelse() functions

Conditional Values with `ifelse()`

ifelse() is a really common command in R and has three parts
1. statement that can be TRUE or FALSE
2. What to return if the statement is TRUE
3. else what to return when the statement is FALSE

Modifying an Existing Variable with `mutate()`

When we want to add variables and change existing ones, we can use the mutate() function
- The basic idea of mutate commands is mutate(<where to go> = <what to go there>)
  - This is probably the trickiest function we cover today to understand
- New variables are created if you provide <where to go> a new variable name (or nothing)
- Variables are modified if you provide <where to go> an existing variable name
- <what to go there> can as simple as a single number all the way to a chain of piped functions, so long as there’s a clear answer for every row
In this case, we want to modify x1stuedexpct to be NA when x1stuedexpct is -8, -9, or 11

Now, we have three values we want to covert to NA, so we could do them one-at-a-time, like below

data_small <- data_small |>
  mutate(x1stuedexpct = ifelse(x1stuedexpct == -8, NA, x1stuedexpct))

Let’s walk through this code

Assign the results back to data_small (which will overwrite our previous data_small)
Take data_small and pipe |> it into mutate()
Inside mutate() assign our results to x1stuedexpct (which will modify the existing variable)
Modify x1stuedexpct with an ifelse() statement, which remember has 3 parts

statement which is asking “is x1stuedexpct == -8”? - Notice == means “is equal to”, while = means “assign to”. Yes it’s confusing, but you’ll get it over time!
if that statement is true, make it NA
else (if the statement is false) return the original variable x1stuedexpct

Okay, make sense? Let’s see what we just did (look at row 26)

print(data_small, n = 26)

# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           11            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           11        1
 8  10008            8            6        1
 9  10009           11           11        3
10  10010            8            6        1
11  10011            8            6        3
12  10012           11           11        2
13  10013            8           10        3
14  10014            2            6        3
15  10015           11           10        3
16  10016            4            6        2
17  10018            6            7        2
18  10019            8           -8        2
19  10020            8            8        4
20  10021            8           11        2
21  10022           10            8        4
22  10024            6            8        2
23  10025            8           -8        4
24  10026            7           10        3
25  10027           11            6        1
26  10028           NA           -8        3
# ℹ 23,477 more rows

This is fine, but we have to do it 3 times for both parent and student expectation
- Instead, can anyone think (not in R code, just in terms of logic) how we could change our statement piece of the ifelse() to be more efficient?

Being Efficient with `%in%` and `c()`

What we can do, is group -8, -9, and 11 together into a list using c()
- c() is a very common function in R used to create a list
Then, we can use the %in% operator to ask if that result is any of the numbers in that list
- This keeps our code shorter and easier to read

data_small <- data_small |>
  mutate(x1stuedexpct = ifelse(x1stuedexpct %in% c(-8, -9, 11), NA, x1stuedexpct),
         x1paredexpct = ifelse(x1paredexpct %in% c(-8, -9, 11), NA, x1paredexpct))

The code now works just as above, but instead of asking if x1stuedexpct is equal to -8, it asks if it’s in the list of -8, -9, and 11, then does the same for parental expectations!
- Let’s view those first 26 rows again to see what we did

print(data_small, n = 26)

# A tibble: 23,503 × 4
   stu_id x1stuedexpct x1paredexpct x1region
    <dbl>        <dbl>        <dbl>    <dbl>
 1  10001            8            6        2
 2  10002           NA            6        1
 3  10003           10           10        4
 4  10004           10           10        3
 5  10005            6           10        3
 6  10006           10            8        3
 7  10007            8           NA        1
 8  10008            8            6        1
 9  10009           NA           NA        3
10  10010            8            6        1
11  10011            8            6        3
12  10012           NA           NA        2
13  10013            8           10        3
14  10014            2            6        3
15  10015           NA           10        3
16  10016            4            6        2
17  10018            6            7        2
18  10019            8           NA        2
19  10020            8            8        4
20  10021            8           NA        2
21  10022           10            8        4
22  10024            6            8        2
23  10025            8           NA        4
24  10026            7           10        3
25  10027           NA            6        1
26  10028           NA           NA        3
# ℹ 23,477 more rows

Just to be doubly-sure, lets check count() again

data_small |> count(x1stuedexpct)

# A tibble: 11 × 2
   x1stuedexpct     n
          <dbl> <int>
 1            1    93
 2            2  2619
 3            3   140
 4            4  1195
 5            5   115
 6            6  3505
 7            7   231
 8            8  4278
 9            9   176
10           10  4461
11           NA  6690

data_small |> count(x1paredexpct)

# A tibble: 11 × 2
   x1paredexpct     n
          <dbl> <int>
 1            1    55
 2            2  1293
 3            3   149
 4            4  1199
 5            5   133
 6            6  4952
 7            7    76
 8            8  3355
 9            9    37
10           10  3782
11           NA  8472

Success!

Creating a New Variable with `mutate()`

So, with that tangent out of the way, let’s get back to our original task, creating a new variable that is the highest of parental and student expectations
To make a new variable which is the highest of two variables, we can use our friends mutate() and ifelse() some more

data_small <- data_small |>
  mutate(high_exp = ifelse(x1stuedexpct > x1paredexpct, x1stuedexpct, x1paredexpct))

That code is almost what we want to do
- If x1stuedexpct is higher then take that, if not, take x1paredexpct
  - There’s two things we haven’t fully accounted for though…
    - One doesn’t actually matter here, but might in other circumstances
    - One definitely matters here
      - Without scrolling past the duck, does anyone know what they might be?

“Rubber duck png sticker, transparent” is marked with CC0 1.0.

Sloppy Mistake 1 (doesn’t matter here)

We were a little sloppy with the statement piece, we just asked if x1stuedexpct was greater than x1paredexpct or not
- If we were being more careful, we might have said “greater than or equal to”
  - Why doesn’t this matter in this context, and when might it matter?

Sloppy Mistake 2 (does matter here)

Now let’s check our data frame to see the one that does matter

print(data_small, n = 26)

# A tibble: 23,503 × 5
   stu_id x1stuedexpct x1paredexpct x1region high_exp
    <dbl>        <dbl>        <dbl>    <dbl>    <dbl>
 1  10001            8            6        2        8
 2  10002           NA            6        1       NA
 3  10003           10           10        4       10
 4  10004           10           10        3       10
 5  10005            6           10        3       10
 6  10006           10            8        3       10
 7  10007            8           NA        1       NA
 8  10008            8            6        1        8
 9  10009           NA           NA        3       NA
10  10010            8            6        1        8
11  10011            8            6        3        8
12  10012           NA           NA        2       NA
13  10013            8           10        3       10
14  10014            2            6        3        6
15  10015           NA           10        3       NA
16  10016            4            6        2        6
17  10018            6            7        2        7
18  10019            8           NA        2       NA
19  10020            8            8        4        8
20  10021            8           NA        2       NA
21  10022           10            8        4       10
22  10024            6            8        2        8
23  10025            8           NA        4       NA
24  10026            7           10        3       10
25  10027           NA            6        1       NA
26  10028           NA           NA        3       NA
# ℹ 23,477 more rows

Hmm, that seems odd, why would R consider NA to be greater than 6?
- Any thoughts?
Generally, R is overly-cautious when dealing with NAs to ensure you don’t accidentally drop them without realizing it
- For example, if you were asked what the mean(c(5, 6, 4, NA)) would be, you’d probably say 5, right?
  - R is never going to just ignore the NA values like that unless we tell it to

mean(c(5, 6, 4, NA))

[1] NA

See, what have to explicitly tell it to remove the NA values

mean(c(5, 6, 4, NA), na.rm = T)

[1] 5

So in our case of trying to get the highest expectation, R doesn’t want us to forget we have NA values, so it throws them at us.
Dealing with missing values is a huge topic in data analysis, and there are many ways to handle them, which is beyond the scope of this lesson
- For now, let’s remove rows that have NA values in either x1stuedexpct or x1paredexpct or both

Dealing With Missing Values with `is.na()`, `&`, and `!`

To do this, we will add another couple of code helpers -is.na(x1stuedexpct) simply asks if the x1stuedexpct is NA or not
- R doesn’t let you just say x1stuedexpct == NA -! is really helpful tool, which can be used to negate or invert a command
  - !is.na(x1stuedexpct) just returns the opposite of is.na(x1stuedexpct) so it tells us that the column is not NA -& can be useful inside conditional statements, as it means both must be TRUE (FYI: | means or)

data_small_cut <- data_small |>
  filter(!is.na(x1stuedexpct) & !is.na(x1paredexpct))

So !is.na(x1stuedexpct) & !is.na(x1paredexpct) makes sure that both x1stuedexpct and x1paredexpct are not NA
Now what does filter() do here?

Keeping Rows Based on a Condition with `filter()`

The filter() command from tidyverse works by only keeping observations that meet the condition(s) we set
- As in, to make it through the filter, a row must answer “yes” to “does it meet this condition?”
So in this case, we are keeping all rows where x1stuedexpct and x1paredexpct are not NA
- Notice, instead of overwriting data_small we assigned this to a new object data_small_cut
  - Generally, when making substantial changes to a data set like dropping observations, we might want to be able to double check what we did, which is easier if we make a new data

Quick Question: A common confusion from this lesson is between filter() and select(). Can someone explain when you’d use select() over filter()?

Let’s check the counts of our x1stuedexpct and x1paredexpct again to see if filter() worked

data_small_cut |> count(x1stuedexpct)

# A tibble: 10 × 2
   x1stuedexpct     n
          <dbl> <int>
 1            1    39
 2            2  1483
 3            3    83
 4            4   769
 5            5    78
 6            6  2539
 7            7   170
 8            8  3180
 9            9   127
10           10  3336

data_small_cut |> count(x1paredexpct)

# A tibble: 10 × 2
   x1paredexpct     n
          <dbl> <int>
 1            1    41
 2            2   911
 3            3   105
 4            4   825
 5            5    97
 6            6  3761
 7            7    66
 8            8  2746
 9            9    30
10           10  3222

Okay, so no NAs, perfect!
Just to be extra sure we only removed NAs, we can check the difference in how many rows our original data_small has with data_small_cut

## what's the difference between original # of rows and current # or rows?
nrow(data_small) - nrow(data_small_cut)

[1] 11699

nrow() does what you’d expect, counts the number of rows
- You could also just check by looking at the dataframes in the environment tab, but this way leaves no room for mental math errors
Now let’s check our high_exp variable in the new data_small_cut

print(data_small_cut, n = 26)

# A tibble: 11,804 × 5
   stu_id x1stuedexpct x1paredexpct x1region high_exp
    <dbl>        <dbl>        <dbl>    <dbl>    <dbl>
 1  10001            8            6        2        8
 2  10003           10           10        4       10
 3  10004           10           10        3       10
 4  10005            6           10        3       10
 5  10006           10            8        3       10
 6  10008            8            6        1        8
 7  10010            8            6        1        8
 8  10011            8            6        3        8
 9  10013            8           10        3       10
10  10014            2            6        3        6
11  10016            4            6        2        6
12  10018            6            7        2        7
13  10020            8            8        4        8
14  10022           10            8        4       10
15  10024            6            8        2        8
16  10026            7           10        3       10
17  10029            2            2        2        2
18  10030            6            8        3        8
19  10036            8            6        2        8
20  10037            8            8        3        8
21  10038            6            4        4        6
22  10039            8           10        3       10
23  10041            3            1        3        3
24  10044           10           10        3       10
25  10045            4            4        2        4
26  10049            8            8        4        8
# ℹ 11,778 more rows

To be more straightforward, let’s compare the counts of high_exp in data_small and dataf_small_cut

data_small |> count(high_exp)

# A tibble: 11 × 2
   high_exp     n
      <dbl> <int>
 1        1     3
 2        2   516
 3        3    62
 4        4   482
 5        5    59
 6        6  2177
 7        7   120
 8        8  3380
 9        9   112
10       10  4893
11       NA 11699

data_small_cut |> count(high_exp)

# A tibble: 10 × 2
   high_exp     n
      <dbl> <int>
 1        1     3
 2        2   516
 3        3    62
 4        4   482
 5        5    59
 6        6  2177
 7        7   120
 8        8  3380
 9        9   112
10       10  4893

Yay, all NAs are gone!
Is there any other way to remove NAs?
- You know what, there is a drop_na() function!

Simply Dropping NAs

The tidyr package in tidyverse has a handy function drop_na()
- Without arguments in (), it will remove any row that has NA in any column
- Specify columns to remove rows that have NA in the designated columns

data_small_drop <- data_small |> 
  drop_na(x1stuedexpct, x1paredexpct)

Let’s compare our two ways of removing NAs and see if they end up being the same
- What’s the function to check if things are equal (we used it when introducing |> )?

## compare two ways of dropping NAs
all.equal(data_small_cut, data_small_drop)

[1] TRUE

Summarizing Data with `summarize()`

Okay, so we have our data selected, we made our high_exp variable, and we’ve done some work to handle missing data
For our final task today, we are going to make some summary tables using summarize() from the tidyverse
summarize() allows us to apply a summary statistic (mean, sd, median, etc.) to a column in our data
summarize() takes an entire data frame as an input, and spits out a small data frame with the just the summary variables
- Note: for this reason, you rarely ever want to assign <- the output of summarize() back to the main data frame object, as you’ll overwrite it
  - You can either spit the summary tables out into the console without assignment (which we will do) or if you need to use them for something else, assign them to a new object

## get average (without storing)
data_small_cut |> summarize(mean(high_exp))

# A tibble: 1 × 1
  `mean(high_exp)`
             <dbl>
1             7.99

See, the output is a 1x1 table with the mean expectation mean(high_exp) of 7.99, almost about completing a master’s degree
- Note: if we want to name the summary variable, we can name it just like we did earlier in mutate() with a single =

data_small_cut |> summarize(mean_exp = mean(high_exp))

# A tibble: 1 × 1
  mean_exp
     <dbl>
1     7.99

But, that wasn’t quite the question we were asked
- We were asked if it varied by region…
  - For time’s sake, we’re going to tell you the region variable is x1region and splits the US in 4 Census regions

Grouping Data with `group_by()`

The group_by() function, following the tidyverse principle of intuitive naming, groups the data and outputs by the variable(s) you say
- So, since we want to calculate the average high expectation by region, we group_by(x1region)
  - Since we just want it for our summarize(), we just add it to the pipe
    - If you wanted to save the data in it’s group_by()-ed state, you could assign it to something

## get grouped average
data_small_cut |>
  group_by(x1region) |>
  summarize(mean_exp = mean(high_exp))

# A tibble: 4 × 2
  x1region mean_exp
     <dbl>    <dbl>
1        1     8.13
2        2     7.88
3        3     8.06
4        4     7.86

Success! While expectations are similar across the country, there’s some variance by region
- While there are few things we could do to make this a little fancier (e.g., changing the region numbers to names, formatting the table, etc.) we have answered our question, and have clear documentation of how we got here, so let’s call that a win!

Saving Data with `write_csv()`

Sometimes we want to be able to access objects from scripts without having to re-run the whole thing
- Remember: one of the main advantages of R is the data we read in is untouched
To do this, we want to write_ a new csv() file, containing our modified data
- unlike read_csv() which only needed a file name/path, write_csv() needs to know what you’re trying to save and the file name/path you want to save it to
  - The only way you can overwrite or change the original data is by saving to the same file name as the original data, so NEVER do that!
Since we didn’t assign our summary table to anything, we can just add write_csv() to the end of the pipe and add a path to where you want to save it
- If you want to save a data frame you already assigned to an object write_csv(<object>, "path") would work just fine!

## write with useful name

data_small_cut |>
  group_by(x1region) |>
  summarize(mean_exp = mean(high_exp)) |>
  write_csv("data/region-expects.csv")

Phew!

Appendix: All at Once

We went through that piece by piece to demonstrate each function, but, there’s no reason we can’t just |> pipe it all together

## Let's redo the analysis above, but with a fully chained set of
## functions.

## start with original dataset
data |>
  ## select columns we want
  select(stu_id, x1stuedexpct, x1paredexpct, x1region) |>
  ## If expectation is -8, -9. or 11, make it NA
  mutate(x1stuedexpct = ifelse(x1stuedexpct %in% list(-8, -9, 11), NA, x1stuedexpct),
         x1paredexpct = ifelse(x1paredexpct %in% list(-8, -9, 11), NA, x1paredexpct)) |>
  ## Make a new variable called high_exp that is the higher or parent and student exp
  mutate(high_exp = ifelse(x1stuedexpct > x1paredexpct, x1stuedexpct, x1paredexpct)) |>
  ## Drop if either or both parent or student exp is NA
  filter(!is.na(x1stuedexpct) & !is.na(x1paredexpct)) |> 
  ## Group the results by region
  group_by(x1region) |>
  ## Get the mean of high_exp (by region)
  summarize(mean_exp = mean(high_exp)) |>
  ## Write that to a .csv file
  write_csv("data/region-expects-chain.csv")

To double check, let’s just check these are the same…

non_chain <- read_csv("data/region-expects.csv")

Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

chain <- read_csv("data/region-expects-chain.csv")

Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

all.equal(non_chain, chain)

[1] TRUE

WooHoo!

Final notes

This rather lengthy lesson has thrown you in the (medium) deep end of the coding pool
- By no means are you expected to get everything we just did
  - We will continue to revisit all these commands throughout the class, by the end of the semester, they will be second nature!
We also saw how to use code to answer a realistic question we might be asked in a data management job, a translation skill that will prove invaluable later on!
- We had to plan out steps and then make some adjustments along the way (e.g., our NA issues), that’s all part of the process!

Assignment Template

Use the hsls_small.csv data set from the lesson answer the following questions

Throughout, you should account for missing values by dropping them

Hint: You’re going to need the code book to identify the necessary variables

Question One

a) What is the average (mean) standardized math score?

b) How does this differ by gender?

Question Two

a) Among those students who are under 185% of the federal poverty line in the base year of the survey, what is the median household income category? (Include a description what that category represents)

Question Three

a) Of the students who earned a high school credential (traditional diploma or GED), what percentage earned a GED or equivalency?

b) How does this differ by region?

Question Four

a) What percentage of students ever attended a post-secondary institution? (as of the data collection in february 2016)

b) Give the cross tabulation for both family incomes above/below $35,000 and region

This means you should have percentages for 8 groups: above/below $35,000 within each region

Hint: group_by() can be given more than one group

Submission

Once complete turn in the .qmd file (it must render/run) and the PDF rendered to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.

Solution

R Solution Code

## -----------------------------------------------------------------------------
##
##' [PROJ: EDH 7916]
##' [FILE: Data Wrangling I Solution]
##' [INIT: Jan 28 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##
## -----------------------------------------------------------------------------

## note to matt

setwd(this.path::here())

## ---------------------------
##' [Libraries]
## ---------------------------

library(tidyverse)

## ---------------------------
##' [Input]
## ---------------------------

df <- read_csv(file.path("data", "hsls-small.csv"))

## ---------------------------
##' [Q1]
## ---------------------------

## Part One
df |>
  filter(!x1txmtscor %in% c(-8, -9)) |>
  summarize(mean = mean(x1txmtscor))

## Part Two
df |>
  filter(!x1txmtscor %in% c(-8, -9),
         x1sex != -9) |>
  group_by(x1sex) |>
  summarize(mean = mean(x1txmtscor))

## ---------------------------
##' [Q2]
## ---------------------------

df |>
  filter(x1poverty185 == 1,
         !x1famincome %in% c(-8,-9)) |>
  summarize(med_inc_cat = median(x1famincome))

print("Median income category in 2, which represents Family income > $15,000 and <= $35,000")

## ---------------------------
##' [Q3]
## ---------------------------

## Simplified
df |>
  filter(x4hscompstat %in% c(1,2)) |>
  count(x4hscompstat) |>
  mutate(perc = n / sum(n) * 100)

## Part One
df |>
  filter(x4hscompstat %in% c(1,2)) |>
  summarize(ged = sum(x4hscompstat == 2),
            total = sum(x4hscompstat %in% c(1,2)),
            perc = ged/total*100)

## Part Two
## Simplified
df |>
  filter(x4hscompstat %in% c(1,2)) |>
  group_by(x1region) |>
  count(x4hscompstat) |>
  mutate(perc = n / sum(n) * 100) |>
  filter(x4hscompstat == 2)

df |>
  filter(x4hscompstat %in% c(1,2)) |>
  group_by(x1region) |>
  summarize(ged = sum(x4hscompstat == 2),
            total = sum(x4hscompstat %in% c(1,2)),
            perc = ged/total*100)

## ---------------------------
##' [Q4]
## ---------------------------

## Part One
df |>
  filter(x4evratndclg != -8) |>
  count(x4evratndclg) |>
  mutate(perc = n / sum(n) * 100)


df |>
  filter(x4evratndclg != -8) |>
  summarize(college = sum(x4evratndclg == 1),
            total = sum(x4evratndclg %in% c(0, 1)),
            perc = college/total*100)

## Part Two
df |>
  filter(x4evratndclg != -8,
         !x1famincome %in% c(-8, -9)) |>
  mutate(below_35k = ifelse(x1famincome %in% c(1,2), 1, 0)) |>
  group_by(x1region, below_35k) |>
  count(x4evratndclg) |>
  mutate(perc = n / sum(n) * 100) |>
  filter(x4evratndclg == 1)



df |>
  filter(x4evratndclg != -8,
         !x1famincome %in% c(-8, -9)) |>
  mutate(below_35k = ifelse(x1famincome %in% c(1,2), 1, 0)) |>
  group_by(x1region, below_35k) |>
  summarize(college = sum(x4evratndclg == 1),
            total = sum(x4evratndclg %in% c(0, 1)),
            perc = college/total*100)

## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------

Re: Urgent Data Question from the Provost

Check working directory

Reading Data in with read_csv()

Native Pipe Operator |> in R

A Brief History of the R Pipe

How the Pipe Works

Without |>

With |>

Without |>

With |>

Without |>

With |>

Assigning Output from Pipes

Basic Tidyverse Commands

Selecting Variables/Columns with select()

Understanding our data

x1stuedexpct

x1paredexpct

Exploring Catagorical Variables with count()

Conditional Values with ifelse()

Modifying an Existing Variable with mutate()

Being Efficient with %in% and c()

Creating a New Variable with mutate()

Sloppy Mistake 1 (doesn’t matter here)

Sloppy Mistake 2 (does matter here)

Dealing With Missing Values with is.na(), &, and !

Keeping Rows Based on a Condition with filter()

Simply Dropping NAs

Summarizing Data with summarize()

Grouping Data with group_by()

Saving Data with write_csv()

Appendix: All at Once

Final notes

Question One

Question Two

Question Three

Question Four

Submission

Solution

Reading Data in with `read_csv()`

Native Pipe Operator `|>` in R

Without `|>`

With `|>`

Without `|>`

With `|>`

Without `|>`

With `|>`

Selecting Variables/Columns with `select()`

`x1stuedexpct`

`x1paredexpct`

Exploring Catagorical Variables with `count()`

Conditional Values with `ifelse()`

Modifying an Existing Variable with `mutate()`

Being Efficient with `%in%` and `c()`

Creating a New Variable with `mutate()`

Dealing With Missing Values with `is.na()`, `&`, and `!`

Keeping Rows Based on a Condition with `filter()`

Summarizing Data with `summarize()`

Grouping Data with `group_by()`

Saving Data with `write_csv()`