Today we take our first full dive into R. Though every data wrangling / analysis is unique, the best way to learn and practice R is to answer a question using data, which is how this lesson is set up
By the end of this lesson, you will have read in a data set, lightly cleaned it, produced results, and saved your findings in a new file
As part of this process, you will have practiced translating a research question into data analysis steps, which is a skill every bit as important as technical sophistication with a statistical language like R
Re: Urgent Data Question from the Provost
Through today’s lesson, we will explore some of the basics of data wrangling
But to make it more realistic, we will be doing so to answer a realistic question you may be asked by your advisor or supervisor
Using HSLS09 data, figure out average differences in college degree expectations across census regions; for a first pass, ignore missing values and use the higher of student and parental expectations if an observation has both.
A primary skill (often unremarked upon) in data analytic work is translation. Your advisor, IR director, funding agency director — even collaborator — won’t speak to you in the language of R
Instead, it’s up to you to
translate a research question into the discrete steps coding steps necessary to provide an answer, and then
translate the answer such that everyone understands what you’ve found
What we need to do is some combination of the following:
Read in the data
Select the variables we need
Mutate a new value that’s the higher of student and parental degree expectations
Filter out observations with missing degree expectation values
Summarize the data within region to get average degree expectation values
Write out the results to a file so we have it for later
Let’s do it!
Throughout this lesson (and class), we are going to lean heavily on the tidyverse collection of packages
If you don’t already have this installed, use install.packages("tidyverse")
library(tidyverse)
Warning: package 'lubridate' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Check working directory
This script — like the one from the organizing lesson — assumes that the class folder is the working directory and that the required data file is in the data sub-directory
If you need a refresher on setting the working directory, see the prior lesson.
Notice that we’re not setting (i.e. hard coding) the working directory in the script. That would not work well for sharing the code. Instead, we rely on relative paths once you know where you need to be and have gotten there
Nationally representative, longitudinal study of 23,000+ 9th graders from 944 schools in 2009, with a first follow-up in 2012 and a second follow-up in 2016
Students followed throughout secondary and postsecondary years
Surveys of students, their parents, math and science teachers, school administrators, and school counselors
A new student assessment in algebraic skills, reasoning, and problem solving for 9th and 11th grades
10 state representative data sets
If you are interested in using HSLS09 for future projects, DO NOT rely on this subset. Be sure to download the full data set with all relevant variables and weights if that’s the case. But for our purposes in this lesson, it will work just fine.
## data are CSV, so we use read_csv() from the readr librarydata <-read_csv("data/hsls-small.csv")
Rows: 23503 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): stu_id, x1sex, x1race, x1stdob, x1txmtscor, x1paredu, x1hhnumber, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## alternatively, you can also use read_csv(file.path("data", "hsls-small.csv"))
You may notice the read_csv() prints out information about the data just read in. Nothing is wrong! The read_csv() function, like many other functions in the tidyverse, assumes you’d rather have more rather than less information and acts accordingly
Here we assign our data to an object called data
You can call it whatever you want
Up to now, this should all seem fairly consistent with last week’s lesson. This is just the set up, now it’s time to dive in!
Native Pipe Operator |> in R
The pipe is one of the things that makes R code more intuitive than other programming languages, as it allows us write code in the order we think about it, passing it one from one function to another, rather than nesting it like traditional code would be written
A Brief History of the R Pipe
The pipe was originally a tidyverse invention, and used %>% symbol, which is probably still a common pipe you see “in the wild”
The pipe we are using was brought into the Vanilla version of R a few years ago as |> (the “native pipe”)
The reason for the change is some benefits that are beyond the scope of this class, but you just need to know that |> and %>% are essentially the same thing
The default shortcut for pipe (ctrl + shift + m) will generate %>%, but you can change it to |> by going to Tools > Global Options > Code > Use native pipe operator, |>
How the Pipe Works
For this example don’t worry about the actual processes (we will go over them more below), just look at how much more intuitive the code is with |>s.
First, let’s say I want to take the data we just read in and select the x1txmtscor (math test scores) column
Without |>
## Without |>select(data, x1txmtscor)
With |>
## With |>data |>select(x1txmtscor)
Neither is that confusing… But, what if we want to take that output and select only students with a math score above 50?
Without |>
## Without |>filter(select(data, x1txmtscor), x1txmtscor >50)
With |>
## With |>data |>select(x1txmtscor) |>filter(x1txmtscor >50)
See how the non-piped version is getting messy? Let’s add one more level to really make the point, creating a new variable that is the square root of the test score
Without |>
## Without |>mutate(filter(select(data, x1txmtscor), x1txmtscor >50), square_root =sqrt(x1txmtscor))
With |>
## With |>data |>select(x1txmtscor) |>filter(x1txmtscor >50) |>mutate(square_root =sqrt(x1txmtscor))
As we are getting longer, let’s use a new line for each pipe, just to make it even clearer (it makes no difference to R)
## Best to use a new line for each pipe when code gets longerdata |>select(x1txmtscor) |>filter(x1txmtscor >50) |>mutate(square_root =sqrt(x1txmtscor))
Even though we haven’t covered any of these commands yet, we can see that the |> is pretty easy to know roughly what’s going on. Whereas, the traditional nested way gets really tricky beyond a couple of commands.
Of course, if you wanted, you could do each step separately, constantly assigning and overwriting an object like below
## Without the |>, we could technically break it down step by steptemp <-select(data, x1txmtscor)temp <-filter(temp, x1txmtscor >50)temp <-mutate(temp, square_root =sqrt(x1txmtscor))temp
But it’s less intuitive than simply piping the results.
If we do want to see step-by-step output in a piped command, you can either
Run the code as you write it line-by-line
Highlight sections of the piped code to run up to a point
Assigning Output from Pipes
Assigning output from pipes is the same as we have covered a few times, we use a <- to pass it backwards to an object (in this case we called that object data_backward_pass)
If you want to think of it this way, we are effectively continuing to pass it forward like this…
## You can think of the assignment as a continuation of the pipe like this## but don't write it this way, it's then hard to find what you called something laterdata |>select(x1txmtscor) |>filter(x1txmtscor >50) |>mutate(square_root =sqrt(x1txmtscor)) -> data_forward_pass
That’s how pipes and assignment work together, and you can see the outputs are all.equal()
## Checking they are the sameall.equal(data_backward_pass, data_forward_pass)
[1] TRUE
However, you really shouldn’t write code like this, as although it runs, it’s then hard to later find where you created data_forward_pass.
Starting the string of code with where you store the result is MUCH clearer.
But hopefully it will help some of you understand how |> and <- work together
Basic Tidyverse Commands
Now we have the pipe |> covered, it’s time to dig into some basic data wrangling commands
Selecting Variables/Columns with select()
Often data sets contain hundreds, thousands, or even tens of thousands of variables, when we are only interested in a handful. Our first tidyverse command select() helps us deal with this, by, as you may have guessed, select()-ing the variables we want
Since we are going to pipe our R commands, we start with our data then pipe it into the select() command
In the select() command, we list out the variables we want
stu_id, x1stuedexpct, x1paredexpct, x1region
Notice: in this (and most tidyverse) command(s) we don’t have to use “quotes” around the variable names. A common error message when you did need to put something in quotes is Error: object <thing you should have "quoted"> not found
data |>select(stu_id, x1stuedexpct, x1paredexpct, x1region)
Quick question: if we want to use this reduced dataset going forward, what should we do?
That’s right, assign it to an object! Let’s call that data_small
data_small <- data |>select(stu_id, x1stuedexpct, x1paredexpct, x1region)
Our next step is to create a variable that’s the higher of student and parent expectations. Sounds simple enough, but first, we have a problem…
Can anyone guess what it is?
Understanding our data
First things first, however, we need to check the code book to see what the numerical values for our two education expectation variables represent. To save time, we’ve copied them here:
x1stuedexpct
How far in school 9th grader thinks he/she will get
value
label
1
Less than high school
2
High school diploma or GED
3
Start an Associate’s degree
4
Complete an Associate’s degree
5
Start a Bachelor’s degree
6
Complete a Bachelor’s degree
7
Start a Master’s degree
8
Complete a Master’s degree
9
Start Ph.D/M.D/Law/other prof degree
10
Complete Ph.D/M.D/Law/other prof degree
11
Don’t know
-8
Unit non-response
x1paredexpct
How far in school parent thinks 9th grader will go
value
label
1
Less than high school
2
High school diploma or GED
3
Start an Associate’s degree
4
Complete an Associate’s degree
5
Start a Bachelor’s degree
6
Complete a Bachelor’s degree
7
Start a Master’s degree
8
Complete a Master’s degree
9
Start Ph.D/M.D/Law/other prof degree
10
Complete Ph.D/M.D/Law/other prof degree
11
Don’t know
-8
Unit non-response
-9
Missing
The good news is that the categorical values are the same for both variables (meaning we can make an easy comparison) and move in a logical progression
The bad news is that we have three values — -8, -9, and 11 — that we need to deal with so that the averages we compute later represent what we mean
Exploring Catagorical Variables with count()
First, let’s see how many observations are affected by these values using count()
Notice that we don’t assign to a new object; this means we’ll see the result in the console, but nothing in our data or object will change
## see unique values for student expectationdata_small |>count(x1stuedexpct)
Dealing with -8 and -9 is straight forward — we’ll convert it missing.
In R, missing values are technically stored as NA.
Not all statistical software uses the same values to represent missing values (for example, STATA uses a dot .)
NCES has decided to represent missing values as a limited number of negative values. In this case, -8 and -9 represent missing values
Note: how to handle missing values is a very important topic, one we could spend all semester discussing
For now, we are just going to drop observations with missing values; but be forewarned that how you handle missing values can have real ramifications for the quality of your final results
In real research, a better approach is usually to impute missing values, but that is beyond our scope right now
Deciding what to do with 11 is a little trickier. While it’s not a missing value per se, it also doesn’t make much sense in its current ordering, that is, to be “higher” than completing a professional degree
For now, we’ll make a decision to convert these to NA as well, effectively deciding that an answer of “I don’t know” is the same as missing an answer
So first step: convert -8, -9, and 11 in both variables to NA. For this, we’ll use the mutate() and ifelse() functions
Conditional Values with ifelse()
ifelse() is a really common command in R and has three parts
statement that can be TRUE or FALSE
What to return if the statement is TRUE
else what to return when the statement is FALSE
Modifying an Existing Variable with mutate()
When we want to add variables and change existing ones, we can use the mutate() function
The basic idea of mutate commands is mutate(<where to go> = <what to go there>)
This is probably the trickiest function we cover today to understand
New variables are created if you provide <where to go> a new variable name (or nothing)
Variables are modified if you provide <where to go> an existing variable name
<what to go there> can as simple as a single number all the way to a chain of piped functions, so long as there’s a clear answer for every row
In this case, we want to modify x1stuedexpct to be NA when x1stuedexpct is -8, -9, or 11
Now, we have three values we want to covert to NA, so we could do them one-at-a-time, like below
Assign the results back to data_small (which will overwrite our previous data_small)
Take data_small and pipe |> it into mutate()
Inside mutate() assign our results to x1stuedexpct (which will modify the existing variable)
Modify x1stuedexpct with an ifelse() statement, which remember has 3 parts
statement which is asking “is x1stuedexpct == -8”? - Notice == means “is equal to”, while = means “assign to”. Yes it’s confusing, but you’ll get it over time!
if that statement is true, make it NA
else (if the statement is false) return the original variable x1stuedexpct
Okay, make sense? Let’s see what we just did (look at row 26)
The code now works just as above, but instead of asking if x1stuedexpct is equal to -8, it asks if it’s in the list of -8, -9, and 11, then does the same for parental expectations!
Let’s view those first 26 rows again to see what we did
print(data_small, n =26)
# A tibble: 23,503 × 4
stu_id x1stuedexpct x1paredexpct x1region
<dbl> <dbl> <dbl> <dbl>
1 10001 8 6 2
2 10002 NA 6 1
3 10003 10 10 4
4 10004 10 10 3
5 10005 6 10 3
6 10006 10 8 3
7 10007 8 NA 1
8 10008 8 6 1
9 10009 NA NA 3
10 10010 8 6 1
11 10011 8 6 3
12 10012 NA NA 2
13 10013 8 10 3
14 10014 2 6 3
15 10015 NA 10 3
16 10016 4 6 2
17 10018 6 7 2
18 10019 8 NA 2
19 10020 8 8 4
20 10021 8 NA 2
21 10022 10 8 4
22 10024 6 8 2
23 10025 8 NA 4
24 10026 7 10 3
25 10027 NA 6 1
26 10028 NA NA 3
# ℹ 23,477 more rows
So, with that tangent out of the way, let’s get back to our original task, creating a new variable that is the highest of parental and student expectations
To make a new variable which is the highest of two variables, we can use our friends mutate() and ifelse() some more
If x1stuedexpct is higher then take that, if not, take x1paredexpct
There’s two things we haven’t fully accounted for though…
One doesn’t actually matter here, but might in other circumstances
One definitely matters here
Without scrolling past the duck, does anyone know what they might be?
Sloppy Mistake 1 (doesn’t matter here)
We were a little sloppy with the statement piece, we just asked if x1stuedexpct was greater than x1paredexpct or not
If we were being more careful, we might have said “greater than or equal to”
Why doesn’t this matter in this context, and when might it matter?
Sloppy Mistake 2 (does matter here)
Now let’s check our data frame to see the one that does matter
print(data_small, n =26)
# A tibble: 23,503 × 5
stu_id x1stuedexpct x1paredexpct x1region high_exp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10001 8 6 2 8
2 10002 NA 6 1 NA
3 10003 10 10 4 10
4 10004 10 10 3 10
5 10005 6 10 3 10
6 10006 10 8 3 10
7 10007 8 NA 1 NA
8 10008 8 6 1 8
9 10009 NA NA 3 NA
10 10010 8 6 1 8
11 10011 8 6 3 8
12 10012 NA NA 2 NA
13 10013 8 10 3 10
14 10014 2 6 3 6
15 10015 NA 10 3 NA
16 10016 4 6 2 6
17 10018 6 7 2 7
18 10019 8 NA 2 NA
19 10020 8 8 4 8
20 10021 8 NA 2 NA
21 10022 10 8 4 10
22 10024 6 8 2 8
23 10025 8 NA 4 NA
24 10026 7 10 3 10
25 10027 NA 6 1 NA
26 10028 NA NA 3 NA
# ℹ 23,477 more rows
Hmm, that seems odd, why would R consider NA to be greater than 6?
Any thoughts?
Generally, R is overly-cautious when dealing with NAs to ensure you don’t accidentally drop them without realizing it
For example, if you were asked what the mean(c(5, 6, 4, NA)) would be, you’d probably say 5, right?
R is never going to just ignore the NA values like that unless we tell it to
mean(c(5, 6, 4, NA))
[1] NA
See, what have to explicitly tell it to remove the NA values
mean(c(5, 6, 4, NA), na.rm = T)
[1] 5
So in our case of trying to get the highest expectation, R doesn’t want us to forget we have NA values, so it throws them at us.
Dealing with missing values is a huge topic in data analysis, and there are many ways to handle them, which is beyond the scope of this lesson
For now, let’s remove rows that have NA values in either x1stuedexpct or x1paredexpct or both
Dealing With Missing Values with is.na(), &, and !
To do this, we will add another couple of code helpers -is.na(x1stuedexpct) simply asks if the x1stuedexpct is NA or not
R doesn’t let you just say x1stuedexpct == NA -! is really helpful tool, which can be used to negate or invert a command
!is.na(x1stuedexpct) just returns the opposite of is.na(x1stuedexpct) so it tells us that the column is not NA -& can be useful inside conditional statements, as it means both must be TRUE (FYI: | means or)
So !is.na(x1stuedexpct) & !is.na(x1paredexpct) makes sure that both x1stuedexpct and x1paredexpct are not NA
Now what does filter() do here?
Keeping Rows Based on a Condition with filter()
The filter() command from tidyverse works by only keeping observations that meet the condition(s) we set
As in, to make it through the filter, a row must answer “yes” to “does it meet this condition?”
So in this case, we are keeping all rows where x1stuedexpct and x1paredexpct are not NA
Notice, instead of overwriting data_small we assigned this to a new object data_small_cut
Generally, when making substantial changes to a data set like dropping observations, we might want to be able to double check what we did, which is easier if we make a new data
Quick Question: A common confusion from this lesson is between filter() and select(). Can someone explain when you’d use select() over filter()?
Let’s check the counts of our x1stuedexpct and x1paredexpct again to see if filter() worked
Let’s compare our two ways of removing NAs and see if they end up being the same
What’s the function to check if things are equal (we used it when introducing |> ?
## compare two ways of dropping NAsall.equal(data_small_cut, data_small_drop)
[1] TRUE
Summarizing Data with summarize()
Okay, so we have our data selected, we made our high_exp variable, and we’ve done some work to handle missing data
In our week one data talk we discussed how massive tables of data are not particularly helpful to for someone to read or interpret
We will cover how to make graphs of our data in a few weeks
For our final task today, we are going to make some summary tables using summarize() from the tidyverse
summarize() allows us to apply a summary statistic (mean, sd, median, etc.) to a column in our data
summarize() takes an entire data frame as an input, and spits out a small data frame with the just the summary variables
Note: for this reason, you rarely ever want to assign <- the output of summarize() back to the main data frame object, as you’ll overwrite it
You can either spit the summary tables out into the console without assignment (which we will do) or if you need to use them for something else, assign them to a new object
## get average (without storing)data_small_cut |>summarize(mean(high_exp))
# A tibble: 1 × 1
`mean(high_exp)`
<dbl>
1 7.99
See, the output is a 1x1 table with the mean expectation mean(high_exp) of 7.99, almost about completing a master’s degree
Note: if we want to name the summary variable, we can name it just like we did earlier in mutate() with a single =
Success! While expectations are similar across the country, there’s some variance by region
While there are few things we could do to make this a little fancier (e.g., changing the region numbers to names, formatting the table, etc.) we have answered our question, and have clear documentation of how we got here, so let’s call that a win!
Saving Data with write_csv()
Sometimes we want to be able to access objects from scripts without having to re-run the whole thing
Remember: one of the main advantages of R is the data we read in is untouched
To do this, we want to write_ a new csv() file, containing our modified data
unlike read_csv() which only needed a file name/path, write_csv() needs to know what you’re trying to save and the file name/path you want to save it to
The only way you can overwrite or change the original data is by saving to the same file name as the original data, so NEVER do that!
Since we didn’t assign our summary table to anything, we can just add write_csv() to the end of the pipe and add a path to where you want to save it
If you want to save a data frame you already assigned to an object write_csv(<object>, "path") would work just fine!
## write with useful namedata_small_cut |>group_by(x1region) |>summarize(mean_exp =mean(high_exp)) |>write_csv("data/region-expects.csv")
Phew!
Appendix: All at Once
We went through that piece by piece to demonstrate each function, but, there’s no reason we can’t just |> pipe it all together
## Let's redo the analysis above, but with a fully chained set of## functions.## start with original datasetdata |>## select columns we wantselect(stu_id, x1stuedexpct, x1paredexpct, x1region) |>## If expectation is -8, -9. or 11, make it NAmutate(x1stuedexpct =ifelse(x1stuedexpct %in%list(-8, -9, 11), NA, x1stuedexpct),x1paredexpct =ifelse(x1paredexpct %in%list(-8, -9, 11), NA, x1paredexpct)) |>## Make a new variable called high_exp that is the higher or parent and student expmutate(high_exp =ifelse(x1stuedexpct > x1paredexpct, x1stuedexpct, x1paredexpct)) |>## Drop if either or both parent or student exp is NAfilter(!is.na(x1stuedexpct) &!is.na(x1paredexpct)) |>## Group the results by regiongroup_by(x1region) |>## Get the mean of high_exp (by region)summarize(mean_exp =mean(high_exp)) |>## Write that to a .csv filewrite_csv("data/region-expects-chain.csv")
To double check, let’s just check these are the same…
non_chain <-read_csv("data/region-expects.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chain <-read_csv("data/region-expects-chain.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x1region, mean_exp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all.equal(non_chain, chain)
[1] TRUE
WooHoo!
Final notes
This rather lengthy lesson has thrown you in the (medium) deep end of the coding pool
By no means are you expected to get everything we just did
We will continue to revisit all these commands throughout the class, by the end of the semester, they will be second nature!
We also saw how to use code to answer a realistic question we might be asked in a data management job, a translation skill that will prove invaluable later on!
We had to plan out steps and then make some adjustments along the way (e.g., our NA issues), that’s all part of the process!
Use the hsls_small.csv data set from the lesson answer the following questions
Throughout, you should account for missing values by dropping them
Hint: You’re going to need the code book to identify the necessary variables
Question One
a) What is the average (mean) standardized math score?
b) How does this differ by gender?
Question Two
a) Among those students who are under 185% of the federal poverty line in the base year of the survey, what is the median household income category? (Include a description what that category represents)
Question Three
a) Of the students who earned a high school credential (traditional diploma or GED), what percentage earned a GED or equivalency?
b) How does this differ by region?
Question Four
a) What percentage of students ever attended a post-secondary institution? (as of the data collection in february 2016)
b) Give the cross tabulation for both family incomes above/below $35,000 and region
This means you should have percentages for 8 groups: above/below $35,000 within each region
Hint: group_by() can be given more than one group
Submission
Once complete turn in the .qmd file (it must render/run) and the PDF rendered to Canvas by the due date (usually Tuesday 12:00pm following the lesson). Assignments will be graded before next lesson on Wednesday in line with the grading policy outlined in the syllabus.