## ---------------------------
## libraries
## ---------------------------
## NONE
III: Data Wrangling I Redux: Vanilla R
This supplemental lesson is a companion to the lesson on data wrangling with Tidyverse
While the tidyverse has fundamentally changed the way many people work with R, you can still use base or “vanilla” R for your data wrangling tasks
You may find it useful to perform an example data analysis without the support of the tidyverse
If you work with people who learned to code without the tidyverse,
Because much of the material is otherwise the same, we won’t go into the same depth in this supplemental lesson (you’ll even see some repeated text)
We’ll simply work to answer the research question posed in the other lesson.
Re: Urgent Data Question from the Provost
- Through today’s lesson, we will explore some of the basics of data wrangling
- But to make it more realistic, we will be doing so to answer a realistic question you may be asked by your advisor or supervisor
Using HSLS09 data, figure out average differences in college degree expectations across census regions; for a first pass, ignore missing values and use the higher of student and parental expectations if an observation has both.
- A primary skill (often unremarked upon) in data analytic work is translation. Your advisor, IR director, funding agency director — even collaborator — won’t speak to you in the language of R
- Instead, it’s up to you to
- translate a research question into the discrete steps coding steps necessary to provide an answer, and then
- translate the answer such that everyone understands what you’ve found
- Instead, it’s up to you to
What we need to do is some combination of the following:
- Read in the data
- Select the variables we need
- Mutate a new value that’s the higher of student and parental degree expectations
- Filter out observations with missing degree expectation values
- Summarize the data within region to get average degree expectation values
- Write out the results to a file so we have it for later
Let’s do it!
NOTE: Since we’re not using the vanilla R, we don’t need to load any packages
Check working directory
- Before we get started, make sure your working directory is set to your class folder
## Check working directory is correct
setwd(this.path::here())
Read in data
- For this lesson, we’ll use a subset of the High School Longitudinal Study of 2009 (HSLS09), an IES /NCES data set that features:
- Nationally representative, longitudinal study of 23,000+ 9th graders from 944 schools in 2009, with a first follow-up in 2012 and a second follow-up in 2016
- Students followed throughout secondary and postsecondary years
- Surveys of students, their parents, math and science teachers, school administrators, and school counselors
- A new student assessment in algebraic skills, reasoning, and problem solving for 9th and 11th grades
- 10 state representative data sets
If you are interested in using HSLS09 for future projects, DO NOT rely on this subset. Be sure to download the full data set with all relevant variables and weights if that’s the case. But for our purposes in this lesson, it will work just fine.
Throughout, we’ll need to consult the code book. An online version can be found at this link.
Quick exercise
Follow the code book link above in your browser and navigate to the HSLS09 code book.
## ---------------------------
## input
## ---------------------------
## data are CSV, so we use read.csv(), which is base R function
<- read.csv(file.path("data", "hsls-small.csv")) df
- Unlike the
read_csv()
function we’ve used before,read.csv()
doesn’t print anything- notice the difference: a
.
instead of an_
- notice the difference: a
- So that we can see our data, well print to the console. BUT before we do that…
-read.csv()
returns a base R data.frame()
rather than the special data frame or tibble()
that the tidyverse uses. - It’s mostly the same, but one difference is that whereas R will only print the first 10 rows of a tibble, it will print the entire data.frame
- We don’t need to see the whole thing, so we’ll use the head()
function to print only the first 10 rows.
## show first 10 rows
head(df, n = 10)
stu_id x1sex x1race x1stdob x1txmtscor x1paredu x1hhnumber x1famincome
1 10001 1 8 199502 59.3710 5 3 10
2 10002 2 8 199511 47.6821 3 6 3
3 10003 2 3 199506 64.2431 7 3 6
4 10004 2 8 199505 49.2690 4 2 5
5 10005 1 8 199505 62.5897 4 4 9
6 10006 2 8 199504 58.1268 3 6 5
7 10007 2 8 199409 49.4960 2 2 4
8 10008 1 8 199410 54.6249 7 3 7
9 10009 1 8 199501 53.1875 2 3 4
10 10010 2 8 199503 63.7986 3 4 4
x1poverty185 x1ses x1stuedexpct x1paredexpct x1region x4hscompstat
1 0 1.5644 8 6 2 1
2 1 -0.3699 11 6 1 1
3 0 1.2741 10 10 4 1
4 0 0.5498 10 10 3 1
5 0 0.1495 6 10 3 1
6 0 1.0639 10 8 3 -8
7 0 -0.4300 8 11 1 1
8 0 1.5144 8 6 1 1
9 0 -0.3103 11 11 3 1
10 0 0.0451 8 6 1 -8
x4evratndclg x4hs2psmos
1 1 3
2 1 3
3 1 4
4 0 -7
5 0 -7
6 -8 -8
7 1 2
8 1 3
9 1 8
10 -8 -8
Select variables (columns)
- Data frames are like special matrices
- They have rows and columns
- You can access these rows and columns using square bracket notation (
[]
) - Because data frames have two dimensions, you use a comma inside the square brackets to indicate what you mean (
[,]
):df[<rows>,<cols>]
- At it’s most basic, you can use numbers to represent the index of the cell or cells you’re interested in
- For example, if you want to access the value of the cell in row 1, column 4, you can use:
## show value at row 1, col 4
1, 4] df[
[1] 199502
- Because data frames have column names (the variable names in our data set), we can also refer to them by name
- The fourth column is the student date of birth variable,
x1stdob
- We can use that instead of
4
(notice the quotation marks""
):
- We can use that instead of
- The fourth column is the student date of birth variable,
## show value at row 1, x1stdob column
1, "x1stdob"] df[
[1] 199502
- If we want to see more than one column, we can put the names in a concatenated vector using the
c()
function:
## show values at row 1, stu_id & x1stdob column
1, c("stu_id", "x1stdob")] df[
stu_id x1stdob
1 10001 199502
- So far, we’ve not assigned these results to anything, so they’ve just printed to the console.
- However, we can assign them to a new object
- If we want to slice our data so that we only have selected columns, we can leave the rows section blank (meaning we want all rows) and include all the columns we want to keep in our new data frame object.
## -----------------
## select
## -----------------
## select columns we need and assign to new object
<- df[, c("stu_id", "x1stuedexpct", "x1paredexpct", "x1region")]
df_tmp
## show 10 rows
head(df_tmp, n = 10)
stu_id x1stuedexpct x1paredexpct x1region
1 10001 8 6 2
2 10002 11 6 1
3 10003 10 10 4
4 10004 10 10 3
5 10005 6 10 3
6 10006 10 8 3
7 10007 8 11 1
8 10008 8 6 1
9 10009 11 11 3
10 10010 8 6 1
Mutate data into new forms
Changing existing variables (columns)
- To conditionally change a variable, we’ll once again use the bracket notation to target our changes
- This time, however, we do a couple of things differently:
- include square brackets on the LHS of the assignment
- use conditions in the
<rows>
part of the bracket
- As before, we need to account for the fact that our two expectation variables,
x1stuedexpct
andx1paredexpct
, have values that need to be converted toNA
:-8
,-9
, and11
- First, let’s look at the unique values using the
table()
function- This somewhat similar to
count()
in tidyverse
- This somewhat similar to
- So that we see any missing values, we’ll include an extra argument
useNA = "ifany"
- This just means we will see counts for
NA
s if there are any
- This just means we will see counts for
## -----------------
## mutate
## -----------------
## see unique values for student expectation
table(df_tmp$x1stuedexpct, useNA = "ifany")
-8 1 2 3 4 5 6 7 8 9 10 11
2059 93 2619 140 1195 115 3505 231 4278 176 4461 4631
## see unique values for parental expectation
table(df_tmp$x1paredexpct, useNA = "ifany")
-9 -8 1 2 3 4 5 6 7 8 9 10 11
32 6715 55 1293 149 1199 133 4952 76 3355 37 3782 1725
- Notice that we use a dollar sign,
$
, to call the column name from the data frame- Unlike with the tidyverse, we cannot just use the column name
- Base R will look for that column name not as a column in a data frame, but as its own object
- It probably won’t find it (or worse, you’ll have another object in memory that it will find and you’ll get the wrong thing!).
- Unlike with the tidyverse, we cannot just use the column name
- To modify a variable when it’s a certain value, we can use the
[]
square brackets in a more advanced way- Start by identifying the column you’d like
- e.g,
df_tmp$x1stuedexpct
-Then add the[]
square brackets and inside them
- e.g,
- Inside them we can add a condition to them, such as when a column is equal to -8
- e.g.,
df_tmp$x1stuedexpct == -8
- Think of this a bit like
filter()
from the tidyverse
- Think of this a bit like
- e.g.,
- If we just print this, you’ll see a load of
-8
, not that useful…
- Start by identifying the column you’d like
## This will just print a bunch of -8s
$x1stuedexpct[df_tmp$x1stuedexpct == -8] df_tmp
[1] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[25] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[49] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[73] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[97] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[121] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[145] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[169] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[193] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[217] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[241] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[265] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[289] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[313] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[337] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[361] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[385] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[409] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[433] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[457] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[481] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[505] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[529] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[553] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[577] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[601] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[625] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[649] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[673] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[697] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[721] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[745] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[769] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[793] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[817] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[841] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[865] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[889] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[913] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[937] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[961] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[985] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1009] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1033] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1057] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1081] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1105] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1129] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1153] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1177] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1201] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1225] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1249] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1273] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1297] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1321] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1345] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1369] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1393] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1417] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1441] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1465] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1489] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1513] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1537] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1561] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1585] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1609] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1633] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1657] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1681] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1705] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1729] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1753] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1777] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1801] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1825] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1849] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1873] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1897] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1921] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1945] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1969] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1993] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[2017] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[2041] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
- But, instead of printing it, we can assign
NA
to it, which will replace all those-8
s withNA
- We can do the same for
11
to while we are at it
- We can do the same for
## replace student expectation values
$x1stuedexpct[df_tmp$x1stuedexpct == -8] <- NA
df_tmp$x1stuedexpct[df_tmp$x1stuedexpct == 11] <- NA df_tmp
- If you think back to our previous lesson, we can be a little more slick than this though
- If we change the statement to
%in% c(-8, -9, 11)
it will do it all at once
- If we change the statement to
## replace parent expectation values
$x1paredexpct[df_tmp$x1paredexpct %in% c(-8, -9, 11)] <- NA df_tmp
Let’s confirm using table()
again. The values that were in -8
, -9
, and 11
should now be summed under NA
.
## see unique values for student expectation (confirm changes)
table(df_tmp$x1stuedexpct, useNA = "ifany")
1 2 3 4 5 6 7 8 9 10 <NA>
93 2619 140 1195 115 3505 231 4278 176 4461 6690
## see unique values for parental expectation (confirm changes)
table(df_tmp$x1paredexpct, useNA = "ifany")
1 2 3 4 5 6 7 8 9 10 <NA>
55 1293 149 1199 133 4952 76 3355 37 3782 8472
Adding new variables (columns)
Adding a new variable to our data frame is just like modifying an existing column
The only difference is that instead of putting an existing column name after the first
$
sign, we’ll make up a new nameThis tells R to add a new column to our data frame
As with the tidyverse version, we’ll use the
ifelse()
function to create a new variable that is the higher of student or parental expectations
## add new column
$high_expct <- ifelse(df_tmp$x1stuedexpct > df_tmp$x1paredexpct, # test
df_tmp$x1stuedexpct, # if TRUE
df_tmp$x1paredexpct) # if FALSE
df_tmp
## show first 10 rows
head(df_tmp, n = 10)
stu_id x1stuedexpct x1paredexpct x1region high_expct
1 10001 8 6 2 8
2 10002 NA 6 1 NA
3 10003 10 10 4 10
4 10004 10 10 3 10
5 10005 6 10 3 10
6 10006 10 8 3 10
7 10007 8 NA 1 NA
8 10008 8 6 1 8
9 10009 NA NA 3 NA
10 10010 8 6 1 8
- Just like in the original lesson, it doesn’t handle
NA
values how we want it to- Look at student
10002
in the second row:- While the student doesn’t have an expectation (or said “I don’t know”), the parent does.
- However, our new variable records
NA
. Let’s fix it with this test:
- However, our new variable records
- While the student doesn’t have an expectation (or said “I don’t know”), the parent does.
- Look at student
If
high_expct
is missing andx1stuedexpct
is not missing, replace with that; otherwise replace with itself (leave alone). Repeat, but forx1paredexpct
. If stillNA
, then we can assume both student and parent expectations were missing.
Translating the bold words to R code:
- is missing:
is.na()
- and:
&
- is not missing:
!is.na()
(!
means NOT)
we get:
## correct for NA values
## NB: We have to include [is.na(df_tmp$high_expct)] each time so that
## everything lines up
## step 1 student
$high_expct[is.na(df_tmp$high_expct)] <- ifelse(
df_tmp## test
!is.na(df_tmp$x1stuedexpct[is.na(df_tmp$high_expct)]),
## if TRUE do this...
$x1stuedexpct[is.na(df_tmp$high_expct)],
df_tmp## ... else do that
$high_expct[is.na(df_tmp$high_expct)]
df_tmp
)
## step 2 parent
$high_expct[is.na(df_tmp$high_expct)] <- ifelse(
df_tmp## test
!is.na(df_tmp$x1paredexpct[is.na(df_tmp$high_expct)]),
## if TRUE do this...
$x1paredexpct[is.na(df_tmp$high_expct)],
df_tmp## ... else do that
$high_expct[is.na(df_tmp$high_expct)]
df_tmp )
- That’s a lot of text!
- What’s happening is that we are trying to replace a vector of values with another vector of values, which need to line up and be the same length
- That’s why we start with
df_tmp$x1stuedexpct[is.na(df_tmp$high_expct)]
- When our
high_expct
column has missing values, we want to replace with non-missingx1stuedexpct
values in the same row- That means we also need to subset that column to only include values in rows that have missing
high_expct
values- Because we must do this each time, our script gets pretty long and unwieldy.
- That means we also need to subset that column to only include values in rows that have missing
- When our
- That’s why we start with
Let’s check to make sure it worked as intended.
## show first 10 rows
head(df_tmp, n = 10)
stu_id x1stuedexpct x1paredexpct x1region high_expct
1 10001 8 6 2 8
2 10002 NA 6 1 6
3 10003 10 10 4 10
4 10004 10 10 3 10
5 10005 6 10 3 10
6 10006 10 8 3 10
7 10007 8 NA 1 8
8 10008 8 6 1 8
9 10009 NA NA 3 NA
10 10010 8 6 1 8
- Looking at the second observation again, it looks like we’ve fixed our
NA
issue
Filter observations (rows)
- Let’s check the counts of our new variable:
## -----------------
## filter
## -----------------
## get summary of our new variable
table(df_tmp$high_expct, useNA = "ifany")
1 2 3 4 5 6 7 8 9 10 <NA>
71 2034 163 1282 132 4334 191 5087 168 6578 3463
- Since we’re can’t use the missing values we’ll drop those observations from our data frame
- Just like when we selected columns above, we’ll use the
[]
square brackets notation- As with dplyr’s
filter()
, we want to filter in what we want (i.e., when it’s not NA)
- As with dplyr’s
- Since we want to filter rows, we set this condition before the comma in the square brackets
- Because we want all the columns, we leave the space after the comma blank
## filter in values that aren't missing
<- df_tmp[!is.na(df_tmp$high_expct),]
df_tmp
## show first 10 rows
head(df_tmp, n = 10)
stu_id x1stuedexpct x1paredexpct x1region high_expct
1 10001 8 6 2 8
2 10002 NA 6 1 6
3 10003 10 10 4 10
4 10004 10 10 3 10
5 10005 6 10 3 10
6 10006 10 8 3 10
7 10007 8 NA 1 8
8 10008 8 6 1 8
10 10010 8 6 1 8
11 10011 8 6 3 8
- It looks like we’ve dropped the rows with missing values in our new variable (or, more technically, kept those without missing values)
- Since we haven’t removed rows until now, to double check, we can compare the number of rows in the original data frame,
df
, to what we have now
## is the original # of rows - current # or rows == NA in count?
nrow(df) - nrow(df_tmp)
[1] 3463
- Comparing the difference, we can see it’s the same as the number of missing values in our new column
- While not a formal test, it does support what we expected
- In other words, if the number were different, we’d definitely want to go back and investigate
- While not a formal test, it does support what we expected
Summarize data
Now we’re ready to get the average of expectations that we need. For an overall average, we can just use the mean()
function.
## -----------------
## summarize
## -----------------
## get average (without storing)
mean(df_tmp$high_expct)
[1] 7.272705
- Overall, we can see that students and parents have high post-secondary expectations on average: to earn some graduate credential beyond a bachelor’s degree
- However, this isn’t what we want. We want the values across census regions.
## check our census regions
table(df_tmp$x1region, useNA = "ifany")
1 2 3 4
3128 5312 8177 3423
- We’re not missing any census data, which is good!
- To calculate our average expectations, we need to use the
aggregate
function - This function allows to compute a
FUN
ctionby
a group- We’ll use it to get our summary.
## get average (assigning this time)
<- aggregate(df_tmp["high_expct"], # var of interest
df_tmp by = list(region = df_tmp$x1region), # by group
FUN = mean) # function to run
## show
df_tmp
region high_expct
1 1 7.389066
2 2 7.168110
3 3 7.357833
4 4 7.125329
- Success! Expectations are similar across the country, but not the same by region.
Write out updated data
- We can use this new data frame as a table in its own right or to make a figure
- For now, however, we’ll simply save it using the opposite of
read.csv()
—write.csv()
## write with useful name
write.csv(df_tmp, file.path("data", "high_expct_mean_region.csv"))
- And with that, we’ve met our task: we can show average educational expectations by region
- To be very precise, we can show the higher of student and parental educational expectations among those who answered the question by region
- This caveat doesn’t necessarily make our analysis less useful, but rather sets its scope.
- Furthermore, we’ve kept our original data as is (we didn’t overwrite it) for future analyses while saving the results of this analysis for quick reference
- Using the
hsls_small.csv
data set answer the following questions- Hint: You’re going to need the code book to identify the necessary variables
- For this assignment, you cannot use the
tidyverse
or any other packages - Although it’s now part of vanilla R, you can’t use a
|>
pipe either - The purpose of this assignment is learn how to write more traditional-style computer code
- Although you’ve already done this task with tidyverse, this is probably still the hardest extra credit, good luck!
Hint
Calculating the percentages is the trickiest part of this assignment, with and without
tidyverse
, as we don’t directly talk about it in the lesson. The output of thetable()
function should give you the information you need, remember you can access specifc elements of any objects (things you assign) with[]
.
Questions
- What is the average standardized math test score?
- How does this differ by gender?
- Among those students who are under 185% of the federal poverty line in the base year of the survey, what is the median household income category? (include what that category represents)
- Of the students who earned a high school credential (traditional diploma or GED), what percentage earned a GED or equivalency?
- How does this differ by region?
- What percentage of students ever attended a post-secondary institution by February 2016?
- Optional: Give the cross tabulation for both family incomes above/below $35,000 and region
- This means you should have percentages for 8 groups: above/below $35k within each region
Once complete, turn in the .R script (no data etc.) to Canvas by the due date (Sunday 11:59pm following the final lesson). Good faith efforts (as determined by the instructor) at extra credit assignments will earn full credit if submitted on time.
Solution
## -----------------------------------------------------------------------------
##
##' [PROJ: 7916]
##' [FILE: Data Wrangling I Redux: Vanilla R Solution]
##' [INIT: February 5th 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##
## -----------------------------------------------------------------------------
setwd(this.path::here())
## ---------------------------
##' [Libraries]
## ---------------------------
## None!
## ---------------------------
##' [Input]
## ---------------------------
<- read.csv(file.path("data", "hsls-small.csv"))
df
## ---------------------------
##' [Q1]
## ---------------------------
## Part One
$x1txmtscor[df$x1txmtscor == -8] <- NA
df<- na.omit(df) # No other NAs, so omit works fine
df_math
mean(df_math$x1txmtscor)
## Part Two
aggregate(df_math["x1txmtscor"], # var of interest
by = list(sex = df_math$x1sex), # by group
FUN = mean) # function to run
## ---------------------------
##' [Q2]
## ---------------------------
<- df[df$x1poverty185 == 1,]
df_pov
paste("The median household income cat is", median(df_pov$x1famincome),
"which represents incomes between $15k and $35k")
## ---------------------------
##' [Q3]
## ---------------------------
##'[Part I]
<- df[df$x4hscompstat %in% c(1,2),]
df_hs
<- table(df_hs$x4hscompstat)
hs_counts "total"] <- hs_counts["1"] + hs_counts["2"]
hs_counts[<- round(hs_counts["2"]/hs_counts["total"]*100, 2)
percent paste(percent, "% of those with HS credential have a GED")
##'[Part II]
<- aggregate(df_hs["x4hscompstat"], # var of interest
hs_counts_region by = list(region = df_hs$x1region), # by group
FUN = table)
"total"] <- hs_counts_region$x4hscompstat[, "1"] + hs_counts_region$x4hscompstat[, "2"]
hs_counts_region["percent"] <- hs_counts_region$x4hscompstat[, "2"]/hs_counts_region["total"]*100
hs_counts_region[
c("region", "percent")]
hs_counts_region[
## ---------------------------
##' [Q4]
## ---------------------------
## Part One
<- df[df$x4evratndclg != -8,]
df_col
<- table(df_col$x4evratndclg)
col_counts <- col_counts["1"]
college <- col_counts["0"] + col_counts["1"]
total <- round(college/total*100, 2)
percent
paste(percent, "% of student ever attended college")
## Part Two
## Drop any missing household income and make new variable
<- df_col[!df_col$x1famincome %in% c(-8,-9),]
df_col $below_35k <- ifelse(df_col$x1famincome %in% c(1,2), 1, 0)
df_col
## Now, we can use the aggregate function again, but use the table function
## we used
## but inside the "by" list
## we can put both columns in there, a bit like we put two variables in
## tidyverse's group by
<- aggregate(df_col["x4evratndclg"], # var of interest
col_counts by = list(b35k = df_col$below_35k, # by group(s)
region = df_col$x1region),
FUN = table) # function to run
## The ouput of this is a little messy, so we can clean it up
$attend <- col_counts$x4evratndclg[,"1"]
col_counts$total <- col_counts$x4evratndclg[,"1"] + col_counts$x4evratndclg[,"0"]
col_counts$perc <- col_counts$attend/col_counts$total * 100
col_counts
print(col_counts)
## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------