III: Data Wrangling I Redux: Vanilla R

R Code

  • This supplemental lesson is a companion to the lesson on data wrangling with Tidyverse

  • While the tidyverse has fundamentally changed the way many people work with R, you can still use base or “vanilla” R for your data wrangling tasks

  • You may find it useful to perform an example data analysis without the support of the tidyverse

  • If you work with people who learned to code without the tidyverse,

  • Because much of the material is otherwise the same, we won’t go into the same depth in this supplemental lesson (you’ll even see some repeated text)

  • We’ll simply work to answer the research question posed in the other lesson.

Re: Urgent Data Question from the Provost

  • Through today’s lesson, we will explore some of the basics of data wrangling
    • But to make it more realistic, we will be doing so to answer a realistic question you may be asked by your advisor or supervisor

Using HSLS09 data, figure out average differences in college degree expectations across census regions; for a first pass, ignore missing values and use the higher of student and parental expectations if an observation has both.

  • A primary skill (often unremarked upon) in data analytic work is translation. Your advisor, IR director, funding agency director — even collaborator — won’t speak to you in the language of R
    • Instead, it’s up to you to
      1. translate a research question into the discrete steps coding steps necessary to provide an answer, and then
      2. translate the answer such that everyone understands what you’ve found

What we need to do is some combination of the following:

  1. Read in the data
  2. Select the variables we need
  3. Mutate a new value that’s the higher of student and parental degree expectations
  4. Filter out observations with missing degree expectation values
  5. Summarize the data within region to get average degree expectation values
  6. Write out the results to a file so we have it for later

Let’s do it!

NOTE: Since we’re not using the vanilla R, we don’t need to load any packages

## ---------------------------
## libraries
## ---------------------------

## NONE

Check working directory

  • Before we get started, make sure your working directory is set to your class folder
## Check working directory is correct
setwd(this.path::here())

Read in data

  • Nationally representative, longitudinal study of 23,000+ 9th graders from 944 schools in 2009, with a first follow-up in 2012 and a second follow-up in 2016
  • Students followed throughout secondary and postsecondary years
  • Surveys of students, their parents, math and science teachers, school administrators, and school counselors
  • A new student assessment in algebraic skills, reasoning, and problem solving for 9th and 11th grades
  • 10 state representative data sets
  • If you are interested in using HSLS09 for future projects, DO NOT rely on this subset. Be sure to download the full data set with all relevant variables and weights if that’s the case. But for our purposes in this lesson, it will work just fine.

  • Throughout, we’ll need to consult the code book. An online version can be found at this link.

Quick exercise

Follow the code book link above in your browser and navigate to the HSLS09 code book.

## ---------------------------
## input
## ---------------------------

## data are CSV, so we use read.csv(), which is base R function
df <- read.csv(file.path("data", "hsls-small.csv"))
  • Unlike the read_csv() function we’ve used before, read.csv() doesn’t print anything
    • notice the difference: a . instead of an _
  • So that we can see our data, well print to the console. BUT before we do that…

-read.csv() returns a base R data.frame() rather than the special data frame or tibble() that the tidyverse uses. - It’s mostly the same, but one difference is that whereas R will only print the first 10 rows of a tibble, it will print the entire data.frame - We don’t need to see the whole thing, so we’ll use the head() function to print only the first 10 rows.

## show first 10 rows
head(df, n = 10)
   stu_id x1sex x1race x1stdob x1txmtscor x1paredu x1hhnumber x1famincome
1   10001     1      8  199502    59.3710        5          3          10
2   10002     2      8  199511    47.6821        3          6           3
3   10003     2      3  199506    64.2431        7          3           6
4   10004     2      8  199505    49.2690        4          2           5
5   10005     1      8  199505    62.5897        4          4           9
6   10006     2      8  199504    58.1268        3          6           5
7   10007     2      8  199409    49.4960        2          2           4
8   10008     1      8  199410    54.6249        7          3           7
9   10009     1      8  199501    53.1875        2          3           4
10  10010     2      8  199503    63.7986        3          4           4
   x1poverty185   x1ses x1stuedexpct x1paredexpct x1region x4hscompstat
1             0  1.5644            8            6        2            1
2             1 -0.3699           11            6        1            1
3             0  1.2741           10           10        4            1
4             0  0.5498           10           10        3            1
5             0  0.1495            6           10        3            1
6             0  1.0639           10            8        3           -8
7             0 -0.4300            8           11        1            1
8             0  1.5144            8            6        1            1
9             0 -0.3103           11           11        3            1
10            0  0.0451            8            6        1           -8
   x4evratndclg x4hs2psmos
1             1          3
2             1          3
3             1          4
4             0         -7
5             0         -7
6            -8         -8
7             1          2
8             1          3
9             1          8
10           -8         -8

Select variables (columns)

  • Data frames are like special matrices
    • They have rows and columns
    • You can access these rows and columns using square bracket notation ([])
    • Because data frames have two dimensions, you use a comma inside the square brackets to indicate what you mean ([,]):
      • df[<rows>,<cols>]
  • At it’s most basic, you can use numbers to represent the index of the cell or cells you’re interested in
    • For example, if you want to access the value of the cell in row 1, column 4, you can use:
## show value at row 1, col 4
df[1, 4]
[1] 199502
  • Because data frames have column names (the variable names in our data set), we can also refer to them by name
    • The fourth column is the student date of birth variable, x1stdob
      • We can use that instead of 4 (notice the quotation marks ""):
## show value at row 1, x1stdob column
df[1, "x1stdob"]
[1] 199502
  • If we want to see more than one column, we can put the names in a concatenated vector using the c() function:
## show values at row 1, stu_id & x1stdob column
df[1, c("stu_id", "x1stdob")]
  stu_id x1stdob
1  10001  199502
  • So far, we’ve not assigned these results to anything, so they’ve just printed to the console.
    • However, we can assign them to a new object
    • If we want to slice our data so that we only have selected columns, we can leave the rows section blank (meaning we want all rows) and include all the columns we want to keep in our new data frame object.
## -----------------
## select
## -----------------

## select columns we need and assign to new object
df_tmp <- df[, c("stu_id", "x1stuedexpct", "x1paredexpct", "x1region")]

## show 10 rows
head(df_tmp, n = 10)
   stu_id x1stuedexpct x1paredexpct x1region
1   10001            8            6        2
2   10002           11            6        1
3   10003           10           10        4
4   10004           10           10        3
5   10005            6           10        3
6   10006           10            8        3
7   10007            8           11        1
8   10008            8            6        1
9   10009           11           11        3
10  10010            8            6        1

Mutate data into new forms

Changing existing variables (columns)

  • To conditionally change a variable, we’ll once again use the bracket notation to target our changes
  • This time, however, we do a couple of things differently:
    • include square brackets on the LHS of the assignment
    • use conditions in the <rows> part of the bracket
  • As before, we need to account for the fact that our two expectation variables, x1stuedexpct and x1paredexpct, have values that need to be converted to NA: -8, -9, and 11
  • First, let’s look at the unique values using the table() function
    • This somewhat similar to count() in tidyverse
  • So that we see any missing values, we’ll include an extra argument useNA = "ifany"
    • This just means we will see counts for NAs if there are any
## -----------------
## mutate
## -----------------

## see unique values for student expectation
table(df_tmp$x1stuedexpct, useNA = "ifany")

  -8    1    2    3    4    5    6    7    8    9   10   11 
2059   93 2619  140 1195  115 3505  231 4278  176 4461 4631 
## see unique values for parental expectation
table(df_tmp$x1paredexpct, useNA = "ifany")

  -9   -8    1    2    3    4    5    6    7    8    9   10   11 
  32 6715   55 1293  149 1199  133 4952   76 3355   37 3782 1725 
  • Notice that we use a dollar sign, $, to call the column name from the data frame
    • Unlike with the tidyverse, we cannot just use the column name
      • Base R will look for that column name not as a column in a data frame, but as its own object
      • It probably won’t find it (or worse, you’ll have another object in memory that it will find and you’ll get the wrong thing!).
  • To modify a variable when it’s a certain value, we can use the [] square brackets in a more advanced way
    • Start by identifying the column you’d like
      • e.g, df_tmp$x1stuedexpct -Then add the [] square brackets and inside them
    • Inside them we can add a condition to them, such as when a column is equal to -8
      • e.g., df_tmp$x1stuedexpct == -8
        • Think of this a bit like filter() from the tidyverse
    • If we just print this, you’ll see a load of -8, not that useful…
## This will just print a bunch of -8s
df_tmp$x1stuedexpct[df_tmp$x1stuedexpct == -8]
   [1] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
  [25] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
  [49] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
  [73] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
  [97] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [121] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [145] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [169] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [193] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [217] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [241] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [265] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [289] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [313] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [337] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [361] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [385] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [409] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [433] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [457] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [481] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [505] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [529] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [553] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [577] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [601] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [625] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [649] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [673] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [697] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [721] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [745] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [769] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [793] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [817] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [841] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [865] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [889] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [913] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [937] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [961] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
 [985] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1009] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1033] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1057] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1081] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1105] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1129] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1153] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1177] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1201] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1225] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1249] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1273] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1297] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1321] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1345] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1369] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1393] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1417] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1441] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1465] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1489] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1513] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1537] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1561] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1585] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1609] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1633] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1657] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1681] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1705] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1729] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1753] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1777] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1801] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1825] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1849] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1873] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1897] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1921] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1945] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1969] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[1993] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[2017] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
[2041] -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
  • But, instead of printing it, we can assign NA to it, which will replace all those -8s with NA
    • We can do the same for 11 to while we are at it
## replace student expectation values
df_tmp$x1stuedexpct[df_tmp$x1stuedexpct == -8] <- NA
df_tmp$x1stuedexpct[df_tmp$x1stuedexpct == 11] <- NA
  • If you think back to our previous lesson, we can be a little more slick than this though
    • If we change the statement to %in% c(-8, -9, 11) it will do it all at once
## replace parent expectation values
df_tmp$x1paredexpct[df_tmp$x1paredexpct %in% c(-8, -9, 11)] <- NA

Let’s confirm using table() again. The values that were in -8, -9, and 11 should now be summed under NA.

## see unique values for student expectation (confirm changes)
table(df_tmp$x1stuedexpct, useNA = "ifany")

   1    2    3    4    5    6    7    8    9   10 <NA> 
  93 2619  140 1195  115 3505  231 4278  176 4461 6690 
## see unique values for parental expectation (confirm changes)
table(df_tmp$x1paredexpct, useNA = "ifany")

   1    2    3    4    5    6    7    8    9   10 <NA> 
  55 1293  149 1199  133 4952   76 3355   37 3782 8472 

Adding new variables (columns)

  • Adding a new variable to our data frame is just like modifying an existing column

  • The only difference is that instead of putting an existing column name after the first $ sign, we’ll make up a new name

  • This tells R to add a new column to our data frame

  • As with the tidyverse version, we’ll use the ifelse() function to create a new variable that is the higher of student or parental expectations

## add new column
df_tmp$high_expct <- ifelse(df_tmp$x1stuedexpct > df_tmp$x1paredexpct, # test
                            df_tmp$x1stuedexpct,                       # if TRUE
                            df_tmp$x1paredexpct)                       # if FALSE

## show first 10 rows
head(df_tmp, n = 10)
   stu_id x1stuedexpct x1paredexpct x1region high_expct
1   10001            8            6        2          8
2   10002           NA            6        1         NA
3   10003           10           10        4         10
4   10004           10           10        3         10
5   10005            6           10        3         10
6   10006           10            8        3         10
7   10007            8           NA        1         NA
8   10008            8            6        1          8
9   10009           NA           NA        3         NA
10  10010            8            6        1          8
  • Just like in the original lesson, it doesn’t handle NA values how we want it to
    • Look at student 10002 in the second row:
      • While the student doesn’t have an expectation (or said “I don’t know”), the parent does.
        • However, our new variable records NA. Let’s fix it with this test:

If high_expct is missing and x1stuedexpct is not missing, replace with that; otherwise replace with itself (leave alone). Repeat, but for x1paredexpct. If still NA, then we can assume both student and parent expectations were missing.

Translating the bold words to R code:

  • is missing: is.na()
  • and: &
  • is not missing: !is.na() (! means NOT)

we get:

## correct for NA values

## NB: We have to include [is.na(df_tmp$high_expct)] each time so that
## everything lines up

## step 1 student
df_tmp$high_expct[is.na(df_tmp$high_expct)] <- ifelse(
    ## test
    !is.na(df_tmp$x1stuedexpct[is.na(df_tmp$high_expct)]), 
    ## if TRUE do this...
    df_tmp$x1stuedexpct[is.na(df_tmp$high_expct)],
    ## ... else do that
    df_tmp$high_expct[is.na(df_tmp$high_expct)]
)

## step 2 parent
df_tmp$high_expct[is.na(df_tmp$high_expct)] <- ifelse(
    ## test
    !is.na(df_tmp$x1paredexpct[is.na(df_tmp$high_expct)]),
    ## if TRUE do this...
    df_tmp$x1paredexpct[is.na(df_tmp$high_expct)],
    ## ... else do that
    df_tmp$high_expct[is.na(df_tmp$high_expct)]
)
  • That’s a lot of text!
  • What’s happening is that we are trying to replace a vector of values with another vector of values, which need to line up and be the same length
    • That’s why we start with df_tmp$x1stuedexpct[is.na(df_tmp$high_expct)]
      • When our high_expct column has missing values, we want to replace with non-missing x1stuedexpct values in the same row
        • That means we also need to subset that column to only include values in rows that have missing high_expct values
          • Because we must do this each time, our script gets pretty long and unwieldy.

Let’s check to make sure it worked as intended.

## show first 10 rows
head(df_tmp, n = 10)
   stu_id x1stuedexpct x1paredexpct x1region high_expct
1   10001            8            6        2          8
2   10002           NA            6        1          6
3   10003           10           10        4         10
4   10004           10           10        3         10
5   10005            6           10        3         10
6   10006           10            8        3         10
7   10007            8           NA        1          8
8   10008            8            6        1          8
9   10009           NA           NA        3         NA
10  10010            8            6        1          8
  • Looking at the second observation again, it looks like we’ve fixed our NA issue

Filter observations (rows)

  • Let’s check the counts of our new variable:
## -----------------
## filter
## -----------------

## get summary of our new variable
table(df_tmp$high_expct, useNA = "ifany")

   1    2    3    4    5    6    7    8    9   10 <NA> 
  71 2034  163 1282  132 4334  191 5087  168 6578 3463 
  • Since we’re can’t use the missing values we’ll drop those observations from our data frame
  • Just like when we selected columns above, we’ll use the [] square brackets notation
    • As with dplyr’s filter(), we want to filter in what we want (i.e., when it’s not NA)
  • Since we want to filter rows, we set this condition before the comma in the square brackets
    • Because we want all the columns, we leave the space after the comma blank
## filter in values that aren't missing
df_tmp <- df_tmp[!is.na(df_tmp$high_expct),]

## show first 10 rows
head(df_tmp, n = 10)
   stu_id x1stuedexpct x1paredexpct x1region high_expct
1   10001            8            6        2          8
2   10002           NA            6        1          6
3   10003           10           10        4         10
4   10004           10           10        3         10
5   10005            6           10        3         10
6   10006           10            8        3         10
7   10007            8           NA        1          8
8   10008            8            6        1          8
10  10010            8            6        1          8
11  10011            8            6        3          8
  • It looks like we’ve dropped the rows with missing values in our new variable (or, more technically, kept those without missing values)
  • Since we haven’t removed rows until now, to double check, we can compare the number of rows in the original data frame, df, to what we have now
## is the original # of rows - current # or rows == NA in count?
nrow(df) - nrow(df_tmp)
[1] 3463
  • Comparing the difference, we can see it’s the same as the number of missing values in our new column
    • While not a formal test, it does support what we expected
      • In other words, if the number were different, we’d definitely want to go back and investigate

Summarize data

Now we’re ready to get the average of expectations that we need. For an overall average, we can just use the mean() function.

## -----------------
## summarize
## -----------------

## get average (without storing)
mean(df_tmp$high_expct)
[1] 7.272705
  • Overall, we can see that students and parents have high post-secondary expectations on average: to earn some graduate credential beyond a bachelor’s degree
  • However, this isn’t what we want. We want the values across census regions.
## check our census regions
table(df_tmp$x1region, useNA = "ifany")

   1    2    3    4 
3128 5312 8177 3423 
  • We’re not missing any census data, which is good!
  • To calculate our average expectations, we need to use the aggregate function
  • This function allows to compute a FUNction by a group
    • We’ll use it to get our summary.
## get average (assigning this time)
df_tmp <- aggregate(df_tmp["high_expct"],                # var of interest
                    by = list(region = df_tmp$x1region), # by group
                    FUN = mean)                          # function to run

## show
df_tmp
  region high_expct
1      1   7.389066
2      2   7.168110
3      3   7.357833
4      4   7.125329
  • Success! Expectations are similar across the country, but not the same by region.

Write out updated data

  • We can use this new data frame as a table in its own right or to make a figure
  • For now, however, we’ll simply save it using the opposite of read.csv()write.csv()
## write with useful name
write.csv(df_tmp, file.path("data", "high_expct_mean_region.csv"))
  • And with that, we’ve met our task: we can show average educational expectations by region
  • To be very precise, we can show the higher of student and parental educational expectations among those who answered the question by region
    • This caveat doesn’t necessarily make our analysis less useful, but rather sets its scope.
  • Furthermore, we’ve kept our original data as is (we didn’t overwrite it) for future analyses while saving the results of this analysis for quick reference
  • Using the hsls_small.csv data set answer the following questions
    • Hint: You’re going to need the code book to identify the necessary variables
  • For this assignment, you cannot use the tidyverse or any other packages
  • Although it’s now part of vanilla R, you can’t use a |> pipe either
  • The purpose of this assignment is learn how to write more traditional-style computer code
  • Although you’ve already done this task with tidyverse, this is probably still the hardest extra credit, good luck!
Hint

Calculating the percentages is the trickiest part of this assignment, with and without tidyverse, as we don’t directly talk about it in the lesson. The output of the table() function should give you the information you need, remember you can access specifc elements of any objects (things you assign) with [].

Questions

  1. What is the average standardized math test score?
  2. How does this differ by gender?
  1. Among those students who are under 185% of the federal poverty line in the base year of the survey, what is the median household income category? (include what that category represents)
  1. Of the students who earned a high school credential (traditional diploma or GED), what percentage earned a GED or equivalency?
  2. How does this differ by region?
  1. What percentage of students ever attended a post-secondary institution by February 2016?
  2. Optional: Give the cross tabulation for both family incomes above/below $35,000 and region
    • This means you should have percentages for 8 groups: above/below $35k within each region

Once complete, turn in the .R script (no data etc.) to Canvas by the due date (Sunday 11:59pm following the final lesson). Good faith efforts (as determined by the instructor) at extra credit assignments will earn full credit if submitted on time.

Solution

R Solution Code

## -----------------------------------------------------------------------------
##
##' [PROJ: 7916]
##' [FILE: Data Wrangling I Redux: Vanilla R Solution]
##' [INIT: February 5th 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##
## -----------------------------------------------------------------------------

setwd(this.path::here())

## ---------------------------
##' [Libraries]
## ---------------------------

## None!

## ---------------------------
##' [Input]
## ---------------------------

df <- read.csv(file.path("data", "hsls-small.csv"))

## ---------------------------
##' [Q1]
## ---------------------------

## Part One
df$x1txmtscor[df$x1txmtscor == -8] <- NA
df_math <- na.omit(df) # No other NAs, so omit works fine

mean(df_math$x1txmtscor)

## Part Two
aggregate(df_math["x1txmtscor"],          # var of interest
          by = list(sex = df_math$x1sex), # by group
          FUN = mean)                # function to run

## ---------------------------
##' [Q2]
## ---------------------------

df_pov <- df[df$x1poverty185 == 1,]

paste("The median household income cat is", median(df_pov$x1famincome),
      "which represents incomes between $15k and $35k")

## ---------------------------
##' [Q3]
## ---------------------------

##'[Part I]

df_hs <- df[df$x4hscompstat %in% c(1,2),]

hs_counts <- table(df_hs$x4hscompstat)
hs_counts["total"] <- hs_counts["1"] + hs_counts["2"]
percent <- round(hs_counts["2"]/hs_counts["total"]*100, 2)
paste(percent, "% of those with HS credential have a GED")


##'[Part II]
hs_counts_region <- aggregate(df_hs["x4hscompstat"],          # var of interest
                              by = list(region = df_hs$x1region), # by group
                              FUN = table)

hs_counts_region["total"] <- hs_counts_region$x4hscompstat[, "1"] + hs_counts_region$x4hscompstat[, "2"]
hs_counts_region["percent"] <- hs_counts_region$x4hscompstat[, "2"]/hs_counts_region["total"]*100

hs_counts_region[c("region", "percent")]

## ---------------------------
##' [Q4]
## ---------------------------

## Part One

df_col <- df[df$x4evratndclg != -8,]

col_counts <- table(df_col$x4evratndclg)
college <- col_counts["1"]
total <- col_counts["0"] + col_counts["1"]
percent <- round(college/total*100, 2)

paste(percent, "% of student ever attended college")

## Part Two

## Drop any missing household income and make new variable
df_col <- df_col[!df_col$x1famincome %in% c(-8,-9),]
df_col$below_35k <- ifelse(df_col$x1famincome %in% c(1,2), 1, 0)

## Now, we can use the aggregate function again, but use the table function
## we used 
## but inside the "by" list
## we can put both columns in there, a bit like we put two variables in
## tidyverse's group by
col_counts <- aggregate(df_col["x4evratndclg"],          # var of interest
                        by = list(b35k = df_col$below_35k, # by group(s)
                                  region = df_col$x1region), 
                        FUN = table)                # function to run

## The ouput of this is a little messy, so we can clean it up
col_counts$attend <- col_counts$x4evratndclg[,"1"]
col_counts$total <- col_counts$x4evratndclg[,"1"] + col_counts$x4evratndclg[,"0"]
col_counts$perc <- col_counts$attend/col_counts$total * 100

print(col_counts)

## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------