III: Working with strings & dates

R Code

The data we’ve used so far in this course have been almost entirely numerical. Even when the field represented an expected level of education, for example, we didn’t see “complete a Bachelor’s degree.” Instead, we saw the number 6 — sometimes with a label and sometimes without (meaning we had to look up what 6 stood for). In the few cases in which we’ve seen strings or dates, the values have been very regular.

Much education-related data, however, are not this uniform. Particularly when using administrative data files, you are likely to read in columns that contain unstructured strings: names, addresses, dates, etc. Why are they unstructured? Almost always the answer is that person who initially keyed in the data neither had a dropdown menu of options to choose from nor separate fields for each part of the data element (e.g., first name, last name). Instead, they have a blank field in which they type:

Enter name: Isaiah Berlin

Why is this a problem? With an open field, the variations are (often) unlimited:

  • I. Berlin
  • isaiah berlin
  • Berlin, Isaiah

Similarly, the same date can be written any number of ways:

Enter date: February 11, 2020

  • 11 February 2020
  • 11 Feb 2020
  • Feb. 11, 2020
  • 2/11/2020 (American)
  • 11/2/2020 (most everyone else)
  • 2/11/20

To be clear, this is not to impugn those who enter the data. Rather, it’s an acknowledgment that the original uses of the data we analyze may differ from our own, e.g., data input to meet compliance with an administrative task versus data input for statistical analysis.

Now, imagine gaining access to an administrative data set with these two irregular columns, name and date, and thousands or even millions of rows. To complete your analytic task, you must clean the data such that for each observation, you keep only last names and need to convert the date into a format that will allow you to easily calculate time periods between dates. With so many observations, it’s an impossible task to do by hand. But because the data are irregular, you can’t, for example, just select the second word from the name (what if the last name is first in some rows?) or the second number after a forward slash, / (what about when the date uses hyphens, month name, or a different order?).

You won’t always need to work with strings and dates, but when you do, having a few specialty tools in your toolbox will be greatly beneficial. Sometimes they can mean the difference between being able to answer your question and not. In this lesson, we’ll discuss two: regular expressions and dates.

Setup

As before, we’ll continue working within the tidyverse. We’ll focus, however, on using two specific libraries:

You may have noticed already that when we load the tidyverse library with library(tidyverse), the stringr library is already loaded. The lubridate library, though part of the tidyverse, is not. We need to load it separately.

## ---------------------------
## libraries
## ---------------------------

## NB: The stringr library is loaded with tidyverse, but
## lubridate is not, so we need to load it separately

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)

NB: As we have done in the past few lessons, we’ll run this script assuming that our working directory is set to the scripts directory.

Part 1: Working with strings

To practice working with strings, we’ll use data from Integrated Postsecondary Education Data System (IPEDS):

The National Center for Education Statistics (NCES) administers the Integrated Postsecondary Education Data System (IPEDS), which is a large-scale survey that collects institution-level data from postsecondary institutions in the United States (50 states and the District of Columbia) and other U.S. jurisdictions. IPEDS defines a postsecondary institution as an organization that is open to the public and has the provision of postsecondary education or training beyond the high school level as one of its primary missions. This definition includes institutions that offer academic, vocational and continuing professional education programs and excludes institutions that offer only avocational (leisure) and adult basic education programs. Definitions for other terms used in this report may be found in the IPEDS online glossary.

NCES annually releases national-level statistics on postsecondary institutions based on the IPEDS data. National statistics include tuition and fees, number and types of degrees and certificates conferred, number of students applying and enrolled, number of employees, financial statistics, graduation rates, student outcomes, student financial aid, and academic libraries.

You can find more information about IPEDS here. As higher education scholars, IPEDS data are a valuable resource that you may often turn to (I do).

We’ll use one file (which can be found here), that covers institutional characteristics for one year:

  • Directory information, 2007 (hd2007.csv)
## ---------------------------
## input
## ---------------------------

## read in data and lower all names using rename_all(tolower)
df <- read_csv(file.path("data", "hd2007.csv")) |>
    rename_all(tolower)
Rows: 7052 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): INSTNM, ADDR, CITY, STABBR, ZIP, CHFNM, CHFTITLE, EIN, OPEID, WEBA...
dbl (43): UNITID, FIPS, OBEREG, GENTELE, OPEFLAG, SECTOR, ICLEVEL, CONTROL, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Finding: str_detect()

So far, we’ve filtered data using dplyr’s filter() verb. When matching a string, we have used == (or != for negative match). For example, if we wanted to limit our data to only those institutions in Florida, we could filter using the stabbr column:

## filter using state abbreviation (not saving, just viewing)
df |>
    filter(stabbr == "FL")
# A tibble: 316 × 59
   unitid instnm    addr  city  stabbr zip    fips obereg chfnm chftitle gentele
    <dbl> <chr>     <chr> <chr> <chr>  <chr> <dbl>  <dbl> <chr> <chr>      <dbl>
 1 132268 Wyotech-… 470 … Ormo… FL     32174    12      5 Stev… Preside… 3.86e12
 2 132338 The Art … 1799… Fort… FL     3331…    12      5 Char… Preside… 9.54e13
 3 132374 Atlantic… 4700… Coco… FL     3306…    12      5 Robe… Director 7.54e 9
 4 132408 The Bapt… 5400… Grac… FL     32440    12      5 Thom… Preside… 8.50e 9
 5 132471 Barry Un… 1130… Miami FL     3316…    12      5 Sist… Preside… 8.01e 9
 6 132523 Gooding … 615 … Pana… FL     32401    12      5 Dr. … CRNA Ph… 8.51e 9
 7 132602 Bethune-… 640 … Dayt… FL     3211…    12      5 Dr T… Preside… 3.86e 9
 8 132657 Lynn Uni… 3601… Boca… FL     3343…    12      5 Kevi… Preside… 5.61e 9
 9 132666 Bradento… 5505… Brad… FL     34209    12      5 A. P… CEO      9.42e 9
10 132675 Bradford… 609 … Star… FL     32091    12      5 Rand… Director 9.05e 9
# ℹ 306 more rows
# ℹ 48 more variables: ein <chr>, opeid <chr>, opeflag <dbl>, webaddr <chr>,
#   adminurl <chr>, faidurl <chr>, applurl <chr>, sector <dbl>, iclevel <dbl>,
#   control <dbl>, hloffer <dbl>, ugoffer <dbl>, groffer <dbl>, fpoffer <dbl>,
#   hdegoffr <dbl>, deggrant <dbl>, hbcu <dbl>, hospital <dbl>, medical <dbl>,
#   tribal <dbl>, locale <dbl>, openpubl <dbl>, act <chr>, newid <dbl>,
#   deathyr <dbl>, closedat <chr>, cyactive <dbl>, postsec <dbl>, …

This works well because the stabbr column, even though it uses strings, is regular. But what happens when the strings aren’t so regular? For example, let’s look the different titles chief college administrators take.

## see first few rows of distinct chief titles
df |>
    distinct(chftitle)
# A tibble: 556 × 1
   chftitle          
   <chr>             
 1 Commandant        
 2 President         
 3 Chancellor        
 4 Interim President 
 5 CEO               
 6 Acting President  
 7 Director          
 8 President/CEO     
 9 Interim Chancellor
10 President/COO     
# ℹ 546 more rows

We find over 500 unique titles. Just looking at the first 10 rows, we see that some titles are pretty similar — President vs. CEO vs. President/CEO — but not exactly the same. Let’s look again, but this time get counts of each distinct title and arrange from most common to least.

## return the most common titles
df |>
    ## get counts of each type
    count(chftitle) |>
    ## arrange in descending order so we see most popular at top
    arrange(desc(n))
# A tibble: 556 × 2
   chftitle               n
   <chr>              <int>
 1 President           3840
 2 Director             560
 3 Chancellor           265
 4 Executive Director   209
 5 Owner                164
 6 Campus President     116
 7 Superintendent       105
 8 CEO                   90
 9 <NA>                  85
10 Interim President     75
# ℹ 546 more rows

Quick exercise

What do you notice about the data frames returned by distinct() and count()? What’s the same? What does count() do that distinct() does not?

Getting our counts and arranging, we can see that President is by far the most common title. That said, we also see Campus President and Interim President (and before we saw Acting President as well).

If your research question asked, how many chief administrators use the title of “President”? regardless the various iterations, you can’t really use a simple == filter any more. In theory, you could inspect your data, find the unique versions, get counts of each of those using ==, and then sum them up — but that’s a lot of work and likely to be error prone!

Instead, we can use the stringr function str_detect(), which looks for a pattern in a vector of strings:

str_detect(< vector of strings >, < pattern >)

Going item by item in the vector, it compares what it sees to the pattern. If it matches, then it returns TRUE; it not, then FALSE. Here’s a toy example:

## string vector example
fruits <- c("green apple", "banana", "red apple")

## search for "apple", which should be true for the first and third item
str_detect(fruits, "apple")
[1]  TRUE FALSE  TRUE

We can use str_detect() inside filter() to select only certain rows in our data frame. In our case, we want only those observations in which the title "President" occurs in the chftitle column. Because we’re only detecting, as long as "President" occurs anywhere in the title, we’ll get that row back.

## how many use some form of the title president?
df |>
    ## still starting with our count
    count(chftitle) |>
    ## ...but keeping only those titles that contain "President"
    filter(str_detect(chftitle, "President")) |>
    ## arranging as before
    arrange(desc(n))
# A tibble: 173 × 2
   chftitle              n
   <chr>             <int>
 1 President          3840
 2 Campus President    116
 3 Interim President    75
 4 President/COO        47
 5 President/CEO        46
 6 School President     31
 7 Vice President       29
 8 President and CEO    17
 9 College President    15
10 President & CEO      14
# ℹ 163 more rows

Now we’re seeing many more versions. We can even more clearly see a few titles that are almost certainly the same title, but were just inputted differently — President/CEO vs. President and CEO vs. President & CEO.

Quick exercise

Ignoring the sub-counts of the various versions, how many chief administrators have the word “President” in their title?

Seeing the different versions of basically the same title should have us stopping to think: since it seems that this data column contains free form input (e.g. Input chief administrator title:), maybe we should allow for typos? The easiest: Is there any reason to assume that “President” will be capitalized?

Quick exercise

What happens if we search for “president” with a lowercase “p”?

Ah! We find a few stragglers. How can we restructure our filter so that we get these, too? There are at least two solutions.

1. Use regular expressions

Regular expressions (aka regex) are strings that use a special syntax to create patterns that can be used to match other strings. They are very useful when you need to match strings that have some general form, but may differ in specifics.

We already used this technique in the a prior lesson when we matched columns in the all_schools_wide.csv with contains("19") so that we could pivot_longer(). Instead of naming all the columns specifically, we recognized that each column took the form of <test>_19<YY>. This is a type of regular expression.

In the tidyverse some of the stringr and tidyselect helper functions abstract-away some of the nitty-gritty behind regular expressions. Knowing a little about regular expression syntax, particularly how it is used in R, can go a long way.

In our first case, we can match strings that have a capital P President or lowercase p president using square brackets ([]). If we want either “P” or “p”, then we can use the regex, [Pp], in place of the first character: "[Pp]resident". This will match either "President" or "president".

## solution 1: look for either P or p
df |>
    count(chftitle) |>
    filter(str_detect(chftitle, "[Pp]resident")) |>
    arrange(desc(n))
# A tibble: 175 × 2
   chftitle              n
   <chr>             <int>
 1 President          3840
 2 Campus President    116
 3 Interim President    75
 4 President/COO        47
 5 President/CEO        46
 6 School President     31
 7 Vice President       29
 8 President and CEO    17
 9 College President    15
10 President & CEO      14
# ℹ 165 more rows

Though we don’t see the new observations in the abbreviated output, we note that the number of rows has increased by two. This means that there are at least two title formats in which "president" is lowercase and that we weren’t picking up when we only used the uppercase version of "President" before.

2. Put everything in the same case and match with that case

Another solution, which is probably much easier in this particular case, is to set all potential values in chftitle to the same case and then match using that case. In many situations, this is preferable since you don’t need to guess cases up front.

We won’t change the values in chftitle permanently — only while filtering. To compare apples to apples (rather than "Apples" to "apples"), we’ll wrap our column name with the function str_to_lower(), which will make character lowercase, and match using lowercase "president".

## solution 2: make everything lowercase so that case doesn't matter
df |>
    count(chftitle) |>
    filter(str_detect(str_to_lower(chftitle), "president")) |>
    arrange(desc(n))
# A tibble: 177 × 2
   chftitle              n
   <chr>             <int>
 1 President          3840
 2 Campus President    116
 3 Interim President    75
 4 President/COO        47
 5 President/CEO        46
 6 School President     31
 7 Vice President       29
 8 President and CEO    17
 9 College President    15
10 President & CEO      14
# ℹ 167 more rows

We recover another two titles when using this second solution. Clearly, our first solution didn’t account for other cases (perhaps “PRESIDENT"?).

In general, I find it’s a good idea to try a solution like the second one before a more complicated one like the first. But because every problem is different, so too are the solutions. You may find yourself using a combination of the two.

Not-so-quick exercise

Another chief title that was high on the list was “Owner.” How many institutions have an “Owner” as their chief administrator? Of these, how many are private, for-profit institutions (control == 3)? How many have the word “Beauty” in their name?

Replace using string position: str_sub()

In addition to filtering data, we sometimes need to create new variables from pieces of exiting variables. For example, let’s look at the zip code values that are included in the file.

## show first few zip code values
df |>
    select(unitid, zip)
# A tibble: 7,052 × 2
   unitid zip       
    <dbl> <chr>     
 1 100636 36112-6613
 2 100654 35762     
 3 100663 35294-0110
 4 100690 36117-3553
 5 100706 35899     
 6 100724 36101-0271
 7 100733 35401     
 8 100751 35487-0166
 9 100760 35010     
10 100812 35611     
# ℹ 7,042 more rows

We can see that we have both regular 5 digit zip codes as well as those that include the extra 4 digits (ZIP+4). Let’s say we don’t need those last four digits for our analysis (particularly because not every school uses them anyway). Our task is to create a new column that pulls out only the main part of the zip code. It is has to work both for zip values that include the additional hyphen and 4 digits as well as those that only have the primary 5 digits to begin with.

One solution in this case is to take advantage of the fact that zip codes — minus the sometimes extra 4 digits — should be regular: 5 digits. If want the sub-part of a string and that sub-part is always in the same spot, we can use the function, str_sub(), which takes a string or column name first, and has arguments for the starting and ending character that mark the sub-string of interest.

In our case, we want the first 5 digits so we should start == 1 and end == 5:

## pull out first 5 digits of zip code
df <- df |>
    mutate(zip5 = str_sub(zip, start = 1, end = 5))

## show (use select() to subset so we can set new columns)
df |>
    select(unitid, zip, zip5)
# A tibble: 7,052 × 3
   unitid zip        zip5 
    <dbl> <chr>      <chr>
 1 100636 36112-6613 36112
 2 100654 35762      35762
 3 100663 35294-0110 35294
 4 100690 36117-3553 36117
 5 100706 35899      35899
 6 100724 36101-0271 36101
 7 100733 35401      35401
 8 100751 35487-0166 35487
 9 100760 35010      35010
10 100812 35611      35611
# ℹ 7,042 more rows

A quick visual inspection of the first few rows shows that our str_sub() function performed as expected (for a real analysis, you’ll want to do more formal checks).

Replace using regular expressions: str_replace()

We can also use a more sophisticated regex pattern with the function str_replace(). The pieces of our regex pattern, "([0-9]+)(-[0-9]+)?", are translated as this:

  • [0-9] := any digit, 0 1 2 3 4 5 6 7 8 9
  • + := match the preceding one or more times
  • ? := match the preceding 0 or more times
  • () := subexpression

Put together, we have:

  • ([0-9]+) := first, look for 1 or more digits
  • (-[0-9]+)? := second, look for a hyphen and one or more digits, but you may not find any of that

Because we used parentheses, (), to separate our subexpressions, we can call them using their numbers (in order) in the last argument of str_replace():

  • "\\1" := return the first subexpression

So what’s happening? If given a zip code that is "32605", the regex pattern will collect each digit — "3" "2" "6" "0" "5" — into the first subexpression because it never sees a hyphen. That first subexpression, "\\1", is returned: "32605". That’s what we want.

If given "32605-1234", it will collect the first 5 digits in the first subexpression, but will stop adding characters there when it sees the hyphen. From then on out, it adds everything it sees the second subexpression: "-" "1" "2" "3" "4". But because str_replace() only returns the first subexpression, we still get the same answer: "32605". This is what we want.

Let’s try it on the data.

## drop last four digits of extended zip code if they exist
df <- df |>
    mutate(zip5_v2 = str_replace(zip, "([0-9]+)(-[0-9]+)?", "\\1"))

## show (use select() to subset so we can set new columns)
df |>
    select(unitid, zip, zip5, zip5_v2)
# A tibble: 7,052 × 4
   unitid zip        zip5  zip5_v2
    <dbl> <chr>      <chr> <chr>  
 1 100636 36112-6613 36112 36112  
 2 100654 35762      35762 35762  
 3 100663 35294-0110 35294 35294  
 4 100690 36117-3553 36117 36117  
 5 100706 35899      35899 35899  
 6 100724 36101-0271 36101 36101  
 7 100733 35401      35401 35401  
 8 100751 35487-0166 35487 35487  
 9 100760 35010      35010 35010  
10 100812 35611      35611 35611  
# ℹ 7,042 more rows

Quick exercise

What if you wanted to the get the last 4 digits (after the hyphen)? What bit of two bits of code above would you change so that you can store the last 4 digits without including the hyphen? Make a new variable called zip_plus4 and store these values. HINT Look at the help file for str_replace().

Let’s compare our two versions: do we get the same results?

## check if both versions of new zip column are equal
identical(df |> select(zip5), df |> select(zip5_v2))
[1] FALSE

No! Let’s see where they are different:

## filter to rows where zip5 != zip5_v2 (not storing...just looking)
df |>
    filter(zip5 != zip5_v2) |>
    select(unitid, zip, zip5, zip5_v2)
# A tibble: 4 × 4
  unitid zip        zip5  zip5_v2   
   <dbl> <chr>      <chr> <chr>     
1 108199 90015--350 90015 90015--350
2 113953 92113--191 92113 92113--191
3 431707 06360--709 06360 06360--709
4 435240 551012595  55101 551012595 

Quick exercise

What happened? In this scenario, which string subsetting technique worked better?

Depending on the task, regular expressions can either feel like a blessing or a curse. To be honest, I’ve spent more time cursing than thanking them. That said, regular expressions are often the only way to perform a data wrangling task on unstructured string data. They are also a cornerstone of natural language processing techniques, which are increasingly of interest to education researchers.

We’ve only scratched the surface of what regular expressions can do. If you face string data in the future, taking a little time to craft a regular expression can be well worth it.

Part II: Working with dates

  • Much like names, dates are often saved as text/strings, and can be messy and formatted differently

    • For example, “2026-07-04”, “July 4th 2026”, “4th July 26”, and “04/07/26” all refer to the United State’s upcoming 250th birthday, but how do we make the computer understand they all mean that?
  • What’s trickier is that even once we have dealt with the formatting, we sometimes need to be able to make calculations with dates

    • For example, if we have students SAT scores from multiple attempts, somone might be interested in knowing how many days passed between attempts

      • This involves knowing how many days are in each month (something I can never remember), if that year was a leap year, etc.
  • In our IPEDS data see which institutions closed in 2007 and 2008 in the closedat column

    • -2 means the institution didn’t close in this period, so let’s drop them
## subset to schools who closed during this period
df <- df |>
  filter(closedat != -2) |>
  select(unitid, instnm, closedat)

df
# A tibble: 83 × 3
   unitid instnm                                                  closedat
    <dbl> <chr>                                                   <chr>   
 1 103440 Sheldon Jackson College                                 6/29/07 
 2 104522 DeVoe College of Beauty                                 3/29/08 
 3 105242 Mundus Institute                                        Sep-07  
 4 105880 Long Technical College-East Valley                      3/31/07 
 5 119711 New College of California                               Jan-08  
 6 136996 Ross Medical Education Center                           7/31/07 
 7 137625 Suncoast II the Tampa Bay School of Massage Therapy LLC 5/31/08 
 8 141583 Hawaii Business College                                 Sep-07  
 9 150127 Ball Memorial Hospital School of Radiologic Technology  May-07  
10 160144 Pat Goins Shreveport Beauty School                      3/1/08  
# ℹ 73 more rows

parse-ing String into Dates

  • So, we can see closedat is a <chr> or string variable

    • From a combination of looking at the values and knowing this is a US document, we can see most of these dates are month/day/year
  • We are really only going to use one function from lubridate the parse_date_time() function

    • There are other ways you can handle dates, some in base R, some other in lubridate, but this is a way that works well in most situations

    • The function takes two main arguments

      • x: The string you are trying to turn into a date

      • orders: The format(s) the string date is written in

        • This take one or more date formats to try

        • Since we think most of our dates are “month, day, year” this will be "mdy"

        • You can see a full list of options on the lubridate reference page

        • It tries these in order, seeing if it can “parse” or “figure out” the date looking for anything that divides dates up (” “,”/“,”-“) and/or spelt out date elements like”August”

          • If it can work out the date with the the first format, it will move on, if not, it will try and others you gave it (in order) and then if it can’t make the string fit into any of those formats, it will “fail to parse”
    • It then outputs a <dttm> (date time) type variable, which we are going to assign to a new variable clean_date

      • This involves a lot of well thought out code on the back-end, but makes our lives so much easier

      • In case you’re interested, R actually stores every date and time as time since 00:00:00 on January 1st 1970 behind the scenes

  • Let’s give this a go!

## create a new clean_date column 
df <- df |>
    mutate(clean_date = parse_date_time(closedat,
                                        orders = "mdy"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `clean_date = parse_date_time(closedat, orders = "mdy")`.
Caused by warning:
!  35 failed to parse.
## show
df
# A tibble: 83 × 4
   unitid instnm                                    closedat clean_date         
    <dbl> <chr>                                     <chr>    <dttm>             
 1 103440 Sheldon Jackson College                   6/29/07  2007-06-29 00:00:00
 2 104522 DeVoe College of Beauty                   3/29/08  2008-03-29 00:00:00
 3 105242 Mundus Institute                          Sep-07   NA                 
 4 105880 Long Technical College-East Valley        3/31/07  2007-03-31 00:00:00
 5 119711 New College of California                 Jan-08   NA                 
 6 136996 Ross Medical Education Center             7/31/07  2007-07-31 00:00:00
 7 137625 Suncoast II the Tampa Bay School of Mass… 5/31/08  2008-05-31 00:00:00
 8 141583 Hawaii Business College                   Sep-07   NA                 
 9 150127 Ball Memorial Hospital School of Radiolo… May-07   NA                 
10 160144 Pat Goins Shreveport Beauty School        3/1/08   2008-03-01 00:00:00
# ℹ 73 more rows
  • Okay, that worked for the majority of our colleges, but we some like Sep-2007 that it didn’t like

Quick Dicussion

Why wasn’t it able to “parse” Sep-2007?

Do we know enough about when this institution closed? Why or why not?

What might we be able to do with the information we have?

  • Now, remember orders can take more than one value, so maybe we could try another date format that can pick up some of these dates

    • Looking through the lubridate reference page, I can see that "my" (month year) is a format it will take as well, so let’s try that
## Try adding another date format
df <- df |>
    mutate(clean_date = parse_date_time(closedat,
                                        orders = c("mdy", "my")))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `clean_date = parse_date_time(closedat, orders = c("mdy",
  "my"))`.
Caused by warning:
!  7 failed to parse.
## show
df
# A tibble: 83 × 4
   unitid instnm                                    closedat clean_date         
    <dbl> <chr>                                     <chr>    <dttm>             
 1 103440 Sheldon Jackson College                   6/29/07  2007-06-29 00:00:00
 2 104522 DeVoe College of Beauty                   3/29/08  2008-03-29 00:00:00
 3 105242 Mundus Institute                          Sep-07   2007-09-01 00:00:00
 4 105880 Long Technical College-East Valley        3/31/07  2007-03-31 00:00:00
 5 119711 New College of California                 Jan-08   2008-01-01 00:00:00
 6 136996 Ross Medical Education Center             7/31/07  2007-07-31 00:00:00
 7 137625 Suncoast II the Tampa Bay School of Mass… 5/31/08  2008-05-31 00:00:00
 8 141583 Hawaii Business College                   Sep-07   2007-09-01 00:00:00
 9 150127 Ball Memorial Hospital School of Radiolo… May-07   2007-05-01 00:00:00
10 160144 Pat Goins Shreveport Beauty School        3/1/08   2008-03-01 00:00:00
# ℹ 73 more rows
  • Okay, this time only 7 failed to parse , so that’s a lot better

  • Take a look at what it did with Sep-2007

    • Do we like this or not? Is it trustworthy?
  • Let’s take a look at the 7 that didn’t go through this time

df |>
  filter(is.na(clean_date))
# A tibble: 7 × 4
  unitid instnm                                     closedat clean_date
   <dbl> <chr>                                      <chr>    <dttm>    
1 200794 Akron Machining Institute Inc              07/200   NA        
2 231572 Braxton School                             2007     NA        
3 262013 New York Institute of Technology-Central … 2007     NA        
4 381343 CET-Reno                                   2007     NA        
5 394846 CET-Santa Ana                              2007     NA        
6 404532 Sharps Academy of Hair Styling             2007     NA        
7 436793 Warren Woods Vocational Adult Education    2007     NA        
  • Hmm, those ones might be a little hard to approximate, so let’s drop them
df <- df |>
  drop_na(clean_date)

Working with parse-ed Date Objects

  • We have successfully turn as many of the close dates into date objects that R can understand, but why?

    • We can now make comparisons and calculations so much more easily

      • As we go through these example, just imagine having to do this by hand
Numerical Calculations
  • Let’s say we wanted to quickly find the earliest date

    • We wouldn’t have got far with the <chr> string date, but our <dttm> object R can understand and tell us the min() value of it
df |> filter(clean_date == min(clean_date))
# A tibble: 1 × 4
  unitid instnm           closedat clean_date         
   <dbl> <chr>            <chr>    <dttm>             
1 365912 City College Inc 3/23/07  2007-03-23 00:00:00

Quick Exercise

Find the school with the most recent closure date

Comparing to Reference Dates

  • Or, lets say we want to see how many schools closed before Christmas Day 2007

    • We can use parse_date_time() again to store December 25th 2007 as a date time object christmas_07

    • Then we can just filter schools whose date is less than that

christmas_07 <- parse_date_time("Dec 25 2007", "mdy")

df |> filter(clean_date < christmas_07)
# A tibble: 60 × 4
   unitid instnm                                    closedat clean_date         
    <dbl> <chr>                                     <chr>    <dttm>             
 1 103440 Sheldon Jackson College                   6/29/07  2007-06-29 00:00:00
 2 105242 Mundus Institute                          Sep-07   2007-09-01 00:00:00
 3 105880 Long Technical College-East Valley        3/31/07  2007-03-31 00:00:00
 4 136996 Ross Medical Education Center             7/31/07  2007-07-31 00:00:00
 5 141583 Hawaii Business College                   Sep-07   2007-09-01 00:00:00
 6 150127 Ball Memorial Hospital School of Radiolo… May-07   2007-05-01 00:00:00
 7 161624 Accutech Career Institute                 Sep-07   2007-09-01 00:00:00
 8 170824 Marquette General Hospital                Aug-07   2007-08-01 00:00:00
 9 180124 College of Coiffure Art Ltd               6/25/07  2007-06-25 00:00:00
10 186849 Harrison Career Institute-Vineland        Sep-07   2007-09-01 00:00:00
# ℹ 50 more rows
  • What about within 30 days of Christmas

    • For this we need to use interval(clean_date, christmas_07) to look between two dates

    • Then time_length(, "day") converts that to the number of days

    • Then abs() to get the absolute value (otherwise dates long after Christmas 2007 will be kept as the interval is -63 days)

    • This gets a little nested, so we can use more pipes |> inside our filter() statement if that’s easier

## Nested version
df |> filter(abs(time_length(interval(clean_date, christmas_07), "day")) < 30)
# A tibble: 9 × 4
  unitid instnm                             closedat clean_date         
   <dbl> <chr>                              <chr>    <dttm>             
1 119711 New College of California          Jan-08   2008-01-01 00:00:00
2 191870 Interboro Institute                12/21/07 2007-12-21 00:00:00
3 225867 Austin Business College            12/28/07 2007-12-28 00:00:00
4 243902 Gaither and Company Beauty College 12/5/07  2007-12-05 00:00:00
5 415561 Cortiva Institute-Colorado         12/31/07 2007-12-31 00:00:00
6 436748 New Hampshire Career Institute     12/20/07 2007-12-20 00:00:00
7 445586 Banner Institute-Chicago           Dec-07   2007-12-01 00:00:00
8 446871 The Bryman School-East             Dec-07   2007-12-01 00:00:00
9 447500 Salter School-Cambridge Campus     12/31/07 2007-12-31 00:00:00
## Internal pipes version
df |> filter(interval(clean_date, christmas_07) |>
               time_length("day") |>
               abs() < 30)
# A tibble: 9 × 4
  unitid instnm                             closedat clean_date         
   <dbl> <chr>                              <chr>    <dttm>             
1 119711 New College of California          Jan-08   2008-01-01 00:00:00
2 191870 Interboro Institute                12/21/07 2007-12-21 00:00:00
3 225867 Austin Business College            12/28/07 2007-12-28 00:00:00
4 243902 Gaither and Company Beauty College 12/5/07  2007-12-05 00:00:00
5 415561 Cortiva Institute-Colorado         12/31/07 2007-12-31 00:00:00
6 436748 New Hampshire Career Institute     12/20/07 2007-12-20 00:00:00
7 445586 Banner Institute-Chicago           Dec-07   2007-12-01 00:00:00
8 446871 The Bryman School-East             Dec-07   2007-12-01 00:00:00
9 447500 Salter School-Cambridge Campus     12/31/07 2007-12-31 00:00:00

Quick Exercise

  • Did any schools close within a week of Christmas?

Extracting Info From Dates

  • What is we want to see which fiscal quarter more schools closed in?

    • The handy quarter() function tell us that
df |> 
  mutate(quarter = quarter(clean_date)) |>
  count(quarter)
# A tibble: 4 × 2
  quarter     n
    <int> <int>
1       1    11
2       2    20
3       3    32
4       4    13
  • What about the day of the week they closed on?

    • Similarly wday() can tell us that

      • We need to say label = TRUE to get “Sun” instead of 1
df |>
  mutate(day = wday(clean_date, label = TRUE)) |>
  count(day)
# A tibble: 7 × 2
  day       n
  <ord> <int>
1 Sun       6
2 Mon       9
3 Tue       7
4 Wed       7
5 Thu       6
6 Fri      15
7 Sat      26

Quick Question

  • Can we 100% trust these week day counts? Why or why not?

Summary

  • In this lesson we’ve looked at working with both generic strings and with dates

    • For both of these we have only begun to scratch the surface, but it should have given you enough of an idea to go out and get your hands dirty

    • If you found this interesting, Dr. Jinnie Shin teaches a class on Natural Language Processing in the Fall, which gets into how make the computer begin to understand text

  • These are both messy types of data and you’re often going to have to make subjective decisions (e.g., how to handle Sep-2007)

    • What’s most important is that you document/comment what you did and why you did it
  • As you will see over the next few weeks, while this stuff can be tricky, working with strings efficiently can be really powerful in more advanced programming

Use the IPEDS data sets, hd2007.csvand ic2007mission.csv, to answer the questions below. You may need to look up and download the data dictionaries for each file, which you can find on this page. You can also use the lesson on getting higher education data for help

For each question, show your data work and then answer the question in a short (1-2 sentence(s)) comment.

Questions

NB To answer the questions, you will need to join the two IPEDS data sets using the common unitid key. Note that column names in hd2007.csv are uppercase (UNITID) while those in ic2007mission.csv are lowercase (unitid). There are a few ways to join when the keys don’t exactly match. One is to set all column names to the same case. If you want to use left_join() starting with hd2007.csv, you can first use the the dplyr verb rename_all(tolower) in your chain to lower all column names. See the help file for left_join() for other ways to join by different variable names.

I also find these cheat sheets extremely helpful

  1. How many chief administrator names start with “Dr.”?
    NB Many chief administrators are listed on more than one line due to branch campuses. Make sure to take this into account by keeping only distinct names.
  2. BONUS How many chief administrator names end with the title “PH.D.” or some variant?
  3. Among those schools that give their mission statement:
    1. How many repeat their institutional name in their mission statement?
    2. How many use the word civic?
    3. Which top 3 states have the most schools with mission statements that use the word future?
    4. Which type of schools (public, private-non-profit, private-for-profit) are most likely to use the word skill in their mission statement?
  4. Among the schools that closed in 2007 or 2008 and give a date with at least a month and year:
    1. Which has been closed for the longest time? How many months has it been from its close date to the beginning of this current month (1 February 2020)?
    2. How many days were there between the first school to close and the last?

Once complete, turn in the .R script (no data etc.) to Canvas by the due date (Sunday 11:59pm following the lesson). Assignments will be graded on the following Monday (time permitting) in line with the grading policy outlined in the syllabus.

Solution

R Solution Code

## -----------------------------------------------------------------------------
##
##' [PROJ: EDH 7916]
##' [FILE: Data Wrangling III Solution]
##' [INIT: March 18 2024]
##' [AUTH: Matt Capaldi] @ttalVlatt
##
## -----------------------------------------------------------------------------

setwd(this.path::here())

## ---------------------------
##' [Libraries]
## ---------------------------

library(tidyverse)
library(lubridate)

## ---------------------------
##' [Input]
## ---------------------------

df_hd <- read_csv("data/hd2007.csv")
df_mission <- read_csv("data/ic2007mission.csv")


df <- df_hd |>
  left_join(df_mission, by = c("UNITID" = "unitid")) # Could also just rename column to lower case

## ---------------------------
##' [Q1]
## ---------------------------

df |>
  mutate(chfnm_lower = str_to_lower(CHFNM)) |> # lower case the name
  filter(str_detect(chfnm_lower, "(^|\\s)dr(a)?(,|\\.|\\s)")) |> # keep anything that after either the start (^) or a " " has "dr" then maybe "a" (for Spanish dra) followed by either "," "." or " " 
  group_by(chfnm_lower, STABBR) |> # group by unique chief admin names (due to branch campuses) also group by state, to minimize chance of two different people with the same name being counted as one. Is there a better way to check for branch campuses?
  slice(1) |> # slice the one row of these (to remove duplicates)
  ungroup() |> # remove the grouping as we only wanted slice() to be grouped
  select(chfnm_lower) |> # keep only the chief admin name
  count(chfnm_lower) |> # get counts of the names (should be a column of ones)
  summarize(sum(n)) # count up the number ones

## Below is how I compared our in class regex with Ben's and my final answer

inclass_names <- df |>
  mutate(chfnm_lower = str_to_lower(CHFNM)) |> # lower case the name
  filter(str_detect(chfnm_lower, "dr\\.?\\s")) |> # keep anything "dr" maybe a "." then " "
  pull(chfnm_lower)

bens_names <- df |>
  mutate(chfnm_lower = str_to_lower(CHFNM)) |> # lower case the name
  filter(str_detect(chfnm_lower, "^dr\\.?[^ew]")) |> # keep anything that starts (^) "dr" maybe a "." then anything but "ew" (to exclude drew or andrew)
  pull(chfnm_lower)

bens_names[!bens_names %in% inclass_names]

matts_names <- df |>
  mutate(chfnm_lower = str_to_lower(CHFNM)) |> # lower case the name
  filter(str_detect(chfnm_lower, "(^|\\s)dr(a)?(,|\\.|\\s)")) |> # keep anything that after either the start (^) or a " " has "dr" then maybe "a" (for Spanish dra) followed by either "," "." or " "
  pull(chfnm_lower)

matts_names[!matts_names %in% bens_names]

## The differences between Ben's and my answer get to an important
## research point, technically Ben's answer is what was asked, but in many situations
## you were given that question, they would want "rev. dr" and "rabbi dr" included

## ---------------------------
##' [Q2]
## ---------------------------

df |>
  mutate(chfnm_lower = str_to_lower(CHFNM)) |> # lower case the name
  filter(str_detect(chfnm_lower, "ph\\.?\\s?d\\.?\\s?$")) |> # keep only if there's "dr" maybe a ".", then a space
  group_by(chfnm_lower, STABBR) |> # group by unique chief admin names (due to branch campuses) also group by state to minimize chance of two different people with the same name (ideally, you'd be more sophisticated and check for branch campuses directly)
  slice(1) |> # slice the one row of these (to remove duplicates)
  ungroup() |> # remove the grouping as we only wanted slice() to be grouped
  count(chfnm_lower) |> # get counts of the names (should be a column of ones)
  summarize(sum(n)) # count up the number ones

## ---------------------------
##' [Q3]
## ---------------------------

##'[i]

df |>
  mutate(mission_lower = str_to_lower(mission), # lower case the mission
         instnm_lower = str_to_lower(INSTNM)) |> # lower case the institution name
  filter(str_detect(mission_lower, instnm_lower)) |> # search for the name in the mission
  count(mission_lower) |> # get counts of the names (should be a column of ones)
  summarize(sum(n)) # count up the number ones
  
##'[ii]

df |>
  mutate(mission_lower = str_to_lower(mission)) |> # lower case the mission
  filter(str_detect(mission_lower, "civic")) |> # look for the word "civic
  count(mission_lower) |> # get counts of the names (should be a column of ones)
  summarize(sum(n)) # count up the number ones

##'[iii]

df |>
  mutate(mission_lower = str_to_lower(mission)) |> # lower case mission
  filter(str_detect(mission_lower, "future")) |> # look for the word "future"
  group_by(STABBR) |> # before we count this time, group by state
  count(mission_lower) |> # get counts of the names (should be a column of ones)
  summarize(n = sum(n)) |> # count up the number ones
  arrange(desc(n)) |> # arrange in descending order of the count
  slice_head(n = 3) # keep the top three rows

##'[iv]

df |>
  mutate(mission_lower = str_to_lower(mission)) |> # lower case the mission
  filter(str_detect(mission_lower, "skill")) |> # look for the word skill
  group_by(CONTROL) |> # before we count, group by control
  count(mission_lower) |> # get counts of the names (should be a column of ones)
  summarize(n = sum(n)) |> # count up the number ones
  arrange(desc(n)) # arrange in descending order

## ---------------------------
##' [Q4]
## ---------------------------

df_close <- df |>
  select(UNITID, INSTNM, CLOSEDAT) |> # Keep only the columns we need
  filter(CLOSEDAT != "-2") |> # Get rid of any colleges that haven't closed
  mutate(clean_date = parse_date_time(CLOSEDAT, ## Create clean_date by turning CLOSEDAT into a date_time object
                                      orders = c("mdy", "my"))) ## Try month/day/year format first, then just month/year (assumes 1st of month)

## Print out the 7 that "failed to parse"
df_close |>
  filter(is.na(clean_date))

## After inspection, we can't approximate the 7 that failed to parse accurately, so drop them
df_close <- df_close |>
  drop_na(clean_date)

## Now the clean_date is a lubridate object, we can just treat it like a numeric variable

##'[i]
##' Option One: Use the date written
df_close |>
  mutate(time_from_today = parse_date_time("Feb 1 2020", "mdy") - clean_date) |> # same logic as above, create a date_time object for Feb 1 2020 (format is month/day/year) get the difference from clean_date 
  slice_max(time_from_today) # slice off the row with the most days from today

## Option Two: Use the date/time right now
df_close |>
  mutate(time_from_today = now() - clean_date) |> # now() just gets date/time when run
  slice_max(time_from_today) # slice off the row with the most days from today

##'[ii]
df_close |>
  summarize(answer = max(clean_date) - min(clean_date)) # We can also just use max() and min() like any other variable


## -----------------------------------------------------------------------------
##' *END SCRIPT*
## -----------------------------------------------------------------------------