IPEDtaS Tutorial

  • This page will walk you the basics of how to use IPEDtaS to automagically retrieve labelled IPEDS .dta files

What IPEDtaS Does

  • NCES provide all the information needed to create labelled IPEDS data
    • However, it can be cumbersome to put all the pieces together manually
  • If you’ve worked with IPEDS before, you’ve likely seen this screen Screenshot of IPEDS Data Center
  • When I first started working with IPEDS, I assume the “Stata Data File” would be a nicely labelled data set, like you can get with other NCES data sets
    • However, it’s not
    • Instead, it’s just a plain .csv data set that is designed to be combined with the “STATA Program File” which applies the labels
      • These program files are also full of issues
        • Such as hard coded file paths from the NCES worker who wrote the script’s computer which you would have to update for each piece of data
        • And error-causing line breaks which you’d have to fix before the script would run
    • Between downloading two files, fixing the issues, then running them together, it’s a lot of work so most people don’t bother
    • Plus, they only work in Stata, R users without a Stata license are left out…
  • The R and Stata IPEDtaS scripts do all this heavy lifting for you so you can use nicely labelled IPEDS data without the extra effort!

Why Would I Want Labels Any?

You will see more of this in the applied example below, but, in short, they make data analysis much easier and reduce the amount you will have to look at the dictionary/codebook!

  • Look at the difference from these simple data checks on the number of colleges in each region
  • Without labels, we just get the numeric code for each region which we’d have to look up
data_without_labels |>
# A tibble: 10 × 2
   obereg     n
    <dbl> <int>
 1      0     7
 2      1   337
 3      2  1025
 4      3   891
 5      4   491
 6      5  1536
 7      6   665
 8      7   236
 9      8   953
10      9   148
  • With labels, we get a description of each region too
data_with_labels |>
# A tibble: 10 × 2
   obereg                                                             n
   <dbl+lbl>                                                      <int>
 1 0 [U.S. Service schools]                                           7
 2 1 [New England (CT, ME, MA, NH, RI, VT)]                         337
 3 2 [Mid East (DE, DC, MD, NJ, NY, PA)]                           1025
 4 3 [Great Lakes (IL, IN, MI, OH, WI)]                             891
 5 4 [Plains (IA, KS, MN, MO, NE, ND, SD)]                          491
 6 5 [Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)]  1536
 7 6 [Southwest (AZ, NM, OK, TX)]                                   665
 8 7 [Rocky Mountains (CO, ID, MT, UT, WY)]                         236
 9 8 [Far West (AK, CA, HI, NV, OR, WA)]                            953
10 9 [Other U.S. jurisdictions (AS, FM, GU, MH, MP, PR, PW, VI)]    148
  • First, this makes life easier reducing the amount we have to look back and forth to the code book
  • Second, it makes it easier to spot accidental errors when checking our work, leading to more reliable analyses

System Requirements

First things first, let’s consider the what you will need on your computer to get started

  1. An up-to-date version of R
  2. An up-to-date tidyverse package
    • In your R console type install.packages("tidyverse") to get the latest version
  3. If you want to download all of IPEDS, up to 12gb of permanent space and 36gb of temporary space
  1. An up-to-date and licensed copy of Stata version 16.0 or higher (BE or Basic is sufficient)
  2. An up-to-date installation of Python
    • Don’t fret, you don’t have to use Python, but the .do file uses PyStata to clean up the scripts, so Python just needs to be on your machine
    • You can see if you already have Python by typing python search into your Stata command box
    • You can install a copy of Python from https://www.python.org/downloads/
  3. If you want to download all of IPEDS, up to 4gb of permanent space and 12gb of temporary space

Setting up Your Project Folder

This part is identical for both Stata and R users, the main points to note are:

  1. Download either the Stata or R version of the script from the links at the top of this page
  • Hint: If the file is opening in your browser, use “download linked file” on macOS, or, “save link as” on windows to save the file (you can also just copy and paste)
  1. The script is designed to treat where ever you place the IPEDtaS.do or IPEDtaS.R file as the “working directory”
  • Check the working directory is set to your current project before doing anything else
    • If you download the script, save it in your project folder, then open it, you will often get the correct working directory by default, but it’s always best to check
    • When the script runs it will store output in ./data and ./dictionaries folders
      • Caution: Anything you have in folders with that name will be overwritten
      • This also applies to ./zip-data, ./zip-do-files, ./zip-dictionaries, ./unzip-data, ./unzip-do-files, ./unzip-dictionaries which are folders used temporarily behind-the-scenes
  1. Personally, I set up my projects with scripts in the top-level of the project folder (as in, not in a sub-folder), so that is how IPEDtaS was designed
    • If you need everything in a sub-folder for sanity reasons either:
      1. Place IPEDtaS in your data folder (e.g ./data/IPEDtaS.do) which will place the data in ./data/data/hd2022.dta
      2. Place IPEDtaS in your ./scripts folder and go through adjusting all the relative paths by adding ../ to back out one level

File Selection

The only real change you have to make in the whole process is to the scripts is selecting which files you want to download

  • By default the scripts are set to download directory information (HD) and enrollment data (EFFY) for the 2023 reporting cycle
    • To change these you need to update the list to the files you want
    • Note: at the bottom of the script there is a list with every single IPEDS file in it, if you want the entire dataset you can just copy and paste that longer list to the top of the script and edit as needed
  • To edit the list you basically just need to follow the list rules for each language
  • The only rule is that the selected_files <- c() must be a valid list of IPEDS file names
    • Each line/entry must end in a comma , except the final one

Here are some short examples of file selection

  1. A simple list
selected_files <- c("HD2021", "EFFY2022", "SFA2122")

## OR

selected_files <- c(
  1. You can also comment out files you don’t want from a longer list
selected_files <- c(
  # "IC2022",
  # "IC2022_AY",
  # "EFFY2022_DIST",

That’s about it, when you run the script, the files you put in selected_files will be downloaded

  • For the Stata version local selected_files needs to be a valid list of IPEDS file names
    • Each line in the list must end in /// except the final one

Here are some short examples of file selection

  1. A simple list
local selected_files ///
  "HD2021" ///
  "EFFY2022" ///
  1. Use multi-line comments to comment out files you don’t want
  • Stata has both single-line comments // or * and multi-line commments which start /* and end */
    • Because of the way the list is structured, we have to use multi-line comments here (even to comment a single line) which have 3 rules
    1. To work in the list, the line before a multi-line comment must be /// and nothing else
    2. Below this start the first line of a multi-line comment with /*
    3. To close out a multi-line comment somewhere else use */
local selected_files ///
  "HD2021" ///
  "IC2022" ///
  "IC2022_AY" ///
  "EFFY2022" ///
  "EFFY2022_DIST" ///

That’s about it, when you run the script, the files you put in local selected_files will be downloaded

Runnning the Script

Once you have the file selection set, simply save the script and hit run/do!

If you’re using this tool as part of a reproducible research project, you might want to include running it as part of your analysis code

  • However, you don’t want to run it every time you run your code, only if the data isn’t already downloaded
  • The below code blocks will do exactly that if you include them at the start of your analysis code
    • Just change hd2021.dta to a file you download
if(!file.exists("data/hd2021.dta")) { source("IPEDtaS.R") }
if(!fileexists("data/hd2021.dta")) { do "IPEDtaS.do" }

Applied Example Using with Labelled IPEDS Data

Okay, now we have our labelled IPEDS data, let’s walk through a simple descriptive analysis using

  • HD2021 (institutional characteristics as of Fall 2021)
  • EFFY2022 (enrollment for 2021-2022 school year)
  • SFA2022 (financial aid for 2021-2022 school year)

Two things to note:

  1. These examples are just meant to illustrate how the labels can help in your work, they are not meant to be ground-breaking informative analyses - If you’re feeling adventurous, play around and swap out different variables as you follow along
  2. To be able to understand some of the code, you’d a decent understanding of R and the tidyverse, but again, the point is just to see how the labels can help

1. Running IPEDtaS, Reading Data, & Joining Data

1. Create a new folder on your computer, download a copy of IPEDtaS.R, and place it in the folder

2. Adjust selected_files <- c() in IPEDtaS.R to download the 3 files we want like below

selected_files <- c(

3. Select the whole IPEDtaS.R script and hit “Run”

4. Start a new R script in that same folder

5. Load tidyverse, haven (part of tidyverse, but requires loading separately), labelled (what haven uses behind the scenes), and gtsummary (to easily create output tables)


6. Read our data in

data_info <- read_dta("data/hd2021.dta")
data_enroll <- read_dta("data/effy2022.dta")
data_aid <- read_dta("data/sfa2122.dta")

Okay, now, take a look at the enrollment data we just read in (click on data_enroll in the environment in the top right)

Screenshot of Enrollment Data Showing Variable Labels

Notice the descriptions under each variable name

  • If you’re familiar with IPEDS data, you won’t be used to seeing those
  • They’re the variable labels we added, super useful for quick questions without having to open the code book!

7. Now we want to join our data together

data <- left_join(data_info, data_enroll, by = "unitid") |>
  left_join(data_aid, by = "unitid")

1. Create a new folder on your computer, download a copy of IPEDtaS.do, and place it in the folder

2. Adjust local selected_files in IPEDtaS.do to download the 3 files we want like below

local selected_files ///
  "HD2021" ///
  "EFFY2022" ///

3. Select the whole IPEDtaS.do script and hit “Run”

4. Start a new Stata do file in that same folder

5. Load our first data set, hd2022

use "data/hd2021.dta", clear

Okay, now, take a look at the variables panel (by default in right hand panel)

  • Each of the variables has a label that describes what the variable means
  • If you’re familiar with standard IPEDS data, you won’t be used to seeing those
  • They’re the variable labels we added, super useful for quick questions without having to open the code book!

Screenshot of Showing Variable Labels

6. Join in our other data sets in a “left join” style (i.e., all observations in the first data set are kept even if they don’t have a match in the second)

joinby unitid using "data/sfa2122.dta", unmatched(master) _merge(sfa)
joinby unitid using "data/effy2022.dta", unmatched(master) _merge(effy)

2. Data Cleaning with Labels

Now we have everything read in the advantage of the labels will truly begin to show!

8. Some of you may have noticed that our data has become extremely “long”

  • As in, our data now has many more observations than we originally had
  • This we means we probably have a little light data-wrangling to do
  • Let’s check how many observations our data set now contains
[1] 116480
  • I have a hunch that the data might be “long” by the variable effylev, so, let’s look at how many observations we have for each value of effylev
data |> count(effylev)
# A tibble: 5 × 2
  effylev                                        n
  <dbl+lbl>                                  <int>
1 -2 [Not applicable, undergraduate detail] 102471
2  1 [All students total]                     5953
3  2 [Undergraduate]                          5680
4  4 [Graduate]                               2040
5 NA                                           336

Once again, if you’re used to IPEDS data, you wouldn’t usually see the information in the [square brackets]

  • These are our value labels, again, super useful for quick questions without having to open the code book!
  • One thing I really like about using labels is you get the best of both worlds
    • We still have the original values to check with the code book (which you don’t get with some tools we will discuss later)

The labels help us quickly identify what the different values of effylev mean and that if we are interested in undergraduate figures (which for now, we are) we want to keep rows that are effylev == 2

data <- data |> filter(effylev == 2)

7. Some of you may have noticed that our data has become extremely “long”

  • As in, our data now has many more observations than we originally had
  • This we means we probably have a little light data-wrangling to do
  • Let’s check how many observations our data set now contains
  • I have a hunch that the data might be “long” by the variable effylev, so, let’s look at how many observations we have for each value of effylev
tabulate effylev
  Undergraduate or graduate level of |
                             student |      Freq.     Percent        Cum.
Not applicable, undergraduate detail |    102,471       88.23       88.23
                  All students total |      5,953        5.13       93.35
                       Undergraduate |      5,680        4.89       98.24
                            Graduate |      2,040        1.76      100.00
                               Total |    116,144      100.00

Once again, if you’re used to IPEDS data, you would usually see a bunch of numbers in the left-hand column, but now we see informative labels

  • These are our value labels, again, super useful for quick questions without having to open the code book!
    • If these are ever unclear, the data still contains the original values to check with the code book (which you don’t get with some tools we will discuss later)
    • You can use the command labelbook to check these
labelbook label_effylev
Value label label_effylev 

      Values                                    Labels
       Range:  [-2,4]                    String length:  [8,36]
           N:  4                 Unique at full length:  yes
        Gaps:  yes                 Unique at length 12:  yes
  Missing .*:  0                           Null string:  no
                               Leading/trailing blanks:  no
                                    Numeric -> numeric:  no
          -2   Not applicable, undergraduate detail
           1   All students total
           2   Undergraduate
           4   Graduate

   Variables:  effylev

The labels help us quickly identify what the different values of effylev mean and that if we are interested in undergraduate figures (which for now, we are) we want to keep rows that are effylev == 2

keep if effylev == 2
(110,800 observations deleted)

3. Tables with Labels

9. Now let’s explore some trends in our data to show how labels can help. How does the percent of students paying out-of-state tuition vary by region?

data |>
  group_by(obereg) |>
  summarize(median_perc_out_of_state = median(scfa13p, na.rm = TRUE))
# A tibble: 10 × 2
   obereg                                                 median_perc_out_of_s…¹
   <dbl+lbl>                                                               <dbl>
 1 0 [U.S. Service schools]                                                    0
 2 1 [New England (CT, ME, MA, NH, RI, VT)]                                    5
 3 2 [Mid East (DE, DC, MD, NJ, NY, PA)]                                       5
 4 3 [Great Lakes (IL, IN, MI, OH, WI)]                                        2
 5 4 [Plains (IA, KS, MN, MO, NE, ND, SD)]                                     6
 6 5 [Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN,…                      5
 7 6 [Southwest (AZ, NM, OK, TX)]                                              4
 8 7 [Rocky Mountains (CO, ID, MT, UT, WY)]                                   19
 9 8 [Far West (AK, CA, HI, NV, OR, WA)]                                       4
10 9 [Other U.S. jurisdictions (AS, FM, GU, MH, MP, PR, …                      0
# ℹ abbreviated name: ¹​median_perc_out_of_state

Notice how again the labels make our analysis instantly more informative

  • We know what obereg 7 means without going to the code book
  • Now, if we want to just use the labels the column, haven has a handy tool for that as well as_factor()
    • This converts a column with value labels to a factor using the label as the value
data |>
  group_by(as_factor(obereg)) |>
  summarize(median_perc_out_of_state = median(scfa13p, na.rm = TRUE))
# A tibble: 10 × 2
   `as_factor(obereg)`                                    median_perc_out_of_s…¹
   <fct>                                                                   <dbl>
 1 U.S. Service schools                                                        0
 2 New England (CT, ME, MA, NH, RI, VT)                                        5
 3 Mid East (DE, DC, MD, NJ, NY, PA)                                           5
 4 Great Lakes (IL, IN, MI, OH, WI)                                            2
 5 Plains (IA, KS, MN, MO, NE, ND, SD)                                         6
 6 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA…                      5
 7 Southwest (AZ, NM, OK, TX)                                                  4
 8 Rocky Mountains (CO, ID, MT, UT, WY)                                       19
 9 Far West (AK, CA, HI, NV, OR, WA)                                           4
10 Other U.S. jurisdictions (AS, FM, GU, MH, MP, PR, PW,…                      0
# ℹ abbreviated name: ¹​median_perc_out_of_state

8. Now let’s explore some trends in our data to show how labels can help. How does the percent of students paying out-of-state tuition vary by region?

tabstat scfa13p, s(median) by(obereg)
Summary for variables: scfa13p
Group variable: obereg (Bureau of Economic Analysis (BEA) regions)

          obereg |       p50
U.S. Service sch |         0
New England (CT, |         5
Mid East (DE, DC |         5
Great Lakes (IL, |         2
Plains (IA, KS,  |         6
Southeast (AL, A |         5
Southwest (AZ, N |         4
Rocky Mountains  |        19
Far West (AK, CA |         4
Other U.S. juris |         0
           Total |         4

Notice how again the labels make our analysis instantly more informative

  • We know which region has 19% of students paying out-of-state tuition without going to the code book (it would previously just have said 7)

4. Plots with Labels

10. What about the relationship between total enrollment and the percent paying instate tuition? Are bigger schools relying more on out-of-state students? Does this trend vary by region?

  • This wouldn’t work in a table, so, let’s look at a simple scatter plot
ggplot(data |> filter(efytotlt < 50000),
       aes(x = efytotlt,
           y = scfa13p)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8)) +
Figure 1

Okay… But what do those variables mean?

  • Without labels, the plot it hard to understand

So, let’s add labels

  • The first step is to change facet_wrap(~obereg) to facet_wrap(~as_factor(obereg))
    • This is the same as we did in the table above, using a new version of the column that uses the value labels as the value
  • The second step involves pulling out the variable labels to go on the x and y axis
    • This is a little more manual, but, we can set our x and y labels using the labs() argument as normal
      • But instead of putting something like x = "my x axis label", we use the var_label() from the labelled package
ggplot(data |> filter(efytotlt < 50000),
       aes(x = efytotlt,
           y = scfa13p)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8)) +
  labs(x = var_label(data$efytotlt),
       y = var_label(data$scfa13p)) +
Figure 2

Well, that’s more informative, but a little messy

  • With a couple of tweaks to allow longer labels to wrap around, we now have a much better looking plot
    • y = str_wrap(var_label(data$scfa13p), 40) says to make a new line every 40 characters on the y axis
    • labeller = label_wrap_gen(multi_line = TRUE) inside our facet_wrap() allows the facet labels to wrap onto multiple lines
ggplot(data |> filter(efytotlt < 50000),
       aes(x = efytotlt,
           y = scfa13p)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8)) +
  labs(x = var_label(data$efytotlt),
       y = str_wrap(var_label(data$scfa13p), 40)) +
             labeller = label_wrap_gen(multi_line = TRUE))
Figure 3

9. What about the relationship between total enrollment and the percent paying instate tuition? Are bigger schools relying more on out-of-state students? Does this trend vary by region?

  • This wouldn’t work in a table, so, let’s look at a simple scatter plot
scatter scfa13p efytotlt if efytotlt < 50000, by(obereg, col(2))
quietly graph export scatter.svg, replace

  • See how by default the x, y, and by/facet labels use the labels and not the variable names/values?
    • This instantly makes your plots more intuitive
  • I don’t typically use Stata for plotting, so I’m not sure how to get the longer labels to wrap, but I’m sure there’s a way

5. Models with Labels

11. Lastly, let’s look at how labels can show up in modeling. Let’s see if the percentage of students paying out of state changes by the level of the institution (4 year, 2 year, Less than 2 year)

model <- lm(scfa13p ~ factor(iclevel),
            data = data)

Characteristic Beta 95% CI1 p-value

    2 -8.7 -9.9, -7.4 <0.001
    3 -14 -21, -7.6 <0.001
1 CI = Confidence Interval

Without using labels, the regression output needs the code book to interpret

  • What is iclevel 2?

Remember from above, using as_factor() rather than factor() tells R to use the labels as the levels

model <- lm(scfa13p ~ as_factor(iclevel),
            data = data)

Characteristic Beta 95% CI1 p-value

    Four or more years
    At least 2 but less than 4 years -8.7 -9.9, -7.4 <0.001
    Less than 2 years (below associate) -14 -21, -7.6 <0.001
1 CI = Confidence Interval

Okay that is much clearer what is going on!

  • as_factor(iclevel) is still a bit messy though
  • Similarly to the plot above, using variable labels is a little more tricky, but, we can do it using the var_label() function again alongside the label = argument in tbl_regression
               label = list(`as_factor(iclevel)` = var_label(data$iclevel)))
Characteristic Beta 95% CI1 p-value
Level of institution

    Four or more years
    At least 2 but less than 4 years -8.7 -9.9, -7.4 <0.001
    Less than 2 years (below associate) -14 -21, -7.6 <0.001
1 CI = Confidence Interval

10. Lastly, let’s look at how labels can show up in modeling. Let’s see if the percentage of students paying out of state changes by the level of the institution (4 year, 2 year, Less than 2 year)

regress scfa13p i.iclevel
      Source |       SS           df       MS      Number of obs   =     1,572
-------------+----------------------------------   F(2, 1569)      =     98.14
       Model |  30433.0627         2  15216.5313   Prob > F        =    0.0000
    Residual |  243262.532     1,569  155.043041   R-squared       =    0.1112
-------------+----------------------------------   Adj R-squared   =    0.1101
       Total |  273695.595     1,571  174.217438   Root MSE        =    12.452

            scfa13p | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
            iclevel |
At least 2 but l..  |  -8.666799   .6320687   -13.71   0.000    -9.906587   -7.427011
Less than 2 yea..)  |  -14.21038    3.35951    -4.23   0.000    -20.79999    -7.62078
              _cons |   14.21038   .4602254    30.88   0.000     13.30766     15.1131

As you can see, the variable labels automatically show up in our regression output

  • Before, you would have seen 2 and 3 in the iclevel column, but now you get informative labels
  • With the labels, it’s easier to interpret which makes you work easier to read and also less likely you will get mixed up and report the wrong value!

Removing Value Labels

It’s rare, but, there are occasions where you might need to remove value labels from your data

  • For instance, in R certain advanced analysis packages get confused when you have value labels
  • You may also spot an error in labelling and need to get rid of it
  • Luckily, it’s pretty easy!

In R, just use the zap_labels() function from haven to create an unlabelled version of your data

data |>
# A tibble: 3 × 2
  iclevel                                     n
  <dbl+lbl>                               <int>
1 1 [Four or more years]                   2436
2 2 [At least 2 but less than 4 years]     1575
3 3 [Less than 2 years (below associate)]  1669
data_unlabelled <- zap_labels(data)

data_unlabelled |>
# A tibble: 3 × 2
  iclevel     n
    <dbl> <int>
1       1  2436
2       2  1575
3       3  1669

In Stata, simply type label drop _all

tabulate iclevel

label drop _all

tabulate iclevel
               Level of institution |      Freq.     Percent        Cum.
                 Four or more years |      2,436       42.89       42.89
   At least 2 but less than 4 years |      1,575       27.73       70.62
Less than 2 years (below associate) |      1,669       29.38      100.00
                              Total |      5,680      100.00

   Level of |
institution |      Freq.     Percent        Cum.
          1 |      2,436       42.89       42.89
          2 |      1,575       27.73       70.62
          3 |      1,669       29.38      100.00
      Total |      5,680      100.00

Getting Capitalized Variable Names

  • By default, IPEDtaS gives you lower-case variable names (which is the default for Stata-style data)
  • Usually, this is going to be easier to work with
    • However, sometimes you might need to keep the original upper-case variable names, such as if you’re adding this to an existing project that already uses upper-case variable names
  • To do this, you just need to add a single line near the end of the IPEDtaS script

Add this line

data_file <- data_file |> dplyr::rename_all(stringr::str_to_upper)

directly above (near end of script)

haven::write_dta(data_file, dta_name)

Add this line

rename *, upper

directly above (near end of script)

save ../data/`dta_name'