This page will walk you the basics of how to use IPEDtaS to automagically retrieve labelled IPEDS .dta files
What IPEDtaS Does
NCES provide all the information needed to create labelled IPEDS data
However, it can be cumbersome to put all the pieces together manually
If you’ve worked with IPEDS before, you’ve likely seen this screen
When I first started working with IPEDS, I assume the “Stata Data File” would be a nicely labelled data set, like you can get with other NCES data sets
However, it’s not
Instead, it’s just a plain .csv data set that is designed to be combined with the “STATA Program File” which applies the labels
These program files are also full of issues
Such as hard coded file paths from the NCES worker who wrote the script’s computer which you would have to update for each piece of data
And error-causing line breaks which you’d have to fix before the script would run
Between downloading two files, fixing the issues, then running them together, it’s a lot of work so most people don’t bother
Plus, they only work in Stata, R users without a Stata license are left out…
The R and Stata IPEDtaS scripts do all this heavy lifting for you so you can use nicely labelled IPEDS data without the extra effort!
Why Would I Want Labels Any?
You will see more of this in the applied example below, but, in short, they make data analysis much easier and reduce the amount you will have to look at the dictionary/codebook!
Look at the difference from these simple data checks on the number of colleges in each region
Without labels, we just get the numeric code for each region which we’d have to look up
If you want to download all of IPEDS, up to 4gb of permanent space and 12gb of temporary space
Setting up Your Project Folder
This part is identical for both Stata and R users, the main points to note are:
Download either the Stata or R version of the script from the links at the top of this page
Hint: If the file is opening in your browser, use “download linked file” on macOS, or, “save link as” on windows to save the file (you can also just copy and paste)
The script is designed to treat where ever you place the IPEDtaS.do or IPEDtaS.R file as the “working directory”
Check the working directory is set to your current project before doing anything else
If you download the script, save it in your project folder, then open it, you will often get the correct working directory by default, but it’s always best to check
When the script runs it will store output in ./data and ./dictionaries folders
Caution: Anything you have in folders with that name will be overwritten
This also applies to ./zip-data, ./zip-do-files, ./zip-dictionaries, ./unzip-data, ./unzip-do-files, ./unzip-dictionaries which are folders used temporarily behind-the-scenes
Personally, I set up my projects with scripts in the top-level of the project folder (as in, not in a sub-folder), so that is how IPEDtaS was designed
If you need everything in a sub-folder for sanity reasons either:
Place IPEDtaS in your data folder (e.g ./data/IPEDtaS.do) which will place the data in ./data/data/hd2022.dta
Place IPEDtaS in your ./scripts folder and go through adjusting all the relative paths by adding ../ to back out one level
File Selection
The only real change you have to make in the whole process is to the scripts is selecting which files you want to download
By default the scripts are set to download directory information (HD) and enrollment data (EFFY) for the 2023 reporting cycle
To change these you need to update the list to the files you want
Note: at the bottom of the script there is a list with every single IPEDS file in it, if you want the entire dataset you can just copy and paste that longer list to the top of the script and edit as needed
To edit the list you basically just need to follow the list rules for each language
Okay, now we have our labelled IPEDS data, let’s walk through a simple descriptive analysis using
HD2021 (institutional characteristics as of Fall 2021)
EFFY2022 (enrollment for 2021-2022 school year)
SFA2022 (financial aid for 2021-2022 school year)
Two things to note:
These examples are just meant to illustrate how the labels can help in your work, they are not meant to be ground-breaking informative analyses - If you’re feeling adventurous, play around and swap out different variables as you follow along
To be able to understand some of the code, you’d a decent understanding of R and the tidyverse, but again, the point is just to see how the labels can help
Okay, now, take a look at the enrollment data we just read in (click on data_enroll in the environment in the top right)
Notice the descriptions under each variable name
If you’re familiar with IPEDS data, you won’t be used to seeing those
They’re the variable labels we added, super useful for quick questions without having to open the code book!
7. Now we want to join our data together
data <-left_join(data_info, data_enroll, by ="unitid") |>left_join(data_aid, by ="unitid")
1. Create a new folder on your computer, download a copy of IPEDtaS.do, and place it in the folder
2. Adjust local selected_files in IPEDtaS.do to download the 3 files we want like below
local selected_files ///"HD2021"///"EFFY2022"///"SFA2122"
3. Select the whole IPEDtaS.do script and hit “Run”
4. Start a new Stata do file in that same folder
5. Load our first data set, hd2022
use"data/hd2021.dta", clear
Okay, now, take a look at the variables panel (by default in right hand panel)
Each of the variables has a label that describes what the variable means
If you’re familiar with standard IPEDS data, you won’t be used to seeing those
They’re the variable labels we added, super useful for quick questions without having to open the code book!
6. Join in our other data sets in a “left join” style (i.e., all observations in the first data set are kept even if they don’t have a match in the second)
8. Some of you may have noticed that our data has become extremely “long”
As in, our data now has many more observations than we originally had
This we means we probably have a little light data-wrangling to do
Let’s check how many observations our data set now contains
nrow(data)
[1] 116480
I have a hunch that the data might be “long” by the variable effylev, so, let’s look at how many observations we have for each value of effylev
data |>count(effylev)
# A tibble: 5 × 2
effylev n
<dbl+lbl> <int>
1 -2 [Not applicable, undergraduate detail] 102471
2 1 [All students total] 5953
3 2 [Undergraduate] 5680
4 4 [Graduate] 2040
5 NA 336
Once again, if you’re used to IPEDS data, you wouldn’t usually see the information in the [square brackets]
These are our value labels, again, super useful for quick questions without having to open the code book!
One thing I really like about using labels is you get the best of both worlds
We still have the original values to check with the code book (which you don’t get with some tools we will discuss later)
The labels help us quickly identify what the different values of effylev mean and that if we are interested in undergraduate figures (which for now, we are) we want to keep rows that are effylev == 2
data <- data |>filter(effylev ==2)
7. Some of you may have noticed that our data has become extremely “long”
As in, our data now has many more observations than we originally had
This we means we probably have a little light data-wrangling to do
Let’s check how many observations our data set now contains
count
116,480
I have a hunch that the data might be “long” by the variable effylev, so, let’s look at how many observations we have for each value of effylev
tabulate effylev
Undergraduate or graduate level of |
student | Freq. Percent Cum.
-------------------------------------+-----------------------------------
Not applicable, undergraduate detail | 102,471 88.23 88.23
All students total | 5,953 5.13 93.35
Undergraduate | 5,680 4.89 98.24
Graduate | 2,040 1.76 100.00
-------------------------------------+-----------------------------------
Total | 116,144 100.00
Once again, if you’re used to IPEDS data, you would usually see a bunch of numbers in the left-hand column, but now we see informative labels
These are our value labels, again, super useful for quick questions without having to open the code book!
If these are ever unclear, the data still contains the original values to check with the code book (which you don’t get with some tools we will discuss later)
You can use the command labelbook to check these
labelbook label_effylev
Value label label_effylev
--------------------------------------------------------------------------------------
Values Labels
Range: [-2,4] String length: [8,36]
N: 4 Unique at full length: yes
Gaps: yes Unique at length 12: yes
Missing .*: 0 Null string: no
Leading/trailing blanks: no
Numeric -> numeric: no
Definition
-2 Not applicable, undergraduate detail
1 All students total
2 Undergraduate
4 Graduate
Variables: effylev
The labels help us quickly identify what the different values of effylev mean and that if we are interested in undergraduate figures (which for now, we are) we want to keep rows that are effylev == 2
Notice how again the labels make our analysis instantly more informative
We know what obereg 7 means without going to the code book
Now, if we want to just use the labels the column, haven has a handy tool for that as well as_factor()
This converts a column with value labels to a factor using the label as the value
data |>group_by(as_factor(obereg)) |>summarize(median_perc_out_of_state =median(scfa13p, na.rm =TRUE))
# A tibble: 10 × 2
`as_factor(obereg)` median_perc_out_of_s…¹
<fct> <dbl>
1 U.S. Service schools 0
2 New England (CT, ME, MA, NH, RI, VT) 5
3 Mid East (DE, DC, MD, NJ, NY, PA) 5
4 Great Lakes (IL, IN, MI, OH, WI) 2
5 Plains (IA, KS, MN, MO, NE, ND, SD) 6
6 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA… 5
7 Southwest (AZ, NM, OK, TX) 4
8 Rocky Mountains (CO, ID, MT, UT, WY) 19
9 Far West (AK, CA, HI, NV, OR, WA) 4
10 Other U.S. jurisdictions (AS, FM, GU, MH, MP, PR, PW,… 0
# ℹ abbreviated name: ¹median_perc_out_of_state
8. Now let’s explore some trends in our data to show how labels can help. How does the percent of students paying out-of-state tuition vary by region?
tabstat scfa13p, s(median) by(obereg)
Summary for variables: scfa13p
Group variable: obereg (Bureau of Economic Analysis (BEA) regions)
obereg | p50
-----------------+----------
U.S. Service sch | 0
New England (CT, | 5
Mid East (DE, DC | 5
Great Lakes (IL, | 2
Plains (IA, KS, | 6
Southeast (AL, A | 5
Southwest (AZ, N | 4
Rocky Mountains | 19
Far West (AK, CA | 4
Other U.S. juris | 0
-----------------+----------
Total | 4
----------------------------
Notice how again the labels make our analysis instantly more informative
We know which region has 19% of students paying out-of-state tuition without going to the code book (it would previously just have said 7)
10. What about the relationship between total enrollment and the percent paying instate tuition? Are bigger schools relying more on out-of-state students? Does this trend vary by region?
This wouldn’t work in a table, so, let’s look at a simple scatter plot
9. What about the relationship between total enrollment and the percent paying instate tuition? Are bigger schools relying more on out-of-state students? Does this trend vary by region?
This wouldn’t work in a table, so, let’s look at a simple scatter plot
scatter scfa13p efytotlt if efytotlt < 50000, by(obereg, col(2))quietlygraphexportscatter.svg, replace
See how by default the x, y, and by/facet labels use the labels and not the variable names/values?
This instantly makes your plots more intuitive
I don’t typically use Stata for plotting, so I’m not sure how to get the longer labels to wrap, but I’m sure there’s a way
11. Lastly, let’s look at how labels can show up in modeling. Let’s see if the percentage of students paying out of state changes by the level of the institution (4 year, 2 year, Less than 2 year)
model <-lm(scfa13p ~factor(iclevel),data = data)tbl_regression(model)
Characteristic
Beta
95% CI1
p-value
factor(iclevel)
1
—
—
2
-8.7
-9.9, -7.4
<0.001
3
-14
-21, -7.6
<0.001
1 CI = Confidence Interval
Without using labels, the regression output needs the code book to interpret
What is iclevel 2?
Remember from above, using as_factor() rather than factor() tells R to use the labels as the levels
model <-lm(scfa13p ~as_factor(iclevel),data = data)tbl_regression(model)
Characteristic
Beta
95% CI1
p-value
as_factor(iclevel)
Four or more years
—
—
At least 2 but less than 4 years
-8.7
-9.9, -7.4
<0.001
Less than 2 years (below associate)
-14
-21, -7.6
<0.001
1 CI = Confidence Interval
Okay that is much clearer what is going on!
as_factor(iclevel) is still a bit messy though
Similarly to the plot above, using variable labels is a little more tricky, but, we can do it using the var_label() function again alongside the label = argument in tbl_regression
10. Lastly, let’s look at how labels can show up in modeling. Let’s see if the percentage of students paying out of state changes by the level of the institution (4 year, 2 year, Less than 2 year)
regress scfa13p i.iclevel
Source | SS df MS Number of obs = 1,572
-------------+---------------------------------- F(2, 1569) = 98.14
Model | 30433.0627 2 15216.5313 Prob > F = 0.0000
Residual | 243262.532 1,569 155.043041 R-squared = 0.1112
-------------+---------------------------------- Adj R-squared = 0.1101
Total | 273695.595 1,571 174.217438 Root MSE = 12.452
-------------------------------------------------------------------------------------
scfa13p | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------------+----------------------------------------------------------------
iclevel |
At least 2 but l.. | -8.666799 .6320687 -13.71 0.000 -9.906587 -7.427011
Less than 2 yea..) | -14.21038 3.35951 -4.23 0.000 -20.79999 -7.62078
|
_cons | 14.21038 .4602254 30.88 0.000 13.30766 15.1131
-------------------------------------------------------------------------------------
As you can see, the variable labels automatically show up in our regression output
Before, you would have seen 2 and 3 in the iclevel column, but now you get informative labels
With the labels, it’s easier to interpret which makes you work easier to read and also less likely you will get mixed up and report the wrong value!
Removing Value Labels
It’s rare, but, there are occasions where you might need to remove value labels from your data
For instance, in R certain advanced analysis packages get confused when you have value labels
You may also spot an error in labelling and need to get rid of it
In R, just use the zap_labels() function from haven to create an unlabelled version of your data
data |>count(iclevel)
# A tibble: 3 × 2
iclevel n
<dbl+lbl> <int>
1 1 [Four or more years] 2436
2 2 [At least 2 but less than 4 years] 1575
3 3 [Less than 2 years (below associate)] 1669
# A tibble: 3 × 2
iclevel n
<dbl> <int>
1 1 2436
2 2 1575
3 3 1669
In Stata, simply type label drop _all
tabulate iclevellabeldrop_alltabulate iclevel
Level of institution | Freq. Percent Cum.
------------------------------------+-----------------------------------
Four or more years | 2,436 42.89 42.89
At least 2 but less than 4 years | 1,575 27.73 70.62
Less than 2 years (below associate) | 1,669 29.38 100.00
------------------------------------+-----------------------------------
Total | 5,680 100.00
Level of |
institution | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,436 42.89 42.89
2 | 1,575 27.73 70.62
3 | 1,669 29.38 100.00
------------+-----------------------------------
Total | 5,680 100.00
Getting Capitalized Variable Names
By default, IPEDtaS gives you lower-case variable names (which is the default for Stata-style data)
Usually, this is going to be easier to work with
However, sometimes you might need to keep the original upper-case variable names, such as if you’re adding this to an existing project that already uses upper-case variable names
To do this, you just need to add a single line near the end of the IPEDtaS script