Final Project

This assignment should be submitted as a text entry directly on Canvas consisting of;

  1. A paragraph describing your project:
    • What will you be investigating/exploring/predicting?
    • Why is it interesting?
  2. A description of where you will find this data.
  3. A few lines describing your main outcome variable in detail
    • How is it coded/what scale is it on?
    • How can you interpret it?

This assignment is worth 5 points, full points will be awarded once satisfactorily completed, multiple re-submissions may be required.

This should be submitted to Canvas by the due date listed.

NOTE: For your initial analyses, the most important thing is that you submit code that sources/renders in full, this assignment will not be successfully completed until that happens, you may have to resubmit multiple times.

Submit the following in either a cleanly formatted R (.R) script or Quarto (.qmd) file:

  1. Comments (if an R script) or text (if a Quarto script) that describe
    • Where your data is from (a link is preferable)
    • How to download it
    • Where to save it in order for your code to run
    • E.g., For this project I used two .csv data files from IPEDS survey year 2019, institutional characteristics HD2019 and public finance F1819_F1A. These can be downloaded from here by clicking on the named files name under “Data Files”. To run this code, save these files in a sub-folder called “data” that sits in the same folder as this .qmd file.
  2. Code that:
    • Reads in the data set that contains (at least) your dependent variable (must read in EXACTLY what downloads from following your instructions above)
    • If appropriate
      • Converts missing values to NA
      • Reshapes the data wider or longer
      • Joins in additional data files
  3. Code that creates at least three of the following:
    • A plot that shows the overall distribution of your dependent/outcome variable

      • Hint: A histogram or density plot might be a good option here
    • A plot that shows the distribution of your dependent/outcome variable grouped by a variable in your data

      • Hint: A histogram or density plot with fill might be a good option here
    • A plot that shows the median, interquartile range, and potential outliers of your dependent/outcome variable grouped by a variable in your data

      • Hint: A box plot with x and/or fill might be a good option here
    • A plot that shows how your dependent/outcome variable changes by another continuous variable in your data

      • Hint: A scatter plot might be a good option here

This assignment is worth 10 points, full points will be awarded once satisfactorily completed, multiple re-submissions may be required.

This should be submitted to Canvas by the due date listed.

  • You will present the results of your report in class during the penultimate week of the semester (see date in Canvas)
  • The presentation format is up to you, previous students have
    • Presented an image of one figure they created
    • Created a short PowerPoint presentation
    • Created presentations using Quarto
  • The primary rule for this presentation is that it is to be 3-5 mins long (read min 3 mins, max 5 mins, ideal 4 mins)
    • As this is meant to replicate the you presenting your report to senior administrators, this is a hard time-limit, you will be stopped if you go over 5 mins

This assignment is worth 5 points, your grade will be determined by:

  • 3 points: Did you present a plot/table/finding from your report that tells a story about your topic?

  • 1 point: Did you present the information in a professional and engaging manner?

  • 1 point: Did you finish within the allotted time limit of 3-5 mins?

  • Your final report is a 3-5 page (single-spaced, not including citations) document that summarizes the analysis you have done using plenty of figures and summary tables along the way

    • This is NOT a traditional academic paper, it is meant to be concise report intended for a university administrator or policymaker

    • 5 pages is a hard limit, anything beyond the 5th page won’t be graded

    • NOTE: You can submit a draft (for feedback only) by the due date on Canvas

  • You will submit the report as Quarto (.qmd) file

    • The output format: should be either docx, pdf (traditional way using LaTeX), or typst (brand new way to create a .pdf)

    • Optional: If your analysis code becomes long, you might want to submit accompanying .R scripts that are source()-ed as discussed in the Quarto Lesson

Required Report Content

  • Before the introduction of your document, a section called “Instructions to Run” that states

    • Where your data is from (a link if preferable)
    • How to download it
    • Where to save it in order for the code to run
    • E.g., For this project I used two .csv data files from IPEDS survey year 2019, institutional characteristics HD2019 and public finance F1819_F1A. These can be downloaded from here by clicking on the named files name under “Data Files”. To run this code, save these files in a sub-folder called “data” that sits in the same folder as this .qmd file.
  • Well commented code (either in the .qmd file or a source()-ed .R script) that:

    • Reads in all your raw data (must read in EXACTLY what downloads from following your instructions above)
    • Performs all data wrangling tasks to clean, join, and reshape your data as necessary for your project
    • Creates:
      • Required: 3 or more plots with ggplot2
      • Required: 1 or more overall descriptive statistics table(s) made with summarize
      • Required: 1 or more other summary table(s) made with summarize
      • Optional: Basic statistics like t.test() or lm()
  • Written text that should be clearly structured with subheadings and describe:

    • Why this is interesting and/or important
      • This should be a single concise but convincing paragraph, think an “elevator pitch” argument as to why this matters
      • There should NOT be any lengthy literature review in this assignment, this is it
    • Why your chose your data source and what the data represents
    • What analysis you did and why, in layman’s terms (not an R or stats expert)
    • What each individual plot and table shows
    • What you found overall
    • Any limitations or future research

Rubric

Report Element Criteria Points
Does it run?
  • If I hit render, does the code run and produce the output report without errors?
4
Data Wrangling: Reading & Cleaning
  • Are the “instructions to run” provided and correct?

  • Is the data read in correctly?

  • Is the data joined and/or reshaped correctly where necessary?

  • Is the data cleaned where necessary with reasonable justification provided for subjective decisions?

4
Data Wrangling: Analysis
  • Is there at least 1 overall descriptive statistics summarize table in the report?

  • Is there at least 1 additional summarize table in the report?

  • Do the summary tables show interesting and relevant information to the report topic?

  • Is the code to produce the summary tables free of mistakes?

4
Data Visualization
  • Are there at least 3 ggplot2 plots in the report?

  • Do the plots show interesting and relevant information to the report topic?

  • Are the plots aesthetically pleasing with good plot design, color choices, and labels?

  • Is the code to produce the summary tables free of mistakes?

4
Written Content
  • Does the written content address the points outlined above?

  • Does the report make a convincing argument?

  • Is the text of the report generally well written and in layman’s terms?

  • Is the text of the report well structured with subheadings?

4