HW07: Machine learning

Oct 31, 2022 5 min read

Overview

Due by 11:59pm on November 1st.

Accessing the `hw07` repository

Go here and find your copy of the hw07 repository. It follows the naming convention hw07-<USERNAME>. Clone the repository to your computer.

Part 1: Student debt

Median student debt in the United States has increased substantially over the past twenty years.

Median federal debt for students has increased since 2006. Source: Federal Reserve Bank of St. Louis

rcis::scorecard includes debt, which reports the median debt of students after leaving school in 2019.

For all models, exclude unitid and name as predictors. These serve as id variables in the data set and uniquely identify each observation. They are not useful in predicting an outcome of interest.

Split scorecard into training and test sets with 75% allocated to training and 25% allocated to testing.
Estimate a basic linear regression model to predict debt as a function of all the other variables in the dataset except for state. Report the RMSE for the model.¹
Estimate the same linear regression model, but this time implement 10-fold cross-validation. Report the RMSE for the model.
Estimate a decision tree model to predict debt using 10-fold cross-validation. Use the rpart engine. Report the RMSE for the model.

For those looking to stretch themselves

Estimate one or more models which utilize some aspect of feature engineering or model tuning. Discuss the process you used to estimate the model and report on its performance.

Part 2: Predicting attitudes towards racist college professors

The General Social Survey is a biannual survey of the American public.²

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

rcis::gss contains a selection of variables from the 2012 GSS. You are going to predict attitudes towards racist college professors. Specifically, each respondent was asked “Should a person who believes that Blacks are genetically inferior be allowed to teach in a college or university?” Given the kerfuffle over Richard J. Herrnstein and Charles Murray’s The Bell Curve and the ostracization of Nobel Prize laureate James Watson over his controversial views on race and intelligence, this analysis will provide further insight into the public debate over this issue.

The outcome of interest colrac is a factor variable coded as either "ALLOWED" (respondent believes the person should be allowed to teach) or "NOT ALLOWED" (respondent believes the person should not allowed to teach).

Use the gss data frame, not gss_colrac. To ensure you have the correct data frame loaded, you can run:

data("gss", package = "rcis")

For all models, exclude id and wtss as predictors. These serve as id variables in the data set and uniquely identify each observation. They are not useful in predicting an outcome of interest.

Split gss into training and test sets with 75% allocated to training and 25% allocated to testing.
Estimate a logistic regression model to predict colrac as a function of age, black, degree, partyid_3, sex, and south. Implement 10-fold cross-validation. Report the accuracy of the model.
Estimate a random forest model to predict colrac as a function of all the other variables in the dataset (except id and wtss). In order to do this, you need to impute missing values for all the predictor columns. This means replacing missing values (NA) with plausible values given what we know about the other observations.
- Remove rows with an NA for colrac - we want to omit observations with missing values for outcomes, not impute them
- Use median imputation for numeric predictors
- Use modal imputation for nominal predictors
Implement 10-fold cross-validation. Report the accuracy of the model.
Estimate a $5$-nearest neighbors model to predict colrac. Use recipes to prepare the data set for training this model (e.g. scaling and normalizing variables, ensuring all predictors are numeric). Be sure to also perform the same preprocessing as for the random forest model. Make sure your step order is correct for the recipe. Implement 10-fold cross-validation. Report the accuracy of the model.
Estimate a ridge logistic regression model to predict colrac.³ Use the same recipe as for the $5$-nearest neighbors model. Implement 10-fold cross-validation, and utilize the same recipe as for the $k$-nearest neighbors model. Report the accuracy of the model.
Select the best performing model. Train that recipe/model using the full training set and report the accuracy using the held-out test set of data.

For those looking to stretch themselves

Estimate some set of additional models which utilize some aspect of feature engineering or model tuning. Discuss the process you used to estimate the model and report on its performance.

Documentation for the other predictors (if the variable is not clearly coded) can be viewed here. You can also run ?gss to open a documentation file in R.

Submit the assignment

Your assignment should be submitted as a set of two Quarto documents using the gfm (GitHub Flavored Markdown) format. Follow instructions on homework workflow.

Rubric

Needs improvement: Cannot get code to run or is poorly documented. No documentation in the README file. Severe misinterpretations of the results. Overall a shoddy or incomplete assignment.

Satisfactory: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Excellent: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible. Writes a user-friendly README file. Implements appropriate visualization techniques for the statistical model. Results are presented in a clear and intuitive manner.

View the documentation for yardstick to find the appropriate function for RMSE. ↩︎
Conducted by NORC at the University of Chicago. ↩︎
logistic_reg(penalty = .01, mixture = 0) ↩︎

HW07: Machine learning

Overview

Accessing the `hw07` repository

Part 1: Student debt

For those looking to stretch themselves

Part 2: Predicting attitudes towards racist college professors

For those looking to stretch themselves

Submit the assignment

Rubric

Benjamin Soltoff

Lecturer in Information Science

HW07: Machine learning

Overview

Accessing the hw07 repository

Part 1: Student debt

For those looking to stretch themselves

Part 2: Predicting attitudes towards racist college professors

For those looking to stretch themselves

Submit the assignment

Rubric

Benjamin Soltoff

Lecturer in Information Science

Accessing the `hw07` repository