Data wrangling
Computer programming as a form of problem solving
library(tidyverse) library(palmerpenguins) Professor X from X-Men (the Patrick Stewart version, not James Mcavoy) Computer Problems. XKCD. Computers are not mind-reading machines. They are very efficient at certain tasks, and can perform calculations thousands of times faster than any human.
dplyr in brief
library(tidyverse) library(nycflights13) Data science workflow Rarely will your data arrive in exactly the form you require in order to analyze it appropriately. As part of the data science workflow you will need to transform your data in order to analyze it.
Practice transforming college education (data)
library(tidyverse) Run the code below in your console to download this exercise as a set of R scripts. usethis::use_course("cis-ds/data-transformation") The Department of Education collects annual statistics on colleges and universities in the United States.
Relational data: a quick review
Relational data is multiple tables of data that when combined together answer research questions. Relations define the important element, not just the individual tables. Relations are defined between a pair of tables, or potentially complex structures can be built up with more than 2 tables.
Practice using relational data
library(tidyverse) library(nycflights13) theme_set(theme_minimal()) Run the code below in your console to download this exercise as a set of R scripts. usethis::use_course("cis-ds/data-wrangling-relational-data-and-factors") For each exercise, use your knowledge of relational data and joining operations to compute a table or graph that answers the question.
Importing data into R
library(tidyverse) library(here) theme_set(theme_minimal()) # set seed for reproducibility set.seed(1234) readr vs. base R One of the main advantages of readr functions over base R functions is that they are typically much faster.
Practice transforming and visualizing factors
library(tidyverse) library(rcis) theme_set(theme_minimal()) Run the code below in your console to download this exercise as a set of R scripts. usethis::use_course("cis-ds/data-wrangling-relational-data-and-factors") # load the data data("gun_deaths") gun_deaths ## # A tibble: 100,798 × 10 ## id year month intent police sex age race place education ## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <fct> ## 1 1 2012 Jan Suicide 0 M 34 Asian/Paci… Home BA+ ## 2 2 2012 Jan Suicide 0 F 21 White Stre… Some col… ## 3 3 2012 Jan Suicide 0 M 60 White Othe… BA+ ## 4 4 2012 Feb Suicide 0 M 64 White Home BA+ ## 5 5 2012 Feb Suicide 0 M 31 White Othe… HS/GED ## 6 6 2012 Feb Suicide 0 M 17 Native Ame… Home Less tha… ## 7 7 2012 Feb Undetermined 0 M 48 White Home HS/GED ## 8 8 2012 Mar Suicide 0 M 41 Native Ame… Home HS/GED ## 9 9 2012 Feb Accidental 0 M 50 White Othe… Some col… ## 10 10 2012 Feb Suicide 0 M NA Black Home <NA> ## # … with 100,788 more rows Convert month into a factor column Click for the solution # create a character vector with all month values month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" ) # or use the built-in constant month.
Tidy data
library(tidyverse) Most data analysts and statisticians analyze data in a spreadsheet or tabular format. This is not the only way to store information,1 however in the social sciences it has been the paradigm for many decades.
Practice tidying data
library(tidyverse) Run the code below in your console to download this exercise as a set of R scripts. usethis::use_course("cis-ds/data-wrangling-tidy-data") For each exercise, tidy the data frame. Before you write any code examine the structure of the data frame and mentally (or with pen-and-paper) sketch out what you think the tidy data structure should be.