class: center, middle, inverse, title-slide .title[ # Data transformation ] .author[ ### INFO 5940
Cornell University ] --- class: inverse, middle # Computational problem-solving --- class: middle <img src="/img/xmen_xavier.jpg" width="80%" style="display: block; margin: auto;" /> --- class: middle <img src="/img/xkcd_computer_problems.png" width="80%" style="display: block; margin: auto;" /> .footnote[Source: [xkcd](https://xkcd.com/722/)] --- ## `penguins` <img src="/img/lter_penguins.png" width="80%" style="display: block; margin: auto;" /> --- ## `penguins` ```r penguins ``` ``` ## # A tibble: 344 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## 7 Adelie Torgersen 38.9 17.8 181 3625 ## 8 Adelie Torgersen 39.2 19.6 195 4675 ## 9 Adelie Torgersen 34.1 18.1 193 3475 ## 10 Adelie Torgersen 42 20.2 190 4250 ## # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ## What is the average body mass of an Adelie penguin? .panelset[ .panel[.panel-name[Conceptual] 1. First we need to identify the **input**, or the data we're going to analyze. 1. Next we need to **select** only the observations which are Adelie penguins. 1. Finally we need to calculate the average value, or **mean**, of `body_mass_g`. ] .panel[.panel-name[Code] ```r data("penguins") penguins_adelie <- filter(.data = penguins, species == "Adelie") summarize(.data = penguins_adelie, avg_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## avg_mass ## <dbl> ## 1 3701. ``` ] ] --- ## What is the average body mass of a penguin for each species?
02
:
00
-- ```r data("penguins") penguins_species <- group_by(.data = penguins, species) summarize(.data = penguins_species, avg_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## species avg_mass ## <fct> <dbl> ## 1 Adelie 3701. ## 2 Chinstrap 3733. ## 3 Gentoo 5076. ``` --- ## What is the average bill length and body mass for each Adelie penguin by sex?
02
:
00
-- .panelset[ .panel[.panel-name[Filter, then group by] ```r data("penguins") penguins_adelie <- filter(.data = penguins, species == "Adelie") penguins_adelie_sex <- group_by(.data = penguins_adelie, sex) summarize( .data = penguins_adelie_sex, bill = mean(bill_length_mm, na.rm = TRUE), avg_mass = mean(body_mass_g, na.rm = TRUE) ) ``` ``` ## # A tibble: 3 × 3 ## sex bill avg_mass ## <fct> <dbl> <dbl> ## 1 female 37.3 3369. ## 2 male 40.4 4043. ## 3 <NA> 37.8 3540 ``` ] .panel[.panel-name[Group by, then filter] ```r data("penguins") penguins_sex <- group_by(.data = penguins, sex) penguins_sex_adelie <- filter(.data = penguins_sex, species == "Adelie") summarize( .data = penguins_sex_adelie, bill = mean(bill_length_mm, na.rm = TRUE), avg_mass = mean(body_mass_g, na.rm = TRUE) ) ``` ``` ## # A tibble: 3 × 3 ## sex bill avg_mass ## <fct> <dbl> <dbl> ## 1 female 37.3 3369. ## 2 male 40.4 4043. ## 3 <NA> 37.8 3540 ``` ] ] --- <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/dplyr_wrangling.png" title="Cartoon of a fuzzy monster with a cowboy hat and lasso, riding another fuzzy monster labeled 'dplyr', lassoing a group of angry / unruly looking creatures labeled 'data.'" alt="Cartoon of a fuzzy monster with a cowboy hat and lasso, riding another fuzzy monster labeled 'dplyr', lassoing a group of angry / unruly looking creatures labeled 'data.'" width="70%" style="display: block; margin: auto;" /> .footnote[Source: [Allison Horst](https://github.com/allisonhorst/stats-illustrations)] --- ## Verbiage for data transformation 1. The first argument is a data frame 1. Subsequent arguments describe what to do with the data frame 1. The result is a new data frame --- ## Key functions in `dplyr` `function()` | Action performed --------------|-------------------------------------------------------- `filter()` | Subsets observations based on their values `arrange()` | Changes the order of observations based on their values `select()` | Selects a subset of columns from the data frame `rename()` | Changes the name of columns in the data frame `mutate()` | Creates new columns (or variables) `group_by()` | Changes the unit of analysis from the complete dataset to individual groups `summarize()` | Collapses the data frame to a smaller number of rows which summarize the larger data --- ## American vs. British English .pull-left[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The holy grail: "For consistency, aim to use British (rather than American) spelling." <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> <a href="http://t.co/7qQSWIowcl">http://t.co/7qQSWIowcl</a>. Colour is right!</p>— Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/405707093770244097">November 27, 2013</a></blockquote> <script async src="http://platform.twitter.com/widgets.js" charset="utf-8"></script> ] -- .pull-right[ <img src="https://media.giphy.com/media/8KnfG3knLExpu/giphy.gif" width="80%" style="display: block; margin: auto;" /> ] -- * `summarize()` = `summarise()` * `color()` = `colour()` --- ## Saving transformed data (`<-`) .panelset[ .panel[.panel-name[Printed, but not saved] ```r filter(diamonds, cut == "Ideal") ``` ``` ## # A tibble: 21,551 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # … with 21,541 more rows ``` ] .panel[.panel-name[Saved, but not printed] ```r diamonds_ideal <- filter(.data = diamonds, cut == "Ideal") ``` ] .panel[.panel-name[Saved and printed] ```r (diamonds_ideal <- filter(.data = diamonds, cut == "Ideal")) ``` ``` ## # A tibble: 21,551 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # … with 21,541 more rows ``` ] ] --- .pull-left[ ### Syntactic column names ```r count(x = diamonds, color) ``` ``` ## # A tibble: 7 × 2 ## color n ## <ord> <int> ## 1 D 6775 ## 2 E 9797 ## 3 F 9542 ## 4 G 11292 ## 5 H 8304 ## 6 I 5422 ## 7 J 2808 ``` ] -- .pull-right[ ### Non-syntactic names * `Social conservative` * `7-point ideology` * `_id` ] --- ## Referring to non-syntactic names .pull-left[ ### Backticks ```r count(x = diamonds, `color`) ``` ``` ## # A tibble: 7 × 2 ## color n ## <ord> <int> ## 1 D 6775 ## 2 E 9797 ## 3 F 9542 ## 4 G 11292 ## 5 H 8304 ## 6 I 5422 ## 7 J 2808 ``` ] -- .pull-right[ ### Don't use quotation marks! ```r count(x = diamonds, "color") ``` ``` ## # A tibble: 1 × 2 ## `"color"` n ## <chr> <int> ## 1 color 53940 ``` ] --- ## Piping (`%>%`) .panelset.sideways[ .panel[.panel-name[No pipes] ```r by_dest <- group_by( .data = flights, dest ) delays <- summarise( .data = by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delays <- filter( .data = delays, count > 20, dest != "HNL" ) ``` ] .panel[.panel-name[With pipes] ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter( count > 20, dest != "HNL" ) ``` ] ] --- ## Errors to avoid with pipes ```r delays <- flights %>% by_dest <- group_by(dest) %>% delay <- summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% delay <- filter(count > 20, dest != "HNL") ``` ``` Error: bad assignment: summarize(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>% delay <- filter(count > 20, dest != "HNL") ``` --- ## Errors to avoid with pipes ```r delays <- flights %>% group_by(.data = flights, dest) %>% summarize(.data = flights, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(.data = flights, count > 20, dest != "HNL") ``` ``` ## Error in `filter()`: ## ! Problem while computing `..1 = .`. ## Caused by error in `summarize()`: ## ! Problem while computing `..1 = .`. ## Caused by error in `group_by()`: ## ! Must group by variables found in `.data`. ## x Column `.` is not found. ``` --- ## Practice transforming data <img src="https://media.giphy.com/media/xUPOqBZFMaxCj9vP44/giphy.gif" width="65%" style="display: block; margin: auto;" />