Text analysis: classification and topic modeling

class: center, middle, inverse, title-slide

.title[
# Text analysis: classification and topic modeling
]
.author[
### INFO 5940 <br /> Cornell University
]

---

# Supervised learning

1. Hand-code a small set of documents `$N = 1,000$`
1. Train a statistical learning model on the hand-coded data
1. Evaluate the effectiveness of the statistical learning model
1. Apply the final model to the remaining set of documents `$N = 1,000,000$`

---

# `USCongress`

```
## Rows: 4,449
## Columns: 7
## $ ID       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ cong     <dbl> 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 1…
## $ billnum  <dbl> 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507, 4508, 4…
## $ h_or_sen <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "…
## $ major    <dbl> 18, 18, 18, 18, 5, 21, 15, 18, 18, 18, 18, 16, 18, 12, 2, 3, …
## $ text     <chr> "To suspend temporarily the duty on Fast Magenta 2 Stage.", "…
## $ label    <fct> "Foreign trade", "Foreign trade", "Foreign trade", "Foreign t…
```

```
## [1] "To suspend temporarily the duty on Fast Magenta 2 Stage."                                                                                                                                                                                
## [2] "To suspend temporarily the duty on Fast Black 286 Stage."                                                                                                                                                                                
## [3] "To suspend temporarily the duty on mixtures of Fluazinam."                                                                                                                                                                               
## [4] "To reduce temporarily the duty on Prodiamine Technical."                                                                                                                                                                                 
## [5] "To amend the Immigration and Nationality Act in regard to Caribbean-born immigrants."                                                                                                                                                    
## [6] "To amend title 38, United States Code, to extend the eligibility for housing loans guaranteed by the Secretary of Veterans Affairs under the Native American Housing Loan Pilot Program to veterans who are married to Native Americans."
```

---

# Split the data set

```r
set.seed(123)

# convert response variable to factor
congress <- congress %>%
  mutate(major = factor(x = major, levels = major, labels = label))

# split into training and testing sets
congress_split <- initial_split(data = congress, strata = major, prop = .8)
congress_split
## <Analysis/Assess/Total>
## <3558/891/4449>

congress_train <- training(congress_split)
congress_test <- testing(congress_split)

# generate cross-validation folds
congress_folds <- vfold_cv(data = congress_train, strata = major)
```

---

# Class imbalance

---

# Preprocessing the data frame

```r
congress_rec <- recipe(major ~ text, data = congress_train)
```

```r
library(textrecipes)

congress_rec <- congress_rec %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = 500) %>%
  step_tfidf(text) %>%
  step_downsample(major)
```

---

# Define the model

```r
tree_spec <- decision_tree() %>%
  set_mode("classification") %>%
  set_engine("C5.0")

tree_spec
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---

# Train the model

```r
tree_wf <- workflow() %>%
  add_recipe(congress_rec) %>%
  add_model(tree_spec)
```

```r
set.seed(123)

tree_cv <- fit_resamples(
  tree_wf,
  congress_folds,
  control = control_resamples(save_pred = TRUE)
)
```

```r
tree_cv_metrics <- collect_metrics(tree_cv)
tree_cv_predictions <- collect_predictions(tree_cv)
tree_cv_metrics
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.455    10 0.00807 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.772    10 0.00809 Preprocessor1_Model1
```

---

# Confusion matrix

---

# Name That Tune!

.pull-left[

]

.pull-right[

]

---

# Topic modeling

* Themes
* Probabilistic topic models
* Latent Dirichlet allocation

---

# Topic and topic

1. I ate a banana and spinach smoothie for breakfast.
1. I like to eat broccoli and bananas.
1. Chinchillas and kittens are cute.
1. My sister adopted a kitten yesterday.
1. Look at this cute hamster munching on a piece of broccoli.

---

# LDA document structure

* Decide on the number of words N the document will have
    * [Dirichlet probability distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution)
    * Fixed set of `$k$` topics
* Generate each word in the document:
    * Pick a topic
    * Generate the word
* LDA backtracks from this assumption

---

# `appa`

---

# `appa`

```r
remotes::install_github("averyrobbins1/appa")
```

```r
library(appa)
data("appa")

glimpse(appa)
## Rows: 13,385
## Columns: 12
## $ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ book              <fct> Water, Water, Water, Water, Water, Water, Water, Wat…
## $ book_num          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ chapter           <fct> "The Boy in the Iceberg", "The Boy in the Iceberg", …
## $ chapter_num       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ character         <chr> "Katara", "Scene Description", "Sokka", "Scene Descr…
## $ full_text         <chr> "Water. Earth. Fire. Air. My grandmother used to tel…
## $ character_words   <chr> "Water. Earth. Fire. Air. My grandmother used to tel…
## $ scene_description <list> <>, <>, "[Close-up of the boy as he grins confident…
## $ writer            <chr> "‎Michael Dante DiMartino, Bryan Konietzko, Aaron Eha…
## $ director          <chr> "Dave Filoni", "Dave Filoni", "Dave Filoni", "Dave F…
## $ imdb_rating       <dbl> 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.…
```

---

# Create the recipe

```r
appa_rec <- recipe(~ id + character_words, data = appa) %>%
  step_tokenize(character_words) %>%
  step_stopwords(character_words, stopword_source = "smart") %>%
  step_ngram(character_words, num_tokens = 5, min_num_tokens = 1) %>%
  step_tokenfilter(character_words, max_tokens = 5000) %>%
  step_tf(character_words)
```

---

# Bake the recipe

```r
appa_prep <- prep(appa_rec)

appa_df <- bake(appa_prep, new_data = NULL)
appa_df %>%
  slice(1:5)
## # A tibble: 5 × 5,000
##      id tf_cha…¹ tf_ch…² tf_ch…³ tf_ch…⁴ tf_ch…⁵ tf_ch…⁶ tf_ch…⁷ tf_ch…⁸ tf_ch…⁹
##   <int>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1        0       0       0       0       0       0       0       0       0
## 2     2        0       0       0       0       0       0       0       0       0
## 3     3        0       0       0       0       0       0       0       0       0
## 4     4        0       0       0       0       0       0       0       0       0
## 5     5        0       0       0       0       0       0       0       0       0
## # … with 4,990 more variables: tf_character_words_aang_aang_aang <dbl>,
## #   tf_character_words_aang_aang_aang_aang <dbl>,
## #   tf_character_words_aang_aang_aang_aang_aang <dbl>,
## #   tf_character_words_aang_airbending <dbl>,
## #   tf_character_words_aang_avatar <dbl>, tf_character_words_aang_back <dbl>,
## #   tf_character_words_aang_big <dbl>, tf_character_words_aang_coming <dbl>,
## #   tf_character_words_aang_dad <dbl>, …
## # ℹ Use `colnames()` to see all variable names
```

---

# Convert to document-term matrix

.tiny[

```r
appa_dtm_prep <- appa_df %>%
  pivot_longer(
    cols = -id,
    names_to = "token",
    values_to = "n"
  ) %>%
  filter(n != 0) %>%
  # clean the token column so it just includes the token
  mutate(
    token = str_remove(string = token, pattern = "tf_character_words_")
  )

# id must be consecutive with no gaps
appa_new_id <- appa_dtm_prep %>%
  distinct(id) %>%
  mutate(new_id = row_number())

appa_dtm <- left_join(x = appa_dtm_prep, y = appa_new_id) %>%
  cast_dtm(document = new_id, term = token, value = n)
appa_dtm
## <<DocumentTermMatrix (documents: 8822, terms: 4999)>>
## Non-/sparse entries: 40408/44060770
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)
```
]

---

# `$k=4$`

```r
appa_lda4 <- LDA(appa_dtm, k = 4, control = list(seed = 123))
```

<img src="index_files/figure-html/appa-4-topn-1.png" width="70%" style="display: block; margin: auto;" />
---

# `$k=12$`

---

# Perplexity

* A statistical measure of how well a probability model predicts a sample
* Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
* Perplexity for LDA model with 12 topics
    * 1373.9451929

---

# Perplexity

---

# `$k=100$`

---

# LDAvis

* Interactive visualization of LDA model results
1. What is the meaning of each topic?
1. How prevalent is each topic?
1. How do the topics relate to each other?