-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathtopic-modeling.rmd
317 lines (251 loc) · 17 KB
/
topic-modeling.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# Topic modeling
## Basic idea
<span class="emph">Topic modeling</span> as typically conducted is a tool for much more than text. The primary technique of <span class="emph">Latent Dirichlet Allocation</span> (LDA) should be as much a part of your toolbox as principal components and factor analysis. It can be seen merely as a dimension reduction approach, but it can also be used for its rich interpretative quality as well. The basic idea is that we'll take a whole lot of features and boil them down to a few 'topics'. In this sense LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics.
In the standard setting, to be able to conduct such an analysis from text one needs a <span class="emph">document-term matrix</span>, where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any <span class="emph">n-gram</span> of interest.
Outside of text analysis terms could represent bacterial composition, genetic information, or whatever the researcher is interested in. Likewise, documents can be people, geographic regions, etc. The gist is, despite the common text-based application, that what constitutes a document or term is dependent upon the research question, and LDA can be applied in a variety of research settings.
## Steps
When it comes to text analysis, most of the time in topic modeling is spent on processing the text itself. Importing/scraping it, dealing with capitalization, punctuation, removing stopwords, dealing with encoding issues, removing other miscellaneous common words. It is a highly iterative process such that once you get to the document-term matrix, you're just going to find the stuff that was missed before and repeat the process with new 'cleaning parameters' in place. So getting to the analysis stage is the hard part. See the [Shakespeare section][Shakespeare Start to Finish], which comprises 5 acts, of which the first four and some additional scenes represent all the processing needed to get to the final scene of topic modeling. In what follows we'll start at the end of that journey.
## Topic Model Example
### Shakespeare
In this example, we'll look at Shakespeare's plays and poems, using a topic model with 10 topics. For our needs, we'll use the <span class="pack">topicmodels</span> package for the analysis, and mostly others for post-processing. Due to the large number of terms, this could take a while to run depending on your machine (maybe a minute or two). We can also see how things compare with the academic classifications for the texts.
```{r tm_chapter_topic_model, eval=F}
load('Data/shakes_dtm_stemmed.RData')
library(topicmodels)
shakes_10 = LDA(convert(shakes_dtm, to = "topicmodels"), k = 10)
```
#### Examine Terms within Topics
One of the first things to do is attempt to interpret the topics, and we can start by seeing which terms are most probable for each topic.
```{r tm_chapter_tm10_results, eval=FALSE}
get_terms(shakes_10, 20)
```
```{r tm_chapter_tm10_results_pretty_terms, echo=FALSE}
load('data/shakespeare_topic_model.RData')
library(topicmodels)
get_terms(shakes_10, 20) %>%
DT::datatable(options=list(dom='tp')) %>%
DT::formatStyle(
0, # ignores columns, but otherwise put here
target='row',
backgroundColor = 'transparent'
)
```
<br>
We can see there is a lot of overlap in these topics for top terms. Just looking at the top 10, *love* occurs in all of them, *god* and *heart* as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term's probability of occurrence within a document, and term *exclusivity*, or how likely a term is to occur in only one particular topic. See the [Shakespeare section][Shakespeare Start to Finish] for some examples of those.
#### Examine Document-Topic Expression
Next we can look at which documents are more likely to express each topic.
```{r tm_chapter_tm10_results_topic_classification, eval=FALSE}
t(topics(shakes_10, 2))
```
```{r tm_chapter_tm10_results_pretty_topic_classification, echo=FALSE}
t(topics(shakes_10, 2)) %>%
data.frame %>%
rename_all(str_replace, 'X', 'Top Topic ') %>%
DT::datatable(options=list(dom='t',
scrollY='400px',
scrollCollapse=T,
pageLength=40,
autoWidth=T,
align='center',
columnDefs=list(list(width='150px', targets=0),
list(width='100px', targets=1:2),
list(className = 'dt-center', targets = 1:2))),
width='500') %>%
DT::formatStyle(
0, # ignores columns, but otherwise put here
target='row',
backgroundColor = 'transparent'
)
```
<br>
For example, based just on term frequency, Hamlet is most likely to be associated with Topic `r t(topicmodels::topics(shakes_10, 3))['Hamlet',1]`. That topic is affiliated with the (stemmed words) `r topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet',1]]`. Sounds about right for Hamlet.
<!-- Hamlet is also one that is actually a decent mix, with its second topic expressed being Topic `r t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]`, with common terms `r topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]]`. They both have `r intersect(topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet',1]], topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]])` among their top 20 terms. -->
The following visualization shows a heatmap for the topic probabilities of each document. Darker values mean higher probability for a document expressing that topic. I've also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram. The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem.
<br>
```{r tm_chapter_viz_topics, echo=FALSE, out.height='700px', out.width='700px', fig.width=11, fig.height=8.5, fig.align='center'}
# Note the latex figure settings can actually limit the plot size; so basically
# you have several chunk options, possibly multiple package options, and css all working against
# each other for single visual (wtf)
# assumption, fig.width=8.5 will be whatever the width of the div is.
library(quanteda)
load('data/shakespeare_classification.RData')
load('data/shakes_words_df.RData')
shakes_dtm = shakes_dtm %>%
dfm_wordstem()
suppressPackageStartupMessages(library(dendextend))
# see proxy::pr_DB %>% data.frame() for actual info for the metrics that
# quanteda uses, whose functions don't bother to even tell you that's where they
# are coming from
# proxy::pr_DB %>% data.frame() %>% select( distance, formula, description, reference) %>% DT::datatable()
colvec = c(palettes$orange$orange, palettes$orange$tetradic)[as.integer(factor(shakes_types$class))]
suppressPackageStartupMessages(library(heatmaply))
# cosine distance, is not a proper distance
row_dend =
(1-textstat_simil(dfm_weight(shakes_dtm, 'relMaxFreq'),
margin = "documents",
method = "cosine")) %>%
as.dist() %>%
hclust(method="complete") %>%
as.dendrogram %>%
set("branches_k_color", k = 4) %>%
set("branches_lwd", c(.5,.5)) %>%
ladderize
# the amount of bugs is staggering
# summary jhfc!
# you can't use width and height except with heatmapr (layout() from plotly is deprecated or otherwise would be an option, but causes overflow problem anyway)
# row_side_colors needed in heatmapr, but it can't be the factor label itself, or it will give an error
# wtf knows what row_side_palette is for
# for the heatmaply part, row_side_colors is required again, or nothing will be displayed; what you actually provide is superfluous and ignored, even NULL
# rowSideColors is now required, but whatever you put there is ignored
# AND AFTER ALL THAT, it literally gets you back to square one, because the width and height are ignored
# fix_heatmaply = shakes_10@gamma %>%
# round(3) %>%
# heatmapr(Rowv=row_dend,
# Colv=F,
# labRow=shakes_10@documents,
# labCol=paste0('Topic ', 1:10),
# k_row = 4,
# row_side_colors = colvec,
# # row_side_colors = data.frame(class=pull(arrange(shakes_types, title), class)),
# rowSideColors = data.frame(class= shakes_types$class),#data.frame(class=pull(arrange(shakes_types, title), class)), # ignored by heatmaply
# # row_side_palette = c("#39BEB1", "#ABB065", "#ACA4E2", "#CC476B", "#E495A5", "black"),
# seriate = 'OLO',
# fontsize_row=8,
# fontsize_col=7,
# subplot_widths=c(.75, .025, .225),
# # subplot_heights=1,
# width=1750,
# height=650)
#
#
# heatmaply(fix_heatmaply,
# labRow=shakes_10@documents,
# colors='Oranges',
# plot_method = 'plotly',
# row_side_colors = shakes_types$class,
# # row_side_palette = shakes_types$class,
# RowSideColors = shakes_types$class,
# label_names = 'Stupid_heatmaply',
# grid_gap=5,
# hide_colorbar = T,
# colorbar_len = 0.3,
# colorbar_ypos=0) %>%
# layout(showlegend=F) %>% # showing the legend will screw up the colorbar and any associated options
# config(displayModeBar = F) %>%
# theme_plotly()
shakes_10@gamma %>%
round(3) %>%
heatmaply::heatmaply(Rowv=row_dend,
Colv=F,
colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
row_side_colors = data.frame(shakes_types$class),
row_side_palette = plasma,
k_row= 4,
# RowSideColors = 'Set2',
labRow=rev(labels(row_dend)),
labCol=paste0('Topic ', 1:10),
hide_colorbar=T,
grid_gap=2,
plot_method='plotly'
) %>%
layout(showlegend=F) %>% # showing the legend will screw up the colorbar and any associated options
config(displayModeBar = F) %>%
theme_plotly()
```
<br>
A couple things stand out. To begin with, most works are associated with one topic[^howmanytopics]. In terms of the discovered topics, traditional classification really probably only works for the <span class="" style="color:#9C179E">historical</span> works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work). Furthermore, <span class="" style="color:#F0F921">tragedies</span> and <span class="" style="color:#0D0887">comedies</span> might hit on the same topics, albeit from different perspectives. In addition, at least some works are very poetical, or at least have topics in common with the <span class="" style="color:#ED7953">poems</span> (love, beauty). If we take four clusters from the cluster analysis, the result boils down to *Phoenix* (on its own), standard poems, a mixed bag of more love-oriented works and the remaining poems, then everything else.
Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic. The order is based on the most likely topics across all documents.
<br>
```{r tm_chapter_cluster_topics, echo=FALSE, fig.align='center', fig.width=11, fig.height=5, out.height='800px', out.width='650px'}
topic_class = shakes_10@gamma %>%
round(3) %>%
data.frame() %>%
rename_all(function(x) str_replace(x, 'X', 'Topic ')) %>%
mutate(text = shakes_10@documents,
class = shakes_types$class)
order_topics = order(colSums(shakes_10@gamma), decreasing=T)
topic_class = topic_class %>%
# select(-text, -class) %>%
select(order_topics, text, class) %>%
arrange_at(vars(contains('Topic')), desc)
topic_class %>%
select(-text, -class) %>%
heatmaply::heatmaply(Rowv=NA,
Colv=NA,
labRow=rev(topic_class$text),
labCol=apply(get_terms(shakes_10, 10), 2, paste0, collapse='\n')[order_topics],
column_text_angle=0,
colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
# subplot_widths=c(1),
plot_method = 'plotly',
fontsize_row=8,
fontsize_col=8,
hide_colorbar = T) %>%
layout(showlegend=F) %>% # height to be deprecated, maybe heatmaply will conform to plotly by then
config(displayModeBar = F) %>%
theme_plotly()
```
So we can see that topic modeling can be used to classify the documents themselves into groups of documents most likely to express the same sorts of topics.
## Extensions
There are extensions of LDA used in topic modeling that will allow your analysis to go even further.
- Correlated Topic Models: the standard LDA does not estimate the topic correlation as part of the process.
- Supervised LDA: In this scenario, topics can be used for prediction, e.g. the classification of tragedy, comedy etc. (similar to PC regression)
- Structured Topic Models: Here we want to find the relevant covariates that can explain the topics (e.g. year written, author sex, etc.)
- Other: There are still other ways to examine topics.
## Topic Model Exercise
### Movie reviews
Perform a topic model on the [Cornell Movie review data](http://www.cs.cornell.edu/people/pabo/movie-review-data/). I've done some initial cleaning (e.g. removing stopwords, punctuation, etc.), and have both a tidy data frame and document term matrix for you to use. The former is provided if you want to do additional processing. But otherwise, just use the <span class="pack">topicmodels</span> package and perform your own analysis on the DTM. You can compare to [this result](https://ldavis.cpsievert.me/reviews/reviews.html).
```{r eval=FALSE, echo=FALSE}
# devtools::install_github("cpsievert/LDAvisData")
data(reviews, package = "LDAvisData")
library(tm)
# stop_words <- stopwords("SMART")
# pre-processing:
reviews <- gsub("'", "", reviews) # remove apostrophes
reviews <- gsub("[[:punct:]]", " ", reviews) # replace punctuation with space
reviews <- gsub("[[:cntrl:]]", " ", reviews) # replace control characters with space
reviews <- gsub("^[[:space:]]+", "", reviews) # remove whitespace at beginning of documents
reviews <- gsub("[[:space:]]+$", "", reviews) # remove whitespace at end of documents
reviews <- tolower(reviews) # force to lowercase
# tokenize on space and output as a list:
doc.list <- strsplit(reviews, "[[:space:]]+")
# # compute the table of terms:
# term.table <- table(unlist(doc.list))
# term.table <- sort(term.table, decreasing = TRUE)
#
# # remove terms that are stop words or occur fewer than 5 times:
# del <- names(term.table) %in% stop_words | term.table < 5
# term.table <- term.table[!del]
# vocab <- names(term.table)
#
# # now put the documents into the format required by the lda package:
# get.terms <- function(x) {
# index <- match(x, vocab)
# index <- index[!is.na(index)]
# rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
# }
# documents <- lapply(doc.list, get.terms)
docdf = stack(doc.list)
library(tidytext)
reviews_df = docdf %>%
rename(word=values,
review=ind) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, pattern='^[0-9]+')) %>%
group_by(review) %>%
count(word)
reviews_dtm = reviews_df %>%
cast_dtm(term=word, document=review, value=n)
reviews_dtm
save(reviews_df, reviews_dtm, file='data/movie_reviews.RData')
```
```{r load_reviews, eval=FALSE}
load('data/movie_reviews.RData')
library(topicmodels)
```
### Associated Press articles
Do some topic modeling on articles from the Associated Press data from the First Text Retrieval Conference in 1992. The following will load the DTM, so you are ready to go. See how your result compares with that of [Dave Blei](http://www.cs.columbia.edu/~blei/lda-c/ap-topics.pdf), based on 100 topics.
```{r ap_data, eval=FALSE}
library(topicmodels)
data("AssociatedPress")
```
[^howmanytopics]: There isn't a lot to work within the realm of choosing an 'optimal' number of topics, but I investigated it via a measure called <span class="emph">perplexity</span>. It bottomed out at around 50 topics. Usually such an approach is done through cross-validation. However, the solution chosen has no guarantee to produce human interpretable topics.