-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathword-embeddings.rmd
251 lines (171 loc) · 12.1 KB
/
word-embeddings.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# Word Embeddings
A key idea in the examination of text concerns representing words as numeric quantities. There are a number of ways to go about this, and we've actually already done so. In the sentiment analysis section words were given a sentiment score. In topic modeling, words were represented as frequencies across documents. Once we get to a numeric representation, we can then run statistical models.
Consider topic modeling again. We take the document-term matrix, and reduce the dimensionality of it to just a few topics. Now consider a <span class="emph">co-occurrence matrix</span>, where if there are $k$ words, it is a $k$ x $k$ matrix, where the diagonal values tell us how frequently word<sub>i</sub> occurs with word<sub>j</sub>. Just like in topic modeling, we could now perform some matrix factorization technique to reduce the dimensionality of the matrix[^wordsinenglish]. Now for each word we have a vector of numeric values (across factors) to represent them. Indeed, this is how some earlier approaches were done, for example, using principal components analysis on the co-occurrence matrix.
Newer techniques such as <span class="emph">word2vec</span> and <span class="emph">GloVe</span> use neural net approaches to construct word vectors. The details are not important for applied users to benefit from them. Furthermore, applications have been made to create sentence and other vector representations[^avgw2v]. In any case, with vector representations of words we can see how similar they are to each other, and perform other tasks based on that information.
A tired example from the literature is as follows:
$$\mathrm{king - man + woman = queen}$$
So a woman-king is a queen.
Here is another example:
$$\mathrm{Paris - France + Germany = Berlin}$$
Berlin is the Paris of Germany.
The idea is that with vectors created just based on co-occurrence we can recover things like analogies. Subtracting the **man** vector from the **king** vector and adding **woman**, the most similar word to this would be **queen**. For more on why this works, take a look [here](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).
## Shakespeare example
We start with some already tokenized data from the works of Shakespeare. We'll treat the words as if they just come from one big Shakespeare document, and only consider the words as tokens, as opposed to using n-grams. We create an iterator object for <span class="pack">text2vec</span> functions to use, and with that in hand, create the vocabulary, keeping only those that occur at least 5 times. This example generally follows that of the package [vignette](http://text2vec.org/glove.html), which you'll definitely want to spend some time with.
```{r text2vec_init, echo=-(4:5), eval=-3}
load('data/shakes_words_df_4text2vec.RData')
library(text2vec)
shakes_words
DT::datatable(shakes_words[1:100,], options=list(dom='t', scrollY=T), rownames = F, width = '40%')
htmltools::br()
shakes_words_ls = list(shakes_words$word)
it = itoken(shakes_words_ls, progressbar = FALSE)
shakes_vocab = create_vocabulary(it)
shakes_vocab = prune_vocabulary(shakes_vocab, term_count_min = 5)
```
Let's take a look at what we have at this point. We've just created word counts, that's all the vocabulary object is.
```{r text2vec_vocab, echo=-2}
shakes_vocab
# DT::datatable(shakes_vocab, options=list(dom='dt'), rownames = F, width = '25%')
```
The next step is to create the token co-occurrence matrix (TCM). The definition of whether two words occur together is arbitrary. Should we just look at previous and next word? Five behind and forward? This will definitely affect results so you will want to play around with it.
```{r text2vec_tcm, eval=FALSE, echo=-1}
# for reasons unknown, knitr consistently fails on this chunk; but since glove already processed no need to here
# maps words to indices
vectorizer = vocab_vectorizer(shakes_vocab)
# use window of 10 for context words
shakes_tcm = create_tcm(it, vectorizer, skip_grams_window = 10)
```
```{r load_tcm, echo=FALSE, eval=FALSE}
load('data/text2vec_shakes_tcm.RData')
```
Note that such a matrix will be extremely sparse. Most words do not go with other words in the grand scheme of things. So when they do, it usually matters.
Now we are ready to create the word vectors based on the GloVe model. Various options exist, so you'll want to dive into the associated help files and perhaps [the original articles](http://nlp.stanford.edu/projects/glove/) to see how you might play around with it. The following takes roughly a minute or two on my machine. I suggest you start with `n_iter = 10` and/or `convergence_tol = 0.001` to gauge how long you might have to wait.
In this setting, we can think of our word of interest as the <span class="emph">target</span>, and any/all other words (within the window) as the <span class="emph">context</span>. Word vectors are learned for both.
```{r text2vec_model, eval=FALSE, echo=-14}
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = shakes_vocab, x_max = 10)
shakes_wv_main = glove$fit_transform(shakes_tcm, n_iter = 1000, convergence_tol = 0.00001)
# dim(shakes_wv_main)
shakes_wv_context = glove$components
# dim(shakes_wv_context)
# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
shakes_word_vectors = shakes_wv_main + t(shakes_wv_context)
save(shakes_wv_main, shakes_word_vectors, file='data/glove_result.RData')
```
Now we can start to play. The measure of interest in comparing two vectors will be <span class="emph">cosine similarity</span>, which, if you're not familiar, you can think of it similarly to the standard correlation[^shakescosine]. Let's see what is similar to Romeo.
```{r text2vec_results, echo=-c(1,7)}
load('data/glove_result.RData')
rom = shakes_word_vectors["romeo", , drop = F]
# ham = shakes_word_vectors["hamlet", , drop = F]
cos_sim_rom = sim2(x = shakes_word_vectors, y = rom, method = "cosine", norm = "l2")
# head(sort(cos_sim_rom[,1], decreasing = T), 10)
head(cos_sim_rom[order(cos_sim_rom, decreasing = T),1], 10) %>%
round(2) %>%
t %>%
data.frame() %>%
kable_df()
```
Obviously Romeo is most like Romeo, but after that comes the rest of the crew in the play. As this text is somewhat raw, it is likely due to names associated with lines in the play. As such, one may want to narrow the window[^romeo5]. Let's try **love**.
```{r text2vec_results_love, echo=-5}
love = shakes_word_vectors["love", , drop = F]
cos_sim_rom = sim2(x = shakes_word_vectors, y = love, method = "cosine", norm = "l2")
# head(sort(cos_sim_rom[,1], decreasing = T), 10)
head(sort(cos_sim_rom[,1], decreasing = T), 10) %>%
kable_df(digits=2)
```
The issue here is that love is so commonly used in Shakespeare, it's most like other very common words. What if we take Romeo, subtract his friend Mercutio, and add Nurse? This is similar to the analogy example we had at the start.
```{r text2vec_results_rommerc, echo=-5}
test = shakes_word_vectors["romeo", , drop = F] -
shakes_word_vectors["mercutio", , drop = F] +
shakes_word_vectors["nurse", , drop = F]
cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2")
# head(sort(cos_sim_test[,1], decreasing = T), 10)
head(sort(cos_sim_test[,1], decreasing = T), 3) %>%
kable_df(digits=2)
```
It looks like we get Juliet as the most likely word (after the ones we actually used), just as we might have expected. Again, we can think of this as Romeo is to Mercutio as Juliet is to the Nurse. Let's try another like that.
```{r text2vec_results_romjulcleo, echo=-5}
test = shakes_word_vectors["romeo", , drop = F] -
shakes_word_vectors["juliet", , drop = F] +
shakes_word_vectors["cleopatra", , drop = F]
cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2")
# head(sort(cos_sim_test[,1], decreasing = T), 3)
head(sort(cos_sim_test[,1], decreasing = T), 3) %>%
kable_df(digits=2)
```
One can play with stuff like this all day. For example, you may find that a Romeo without love is a Tybalt!
```{r text2vec_results_other, echo=F, eval=F}
# test = shakes_word_vectors["romeo", , drop = F] +
# shakes_word_vectors["juliet", , drop = F]
# test = shakes_word_vectors["death", , drop = F] +
# shakes_word_vectors["war", , drop = F]
# test = shakes_word_vectors["war", , drop = F] +
# shakes_word_vectors["honour", , drop = F] +
# shakes_word_vectors["death", , drop = F] +
# shakes_word_vectors["blood", , drop = F]
# test = shakes_word_vectors["king", , drop = F] +
# shakes_word_vectors["lear", , drop = F]
# test = shakes_word_vectors["juliet", , drop = F] -
# shakes_word_vectors["romeo", , drop = F] +
# shakes_word_vectors["mark", , drop = F]
test = shakes_word_vectors["romeo", , drop = F] -
shakes_word_vectors["love", , drop = F]
test = shakes_word_vectors["rome", , drop = F] -
shakes_word_vectors["caesar", , drop = F] +
shakes_word_vectors["richard", , drop = F]
cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2")
# head(sort(cos_sim_test[,1], decreasing = T), 3)
head(sort(cos_sim_test[,1], decreasing = T), 50)
```
## Wikipedia
The following shows the code for analyzing text from Wikipedia, and comes directly from the <span class="pack">text2vec</span> vignette. Note that this is a relatively large amount of text (100MB), and so will take notably longer to process.
```{r wiki_code, eval=FALSE}
text8_file = "data/texts_raw/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "data/text8.zip")
unzip("data/text8.zip", files = "text8", exdir = "data/texts_raw/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
tokens = space_tokenizer(wiki)
it = itoken(tokens, progressbar = FALSE)
vocab = create_vocabulary(it)
vocab = prune_vocabulary(vocab, term_count_min = 5L)
vectorizer = vocab_vectorizer(vocab)
tcm = create_tcm(it, vectorizer, skip_grams_window = 5L)
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 100, convergence_tol = 0.001)
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
```
Let's try our Berlin example.
```{r echo=FALSE}
# this will load the cosine result
load('data/wiki_berlin_queen.RData')
```
```{r berlin, eval=FALSE}
berlin = word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["germany", , drop = FALSE]
berlin_cos_sim = sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2")
head(sort(berlin_cos_sim[,1], decreasing = TRUE), 5)
```
```{r berlin_eval, echo=FALSE}
head(sort(berlin_cos_sim[,1], decreasing = TRUE), 5)
```
Success! Now let's try the queen example.
```{r queen, eval=FALSE}
queen = word_vectors["king", , drop = FALSE] -
word_vectors["man", , drop = FALSE] +
word_vectors["woman", , drop = FALSE]
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
```
```{r queen_eval, echo=FALSE}
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
```
Not so much, though it is still a top result. Results are of course highly dependent upon the data and settings you choose, so keep the context in mind when trying this out.
Now that words are vectors, we can use them in any model we want, for example, to predict sentimentality. Furthermore, extensions have been made to deal with sentences, paragraphs, and even [lda2vec](https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=)! In any event, hopefully you have some idea of what word embeddings are and can do for you, and have added another tool to your text analysis toolbox.
[^wordsinenglish]: You can imagine how it might be difficult to deal with the English language, which might be something on the order of 1 million words.
[^avgw2v]: Simply taking the average of the word vector representations within a sentence to represent the sentence as a vector is surprisingly performant.
[^shakescosine]: It's also used in the [Shakespeare Start to Finish][Similarity] section.
[^romeo5]: With a window of 5, Romeo's top 10 includes others like Troilus and Cressida.