forked from rstudio-conf-2020/big-data
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path10-spark-textmining.Rmd
181 lines (137 loc) · 5.39 KB
/
10-spark-textmining.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
```{r, textmining, include = FALSE}
eval_mining <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_mining <- Sys.getenv("GLOBAL_EVAL")
```
```{r, eval = eval_mining, include = FALSE}
library(wordcloud2)
library(sparklyr)
library(dplyr)
```
# Text mining with `sparklyr`
For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the [Gutenberg Project](https://www.gutenberg.org/) site via the `gutenbergr` package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.
```{r, eval = eval_mining}
readLines("/usr/share/class/books/arthur_doyle.txt", 30)
```
## Data Import
*Read the book data into Spark*
1. Load the `sparklyr` library
```{r, eval = eval_mining}
library(sparklyr)
```
2. Open a Spark session
```{r, eval = eval_mining}
sc <- spark_connect(master = "local")
```
3. Use the `spark_read_text()` function to read the **mark_twain.txt** file, assign it to a variable called `twain`
```{r, eval = eval_mining}
twain <- spark_read_text(sc, "twain", "/usr/share/class/books/mark_twain.txt")
```
4. Use the `spark_read_text()` function to read the **arthur_doyle.txt** file, assign it to a variable called `doyle`
```{r, eval = eval_mining}
doyle <-
```
## Tidying data
*Prepare the data for analysis*
1. Load the `dplyr` library
```{r}
library(dplyr)
```
2. Add a column to `twain` named `author` with a value of "twain". Assign it to a new variable called `twain_id`
```{r, eval = eval_mining}
```
3. Add a column to `doyle` named `author` with a value of "doyle". Assign it to a new variable called `doyle_id`
```{r, eval = eval_mining}
```
4. Use `sdf_bind_rows()` to append the two files together in a variable called `both`
```{r, eval = eval_mining}
```
5. Preview `both`
```{r, eval = eval_mining}
```
6. Filter out empty lines into a variable called `all_lines`
```{r, eval = eval_mining}
all_lines <-
```
7. Use Hive's *regexp_replace* to remove punctuation, assign it to the same `all_lines` variable
```{r, eval = eval_mining}
all_lines <- all_lines %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))
```
## Transform the data
*Use feature transformers to make additional preparations*
1. Use `ft_tokenizer()` to separate each word. in the line. Set the `output_col` to "word_list". Assign to a variable called `word_list`
```{r, eval = eval_mining}
word_list <-
```
2. Remove "stop words" with the `ft_stop_words_remover()` transformer. Set the `output_col` to "wo_stop_words". Assign to a variable called `wo_stop`
```{r, eval = eval_mining}
wo_stop <-
```
3. Un-nest the tokens inside *wo_stop_words* using `explode()`. Assign to a variable called `exploded`
```{r, eval = eval_mining}
exploded <-
```
4. Select the *word* and *author* columns, and remove any word with less than 3 characters. Assign to `all_words`
```{r, eval = eval_mining}
all_words <-
```
5. Cache the `all_words` variable using `compute()`
```{r, eval = eval_mining}
all_words <- all_words %>%
compute("all_words")
```
## Data Exploration
*Used word clouds to explore the data*
1. Create a variable with the word count by author, name it `word_count`
```{r, eval = eval_mining}
word_count <-
```
2. Filter `word_cout` to only retain "twain", assign it to `twain_most`
```{r, eval = eval_mining}
twain_most <- word_count %>%
filter(author == "twain")
```
3. Use `wordcloud` to visualize the top 50 words used by Twain
```{r, eval = eval_mining}
twain_most %>%
head(50) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9"))
)
```
4. Filter `word_cout` to only retain "doyle", assign it to `doyle_most`
```{r, eval = eval_mining}
doyle_most <-
```
5. Used `wordcloud` to visualize the top 50 words used by Doyle that have more than 5 characters
```{r, eval = eval_mining}
```
6. Use `anti_join()` to figure out which words are used by Doyle but not Twain. Order the results by number of words.
```{r, eval = eval_mining}
doyle_unique <-
```
7. Use `wordcloud` to visualize top 50 records in the previous step
```{r, eval = eval_mining}
doyle_unique %>%
head(50) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9"))
)
```
8. Find out how many times Twain used the word "sherlock"
```{r, eval = eval_mining}
```
9. Against the `twain` variable, use Hive's *instr* and *lower* to make all ever word lower cap, and then look for "sherlock" in the line
```{r, eval = eval_mining}
```
10. Close Spark session
```{r, eval = eval_mining}
spark_disconnect(sc)
```
Most of these lines are in a short story by Mark Twain called [A Double Barrelled Detective Story](https://www.gutenberg.org/files/3180/3180-h/3180-h.htm#link2H_4_0008). As per the [Wikipedia](https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story) page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.