10-spark-textmining.Rmd

```{r, textmining, include = FALSE}
eval_mining <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_mining <- Sys.getenv("GLOBAL_EVAL")
```

```{r, eval = eval_mining, include = FALSE}
library(wordcloud2)
library(sparklyr)
library(dplyr)
```

# Text mining with `sparklyr`

For this example, there are two files that will be analyzed.  They are both the full works of Sir Arthur Conan Doyle and Mark Twain.  The files were downloaded from the [Gutenberg Project](https://www.gutenberg.org/) site via the `gutenbergr` package.  Intentionally, no data cleanup was done to the files prior to this analysis.  See the appendix below to see how the data was downloaded and prepared.

```{r, eval = eval_mining}
readLines("/usr/share/class/books/arthur_doyle.txt", 30) 
```


## Data Import
*Read the book data into Spark*

1. Load the `sparklyr` library
    ```{r, eval = eval_mining}
    library(sparklyr)
    ```

2. Open a Spark session
    ```{r, eval = eval_mining}
    sc <- spark_connect(master = "local")
    ```

3. Use the `spark_read_text()` function to read the **mark_twain.txt** file, assign it to a variable called `twain`
    ```{r, eval = eval_mining}
    twain <- spark_read_text(sc, "twain", "/usr/share/class/books/mark_twain.txt") 
    ```

4. Use the `spark_read_text()` function to read the **arthur_doyle.txt** file, assign it to a variable called `doyle`
    ```{r, eval = eval_mining}
    doyle <- 
    ```


## Tidying data
*Prepare the data for analysis*

1. Load the `dplyr` library
    ```{r}
    library(dplyr)
    ```

2. Add a column to `twain` named `author` with a value of "twain".  Assign it to a new variable called `twain_id`
    ```{r, eval = eval_mining}
    
    ```

3. Add a column to `doyle` named `author` with a value of "doyle".  Assign it to a new variable called `doyle_id`
    ```{r, eval = eval_mining}
    
    ```

4. Use `sdf_bind_rows()` to append the two files together in a variable called `both`
    ```{r, eval = eval_mining}
    
    ```

5. Preview `both`
    ```{r, eval = eval_mining}
    
    ```

6. Filter out empty lines into a variable called `all_lines`
    ```{r, eval = eval_mining}
    all_lines <- 
    ```

7. Use Hive's *regexp_replace* to remove punctuation, assign it to the same `all_lines` variable
    ```{r, eval = eval_mining}
    all_lines <- all_lines %>%
      mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))
    ```

## Transform the data
*Use feature transformers to make additional preparations*

1. Use `ft_tokenizer()` to separate each word. in the line.  Set the `output_col` to "word_list". Assign to a variable called `word_list`
    ```{r, eval = eval_mining}
    word_list <- 
    ```

2. Remove "stop words" with the `ft_stop_words_remover()` transformer. Set the `output_col` to "wo_stop_words". Assign to a variable called `wo_stop`
    ```{r, eval = eval_mining}
    wo_stop <- 
    ```

3. Un-nest the tokens inside *wo_stop_words* using `explode()`.  Assign to a variable called `exploded`
    ```{r, eval = eval_mining}
    exploded <- 
    ```

4. Select the *word* and *author* columns, and remove any word with less than 3 characters. Assign to `all_words`
    ```{r, eval = eval_mining}
    all_words <- 
    ```

5. Cache the `all_words` variable using `compute()`  
    ```{r, eval = eval_mining}
    all_words <- all_words %>%
      compute("all_words")
    ```

## Data Exploration
*Used word clouds to explore the data*

1. Create a variable with the word count by author, name it `word_count`
    ```{r, eval = eval_mining}
    word_count <- 
    ```

2. Filter `word_cout` to only retain "twain", assign it to `twain_most`
    ```{r, eval = eval_mining}
    twain_most <- word_count %>%
      filter(author == "twain")
    ```

3. Use `wordcloud` to visualize the top 50 words used by Twain
    ```{r, eval = eval_mining}
    twain_most %>%
      head(50) %>%
      collect() %>%
      with(wordcloud::wordcloud(
        word, 
        n,
        colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9"))
        )
    ```

4. Filter `word_cout` to only retain "doyle", assign it to `doyle_most`
    ```{r, eval = eval_mining}
    doyle_most <- 
    ```

5. Used `wordcloud` to visualize the top 50 words used by Doyle that have more than 5 characters
    ```{r, eval = eval_mining}

    ```

6. Use `anti_join()` to figure out which words are used by Doyle but not Twain. Order the results by number of words.
    ```{r, eval = eval_mining}
    doyle_unique <- 
    ```

7. Use `wordcloud` to visualize top 50 records in the previous step
    ```{r, eval = eval_mining}
    doyle_unique %>%
      head(50) %>%
      collect() %>%
      with(wordcloud::wordcloud(
        word, 
        n,
        colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9"))
        )
    ```

8. Find out how many times Twain used the word "sherlock"
    ```{r, eval = eval_mining}
    
    ```

9. Against the `twain` variable, use Hive's *instr* and *lower* to make all ever word lower cap, and then look for "sherlock" in the line
    ```{r, eval = eval_mining}

    ```

10. Close Spark session
    ```{r, eval = eval_mining}
    spark_disconnect(sc)
    ```

Most of these lines are in a short story by Mark Twain called [A Double Barrelled Detective Story](https://www.gutenberg.org/files/3180/3180-h/3180-h.htm#link2H_4_0008). As per the [Wikipedia](https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story) page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.