The main focus of this project is the application of Topic Modeling on short documents, collected from social media platform - Twitter. The algorithm used for this purpose is Latent Dirichlet Allocation, which is one of the simplest topic models. Apache Spark engine together with its underlying Hadoop File System have been used to distribute work across all nodes/machines.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- SBT 0.13.12
- Apache Spark 2.1.0
- Scala 2.11.0
- Hadoop 2.7.3
Dataset has been collected from Twitter platform using TwitterCollector script.
- Run following commands for initial project setup:
git clone http://github.com/arajski/topic-modelling
cd topic-modelling
- Edit
submit-spark.sh
file to make sure it contains correct paths for Hadoop and Apache Spark (file contains sample configuration). - To run the application and send it to Apache Spark cluster, execute the following script with HDFS url as a parameter. It should point to a directory with stored data files.
./submit-spark.sh hdfs_url
First run will download all dependencies, including Stanford CoreNLP library, compile the solution and run the test suites.
To run the test suites, simply run sbt test
.
Test cases are available in src/test/scala
directory
- Stanford CoreNLP - Library used for Natural Language Processing
- Apache Spark - Data processing and task distribution engine
This project is licensed under the MIT License - see the LICENSE file for details