-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MiniProject 3 TextMining #9
base: master
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Project Overview" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"I used the Project Gutenberg as my data source. I picked the book <i> The Adventure of Sherlock Holmes </i> by Arthur Conan Doyle. I analyzed the source by figuring out the word frequencies and computing the summary statistics such as the total number of words, total different number of words, the most common words, etc. I also processed the source through sentiment analysis. Through this mini-project, I hoped to learn how to retrieve information from the internet and process that information through different techniques and come up with a valid and resonable conclusion." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Implementation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"I started this project by first requesting the data file from the internet, which is Project Gutenberg in this case. However, as I realized that Project Gutenberg places a limit on how many times we can pull the file within 24 hours, I decided to pickle the data instead of pulling the text from the website everytime I run the program to avoid any future impediments imposed by such limit. Before the text was pickled and saved to my disk, I needed to modify the text by stripping out the irrelevant information such as the preamble. I did so by spliting the text to a list and selecting the appropriate list of actual text. After appropriate text was pickled to my disk, I commented out the pickle dump and had the pickle load in the beginning of the program. \n", | ||
"\n", | ||
"With the text file ready, I started analyzing the text. In order to find out the frequency of each word, I could use the dictionary to do so. But I had to get rid of all the punctuation and whitespace and keep all words lowercase. So, in order to process the entire text, I needed to process by line first. It is like figuring out how to tackle the issue from the big picture first and see how to make the big picture work by making the smaller components of that big picture work. After I had a dictionary of all the words, which contained the value and key,I could easily compute the most common words by sorting the keys.\n", | ||
"\n", | ||
"Another analysis I did to the text was the sentiment analysis. I imported the vadar sentiment analyzer and tried to print the result of the sentiment for the text. However, I ran into the issue that the function only takes strings. So I had to make the entire text one big string so I would have one sentiment result rather than one sentiment result for each line. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"After I did some text analysis, I had a number of summary statistics. I found that \"the total number of words in the Sherlock text is 105358\" and 'the book is consisted of a total number of 8019 different words in this book. This is very interesting finding because it proves that we do not have to know majority of words to write a book. Each word is on average repeated about 13 times in the book. As matter of fact, based on the second edition of the Oxford English Dictionary, which contains full entries for 171,476 words in current use, and 47,156 obsolete words, we only need less than 4% of English vocab to write a book.\n", | ||
"\n", | ||
"The next analysis I did was to find out the top 10 most common words. I got the result of \"the\",\"and\",\"i\",\"to\",\"of\",\"a\",\"in\",\"that\",\"it\",\"you\". These words are typically high frequent words in English outside of this context. Despite the fact that this book only used 8019 different words, it still provides a very good representation of most common used words in English. I also set a function to ramdonly select 10 words out of the 8017 different words each time the program is run to see if I could find any interesting discovery from it. Yet, so far, I had not yet found any unique things that stood out to me. Last but not least, I ran the sentiment analysis on the full text and got the follwoing result: {'neg': 0.078, 'neu': 0.83, 'pos': 0.092, 'compound': 1.0}. From this analysis, we can conclude that most of the book is neutral with a slightly positive sentiment over negative sentiment. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Reflection" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"From a process point of view, I think I learned a lot through this mini-project. Before this project, I never thought of applying a text analysis on things we do daily, such as looking up information on twitter, facebook, wikipedia, etc. While doing this project, I learned how to pull data from internet using pickling. I really like pickling as it prevents the system from blocking me out. However, I could improve my text analysis by applying more techniques, such as graphs. I was not very familar with matplot. So I did not use it on this project. I wish I knew a little more about it before this project. But I will learn more on that and hopefully applying that to my future projects. I will definitely apply text analysis in the future to get some quality information from the data that is available to everyone online." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,13 @@ | ||
# TextMining | ||
|
||
This is the base repo for the text mining and analysis project for Software Design at Olin College. | ||
|
||
## Open your terminal on your computer | ||
type command: $ pip install nltk requests vaderSentiment | ||
|
||
## Run the TextMining python file | ||
If you want to recreate the sherlock.txt, open the python script and uncomment line 10-16. You can recomment those lines after you recreate the sherlock.txt. | ||
|
||
## Project Write-up and Reflection | ||
Please check my **Project Write-up and Reflection.ipynb** in the TextMing Document | ||
if unable to locate the file, you can access it here: https://github.com/leoliuuu/TextMining/blob/master/Project%20Write-up%20and%20Reflection.ipynb |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
import requests | ||
import pickle | ||
import string | ||
import random | ||
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer | ||
|
||
#data file downloaded once, text has been saved through pickle | ||
#text request needs to be be re-pickled when a new user runs the analysis(uncommet from line 10-16 and then run the file, then comment that part again after the text is pickled to user's computer) | ||
|
||
# the_adventure_of_sherlock_holmes = requests.get('http://www.gutenberg.org/ebooks/1661.txt.utf-8').text | ||
# line_split=the_adventure_of_sherlock_holmes.split('\n') | ||
# modified_sherlock_text=line_split[57:12682] | ||
# | ||
# f = open ('sherlock_texts.pickle', 'wb') | ||
# pickle.dump(modified_sherlock_text,f) | ||
# f.close() | ||
|
||
|
||
input_file = open('sherlock_texts.pickle', 'rb') | ||
reloaded_copy_of_texts = pickle.load(input_file) #reload the pickled file | ||
|
||
sherlock_texts = open('sherlock.txt','w') #open a writable text and convert the pickled list to the string | ||
sherlock_texts.write ('\n'.join (reloaded_copy_of_texts)) | ||
sherlock_texts.close() | ||
|
||
def process_line(line, hist): | ||
line = line.replace('-', ' ') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add docstrings for all the functions you write. You can take a look at the google style used here: https://github.com/sd17fall/GeneFinder/blob/formatted/gene_finder.py. I can't not figure out what this function is doing unless I go read every line. When the code gets longer, having no docstrings is really frustrating to the reviewer. |
||
for word in line.split(): | ||
word = word.strip(string.punctuation + string.whitespace) | ||
word = word.lower() | ||
hist[word] = hist.get(word, 0) + 1 | ||
|
||
def process_text(TextFile): | ||
hist = dict() | ||
fp = open(TextFile) | ||
for line in fp: | ||
process_line(line, hist) | ||
return hist | ||
|
||
def most_common(hist): | ||
t = [] | ||
for key, value in hist.items(): | ||
t.append((value, key)) | ||
t.sort() | ||
t.reverse() | ||
return t | ||
|
||
def total_words(hist): | ||
return sum(hist.values()) | ||
|
||
def different_words(hist): | ||
return len(hist) | ||
|
||
def random_word(hist): | ||
t = [] | ||
for word,freq in hist.items(): | ||
t.extend([word]) | ||
return random.choice(t) | ||
|
||
|
||
|
||
def main(): | ||
hist = process_text('sherlock.txt') | ||
print('\nThe total number of words in the Sherlock text is', total_words(hist),'.') | ||
|
||
print('\nThe book is consisted of a total number of', different_words(hist), 'different words in this book.') | ||
|
||
t = most_common(hist) | ||
print('\nThe most common words are:') | ||
for freq, word in t[:10]: | ||
print(word, '\t', freq) | ||
|
||
print("\nHere are some random words from the book:") | ||
for i in range(10): | ||
print(random_word(hist)) | ||
|
||
|
||
data = open('sherlock.txt').read().replace('\n', '') | ||
analyzer = SentimentIntensityAnalyzer() | ||
print(analyzer.polarity_scores(data)) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use
""" comments """
instead of putting#
for every line of code. This is called the header comment. It usually contains brief description of the code, just like what you've written here.