You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This Python project utilizes the Natural Language Toolkit (NLTK) library to implement an N-gram language model. The code
9
+
includes the following packages:
10
+
### Packages
11
+
12
+
1.**Pandas:** A data manipulation library used for handling and analyzing tabular data.
13
+
14
+
2.**NLTK (Natural Language Toolkit):**
15
+
-`bigrams`: Module for extracting bigrams from a sequence of words.
16
+
-`lm.preprocessing.pad_both_ends`: Preprocessing module for padding both ends of a sequence.
17
+
-`tokenize.WordPunctTokenizer`: Tokenizer for breaking text into words using punctuation and whitespace.
18
+
19
+
-`lm.Vocabulary`: Module for constructing a vocabulary from a given text corpus.
20
+
-`lm.Laplace`: Module implementing Laplace smoothing for language modeling.
21
+
22
+
## 🛠️ Description
23
+
24
+
### N-gram Language Model Project
25
+
This Python project implements a text prediction system using the Laplace smoothing model with bigrams. The goal is to predict the next word in a given sentence based on the provided prefix. The project utilizes the Natural Language Toolkit (NLTK) library for processing and modeling natural language data.
26
+
27
+
### How It Works
28
+
29
+
1.**Data Preprocessing:**
30
+
- The project starts by reading a CSV file (`train.csv`) containing text data.
31
+
- HTML tags are removed from the 'Body' column of the dataset using a function called `remove_html_tags`.
32
+
- The text is tokenized using the `WordPunctTokenizer` from NLTK.
33
+
34
+
2.**N-gram Model Building:**
35
+
- The corpus is then processed further by padding both ends of each sentence with special symbols ("<s>" and "</s>") using the `pad_both_ends` function.
36
+
- Bigrams are extracted from the padded sentences using the `bigrams` function.
37
+
- The vocabulary is constructed using the `Vocabulary` class from NLTK.
38
+
39
+
3.**Laplace Smoothing Model:**
40
+
- The Laplace smoothing model is implemented using the `Laplace` class from NLTK's language modeling module.
41
+
- The model is trained on the bigram data.
42
+
43
+
4.**Next-word Prediction:**
44
+
- User input is taken to provide a prefix for next-word prediction.
45
+
- The Laplace model scores each word in the vocabulary based on its likelihood to follow the given prefix.
46
+
- The top three predictions with their respective scores are displayed.
0 commit comments