Skip to content

Commit 9f5d175

Browse files
committed
the project is finsih and i add a complete README.md
1 parent a1b5a17 commit 9f5d175

File tree

5 files changed

+89
-8
lines changed

5 files changed

+89
-8
lines changed

Text_Predication/README.md

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Text prediction using N-gram Language Model
2+
3+
<p align="center">
4+
<img src="assets/predict.jpg" width=40% height=40%>
5+
6+
7+
## ⚙️ Languages or Frameworks Used
8+
This Python project utilizes the Natural Language Toolkit (NLTK) library to implement an N-gram language model. The code
9+
includes the following packages:
10+
### Packages
11+
12+
1. **Pandas:** A data manipulation library used for handling and analyzing tabular data.
13+
14+
2. **NLTK (Natural Language Toolkit):**
15+
- `bigrams`: Module for extracting bigrams from a sequence of words.
16+
- `lm.preprocessing.pad_both_ends`: Preprocessing module for padding both ends of a sequence.
17+
- `tokenize.WordPunctTokenizer`: Tokenizer for breaking text into words using punctuation and whitespace.
18+
19+
- `lm.Vocabulary`: Module for constructing a vocabulary from a given text corpus.
20+
- `lm.Laplace`: Module implementing Laplace smoothing for language modeling.
21+
22+
## 🛠️ Description
23+
24+
### N-gram Language Model Project
25+
This Python project implements a text prediction system using the Laplace smoothing model with bigrams. The goal is to predict the next word in a given sentence based on the provided prefix. The project utilizes the Natural Language Toolkit (NLTK) library for processing and modeling natural language data.
26+
27+
### How It Works
28+
29+
1. **Data Preprocessing:**
30+
- The project starts by reading a CSV file (`train.csv`) containing text data.
31+
- HTML tags are removed from the 'Body' column of the dataset using a function called `remove_html_tags`.
32+
- The text is tokenized using the `WordPunctTokenizer` from NLTK.
33+
34+
2. **N-gram Model Building:**
35+
- The corpus is then processed further by padding both ends of each sentence with special symbols ("<s>" and "</s>") using the `pad_both_ends` function.
36+
- Bigrams are extracted from the padded sentences using the `bigrams` function.
37+
- The vocabulary is constructed using the `Vocabulary` class from NLTK.
38+
39+
3. **Laplace Smoothing Model:**
40+
- The Laplace smoothing model is implemented using the `Laplace` class from NLTK's language modeling module.
41+
- The model is trained on the bigram data.
42+
43+
4. **Next-word Prediction:**
44+
- User input is taken to provide a prefix for next-word prediction.
45+
- The Laplace model scores each word in the vocabulary based on its likelihood to follow the given prefix.
46+
- The top three predictions with their respective scores are displayed.
47+
48+
49+
## 🌟 How to run
50+
51+
1. **Instal** the required dependencies:
52+
53+
```bash
54+
pip install -r requirements.txt
55+
```
56+
57+
2. **Run** the code:
58+
59+
```bash
60+
python text_prediction.py
61+
```
62+
63+
## 📺 Demo
64+
65+
![](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExN3BndnM1M2tnaWhlbjkxczJmcndzenh2bnlhaWFkZWR2YWhqNDg0ZSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/NIQIoC9vc7xBEPOCPY/giphy.gif)
66+
67+
## 🤖 Author
68+
69+
links : [louisbau](https://github.com/louisbau)

Text_Predication/assets/demo.PNG

5.13 KB
Loading

Text_Predication/assets/predict.jpg

170 KB
Loading
File renamed without changes.

Text_Predication/text_prediction.py

+20-8
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,28 @@
1-
import math
2-
31
import pandas as pd
42
from nltk import bigrams
53
from nltk.lm.preprocessing import pad_both_ends
6-
74
from nltk.tokenize import WordPunctTokenizer
8-
from nltk.probability import FreqDist
95
from nltk.lm import Vocabulary
106
from nltk.lm import Laplace
117

128

139
def remove_html_tags(text):
10+
# Function to remove HTML tags from text
1411
import re
1512
clean = re.compile('<.*?>')
1613
return re.sub(clean, '', text)
1714

1815

1916
def remap_corpus(path):
17+
# Read CSV file, preprocess the 'Body' column, and tokenize the text
2018
df_corpus = pd.read_csv(path)
2119
df_corpus['Body'] = df_corpus['Body'].apply(lambda x: remove_html_tags(x))
2220
df_corpus['Body_tokenized'] = df_corpus['Body'].apply(lambda x: WordPunctTokenizer().tokenize(x))
2321
return df_corpus
2422

2523

2624
def padding_corpus(corpus):
25+
# Pad both ends of each sentence in the corpus
2726
corpus_padding = []
2827
for sentence in corpus:
2928
corpus_padding.append(
@@ -33,13 +32,15 @@ def padding_corpus(corpus):
3332

3433

3534
def remap_bigram(corpus):
35+
# Extract bigrams from each sentence in the corpus
3636
corpus_bigram = []
3737
for sentence in corpus:
3838
corpus_bigram.append(list(bigrams(sentence)))
3939
return corpus_bigram
4040

4141

4242
def vocab(corpus):
43+
# Create a vocabulary list from the corpus
4344
voc_list = []
4445
for sentence in corpus:
4546
for word in sentence:
@@ -48,6 +49,7 @@ def vocab(corpus):
4849

4950

5051
def prediction(train, prefix):
52+
# Perform next-word prediction using Laplace smoothing
5153
train = padding_corpus(train)
5254
voc = vocab(train)
5355
LaplaceModel = Laplace(2, vocabulary=voc)
@@ -62,10 +64,20 @@ def prediction(train, prefix):
6264

6365

6466
if __name__ == '__main__':
65-
path_train = "./train.csv"
67+
# Main execution
68+
print("Text Prediction using Laplace Smoothing")
69+
print("--------------------------------------")
70+
print("Reading training data...")
71+
72+
path_train = "corpus/train.csv"
6673
corpus_train = remap_corpus(path_train)['Body_tokenized']
74+
print("Training data is ready!")
75+
print("--------------------------------------")
6776
user_input = input("Enter a prefix for next-word prediction: ")
68-
prediction = prediction(corpus_train, user_input)
69-
for i in prediction:
70-
print(f"Next word predictionc can be : {i}")
77+
print("--------------------------------------")
78+
print("Performing next-word prediction...")
7179

80+
prediction_result = prediction(corpus_train, user_input)
81+
print("------- Result of Prediction ---------")
82+
for i in prediction_result:
83+
print(f"Next word predictionc can be : {i}")

0 commit comments

Comments
 (0)