Skip to content

Commit b9964a4

Browse files
authoredJul 27, 2023
GH-696: A bug when using word_tokenize (#697)
* add VLSP2013_WTK dataset * update UTS_WTK dataset * retrain model with VLSP2013_WTK dataset
1 parent ae5505d commit b9964a4

30 files changed

+241
-249
lines changed
 

‎.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,5 @@ target/
7171
wandb
7272

7373
node_modules
74-
.vscode
74+
.vscode
75+
.DS_Store

‎.gitmodules

+3
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,6 @@
1010
[submodule "datasets/UTS_Dictionary"]
1111
path = datasets/UTS_Dictionary
1212
url = https://huggingface.co/datasets/undertheseanlp/UTS_Dictionary
13+
[submodule "datasets/VLSP2013_WTK"]
14+
path = datasets/VLSP2013_WTK
15+
url = https://huggingface.co/datasets/undertheseanlp/VLSP2013_WTK

‎datasets/.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
datasets
1+
.DS_Store

‎datasets/UTS_Dictionary

Submodule UTS_Dictionary updated from 72deff8 to 8d70336

‎datasets/UTS_WTK

Submodule UTS_WTK updated from 91d7b1f to 5289438

‎datasets/VLSP2013_WTK

Submodule VLSP2013_WTK added at 85435e0

‎examples/word_tokenize/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
outputs

‎examples/word_tokenize/README.md

+27-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,29 @@
11
# Word Tokenization
22

3-
* [Google Colab](https://colab.research.google.com/drive/1NR9NlHJDj5_wywRze7yQizw1DgICKL72?usp=sharing)
3+
* [Google Colab](https://colab.research.google.com/drive/1NR9NlHJDj5_wywRze7yQizw1DgICKL72?usp=sharing)
4+
* [Technical Report](technical_report.md)
5+
6+
## Usage
7+
8+
### Training
9+
10+
Train and Evaluate Model:
11+
12+
```
13+
python train.py ++'dataset.include_test=False'
14+
python train.py +dataset=vlsp2013_wtk
15+
```
16+
17+
Train the Final Model (Including Test Split):
18+
19+
```
20+
python train.py ++'dataset_extras.include_test=True'
21+
```
22+
23+
### Inference
24+
25+
Generate labels with the trained model
26+
27+
```
28+
python predict.py +'output_dir="tmp/ws_20230725/"' +'text="Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump"'
29+
```
+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
defaults:
2+
- dataset/uts_wtk
3+
- train
4+
5+
dataset_extras:
6+
train_samples: -1
7+
include_test: false
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
name: undertheseanlp/UTS_WTK
2+
subset: large
3+
params:
4+
revision: 1.0
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
name: undertheseanlp/VLSP2013_WTK
2+
subset: null
3+
params:
4+
revision: 1.0.0
+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
train:
2+
output_dir: tmp/ws_20230727
3+
params:
4+
c1: 1.0
5+
c2: 1e-3
6+
max_iterations: 1000
7+
feature:
8+
possible_transitions: true
9+
possible_states: true

‎examples/word_tokenize/predict.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
from os.path import dirname, join
22
from underthesea.models.fast_crf_sequence_tagger import FastCRFSequenceTagger
3+
from underthesea.pipeline.word_tokenize.regex_tokenize import tokenize
34

4-
output_dir = join(dirname(__file__), "tmp/ws_20220222")
5+
output_dir = join(dirname(__file__), "tmp/ws_202307270300")
56
sentence = "Quỳnh Như tiết lộ với báo Bồ Đào Nha về hành trình làm nên lịch sử"
6-
tokens = sentence.split()
7+
sentence = "Thời Trần, những người đứng đầu xã được gọi là Xã quan."
8+
sentence = "Phổ là bang lớn nhất và mạnh nhất trong Liên bang Đức (chiếm 61% dân số và 64% lãnh thổ)."
9+
tokens = tokenize(sentence)
710
tokens_ = [[token] for token in tokens]
811

912
model = FastCRFSequenceTagger()
+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
seqeval
2+
datasets
3+
hydra-core
+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Vietnamese Word Segmentation with underthesea
2+
3+
```
4+
Author: Vu Anh
5+
Date: July 27, 2023
6+
```
7+
8+
Vietnamese Word Segmentation (VWS) plays a pivotal role in various Natural Language Processing (NLP) tasks for the Vietnamese language. Segmentation can be thought of as the process of splitting a sequence of characters into meaningful chunks or "words." This is particularly challenging for Vietnamese due to its nature of combining multiple lexical units into single written forms, which can be confusing without proper context.
9+
10+
Over the past years, numerous models have been proposed to tackle this issue, with Conditional Random Fields (CRF) being one of the prominent ones due to its ability to consider the context for making segmentation decisions. This report presents our experiment leveraging the CRF model for the VWS task on both the `UTS_WTK` and `VLSP2013_WTK` datasets.
11+
12+
## Methods
13+
14+
### Datasets
15+
16+
For our experiments, we utilized two datasets: `UTS_WTK` and `VLSP2013_WTK`.
17+
18+
Each of these datasets is meticulously tailored for Vietnamese Word Segmentation tasks, encompassing a vast range of textual domains. Their exhaustive scope guarantees a rigorous appraisal of our model's proficiency.
19+
20+
Employing both datasets ensures a comprehensive evaluation and robust validation of our methodology.
21+
22+
### Conditional Random Fields (CRF)
23+
24+
Conditional Random Fields (CRFs) are a type of statistical modeling method often used for structured prediction. Introduced by Lafferty et al. (2001), CRFs have found applications in various sequence labeling tasks. For the Vietnamese Word Segmentation task, CRF models the sequence of labels (whether to split or not) given a sequence of input tokens. CRFs are advantageous for this task as they consider the entire sequence context when predicting for an individual token.
25+
26+
**Feature Engineering**
27+
28+
For our CRF model, we have leveraged various token-related features. Here's a comprehensive list of the features used
29+
30+
| Feature Type | Features |
31+
| ----------------- | ---------------------------------------------------------------------------- |
32+
| Unigram | `T[-2]`, `T[-1]`, `T[0]`, `T[1]`, `T[2]` |
33+
| Bigram | `T[-2,-1]`, `T[-1,0]`, `T[0,1]`, `T[1,2]`, `T[-2,0]`, `T[-1,1]`, `T[0,2]` |
34+
| Lowercase Unigram | `T[-2].lower`, `T[-1].lower`, `T[0].lower`, `T[1].lower`, `T[2].lower` |
35+
| Lowercase Bigram | `T[-2,-1].lower`, `T[-1,0].lower`, `T[0,1].lower`, `T[1,2].lower` |
36+
| Is Digit | `T[-1].isdigit`, `T[0].isdigit`, `T[1].isdigit` |
37+
| Is Title | `T[-2].istitle`, `T[-1].istitle`, `T[0].istitle`, `T[1].istitle`, `T[2].istitle`, `T[0,1].istitle`, `T[0,2].istitle` |
38+
| Is in Dictionary | `T[-2].is_in_dict`, `T[-1].is_in_dict`, `T[0].is_in_dict`, `T[1].is_in_dict`, `T[2].is_in_dict`, `T[-2,-1].is_in_dict`, `T[-1,0].is_in_dict`, `T[0,1].is_in_dict`, `T[1,2].is_in_dict`, `T[-2,0].is_in_dict`, `T[-1,1].is_in_dict`, `T[0,2].is_in_dict` |
39+
40+
## Results
41+
42+
The table below captures the results of the Vietnamese Word Segmentation task using the Conditional Random Fields (CRF) model:
43+
44+
| Dataset | Model | F1 Score |
45+
|:----------------|:-----------|---------:|
46+
| UTS_WTK (1.0.0) | CRF | 0.977 |
47+
| VLSP2013_WTK | CRF | 0.973 |
48+
49+
## Conclusion
50+
51+
Our experiments using Conditional Random Fields for Vietnamese Word Segmentation on both the `UTS_WTK` and `VLSP2013_WTK` datasets have yielded promising results. The comprehensive feature engineering utilized effectively captures the intricate nuances of the Vietnamese language, as evidenced by the achieved F1 scores.
52+
53+
## Integration
54+
55+
Following the comprehensive training on the `VSLP2013_WTK` dataset, we secured a model checkpoint dated `20230727`. This refined model was subsequently integrated into the `underthesea` toolkit, specifically its version `6.6.0`.
56+
57+
## References
58+
59+
```
60+
@misc{UTS_WTK,
61+
title={UTS_WTK: a Vietnamese Word Segmentation Dataset},
62+
author={Vu Anh},
63+
year={2022}
64+
}
65+
66+
@inproceedings{Lafferty2001ConditionalRF,
67+
title={Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data},
68+
author={John D. Lafferty and Andrew McCallum and Fernando Pereira},
69+
booktitle={International Conference on Machine Learning},
70+
year={2001}
71+
}
72+
```

‎examples/word_tokenize/train.py

+59-69
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,71 @@
1-
from os.path import dirname, join
1+
import hydra
2+
from hydra.utils import get_original_cwd
3+
from omegaconf import DictConfig, OmegaConf
4+
5+
from os.path import join
26
from underthesea.models.fast_crf_sequence_tagger import FastCRFSequenceTagger
37
from underthesea.trainers.crf_trainer import CRFTrainer
48
from underthesea.transformer.tagged_feature import lower_words as dictionary
59
from datasets import load_dataset
610
from underthesea.utils.preprocess_dataset import preprocess_word_tokenize_dataset
711

8-
features = [
9-
# word unigram and bigram and trigram
10-
"T[-2]",
11-
"T[-1]",
12-
"T[0]",
13-
"T[1]",
14-
"T[2]",
15-
"T[-2,-1]",
16-
"T[-1,0]",
17-
"T[0,1]",
18-
"T[1,2]",
19-
"T[-2,0]",
20-
"T[-1,1]",
21-
"T[0,2]",
22-
"T[-2].lower",
23-
"T[-1].lower",
24-
"T[0].lower",
25-
"T[1].lower",
26-
"T[2].lower",
27-
"T[-2,-1].lower",
28-
"T[-1,0].lower",
29-
"T[0,1].lower",
30-
"T[1,2].lower",
31-
"T[-1].isdigit",
32-
"T[0].isdigit",
33-
"T[1].isdigit",
34-
"T[-2].istitle",
35-
"T[-1].istitle",
36-
"T[0].istitle",
37-
"T[1].istitle",
38-
"T[2].istitle",
39-
"T[0,1].istitle",
40-
"T[0,2].istitle",
41-
"T[-2].is_in_dict",
42-
"T[-1].is_in_dict",
43-
"T[0].is_in_dict",
44-
"T[1].is_in_dict",
45-
"T[2].is_in_dict",
46-
"T[-2,-1].is_in_dict",
47-
"T[-1,0].is_in_dict",
48-
"T[0,1].is_in_dict",
49-
"T[1,2].is_in_dict",
50-
"T[-2,0].is_in_dict",
51-
"T[-1,1].is_in_dict",
52-
"T[0,2].is_in_dict",
53-
]
54-
model = FastCRFSequenceTagger(features, dictionary)
5512

56-
pwd = dirname(__file__)
57-
output_dir = join(pwd, "tmp/ws_20220224")
58-
training_params = {
59-
"output_dir": output_dir,
60-
"params": {
61-
"c1": 1.0, # coefficient for L1 penalty
62-
"c2": 1e-3, # coefficient for L2 penalty
63-
"max_iterations": 1000, #
64-
# include transitions that are possible, but not observed
65-
"feature.possible_transitions": True,
66-
"feature.possible_states": True,
67-
},
68-
}
13+
@hydra.main(version_base=None, config_path="conf/", config_name="config")
14+
def train(cfg: DictConfig) -> None:
15+
wd = get_original_cwd()
16+
print(OmegaConf.to_yaml(cfg))
17+
18+
features = [
19+
# word unigram and bigram and trigram
20+
"T[-2]", "T[-1]", "T[0]", "T[1]", "T[2]",
21+
"T[-2,-1]", "T[-1,0]", "T[0,1]", "T[1,2]", "T[-2,0]",
22+
"T[-1,1]", "T[0,2]",
23+
"T[-2].lower", "T[-1].lower", "T[0].lower", "T[1].lower", "T[2].lower",
24+
"T[-2,-1].lower", "T[-1,0].lower", "T[0,1].lower", "T[1,2].lower",
25+
"T[-1].isdigit", "T[0].isdigit", "T[1].isdigit",
26+
"T[-2].istitle", "T[-1].istitle", "T[0].istitle", "T[1].istitle", "T[2].istitle",
27+
"T[0,1].istitle", "T[0,2].istitle",
28+
"T[-2].is_in_dict", "T[-1].is_in_dict", "T[0].is_in_dict", "T[1].is_in_dict", "T[2].is_in_dict",
29+
"T[-2,-1].is_in_dict", "T[-1,0].is_in_dict",
30+
"T[0,1].is_in_dict", "T[1,2].is_in_dict", "T[-2,0].is_in_dict",
31+
"T[-1,1].is_in_dict", "T[0,2].is_in_dict",
32+
]
33+
model = FastCRFSequenceTagger(features, dictionary)
34+
35+
training_params = {
36+
"output_dir": join(wd, cfg.train.output_dir),
37+
"params": {
38+
"c1": cfg.train.params.c1, # coefficient for L1 penalty
39+
"c2": cfg.train.params.c2, # coefficient for L2 penalty
40+
"max_iterations": cfg.train.params.max_iterations, #
41+
# include transitions that are possible, but not observed
42+
"feature.possible_transitions": cfg.train.params.feature.possible_transitions,
43+
"feature.possible_states": cfg.train.params.feature.possible_states,
44+
},
45+
}
46+
47+
dataset_name = cfg.dataset.name
48+
dataset_params = cfg.dataset.params
49+
50+
# Check if subset exists in the config and load dataset accordingly
51+
if 'subset' in cfg.dataset:
52+
dataset_subset = cfg.dataset.subset
53+
dataset = load_dataset(dataset_name, dataset_subset, **dataset_params)
54+
else:
55+
dataset = load_dataset(dataset_name, **dataset_params)
6956

57+
corpus = preprocess_word_tokenize_dataset(dataset)
7058

71-
dataset = load_dataset("undertheseanlp/UTS_WTK", "base")
72-
corpus = preprocess_word_tokenize_dataset(dataset)
59+
train_dataset = corpus["train"]
60+
test_dataset = corpus["test"]
61+
if cfg.dataset_extras.include_test:
62+
train_dataset = train_dataset + test_dataset
63+
print("Train dataset", len(train_dataset))
64+
print("Test dataset", len(test_dataset))
7365

74-
train_dataset = corpus["train"]
75-
test_dataset = corpus["test"]
76-
print("Train dataset", len(train_dataset))
77-
print("Test dataset", len(test_dataset))
66+
trainer = CRFTrainer(model, training_params, train_dataset, test_dataset)
67+
trainer.train()
7868

79-
trainer = CRFTrainer(model, training_params, train_dataset, test_dataset)
8069

81-
trainer.train()
70+
if __name__ == "__main__":
71+
train()

‎tests/pipeline/chunking/test_chunk.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,4 @@ def test_simple_cases(self):
1313
def test_accuracy(self):
1414
output = chunk(
1515
u"Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa")
16-
self.assertEqual(len(output), 13)
16+
self.assertEqual(len(output), 11)

‎tests/pipeline/ipa/test_ipa.py

+1-3
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,9 @@ def test_1(self):
1111
self.assertEqual(expected, actual)
1212

1313
def test_2(self):
14-
# text = "cún"
1514
text = "chật"
1615
actual = viet2ipa(text)
17-
# expected = "kun³⁴"
18-
expected = "ʨɤ̆t¹⁰ˀ"
16+
expected = "tɕət²¹ˀ"
1917
self.assertEqual(expected, actual)
2018

2119
def test_3(self):

‎tests/pipeline/ipa/tests.txt

+4-4
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ bói,ɓɔj²⁴
44
xoong,sɔːŋ³³
55
oong,ʔɔːŋ³³
66
giếng,ziəŋ²⁴
7-
hư,³³
8-
hương,hɯəŋ³³
9-
vơ,³³
10-
tơ,³³
7+
hư,³³
8+
hương,hɨəŋ³³
9+
vơ,³³
10+
tơ,³³
1111
em,ʔɛm³³
1212
goá,ɣʷa³⁴
1313
hoa,hʷa³³

‎tests/pipeline/pos_tag/test_pos_tag.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@ def test_simple_cases(self):
1111
self.assertEqual(actual, expected)
1212

1313
def test_accuracy(self):
14-
output = pos_tag(u"Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa")
15-
self.assertEqual(len(output), 13)
14+
actual = pos_tag(u"Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa")
15+
self.assertEqual(len(actual), 11)

‎tests/pipeline/pos_tag/test_pos_tag_v2.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,6 @@ def test_simple_cases(self):
1313
def test_accuracy(self):
1414
text = "Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa"
1515
output = pos_tag(text, model="v2.0")
16-
self.assertEqual(len(output), 13)
17-
self.assertEqual(output[4][0], "để")
18-
self.assertEqual(output[4][1], "E")
16+
self.assertEqual(len(output), 11)
17+
self.assertEqual(output[4][0], "tay")
18+
self.assertEqual(output[4][1], "N")

0 commit comments

Comments
 (0)
Please sign in to comment.