GH-696: A bug when using word_tokenize (#697)

rain1024 · web-flow · commit b9964a4beebb · 2023-07-27T10:47:02.000+07:00
* add VLSP2013_WTK dataset
* update UTS_WTK dataset
* retrain model with VLSP2013_WTK dataset
diff --git a/.gitignore b/.gitignore
@@ -71,4 +71,5 @@ target/
 wandb
 
 node_modules
-.vscode
+.vscode
+.DS_Store
diff --git a/.gitmodules b/.gitmodules
@@ -10,3 +10,6 @@
 [submodule "datasets/UTS_Dictionary"]
 	path = datasets/UTS_Dictionary
 	url = https://huggingface.co/datasets/undertheseanlp/UTS_Dictionary
+[submodule "datasets/VLSP2013_WTK"]
+	path = datasets/VLSP2013_WTK
+	url = https://huggingface.co/datasets/undertheseanlp/VLSP2013_WTK
diff --git a/datasets/.gitignore b/datasets/.gitignore
@@ -1 +1 @@
-datasets
+.DS_Store
diff --git a/datasets/UTS_Dictionary b/datasets/UTS_Dictionary
@@ -1 +1 @@
-Subproject commit 72deff816eec542699218f97b5948dae791508f4
+Subproject commit 8d70336d606cc4d2567ca2f49c54c749a5b41017
diff --git a/datasets/UTS_WTK b/datasets/UTS_WTK
@@ -1 +1 @@
-Subproject commit 91d7b1fe3a7d54c1acf8b7bf7f2d93f38794ba9d
+Subproject commit 5289438171a9425d15ebc211fd0e1dcfe74f6893
diff --git a/datasets/VLSP2013_WTK b/datasets/VLSP2013_WTK
@@ -0,0 +1 @@
+Subproject commit 85435e07a395527f717ea407ebd7290dd52246c5
diff --git a/examples/word_tokenize/.gitignore b/examples/word_tokenize/.gitignore
@@ -0,0 +1 @@
+outputs
diff --git a/examples/word_tokenize/README.md b/examples/word_tokenize/README.md
@@ -1,3 +1,29 @@
 # Word Tokenization
 
-* [Google Colab](https://colab.research.google.com/drive/1NR9NlHJDj5_wywRze7yQizw1DgICKL72?usp=sharing)
+* [Google Colab](https://colab.research.google.com/drive/1NR9NlHJDj5_wywRze7yQizw1DgICKL72?usp=sharing)
+* [Technical Report](technical_report.md)
+
+## Usage
+
+### Training
+
+Train and Evaluate Model:
+
+```
+python train.py ++'dataset.include_test=False'
+python train.py +dataset=vlsp2013_wtk
+```
+
+Train the Final Model (Including Test Split):
+
+```
+python train.py ++'dataset_extras.include_test=True'
+```
+
+### Inference
+
+Generate labels with the trained model
+
+```
+python predict.py +'output_dir="tmp/ws_20230725/"' +'text="Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump"'
+```
diff --git a/examples/word_tokenize/conf/config.yaml b/examples/word_tokenize/conf/config.yaml
@@ -0,0 +1,7 @@
+defaults:
+  - dataset/uts_wtk
+  - train
+
+dataset_extras:
+  train_samples: -1
+  include_test: false
diff --git a/examples/word_tokenize/conf/dataset/uts_wtk.yaml b/examples/word_tokenize/conf/dataset/uts_wtk.yaml
@@ -0,0 +1,4 @@
+name: undertheseanlp/UTS_WTK
+subset: large
+params:
+  revision: 1.0
diff --git a/examples/word_tokenize/conf/dataset/vlsp2013_wtk.yaml b/examples/word_tokenize/conf/dataset/vlsp2013_wtk.yaml
@@ -0,0 +1,4 @@
+name: undertheseanlp/VLSP2013_WTK
+subset: null
+params:
+  revision: 1.0.0
diff --git a/examples/word_tokenize/conf/train.yaml b/examples/word_tokenize/conf/train.yaml
@@ -0,0 +1,9 @@
+train:
+  output_dir: tmp/ws_20230727
+  params:
+    c1: 1.0
+    c2: 1e-3
+    max_iterations: 1000
+    feature:
+      possible_transitions: true
+      possible_states: true
diff --git a/examples/word_tokenize/predict.py b/examples/word_tokenize/predict.py
@@ -1,9 +1,12 @@
 from os.path import dirname, join
 from underthesea.models.fast_crf_sequence_tagger import FastCRFSequenceTagger
+from underthesea.pipeline.word_tokenize.regex_tokenize import tokenize
 
-output_dir = join(dirname(__file__), "tmp/ws_20220222")
+output_dir = join(dirname(__file__), "tmp/ws_202307270300")
 sentence = "Quỳnh Như tiết lộ với báo Bồ Đào Nha về hành trình làm nên lịch sử"
-tokens = sentence.split()
+sentence = "Thời Trần, những người đứng đầu xã được gọi là Xã quan."
+sentence = "Phổ là bang lớn nhất và mạnh nhất trong Liên bang Đức (chiếm 61% dân số và 64% lãnh thổ)."
+tokens = tokenize(sentence)
 tokens_ = [[token] for token in tokens]
 
 model = FastCRFSequenceTagger()
diff --git a/examples/word_tokenize/requirements.txt b/examples/word_tokenize/requirements.txt
@@ -0,0 +1,3 @@
+seqeval
+datasets
+hydra-core
diff --git a/examples/word_tokenize/technical_report.md b/examples/word_tokenize/technical_report.md
@@ -0,0 +1,72 @@
+# Vietnamese Word Segmentation with underthesea
+
+```
+Author: Vu Anh
+Date: July 27, 2023
+```
+
+Vietnamese Word Segmentation (VWS) plays a pivotal role in various Natural Language Processing (NLP) tasks for the Vietnamese language. Segmentation can be thought of as the process of splitting a sequence of characters into meaningful chunks or "words." This is particularly challenging for Vietnamese due to its nature of combining multiple lexical units into single written forms, which can be confusing without proper context.
+
+Over the past years, numerous models have been proposed to tackle this issue, with Conditional Random Fields (CRF) being one of the prominent ones due to its ability to consider the context for making segmentation decisions. This report presents our experiment leveraging the CRF model for the VWS task on both the `UTS_WTK` and `VLSP2013_WTK` datasets.
+
+## Methods
+
+### Datasets
+
+For our experiments, we utilized two datasets: `UTS_WTK` and `VLSP2013_WTK`.
+
+Each of these datasets is meticulously tailored for Vietnamese Word Segmentation tasks, encompassing a vast range of textual domains. Their exhaustive scope guarantees a rigorous appraisal of our model's proficiency.
+
+Employing both datasets ensures a comprehensive evaluation and robust validation of our methodology.
+
+### Conditional Random Fields (CRF)
+
+Conditional Random Fields (CRFs) are a type of statistical modeling method often used for structured prediction. Introduced by Lafferty et al. (2001), CRFs have found applications in various sequence labeling tasks.  For the Vietnamese Word Segmentation task, CRF models the sequence of labels (whether to split or not) given a sequence of input tokens. CRFs are advantageous for this task as they consider the entire sequence context when predicting for an individual token.
+
+**Feature Engineering**
+
+For our CRF model, we have leveraged various token-related features. Here's a comprehensive list of the features used
+
+| Feature Type       | Features                                                                    |
+| ----------------- | ---------------------------------------------------------------------------- |
+| Unigram           | `T[-2]`, `T[-1]`, `T[0]`, `T[1]`, `T[2]`                                     |
+| Bigram            | `T[-2,-1]`, `T[-1,0]`, `T[0,1]`, `T[1,2]`, `T[-2,0]`, `T[-1,1]`, `T[0,2]`    |
+| Lowercase Unigram | `T[-2].lower`, `T[-1].lower`, `T[0].lower`, `T[1].lower`, `T[2].lower`       |
+| Lowercase Bigram  | `T[-2,-1].lower`, `T[-1,0].lower`, `T[0,1].lower`, `T[1,2].lower`            |
+| Is Digit          | `T[-1].isdigit`, `T[0].isdigit`, `T[1].isdigit`                              |
+| Is Title          | `T[-2].istitle`, `T[-1].istitle`, `T[0].istitle`, `T[1].istitle`, `T[2].istitle`, `T[0,1].istitle`, `T[0,2].istitle`   |
+| Is in Dictionary  | `T[-2].is_in_dict`, `T[-1].is_in_dict`, `T[0].is_in_dict`, `T[1].is_in_dict`, `T[2].is_in_dict`, `T[-2,-1].is_in_dict`, `T[-1,0].is_in_dict`, `T[0,1].is_in_dict`, `T[1,2].is_in_dict`, `T[-2,0].is_in_dict`, `T[-1,1].is_in_dict`, `T[0,2].is_in_dict` |
+
+## Results
+
+The table below captures the results of the Vietnamese Word Segmentation task using the Conditional Random Fields (CRF) model:
+
+| Dataset         | Model      | F1 Score |
+|:----------------|:-----------|---------:|
+| UTS_WTK (1.0.0) | CRF        | 0.977    |
+| VLSP2013_WTK    | CRF        | 0.973    |
+
+## Conclusion
+
+Our experiments using Conditional Random Fields for Vietnamese Word Segmentation on both the `UTS_WTK` and `VLSP2013_WTK` datasets have yielded promising results. The comprehensive feature engineering utilized effectively captures the intricate nuances of the Vietnamese language, as evidenced by the achieved F1 scores.
+
+## Integration
+
+Following the comprehensive training on the `VSLP2013_WTK` dataset, we secured a model checkpoint dated `20230727`. This refined model was subsequently integrated into the `underthesea` toolkit, specifically its version `6.6.0`.
+
+## References
+
+```
+@misc{UTS_WTK,
+  title={UTS_WTK: a Vietnamese Word Segmentation Dataset},
+  author={Vu Anh},
+  year={2022}
+}
+
+@inproceedings{Lafferty2001ConditionalRF,
+  title={Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data},
+  author={John D. Lafferty and Andrew McCallum and Fernando Pereira},
+  booktitle={International Conference on Machine Learning},
+  year={2001}
+}
+```
diff --git a/examples/word_tokenize/train.py b/examples/word_tokenize/train.py
@@ -1,81 +1,71 @@
-from os.path import dirname, join
+import hydra
+from hydra.utils import get_original_cwd
+from omegaconf import DictConfig, OmegaConf
+
+from os.path import join
 from underthesea.models.fast_crf_sequence_tagger import FastCRFSequenceTagger
 from underthesea.trainers.crf_trainer import CRFTrainer
 from underthesea.transformer.tagged_feature import lower_words as dictionary
 from datasets import load_dataset
 from underthesea.utils.preprocess_dataset import preprocess_word_tokenize_dataset
 
-features = [
-    # word unigram and bigram and trigram
-    "T[-2]",
-    "T[-1]",
-    "T[0]",
-    "T[1]",
-    "T[2]",
-    "T[-2,-1]",
-    "T[-1,0]",
-    "T[0,1]",
-    "T[1,2]",
-    "T[-2,0]",
-    "T[-1,1]",
-    "T[0,2]",
-    "T[-2].lower",
-    "T[-1].lower",
-    "T[0].lower",
-    "T[1].lower",
-    "T[2].lower",
-    "T[-2,-1].lower",
-    "T[-1,0].lower",
-    "T[0,1].lower",
-    "T[1,2].lower",
-    "T[-1].isdigit",
-    "T[0].isdigit",
-    "T[1].isdigit",
-    "T[-2].istitle",
-    "T[-1].istitle",
-    "T[0].istitle",
-    "T[1].istitle",
-    "T[2].istitle",
-    "T[0,1].istitle",
-    "T[0,2].istitle",
-    "T[-2].is_in_dict",
-    "T[-1].is_in_dict",
-    "T[0].is_in_dict",
-    "T[1].is_in_dict",
-    "T[2].is_in_dict",
-    "T[-2,-1].is_in_dict",
-    "T[-1,0].is_in_dict",
-    "T[0,1].is_in_dict",
-    "T[1,2].is_in_dict",
-    "T[-2,0].is_in_dict",
-    "T[-1,1].is_in_dict",
-    "T[0,2].is_in_dict",
-]
-model = FastCRFSequenceTagger(features, dictionary)
 
-pwd = dirname(__file__)
-output_dir = join(pwd, "tmp/ws_20220224")
-training_params = {
-    "output_dir": output_dir,
-    "params": {
-        "c1": 1.0,  # coefficient for L1 penalty
-        "c2": 1e-3,  # coefficient for L2 penalty
-        "max_iterations": 1000,  #
-        # include transitions that are possible, but not observed
-        "feature.possible_transitions": True,
-        "feature.possible_states": True,
-    },
-}
+@hydra.main(version_base=None, config_path="conf/", config_name="config")
+def train(cfg: DictConfig) -> None:
+    wd = get_original_cwd()
+    print(OmegaConf.to_yaml(cfg))
+
+    features = [
+        # word unigram and bigram and trigram
+        "T[-2]", "T[-1]", "T[0]", "T[1]", "T[2]",
+        "T[-2,-1]", "T[-1,0]", "T[0,1]", "T[1,2]", "T[-2,0]",
+        "T[-1,1]", "T[0,2]",
+        "T[-2].lower", "T[-1].lower", "T[0].lower", "T[1].lower", "T[2].lower",
+        "T[-2,-1].lower", "T[-1,0].lower", "T[0,1].lower", "T[1,2].lower",
+        "T[-1].isdigit", "T[0].isdigit", "T[1].isdigit",
+        "T[-2].istitle", "T[-1].istitle", "T[0].istitle", "T[1].istitle", "T[2].istitle",
+        "T[0,1].istitle", "T[0,2].istitle",
+        "T[-2].is_in_dict", "T[-1].is_in_dict", "T[0].is_in_dict", "T[1].is_in_dict", "T[2].is_in_dict",
+        "T[-2,-1].is_in_dict", "T[-1,0].is_in_dict",
+        "T[0,1].is_in_dict", "T[1,2].is_in_dict", "T[-2,0].is_in_dict",
+        "T[-1,1].is_in_dict", "T[0,2].is_in_dict",
+    ]
+    model = FastCRFSequenceTagger(features, dictionary)
+
+    training_params = {
+        "output_dir": join(wd, cfg.train.output_dir),
+        "params": {
+            "c1": cfg.train.params.c1,  # coefficient for L1 penalty
+            "c2": cfg.train.params.c2,  # coefficient for L2 penalty
+            "max_iterations": cfg.train.params.max_iterations,  #
+            # include transitions that are possible, but not observed
+            "feature.possible_transitions": cfg.train.params.feature.possible_transitions,
+            "feature.possible_states": cfg.train.params.feature.possible_states,
+        },
+    }
+
+    dataset_name = cfg.dataset.name
+    dataset_params = cfg.dataset.params
+
+    # Check if subset exists in the config and load dataset accordingly
+    if 'subset' in cfg.dataset:
+        dataset_subset = cfg.dataset.subset
+        dataset = load_dataset(dataset_name, dataset_subset, **dataset_params)
+    else:
+        dataset = load_dataset(dataset_name, **dataset_params)
 
+    corpus = preprocess_word_tokenize_dataset(dataset)
 
-dataset = load_dataset("undertheseanlp/UTS_WTK", "base")
-corpus = preprocess_word_tokenize_dataset(dataset)
+    train_dataset = corpus["train"]
+    test_dataset = corpus["test"]
+    if cfg.dataset_extras.include_test:
+        train_dataset = train_dataset + test_dataset
+    print("Train dataset", len(train_dataset))
+    print("Test dataset", len(test_dataset))
 
-train_dataset = corpus["train"]
-test_dataset = corpus["test"]
-print("Train dataset", len(train_dataset))
-print("Test dataset", len(test_dataset))
+    trainer = CRFTrainer(model, training_params, train_dataset, test_dataset)
+    trainer.train()
 
-trainer = CRFTrainer(model, training_params, train_dataset, test_dataset)
 
-trainer.train()
+if __name__ == "__main__":
+    train()
diff --git a/tests/pipeline/chunking/test_chunk.py b/tests/pipeline/chunking/test_chunk.py
@@ -13,4 +13,4 @@ def test_simple_cases(self):
     def test_accuracy(self):
         output = chunk(
             u"Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa")
-        self.assertEqual(len(output), 13)
+        self.assertEqual(len(output), 11)
diff --git a/tests/pipeline/ipa/test_ipa.py b/tests/pipeline/ipa/test_ipa.py
@@ -11,11 +11,9 @@ def test_1(self):
         self.assertEqual(expected, actual)
 
     def test_2(self):
-        # text = "cún"
         text = "chật"
         actual = viet2ipa(text)
-        # expected = "kun³⁴"
-        expected = "ʨɤ̆t¹⁰ˀ"
+        expected = "tɕət²¹ˀ"
         self.assertEqual(expected, actual)
 
     def test_3(self):
diff --git a/tests/pipeline/ipa/tests.txt b/tests/pipeline/ipa/tests.txt
@@ -4,10 +4,10 @@ bói,ɓɔj²⁴
 xoong,sɔːŋ³³
 oong,ʔɔːŋ³³
 giếng,ziəŋ²⁴
-hư,hɯ³³
-hương,hɯəŋ³³
-vơ,vɤ³³
-tơ,tɤ³³
+hư,hɨ³³
+hương,hɨəŋ³³
+vơ,və³³
+tơ,tə³³
 em,ʔɛm³³
 goá,ɣʷa³⁴
 hoa,hʷa³³
diff --git a/tests/pipeline/pos_tag/test_pos_tag.py b/tests/pipeline/pos_tag/test_pos_tag.py
@@ -11,5 +11,5 @@ def test_simple_cases(self):
         self.assertEqual(actual, expected)
 
     def test_accuracy(self):
-        output = pos_tag(u"Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa")
-        self.assertEqual(len(output), 13)
+        actual = pos_tag(u"Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa")
+        self.assertEqual(len(actual), 11)
diff --git a/tests/pipeline/pos_tag/test_pos_tag_v2.py b/tests/pipeline/pos_tag/test_pos_tag_v2.py
@@ -13,6 +13,6 @@ def test_simple_cases(self):
     def test_accuracy(self):
         text = "Tổng Bí thư: Ai trót để tay nhúng chàm thì hãy sớm tự gột rửa"
         output = pos_tag(text, model="v2.0")
-        self.assertEqual(len(output), 13)
-        self.assertEqual(output[4][0], "để")
-        self.assertEqual(output[4][1], "E")
+        self.assertEqual(len(output), 11)
+        self.assertEqual(output[4][0], "tay")
+        self.assertEqual(output[4][1], "N")
diff --git a/tests/pipeline/word_tokenize/test_word_tokenize.py b/tests/pipeline/word_tokenize/test_word_tokenize.py
diff --git a/underthesea/models/fast_crf_sequence_tagger.py b/underthesea/models/fast_crf_sequence_tagger.py
diff --git a/underthesea/pipeline/word_tokenize/__init__.py b/underthesea/pipeline/word_tokenize/__init__.py
diff --git a/underthesea/pipeline/word_tokenize/model.py b/underthesea/pipeline/word_tokenize/model.py
diff --git a/underthesea/pipeline/word_tokenize/models/ws_crf_vlsp2013_20230727/dictionary.bin b/underthesea/pipeline/word_tokenize/models/ws_crf_vlsp2013_20230727/dictionary.bin
diff --git a/underthesea/pipeline/word_tokenize/models/ws_crf_vlsp2013_20230727/features.bin b/underthesea/pipeline/word_tokenize/models/ws_crf_vlsp2013_20230727/features.bin
diff --git a/underthesea/pipeline/word_tokenize/models/ws_crf_vlsp2013_20230727/models.bin b/underthesea/pipeline/word_tokenize/models/ws_crf_vlsp2013_20230727/models.bin
diff --git a/underthesea/pipeline/word_tokenize/nightly.py b/underthesea/pipeline/word_tokenize/nightly.py
diff --git a/underthesea/pipeline/word_tokenize/wt_crf_2018_09_13.bin b/underthesea/pipeline/word_tokenize/wt_crf_2018_09_13.bin

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+seqeval`
	`2`	`+datasets`
	`3`	`+hydra-core`