Skip to content
This repository was archived by the owner on May 4, 2021. It is now read-only.

Commit cd6378d

Browse files
authored
Adding Phase 1 description to readme
1 parent 091dd86 commit cd6378d

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
11
# DataCollection
22

3+
Collecting data for machine translation training from CommonCrawl is a two-phase process illustrated in the following diagram:
4+
35
![CommonCrawl process diagram](/common_crawl_process.png?raw=true "CommonCrawl data collection process")
6+
7+
## Phase 1: Language annotation, building a meta-data database and monolingual data extraction
8+
9+
The first phase detects the languages of the web pages contained in the crawl and other meta-data. A database is built from this data that can be accessed via a RESTful web API.
10+
11+
In this phase monolingual data for language model training can be generated. The data for some of the CommonCrawl crawls and some languages can be found on:
12+
13+
* http://statmt.org/ngrams/
14+
* http://www.statmt.org/wmt16/translation-task.html
15+
16+
For more details on the monolingual data see [ModernMT Deliverable 2.1](http://www.modernmt.eu/deliverables/mmt-d2-1-report-on-data-repository/).
17+

0 commit comments

Comments
 (0)