You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: _be/tokenization.md
+13-14
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,6 @@
1
1
---
2
2
layout: base
3
3
title: 'Tokenization'
4
-
permalink: be/overview/tokenization.html
5
4
---
6
5
7
6
# Tokenization
@@ -11,18 +10,18 @@ The low-level tokenization of the Belarusian UD treebank generally adopts the RN
11
10
* In general, tokens are delimited by whitespace. The regexp [А-zА-яЁёУўі\-]+ usually corresponds to one token.
12
11
* Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
13
12
* Each punctuation mark is treated as a single token, e.g. the following sequence: <b>)", -</b> becomes four tokens, <b>)</b> , <b>"</b>, <b>,</b>, and <b>-"</b>. Exceptions are conventional multi-character punctuation marks: <b>--</b> , <b>...</b> , <b>?!</b> , etc., and emojis and smileys: <b>:)</b> , <b>^_^</b>, etc.
14
-
* Conventional non-cyrillic multi-character terms are tokenized as single tokens: <b>°С</b>, <b>км2</b>.
15
-
16
-
Some special cases worth mentioning:
17
-
* Numerical expressions including decimal numbers, such as <b>245</b>, <b>3,14</b>, are treated as single tokens.
18
-
* Time expressions like <b>20:55</b> are splitted into separate tokens (in this case, three { <b>20</b> , <b>:</b> , <b>55</b> }).
19
-
* Dates like <b>20.04.2012</b> are splitted into separate tokens (in this case, five { <b>20</b> , <b>.</b> , <b>04</b> , <b>.</b> , <b>2012</b> }).
20
-
* Special symbols before and after numerical expressions, as in <b>$500</b> , <b>2,67%</b> , <b>+27°С</b> , are tokenised separately (so, the tokens are { <b>$</b> , <b>500</b> } , { <b>2,67</b> , <b>%</b> } , { <b>+</b> , <b>27</b> , <b>°С</b> }).
21
-
* Numerical expressions with hyphen and cyrillic endings (e.g. <b>1-ый</b> “1st”, <b>3-м</b> “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. <b>79-гадовы</b> “79 year old”, <b>500-годдзе</b> “500th anniversary”) are treated as single tokens.
22
-
* Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { <b>з-за</b> } “because of”, { <b>зялёна-шэрых</b> } “green-gray”, { <b>Санкт-Пецярбург</b> } “St. Petersburg”, but { <b>Ростове</b> , <b>-</b> , <b>на</b> , <b>-</b> , <b>Дону</b>} “(in) Rostov on Don”.
13
+
* Conventional non-cyrillic multi-character terms are tokenized as single tokens: <b>°С</b>, <b>км2</b>.
14
+
15
+
Some special cases worth mentioning:
16
+
* Numerical expressions including decimal numbers, such as <b>245</b>, <b>3,14</b>, are treated as single tokens.
17
+
* Time expressions like <b>20:55</b> are splitted into separate tokens (in this case, three { <b>20</b> , <b>:</b> , <b>55</b> }).
18
+
* Dates like <b>20.04.2012</b> are splitted into separate tokens (in this case, five { <b>20</b> , <b>.</b> , <b>04</b> , <b>.</b> , <b>2012</b> }).
19
+
* Special symbols before and after numerical expressions, as in <b>$500</b> , <b>2,67%</b> , <b>+27°С</b> , are tokenised separately (so, the tokens are { <b>$</b> , <b>500</b> } , { <b>2,67</b> , <b>%</b> } , { <b>+</b> , <b>27</b> , <b>°С</b> }).
20
+
* Numerical expressions with hyphen and cyrillic endings (e.g. <b>1-ый</b> “1st”, <b>3-м</b> “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. <b>79-гадовы</b> “79 year old”, <b>500-годдзе</b> “500th anniversary”) are treated as single tokens.
21
+
* Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { <b>з-за</b> } “because of”, { <b>зялёна-шэрых</b> } “green-gray”, { <b>Санкт-Пецярбург</b> } “St. Petersburg”, but { <b>Ростове</b> , <b>-</b> , <b>на</b> , <b>-</b> , <b>Дону</b>} “(in) Rostov on Don”.
23
22
* Abbreviations are treated as single tokens, whitespaces split the abbreviations.
24
23
* Abbreviations marked by a period, as in <b>стр.</b> “p. (page)”, <b>П.</b> “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { <b>1914</b> , <b>г</b> , <b>.</b> } “year 1914”.
25
-
* Abbreviations can not contain a period inside, i.e. the patterns like <b>і т.д.</b> “and so on”, <b>да т.п.</b> “and so forth” are splitted into three tokens: { <b>i</b> , <b>т.</b> , <b>д.</b> }, { <b>да</b> , <b>т.</b> , <b>п.</b> }.
24
+
* Abbreviations can not contain a period inside, i.e. the patterns like <b>і т.д.</b> “and so on”, <b>да т.п.</b> “and so forth” are splitted into three tokens: { <b>i</b> , <b>т.</b> , <b>д.</b> }, { <b>да</b> , <b>т.</b> , <b>п.</b> }.
26
25
* Email addresses, URLs, and tweet-style names are treated as single tokens: {[email protected]}, {https://github.com}, {@anna_li}
27
26
28
27
The Belarusian UD treebank does not contain multiword tokens.
@@ -35,11 +34,11 @@ The Belarusian UD treebank does not contain multiword tokens.
* reflexive verbs are kept as a single token (orthographic rule): <b>з'яўляецца</b> “is”.
37
+
* reflexive verbs are kept as a single token (orthographic rule): <b>з'яўляецца</b> “is”.
39
38
* the forms of subjunctive mood, analytical passive, complex future tense, complex comparatives, etc. are splitted
40
-
according to the orthographic principle: { <b>маглі</b> , <b>б</b> } “would be able, could”, { <b>былі</b> , <b>зафіксаваныя</b> } “were recorded”, { <b>будзе</b> , <b>трымацца</b> } “will be held”, { <b>больш</b> , <b>сур'ёзныя</b> } “more serious”
39
+
according to the orthographic principle: { <b>маглі</b> , <b>б</b> } “would be able, could”, { <b>былі</b> , <b>зафіксаваныя</b> } “were recorded”, { <b>будзе</b> , <b>трымацца</b> } “will be held”, { <b>больш</b> , <b>сур'ёзныя</b> } “more serious”
41
40
* <b>не</b> and <b>ни</b> used as negation markers with verbs, pronouns and other words are tokenized according to the orthographic rules: { <b>не</b> , <b>рэагуючы</b> } “not reacting”, { <b>ні</b> , <b>ў</b> , <b>каго</b> } “at no one”, but { <b>непапраўнай</b> } “irrecoverable”, { <b>незавершаны</b> } “not finished”, { <b>ніякіх</b> } “no one”.
42
-
* паў- and напаў- “half” are never kept separate: <b>паўбеспрацоўны</b> “half-unemployed”, <b>напаўзабыты</b> “half-forgotten”.
41
+
* паў- and напаў- “half” are never kept separate: <b>паўбеспрацоўны</b> “half-unemployed”, <b>напаўзабыты</b> “half-forgotten”.
Copy file name to clipboardexpand all lines: _bg/specific-syntax.md
+1-2
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,11 @@
1
1
---
2
2
layout: base
3
3
title: 'Syntax'
4
-
permalink: bg/overview/specific-syntax.html
5
4
---
6
5
7
6
# Specific constructions
8
7
9
-
## Yes-No question particle
8
+
## Yes-No question particle
10
9
11
10
In Bulgarian the Yes-No questions are formed with the question particle ли (li). At the moment this particle is annotated with the [cs-dep/discourse]() relation.
Copy file name to clipboardexpand all lines: _cop/morphology.md
+3-4
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,6 @@
1
1
---
2
2
layout: base
3
3
title: 'Morphology'
4
-
permalink: cop/overview/morphology.html
5
4
udver: '2'
6
5
---
7
6
@@ -13,7 +12,7 @@ In keeping with other Universal Dependency treebanks, the Coptic dependency tree
13
12
14
13
|Coptic Scriptorium | Universal Tags|
15
14
|--------------------- |:---------------------|
16
-
|AAOR | AUX |
15
+
|AAOR | AUX |
17
16
|ACAUS | AUX |
18
17
|ACOND | SCONJ |
19
18
|ACONJ | AUX |
@@ -63,12 +62,12 @@ In keeping with other Universal Dependency treebanks, the Coptic dependency tree
63
62
64
63
**Notes**
65
64
66
-
The Universal POS tags do not map well onto Coptic tags in several cases; in all instances, the attempt has been made to choose the nearest category, especially with syntactic function in mind. The objective is to create dependency trees that connect similar categories to those of other languages.
65
+
The Universal POS tags do not map well onto Coptic tags in several cases; in all instances, the attempt has been made to choose the nearest category, especially with syntactic function in mind. The objective is to create dependency trees that connect similar categories to those of other languages.
67
66
68
67
Most tripartite conjugation bases have been mapped to either auxiliaries (`AUX`), if they are main clause conjugations (past auxiliary APST, aorist AAOR, etc.) or not the main conjugation morpheme (e.g. future marker FUT, which may join a durative conjugation or irrealis preterit). For the subordinate conjugations (APREC, ALIM), the universal tag `SCONJ` (subordinating conjunction) is used.
69
68
70
69
The category IMOD is cast as a form of `ADV`. While the alternatives of `ADP` (adposition) or `PART` (particle) are semantically appealing, the mapping to `ADV` best represents their sentential function and parallels the dependency label advmod. Note that this results in some adverbs carrying determiners, which is rather odd in terms of underlying categories for the syntax trees. It is perhaps similar to some extent to situations with the Stanford Typed label npadvmod, with the distinction that Coptic IMODs only attach to pronouns, never nouns.
71
70
72
-
The existential predicates (EXIST) have been mapped as `VERB`, whereas the copula (COP) is mapped to `PTC`, since unlike in the case of existence, it does not contain the actual predicate, and is also absent in the interlocutive patterns.
71
+
The existential predicates (EXIST) have been mapped as `VERB`, whereas the copula (COP) is mapped to `PTC`, since unlike in the case of existence, it does not contain the actual predicate, and is also absent in the interlocutive patterns.
73
72
74
73
Finally the converters have been treated similarly to conjugation bases, although they co-occur with the bases. Subordinate converters (CCIRC, CREL) are treated as `SCONJ`, while (potentially) main clause converters (CFOC, CPRET) are tagged as `AUX`. In all cases, we stress that these are not ideal tag assignments, but ones that aim to stay closest to the limited universal tag set’s behavior. For all new projects we recommend using Scriptorium tags and converting automatically to universal tags if necessary.
Copy file name to clipboardexpand all lines: _cop/specific-syntax.md
+3-4
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,6 @@
1
1
---
2
2
layout: base
3
3
title: 'Syntax'
4
-
permalink: cop/overview/specific-syntax.html
5
4
udver: '2'
6
5
---
7
6
@@ -22,7 +21,7 @@ csubj(ⲉⲝⲉⲥⲧⲓ, ⲁⲁ)
22
21
Greek conjunctions and particles that are non-coordinating (i.e. not meaning ‘and/or’) are labeled as `advmod` to their associated predicate, as in the following example:
23
22
24
23
~~~sdparse
25
-
ⲙⲏ ⲁⲣⲁ ⲉ ⲓ ⲟⲩⲏϩ ⲟⲛ ϩⲓϫⲛ ⲧ ⲙⲏⲧⲉ ⲛ ϫⲱ ⲕ \n After all do I still sit upon the middle of your head?
24
+
ⲙⲏ ⲁⲣⲁ ⲉ ⲓ ⲟⲩⲏϩ ⲟⲛ ϩⲓϫⲛ ⲧ ⲙⲏⲧⲉ ⲛ ϫⲱ ⲕ \n After all do I still sit upon the middle of your head?
26
25
27
26
advmod(ⲟⲩⲏϩ, ⲙⲏ)
28
27
~~~
@@ -34,7 +33,7 @@ Inverted modifiers of the type ⲛⲟϭ ⲛϭⲟⲙ ‘great power’ (lit. a
34
33
~~~sdparse
35
34
ⲡⲓ ⲛⲟϭ ⲛ ⲃⲁⲣⲟⲥ \n this great burden
36
35
37
-
det(ⲛⲟϭ, ⲡⲓ)
36
+
det(ⲛⲟϭ, ⲡⲓ)
38
37
nmod(ⲛⲟϭ, ⲃⲁⲣⲟⲥ)
39
38
case(ⲃⲁⲣⲟⲥ, ⲛ)
40
39
~~~
@@ -43,7 +42,7 @@ case(ⲃⲁⲣⲟⲥ, ⲛ)
43
42
44
43
The independent possessive pronoun ‘that, which is of X, belongs to X’ is analyzed as the head of the phrase, and the possessor is attached as nmod to this:
Copy file name to clipboardexpand all lines: _cop/tokenization.md
+1-2
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,6 @@
1
1
---
2
2
layout: base
3
3
title: 'Tokenization'
4
-
permalink: cop/overview/tokenization.html
5
4
udver: '2'
6
5
---
7
6
@@ -17,7 +16,7 @@ For portmanteau tags, tokens which carry a fused portmanteau POS tag receive bot
17
16
18
17
*Pure universal dependencies*
19
18
20
-
When using pure dependencies, more ‘lexical’ functions trump more ‘grammatical’ ones, so that examples like ACOND_PPERS are still labeled nsubj, omitting the aux label entirely. This preserves the pure universal dependency tag set.
19
+
When using pure dependencies, more ‘lexical’ functions trump more ‘grammatical’ ones, so that examples like ACOND_PPERS are still labeled nsubj, omitting the aux label entirely. This preserves the pure universal dependency tag set.
21
20
22
21
Alternatively, if the intended application of the annotation project supports sub-tokenization, the CoNLL-U format can be used as follows, specifying subtokens/supertokens for fused units:
Copy file name to clipboardexpand all lines: _el/specific-syntax.md
+6-7
Original file line number
Diff line number
Diff line change
@@ -1,43 +1,42 @@
1
1
---
2
2
layout: base
3
3
title: 'Syntax'
4
-
permalink: el/overview/specific-syntax.html
5
4
---
6
5
7
6
### Free relatives
8
7
9
-
Free relative clauses are marked as [ccomp](el-dep/ccomp), [csubj](el-dep/csubj), [advcl](el-dep/advcl) and [advcl](el-dep/advcl), depending on their relation to their verbal or nominal head.
8
+
Free relative clauses are marked as [ccomp](el-dep/ccomp), [csubj](el-dep/csubj), [advcl](el-dep/advcl) and [advcl](el-dep/advcl), depending on their relation to their verbal or nominal head.
0 commit comments