Skip to content

Commit 14d0ea4

Browse files
committed
Removed broken permalinks.
1 parent c3314f2 commit 14d0ea4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+183
-296
lines changed

_ar/syntax.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Syntax'
4-
permalink: ar/overview/syntax.html
54
---
65

76
# Syntax

_ar/tokenization.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Tokenization'
4-
permalink: ar/overview/tokenization.html
54
---
65

76
# Tokenization

_be/syntax.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Syntax'
4-
permalink: be/overview/syntax.html
54
---
65

76
# Syntax

_be/tokenization.md

+13-14
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Tokenization'
4-
permalink: be/overview/tokenization.html
54
---
65

76
# Tokenization
@@ -11,18 +10,18 @@ The low-level tokenization of the Belarusian UD treebank generally adopts the RN
1110
* In general, tokens are delimited by whitespace. The regexp [А-zА-яЁёУўі\-]+ usually corresponds to one token.
1211
* Punctuation (recognized by the corresponding Unicode property) that is conventionally written adjacent to the preceding or following word is separated during tokenization.
1312
* Each punctuation mark is treated as a single token, e.g. the following sequence: <b>)", -</b> becomes four tokens, <b>)</b> , <b>"</b>, <b>,</b>, and <b>-"</b>. Exceptions are conventional multi-character punctuation marks: <b>--</b> , <b>...</b> , <b>?!</b> , etc., and emojis and smileys: <b>:)</b> , <b>^_^</b>, etc.
14-
* Conventional non-cyrillic multi-character terms are tokenized as single tokens: <b>°С</b>, <b>км2</b>.
15-
16-
Some special cases worth mentioning:
17-
* Numerical expressions including decimal numbers, such as <b>245</b>, <b>3,14</b>, are treated as single tokens.
18-
* Time expressions like <b>20:55</b> are splitted into separate tokens (in this case, three { <b>20</b> , <b>:</b> , <b>55</b> }).
19-
* Dates like <b>20.04.2012</b> are splitted into separate tokens (in this case, five { <b>20</b> , <b>.</b> , <b>04</b> , <b>.</b> , <b>2012</b> }).
20-
* Special symbols before and after numerical expressions, as in <b>$500</b> , <b>2,67%</b> , <b>+27°С</b> , are tokenised separately (so, the tokens are { <b>$</b> , <b>500</b> } , { <b>2,67</b> , <b>%</b> } , { <b>+</b> , <b>27</b> , <b>°С</b> }).
21-
* Numerical expressions with hyphen and cyrillic endings (e.g. <b>1-ый</b> “1st”, <b>3-м</b> “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. <b>79-гадовы</b> “79 year old”, <b>500-годдзе</b> “500th anniversary”) are treated as single tokens.
22-
* Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { <b>з-за</b> } “because of”, { <b>зялёна-шэрых</b> } “green-gray”, { <b>Санкт-Пецярбург</b> } “St. Petersburg”, but { <b>Ростове</b> , <b>-</b> , <b>на</b> , <b>-</b> , <b>Дону</b>} “(in) Rostov on Don”.
13+
* Conventional non-cyrillic multi-character terms are tokenized as single tokens: <b>°С</b>, <b>км2</b>.
14+
15+
Some special cases worth mentioning:
16+
* Numerical expressions including decimal numbers, such as <b>245</b>, <b>3,14</b>, are treated as single tokens.
17+
* Time expressions like <b>20:55</b> are splitted into separate tokens (in this case, three { <b>20</b> , <b>:</b> , <b>55</b> }).
18+
* Dates like <b>20.04.2012</b> are splitted into separate tokens (in this case, five { <b>20</b> , <b>.</b> , <b>04</b> , <b>.</b> , <b>2012</b> }).
19+
* Special symbols before and after numerical expressions, as in <b>$500</b> , <b>2,67%</b> , <b>+27°С</b> , are tokenised separately (so, the tokens are { <b>$</b> , <b>500</b> } , { <b>2,67</b> , <b>%</b> } , { <b>+</b> , <b>27</b> , <b>°С</b> }).
20+
* Numerical expressions with hyphen and cyrillic endings (e.g. <b>1-ый</b> “1st”, <b>3-м</b> “3rd.Ins”) as well as adjectives and other non-numerals which contain digits (e.g. <b>79-гадовы</b> “79 year old”, <b>500-годдзе</b> “500th anniversary”) are treated as single tokens.
21+
* Other words with hyphen are treated as single tokens, except for the cases then the first part is inflected. Examples: { <b>з-за</b> } “because of”, { <b>зялёна-шэрых</b> } “green-gray”, { <b>Санкт-Пецярбург</b> } “St. Petersburg”, but { <b>Ростове</b> , <b>-</b> , <b>на</b> , <b>-</b> , <b>Дону</b>} “(in) Rostov on Don”.
2322
* Abbreviations are treated as single tokens, whitespaces split the abbreviations.
2423
* Abbreviations marked by a period, as in <b>стр.</b> “p. (page)”, <b>П.</b> “P. (for Peter)”, are treated as single tokens. If the period overlaps with the end of sentence period then it is written once as a separate token (denoting end-of-sentence), e.g. { <b>1914</b> , <b>г</b> , <b>.</b> } “year 1914”.
25-
* Abbreviations can not contain a period inside, i.e. the patterns like <b>і т.д.</b> “and so on”, <b>да т.п.</b> “and so forth” are splitted into three tokens: { <b>i</b> , <b>т.</b> , <b>д.</b> }, { <b>да</b> , <b>т.</b> , <b>п.</b> }.
24+
* Abbreviations can not contain a period inside, i.e. the patterns like <b>і т.д.</b> “and so on”, <b>да т.п.</b> “and so forth” are splitted into three tokens: { <b>i</b> , <b>т.</b> , <b>д.</b> }, { <b>да</b> , <b>т.</b> , <b>п.</b> }.
2625
* Email addresses, URLs, and tweet-style names are treated as single tokens: {[email protected]}, {https://github.com}, {@anna_li}
2726

2827
The Belarusian UD treebank does not contain multiword tokens.
@@ -35,11 +34,11 @@ The Belarusian UD treebank does not contain multiword tokens.
3534

3635
### Verb forms, analytical grammatical forms, negation
3736

38-
* reflexive verbs are kept as a single token (orthographic rule): <b>з'яўляецца</b> “is”.
37+
* reflexive verbs are kept as a single token (orthographic rule): <b>з'яўляецца</b> “is”.
3938
* the forms of subjunctive mood, analytical passive, complex future tense, complex comparatives, etc. are splitted
40-
according to the orthographic principle: { <b>маглі</b> , <b>б</b> } “would be able, could”, { <b>былі</b> , <b>зафіксаваныя</b> } “were recorded”, { <b>будзе</b> , <b>трымацца</b> } “will be held”, { <b>больш</b> , <b>сур'ёзныя</b> } “more serious”
39+
according to the orthographic principle: { <b>маглі</b> , <b>б</b> } “would be able, could”, { <b>былі</b> , <b>зафіксаваныя</b> } “were recorded”, { <b>будзе</b> , <b>трымацца</b> } “will be held”, { <b>больш</b> , <b>сур'ёзныя</b> } “more serious”
4140
* <b>не</b> and <b>ни</b> used as negation markers with verbs, pronouns and other words are tokenized according to the orthographic rules: { <b>не</b> , <b>рэагуючы</b> } “not reacting”, { <b>ні</b> , <b>ў</b> , <b>каго</b> } “at no one”, but { <b>непапраўнай</b> } “irrecoverable”, { <b>незавершаны</b> } “not finished”, { <b>ніякіх</b> } “no one”.
42-
* паў- and напаў- “half” are never kept separate: <b>паўбеспрацоўны</b> “half-unemployed”, <b>напаўзабыты</b> “half-forgotten”.
41+
* паў- and напаў- “half” are never kept separate: <b>паўбеспрацоўны</b> “half-unemployed”, <b>напаўзабыты</b> “half-forgotten”.
4342

4443
### Character set
4544

_bg/morphology.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Morphology'
4-
permalink: bg/overview/morphology.html
54
---
65

76
# Morphology

_bg/specific-syntax.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,11 @@
11
---
22
layout: base
33
title: 'Syntax'
4-
permalink: bg/overview/specific-syntax.html
54
---
65

76
# Specific constructions
87

9-
## Yes-No question particle
8+
## Yes-No question particle
109

1110
In Bulgarian the Yes-No questions are formed with the question particle ли (li). At the moment this particle is annotated with the [cs-dep/discourse]() relation.
1211

_ckb/specific-syntax.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Syntax'
4-
permalink: ckb/overview/specific-syntax.html
54
---
65

76
# Specific constructions
@@ -14,6 +13,6 @@ We do not split off possessive inflection.
1413

1514
~~~ sdparse
1615
17-
Mindalakanim \n my-children
16+
Mindalakanim \n my-children
1817
1918
~~~

_cop/morphology.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Morphology'
4-
permalink: cop/overview/morphology.html
54
udver: '2'
65
---
76

@@ -13,7 +12,7 @@ In keeping with other Universal Dependency treebanks, the Coptic dependency tree
1312

1413
|Coptic Scriptorium | Universal Tags|
1514
|--------------------- |:---------------------|
16-
|AAOR | AUX |
15+
|AAOR | AUX |
1716
|ACAUS | AUX |
1817
|ACOND | SCONJ |
1918
|ACONJ | AUX |
@@ -63,12 +62,12 @@ In keeping with other Universal Dependency treebanks, the Coptic dependency tree
6362

6463
**Notes**
6564

66-
The Universal POS tags do not map well onto Coptic tags in several cases; in all instances, the attempt has been made to choose the nearest category, especially with syntactic function in mind. The objective is to create dependency trees that connect similar categories to those of other languages.
65+
The Universal POS tags do not map well onto Coptic tags in several cases; in all instances, the attempt has been made to choose the nearest category, especially with syntactic function in mind. The objective is to create dependency trees that connect similar categories to those of other languages.
6766

6867
Most tripartite conjugation bases have been mapped to either auxiliaries (`AUX`), if they are main clause conjugations (past auxiliary APST, aorist AAOR, etc.) or not the main conjugation morpheme (e.g. future marker FUT, which may join a durative conjugation or irrealis preterit). For the subordinate conjugations (APREC, ALIM), the universal tag `SCONJ` (subordinating conjunction) is used.
6968

7069
The category IMOD is cast as a form of `ADV`. While the alternatives of `ADP` (adposition) or `PART` (particle) are semantically appealing, the mapping to `ADV` best represents their sentential function and parallels the dependency label advmod. Note that this results in some adverbs carrying determiners, which is rather odd in terms of underlying categories for the syntax trees. It is perhaps similar to some extent to situations with the Stanford Typed label npadvmod, with the distinction that Coptic IMODs only attach to pronouns, never nouns.
7170

72-
The existential predicates (EXIST) have been mapped as `VERB`, whereas the copula (COP) is mapped to `PTC`, since unlike in the case of existence, it does not contain the actual predicate, and is also absent in the interlocutive patterns.
71+
The existential predicates (EXIST) have been mapped as `VERB`, whereas the copula (COP) is mapped to `PTC`, since unlike in the case of existence, it does not contain the actual predicate, and is also absent in the interlocutive patterns.
7372

7473
Finally the converters have been treated similarly to conjugation bases, although they co-occur with the bases. Subordinate converters (CCIRC, CREL) are treated as `SCONJ`, while (potentially) main clause converters (CFOC, CPRET) are tagged as `AUX`. In all cases, we stress that these are not ideal tag assignments, but ones that aim to stay closest to the limited universal tag set’s behavior. For all new projects we recommend using Scriptorium tags and converting automatically to universal tags if necessary.

_cop/specific-syntax.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Syntax'
4-
permalink: cop/overview/specific-syntax.html
54
udver: '2'
65
---
76

@@ -22,7 +21,7 @@ csubj(ⲉⲝⲉⲥⲧⲓ, ⲁⲁ)
2221
Greek conjunctions and particles that are non-coordinating (i.e. not meaning ‘and/or’) are labeled as `advmod` to their associated predicate, as in the following example:
2322

2423
~~~ sdparse
25-
ⲙⲏ ⲁⲣⲁ ⲉ ⲓ ⲟⲩⲏϩ ⲟⲛ ϩⲓϫⲛ ⲧ ⲙⲏⲧⲉ ⲛ ϫⲱ ⲕ \n After all do I still sit upon the middle of your head?
24+
ⲙⲏ ⲁⲣⲁ ⲉ ⲓ ⲟⲩⲏϩ ⲟⲛ ϩⲓϫⲛ ⲧ ⲙⲏⲧⲉ ⲛ ϫⲱ ⲕ \n After all do I still sit upon the middle of your head?
2625
2726
advmod(ⲟⲩⲏϩ, ⲙⲏ)
2827
~~~
@@ -34,7 +33,7 @@ Inverted modifiers of the type ⲛⲟϭ ⲛϭⲟⲙ ‘great power’ (lit. a
3433
~~~ sdparse
3534
ⲡⲓ ⲛⲟϭ ⲛ ⲃⲁⲣⲟⲥ \n this great burden
3635
37-
det(ⲛⲟϭ, ⲡⲓ)
36+
det(ⲛⲟϭ, ⲡⲓ)
3837
nmod(ⲛⲟϭ, ⲃⲁⲣⲟⲥ)
3938
case(ⲃⲁⲣⲟⲥ, ⲛ)
4039
~~~
@@ -43,7 +42,7 @@ case(ⲃⲁⲣⲟⲥ, ⲛ)
4342

4443
The independent possessive pronoun ‘that, which is of X, belongs to X’ is analyzed as the head of the phrase, and the possessor is attached as nmod to this:
4544

46-
~~~ sdparse
45+
~~~ sdparse
4746
ⲛⲁ ⲡⲉ ⲭⲣⲓⲥⲧⲟⲥ \n that which is Christ's
4847
4948
nmod(ⲛⲁ, ⲭⲣⲓⲥⲧⲟⲥ)

_cop/tokenization.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
---
22
layout: base
33
title: 'Tokenization'
4-
permalink: cop/overview/tokenization.html
54
udver: '2'
65
---
76

@@ -17,7 +16,7 @@ For portmanteau tags, tokens which carry a fused portmanteau POS tag receive bot
1716

1817
*Pure universal dependencies*
1918

20-
When using pure dependencies, more ‘lexical’ functions trump more ‘grammatical’ ones, so that examples like ACOND_PPERS are still labeled nsubj, omitting the aux label entirely. This preserves the pure universal dependency tag set.
19+
When using pure dependencies, more ‘lexical’ functions trump more ‘grammatical’ ones, so that examples like ACOND_PPERS are still labeled nsubj, omitting the aux label entirely. This preserves the pure universal dependency tag set.
2120

2221
Alternatively, if the intended application of the annotation project supports sub-tokenization, the CoNLL-U format can be used as follows, specifying subtokens/supertokens for fused units:
2322

_el/specific-syntax.md

+6-7
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,42 @@
11
---
22
layout: base
33
title: 'Syntax'
4-
permalink: el/overview/specific-syntax.html
54
---
65

76
### Free relatives
87

9-
Free relative clauses are marked as [ccomp](el-dep/ccomp), [csubj](el-dep/csubj), [advcl](el-dep/advcl) and [advcl](el-dep/advcl), depending on their relation to their verbal or nominal head.
8+
Free relative clauses are marked as [ccomp](el-dep/ccomp), [csubj](el-dep/csubj), [advcl](el-dep/advcl) and [advcl](el-dep/advcl), depending on their relation to their verbal or nominal head.
109

1110
~~~ sdparse
12-
Για να εντυπωσιάζετε όποιον γνωρίζετε
11+
Για να εντυπωσιάζετε όποιον γνωρίζετε
1312
ccomp(εντυπωσιάζετε, γνωρίζετε)
1413
dobj(γνωρίζετε, όποιον)
1514
~~~
1615

1716
~~~ sdparse
18-
Όποιος έφυγε , έχασε
17+
Όποιος έφυγε , έχασε
1918
csubj(έχασε, έφυγε)
2019
nsubj(έφυγε, Όποιος)
2120
~~~
2221

2322
~~~ sdparse
24-
Τιμώρησε όποιον μαθητή τον ενοχλούσε
23+
Τιμώρησε όποιον μαθητή τον ενοχλούσε
2524
ccomp(Τιμώρησε, ενοχλούσε)
2625
nsubj(ενοχλούσε, μαθητή)
2726
dobj(ενοχλούσε, τον)
2827
det(μαθητή, όποιον)
2928
~~~
3029

3130
~~~ sdparse
32-
η ενασχόληση με οποιοδήποτε θέμα σε ενδιαφέρει
31+
η ενασχόληση με οποιοδήποτε θέμα σε ενδιαφέρει
3332
acl(ενασχόληση, ενδιαφέρει)
3433
dobj(ενδιαφέρει, σε)
3534
nsubj(ενδιαφέρει, θέμα)
3635
det(θέμα, οποιοδήποτε)
3736
~~~
3837

3938
~~~ sdparse
40-
Έλα όποτε ευκαιρήσεις
39+
Έλα όποτε ευκαιρήσεις
4140
advcl(Έλα, ευκαιρήσεις)
4241
advmod(ευκαιρήσεις, όποτε)
4342
~~~

_el/syntax.md

-12
This file was deleted.

0 commit comments

Comments
 (0)