Using a simple language, tokens of a .conllu
-file can be edited if a condition is met
bin/replace.sh rules.txt input.conllu [--nostrict] > output.conllu
The rule file as line per rule
condition > new_values
condition
is a logical expression which is evaluated for each word, and if true the new values are set to the token which satisfies the condition.
The condition
is a set of key:values
, operators like and
, or
or !
(not) and parentheses. The condition may contain whitespaces:
Examples:
Upos:ADP and !Deprel:case
: true if the current token hasADP
as UPOS and its deprel is notcase
. Available keys:Upos:
(Values:[A-Z]+
)Xpos:
(Values: string of any character except whitespaces,)
and&
)Lemma:
(Values: string of any character except whitespaces,)
and&
)Form:
(Values: string of any character except whitespaces,)
and&
)Deprel:
(Values:[a-z]+
, optionally followed by:
and a string of any character except whitespaces,)
and&
)HeadId:
(Values:[+-][0-9]+
(relative from current head) or[0-9]+
(absolute head id), true if the head of the current token matchesEUD:
(Values:[+-][0-9]+
:
, deprel, if EudHeadId is*
any head position is accepted, without-
or+
the EudHead is interpreted as an absolute value))Feat:
(Values: FeatureName=Value or FeatureName:Value. The Featurename must match[A-Za-z_\[\]]+
, the Value[A-Za-z0-9]+
)Misc:
(Values: MiscName=Value or MiscName:Value. MiscName must match[A-Za-z_]+
, the Value can be any string without whitespaces,)
and&
)Id:
(Values: integer)MWT:
(Values: length of the multi-word token[2-9]
)IsEmpty
(no value, true if the current node is empty)IsMWT
(no value, true if the current node is empty)
Form:
, Lemma:
and Xpos:
can contain simple regular expression (only the character ')' cannot be used
EUD
cannot deal (yet) with empty word ids (n.m
)
Lemma
and Form
can have either a regex as argument or a filename of a file which contains a list of forms or lemmas:
Lemma:sing.* > misc:"Value=Sing"
Lemma:#mylemmas.txt > misc:"Value=Sing"
(if the filemylemmas.txt
does not exist, the condition is false)
In addition to key keys listed above, four functions are available to take the context of the token into account:
child()
child of current tokenhead()
head of current tokenprec()
preceding tokennext()
following token
For example:
head(head(Upos:VERB and Feat:Tense=Past))
: true if the current token has a head who has a head with UPOSVERB and the feature
Tense=Past`child(Upos:VERB && Feat:VerbForm=Part) and child(Upos:DET)
: true if the current token has a dependant with UPOSVERB
and a featureVerbForm=Part
and another child with UPOSDET
.
Functions can be nested (eventhough child(head())
does not make sense, does it :-)
The same language is used for complex search and replace
For more information check the formal grammar for conditions.
new_values
is a whitespace separated list of targeted_colum:value
which modify the tokens matched the condition.
The targeted_column
indicates which column of the word a new value is assigned to:
Possible keys
:
Form
Lemma
Upos
Xpos
Deprel
HeadId
Feat
Eud
Misc
(theId
column cannot be changed).
value
is a combination (using +
) of strings or functions which give access to other columns of the current word or it's head. Strings must be included
in double quotes "NOUN"
.
column_name
to retrieve a value from can be:
Form
Lemma
Upos
Xpos
Feat_<FeatureName>
Deprel
Misc_<KeyName>
HeadId
Available functions are:
this(<column_name>)
value of the given column of the current tokenhead(<column_name>)
value of the given column of the head of the current tokenhead(head(<column_name>)
value of the given column of the head's head of the current tokensubstring(this()/head(), start, end)
take the substring of the this/head expression fromstart
toend
substring(this()/head(), start)
take the substring of the result of the this/head expression fromstart
until the end of the stringupper(this()/head())
uppercase the result of the this/head expressionlower(this()/head())
lowercase the result of the this/head expressioncap(this()/head())
capitalize (first character uppercase, rest lowercase) the result of the this/head expressionreplace(this()/head(), regex, newstring) replaces the
regexof the result fo the this/head expression by
newstring`
If a token has a head 0, it's deprel will always be root
unless the option --nostrict
is used with replace.sh
Upos:"NOUN"
set Upos toNOUN
- `Eud:"+2:dep" add a enhanced UD relation "dep" using the current id + 2 (must be a negative or positive integer without 0 (if resulting head id is out of the sentence, the head id is not modified)
Eud:head(HeadId)+":"+head(Deprel)
set EUD to head and deprel of the headword- `HeadId:"+2" set head to current ud + 2 (must be a negative or positive integer without 0 (if resulting head id is out of the sentence, the head id is not modified)
- `HeadId:"-1" set head to current ud - 1
- `HeadId:"5" set head to 5 (n must be 0 or a positive integer)
- `HeadId:head(Headid) set head to the headid of head node
Feat:"Number=Sing"
adds a featureNumber=Sing
(Number: deletes the feature)Lemma:this(Form)
set lemma to the form of current tokenLemma:this(Misc_Translit)
set lemma to the keyTranslit
of theMisc
columnLemma:this(Form)+"er"
set lemma to the form + "er"Lemma:"de"+token(Form)
set lemma to "de" + formFeat:"Featname"+this(Lemma)
set the feature Featname to the value of LemmaFeat:"Gender"+this(Misc_Special)
set the feature Gender to the value of the Misc SpecialMisc:"Keyname"+head(head(Upos))
set the key "Keyname" ofMisc
column to the Upos of the head of the headLemma:substring(this(Form),1,3)
set lemma to the substring (1 - 3) of the formLemma:substring(this(Form),1)
set lemma to the substring (1 - end) ofthe formForm:replace(this(Form),"é","e")
replace all occurrances ofé
in the form bye
N.B. no white spaces allowed in a value expression!
therefore Lemma:substring(this(Form), 1, 3)
or Lemma:this(Form) + "er"are invalid, use
Lemma:substring(this(Form),1,3) or Lemma:this(Form)+"er"
instead.
In order to empty a column, just set it to "_"
: Feat:"_"
, Xpos:"_"
, Eud:"_"
etc.
For more information check the formal grammar for replacements (the part after the first :
).