Skip to content

Commit d542892

Browse files
committed
Initial checkin.
0 parents  commit d542892

File tree

4 files changed

+17507
-0
lines changed

4 files changed

+17507
-0
lines changed

NOTICE.txt

+197
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
This project uses/adapts the following components to build
2+
a german decompounder for Apache Lucene / Apache Solr / Elasticsearch:
3+
4+
=============================================================================
5+
6+
The de_DR.xml file was taken from offo-hyphenation v1.2:
7+
https://sourceforge.net/projects/offo
8+
9+
Hyphenation patterns for new German orthography.
10+
11+
Constructed by Carlos Villegas from TeX's dehyphn.tex file
12+
obtained from http://www.ctan.org/tex-archive/language/hyphenation
13+
14+
This file may be used, distributed and modified only according to
15+
LaTeX Project Public License:
16+
17+
ftp://ctan.tug.org/tex-archive/fonts/mathpazo/lppl.txt
18+
19+
Please report errors in this file to the following address:
20+
[email protected] [or yours] and not to the address of the
21+
original authors. You are not allowed to distribute this file under its
22+
original name in the TeX distribution."
23+
24+
=============================================================================
25+
26+
The dictionary file (dictionary-de.txt) was created based on the data
27+
by Björn Jacke: https://www.j3e.de/ispell/igerman98/
28+
29+
According to LibreOffice's website, the dictionary is provided
30+
under LGPL-v3+ (GNU Lesser General Public License Version 3 or later) license.
31+
https://extensions.libreoffice.org/extensions/german-de-de-igerman98-dictionaries
32+
33+
GNU LESSER GENERAL PUBLIC LICENSE
34+
Version 3, 29 June 2007
35+
36+
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
37+
Everyone is permitted to copy and distribute verbatim copies
38+
of this license document, but changing it is not allowed.
39+
40+
41+
This version of the GNU Lesser General Public License incorporates
42+
the terms and conditions of version 3 of the GNU General Public
43+
License, supplemented by the additional permissions listed below.
44+
45+
0. Additional Definitions.
46+
47+
As used herein, "this License" refers to version 3 of the GNU Lesser
48+
General Public License, and the "GNU GPL" refers to version 3 of the GNU
49+
General Public License.
50+
51+
"The Library" refers to a covered work governed by this License,
52+
other than an Application or a Combined Work as defined below.
53+
54+
An "Application" is any work that makes use of an interface provided
55+
by the Library, but which is not otherwise based on the Library.
56+
Defining a subclass of a class defined by the Library is deemed a mode
57+
of using an interface provided by the Library.
58+
59+
A "Combined Work" is a work produced by combining or linking an
60+
Application with the Library. The particular version of the Library
61+
with which the Combined Work was made is also called the "Linked
62+
Version".
63+
64+
The "Minimal Corresponding Source" for a Combined Work means the
65+
Corresponding Source for the Combined Work, excluding any source code
66+
for portions of the Combined Work that, considered in isolation, are
67+
based on the Application, and not on the Linked Version.
68+
69+
The "Corresponding Application Code" for a Combined Work means the
70+
object code and/or source code for the Application, including any data
71+
and utility programs needed for reproducing the Combined Work from the
72+
Application, but excluding the System Libraries of the Combined Work.
73+
74+
1. Exception to Section 3 of the GNU GPL.
75+
76+
You may convey a covered work under sections 3 and 4 of this License
77+
without being bound by section 3 of the GNU GPL.
78+
79+
2. Conveying Modified Versions.
80+
81+
If you modify a copy of the Library, and, in your modifications, a
82+
facility refers to a function or data to be supplied by an Application
83+
that uses the facility (other than as an argument passed when the
84+
facility is invoked), then you may convey a copy of the modified
85+
version:
86+
87+
a) under this License, provided that you make a good faith effort to
88+
ensure that, in the event an Application does not supply the
89+
function or data, the facility still operates, and performs
90+
whatever part of its purpose remains meaningful, or
91+
92+
b) under the GNU GPL, with none of the additional permissions of
93+
this License applicable to that copy.
94+
95+
3. Object Code Incorporating Material from Library Header Files.
96+
97+
The object code form of an Application may incorporate material from
98+
a header file that is part of the Library. You may convey such object
99+
code under terms of your choice, provided that, if the incorporated
100+
material is not limited to numerical parameters, data structure
101+
layouts and accessors, or small macros, inline functions and templates
102+
(ten or fewer lines in length), you do both of the following:
103+
104+
a) Give prominent notice with each copy of the object code that the
105+
Library is used in it and that the Library and its use are
106+
covered by this License.
107+
108+
b) Accompany the object code with a copy of the GNU GPL and this license
109+
document.
110+
111+
4. Combined Works.
112+
113+
You may convey a Combined Work under terms of your choice that,
114+
taken together, effectively do not restrict modification of the
115+
portions of the Library contained in the Combined Work and reverse
116+
engineering for debugging such modifications, if you also do each of
117+
the following:
118+
119+
a) Give prominent notice with each copy of the Combined Work that
120+
the Library is used in it and that the Library and its use are
121+
covered by this License.
122+
123+
b) Accompany the Combined Work with a copy of the GNU GPL and this license
124+
document.
125+
126+
c) For a Combined Work that displays copyright notices during
127+
execution, include the copyright notice for the Library among
128+
these notices, as well as a reference directing the user to the
129+
copies of the GNU GPL and this license document.
130+
131+
d) Do one of the following:
132+
133+
0) Convey the Minimal Corresponding Source under the terms of this
134+
License, and the Corresponding Application Code in a form
135+
suitable for, and under terms that permit, the user to
136+
recombine or relink the Application with a modified version of
137+
the Linked Version to produce a modified Combined Work, in the
138+
manner specified by section 6 of the GNU GPL for conveying
139+
Corresponding Source.
140+
141+
1) Use a suitable shared library mechanism for linking with the
142+
Library. A suitable mechanism is one that (a) uses at run time
143+
a copy of the Library already present on the user's computer
144+
system, and (b) will operate properly with a modified version
145+
of the Library that is interface-compatible with the Linked
146+
Version.
147+
148+
e) Provide Installation Information, but only if you would otherwise
149+
be required to provide such information under section 6 of the
150+
GNU GPL, and only to the extent that such information is
151+
necessary to install and execute a modified version of the
152+
Combined Work produced by recombining or relinking the
153+
Application with a modified version of the Linked Version. (If
154+
you use option 4d0, the Installation Information must accompany
155+
the Minimal Corresponding Source and Corresponding Application
156+
Code. If you use option 4d1, you must provide the Installation
157+
Information in the manner specified by section 6 of the GNU GPL
158+
for conveying Corresponding Source.)
159+
160+
5. Combined Libraries.
161+
162+
You may place library facilities that are a work based on the
163+
Library side by side in a single library together with other library
164+
facilities that are not Applications and are not covered by this
165+
License, and convey such a combined library under terms of your
166+
choice, if you do both of the following:
167+
168+
a) Accompany the combined library with a copy of the same work based
169+
on the Library, uncombined with any other library facilities,
170+
conveyed under the terms of this License.
171+
172+
b) Give prominent notice with the combined library that part of it
173+
is a work based on the Library, and explaining where to find the
174+
accompanying uncombined form of the same work.
175+
176+
6. Revised Versions of the GNU Lesser General Public License.
177+
178+
The Free Software Foundation may publish revised and/or new versions
179+
of the GNU Lesser General Public License from time to time. Such new
180+
versions will be similar in spirit to the present version, but may
181+
differ in detail to address new problems or concerns.
182+
183+
Each version is given a distinguishing version number. If the
184+
Library as you received it specifies that a certain numbered version
185+
of the GNU Lesser General Public License "or any later version"
186+
applies to it, you have the option of following the terms and
187+
conditions either of that published version or of any later version
188+
published by the Free Software Foundation. If the Library as you
189+
received it does not specify a version number of the GNU Lesser
190+
General Public License, you may choose any version of the GNU Lesser
191+
General Public License ever published by the Free Software Foundation.
192+
193+
If the Library as you received it specifies that a proxy can decide
194+
whether future versions of the GNU Lesser General Public License shall
195+
apply, that proxy's public statement of acceptance of any version is
196+
permanent authorization for you to choose that version for the
197+
Library.

README.md

+101
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Data files for German Decompounder for Apache Lucene / Apache Solr / Elasticsearch #
2+
3+
This project was started to offer German decompounding out of box for users
4+
of Apache Lucene, Apache Solr, or Elasticsearch. The problem with the data files is
5+
their license, so be careful when packaging them. Apache Lucene is a Apache v2.0
6+
licensed, so the data files cannot be shipped together with the distribution.
7+
8+
For decompounding German words, the recommended approach is the following:
9+
10+
* First use a hyphenator to create syllables of the input tokens. Of course this does
11+
way too much. If we would index syllables the user would match a lot of wrong stuff.
12+
The hyphenator rules are used in many word processor programs (e.g., Open Office or
13+
Latex). They are provided here in the format of an XML file for Apache FOPs (Formatting
14+
Objects Processor, taken from https://sourceforge.net/projects/offo/). Those files
15+
can be read by Lucene's `HyphenationCompoundWordTokenFilter` to do the hyphenation.
16+
* The second step is therefor to take the syllables and form words out of it again.
17+
The Lucene `HyphenationCompoundWordTokenFilter` can do this based on a dictionary.
18+
This project here mainly provides the dictionary to do this (see below).
19+
20+
The dictionary is developed here (dictionary-de.txt) was created based on the data
21+
by Björn Jacke: https://www.j3e.de/ispell/igerman98/
22+
23+
I used his large and high quality dictionary to make a dictionary file only containing
24+
the parts of German compounds. The dictionary therefore is not large, it contains
25+
about 17,000 words, that are commponly used to form compounds. The dictionary does
26+
*not* contain the compounds, only the parts that are used to create them.
27+
The dictionary was lowercased and the umlauts normalized.
28+
29+
Keep in mind: The files provided here are for *new* German orthography (since 1998)!
30+
31+
## Apache Solr example ##
32+
33+
Here is a config example for Apache Solr. To use it put the two data files
34+
into the core's config directory's `lang` subfolder. After that you can add the
35+
following definition to your Solr schema:
36+
37+
```xml
38+
<!-- German -->
39+
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
40+
<analyzer>
41+
<tokenizer class="solr.StandardTokenizerFactory"/>
42+
<filter class="solr.LowerCaseFilterFactory"/>
43+
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="lang/de_DR.xml"
44+
dictionary="lang/dictionary-de.txt" onlyLongestMatch="true" minSubwordSize="4"/>
45+
<filter class="solr.GermanNormalizationFilterFactory"/>
46+
<filter class="solr.GermanLightStemFilterFactory"/>
47+
</analyzer>
48+
</fieldType>
49+
```
50+
51+
Important: Use the analyzer for both indexing and searching!
52+
53+
## Elasticsearch example ##
54+
55+
Here is a config example for Elasticsearch. To use it put the two data files
56+
into the `${ES_HOME}/config/analysis` directory of your ES node and add
57+
the following settings to your index. After that you can use the
58+
`german_decompound` analyzer in your mapping.
59+
60+
```json
61+
"settings": {
62+
"analysis": {
63+
"filter": {
64+
"german_decompounder": {
65+
"type": "hyphenation_decompounder",
66+
"word_list_path": "analysis/dictionary-de.txt",
67+
"hyphenation_patterns_path": "analysis/de_DR.xml",
68+
"only_longest_match": true,
69+
"min_subword_size": 4
70+
},
71+
"german_stemmer": {
72+
"type": "stemmer",
73+
"language": "light_german"
74+
}
75+
},
76+
"analyzer": {
77+
"german_decompound": {
78+
"type": "custom",
79+
"tokenizer": "standard",
80+
"filter": [
81+
"lowercase",
82+
"german_decompounder",
83+
"german_normalization",
84+
"german_stemmer"
85+
]
86+
}
87+
}
88+
}
89+
}
90+
```
91+
92+
Important: Use the analyzer for both indexing and searching!
93+
94+
## Help Out! ##
95+
96+
If you have suggestions for improving the German dictionary, please send
97+
a pull request, thanks! Be sure to only send "plain words", no compounds!
98+
99+
## License ##
100+
101+
See [NOTICE.txt](NOTICE.txt) for more information!

0 commit comments

Comments
 (0)