1- .. _how-to-add-new -license-detection-rule :
1+ .. _how-to-add-modify -license-detection-rule :
22
3- How to add new license rules for enhanced detection
4- ===================================================
3+ How to add or modify license rules for enhanced detection
4+ ==========================================================
55
66ScanCode relies on license rules to detect licenses. A rule is a simple text
77file containing a license text or notice or mention with YAML frontmatter with data
88attributes that tells ScanCode which license expression to report when the text
99is detected, and other properties.
1010
11- See the :ref: ` faq ` for a high level description of adding license detection rules.
11+ .. _ adding_a_new_rule :
1212
1313How to add a new license detection rule?
1414----------------------------------------
@@ -39,7 +39,6 @@ better detection.
3939For a simple `mit AND apache-2.0 ` license expression detection, here is an example
4040rule file::
4141
42-
4342 ---
4443 license_expression: mit AND apache-2.0
4544 is_license_notice: yes
@@ -51,36 +50,236 @@ rule file::
5150 ## License
5251 The MIT License (MIT) + Apache 2.0. Read [LICENSE](LICENSE).
5352
54- See the `` src/licensedcode/data/rules/ `` directory for many examples.
53+ .. note ::
5554
56- More (advanced) rules options:
55+ Add rules in a local developement installation and run `scancode-reindex-licenses `
56+ to make sure we reindex the rules and this validates the new licenses.
5757
58- - you can use a `notes ` text field to document this rule and explain where you
59- found it first.
58+ See the ``src/licensedcode/data/rules/ `` directory for many examples.
6059
61- - if no license should be detected for your .RULE text, do not add a license expression,
62- just add a ``notes `` field.
60+ The mandatory rules options are:
6361
64- - Each rule needs to have one flag to describe the type of license rule. The options are:
62+ - Every rule (except `is_false_positive ` rules) must have a `license-expression `
63+ field. The license keys used in the license expression must also be
64+ present in the scancode licenseDB as a `.LICENSE ` file and these license
65+ keys can be joined by: `OR `, `AND `, `WITH ` operators
66+
67+ - Each rule needs to have one flag to describe the type of license rule.
68+ You cannot use more than one flag for a rule. The options are:
6569
6670 - `is_license_notice `
6771 - `is_license_text `
6872 - `is_license_tag `
6973 - `is_license_reference `
7074 - `is_license_intro `
75+ - `is_license_clue `
76+ - `is_false_positive `
77+
78+ Some more optional rule data fields:
79+
80+ - `minimum_coverage ` is the percentage of rule text which must be present for
81+ this rule to match to a piece of text. Highly recommended to set these for
82+ rules which have high similarity to rules of another license.
83+
84+ - `relevance ` is a license rule data which signifies how relevant/important a
85+ piece of rule text is, and eventually contributes to the match score.
86+ A lower relevance means lower confidence that the `license-expression `
87+ set for this rule is correct for this rule text. This is set to `100 `
88+ as a default if the rule has atleast 18 words (or more), unless
89+ otherwise set explicitly. Relevance should be set when the text does not
90+ completely represent the given license-expression for the rule. For
91+ example some rules reference just `gpl ` and not to a specific version
92+ of the gpl license, or some online references to a license might be
93+ modified and outdated in some cases.
7194
72- - There can also be false positive rules, which if detected in the file scanned, will not
73- be present in the result license detections. These just have the license text and a
74- `is_false_positive ` flag set to True.
95+ - if a rule is being deprecated it should be marked with the `is_deprecated `
96+ data field being set to `True `. This can be because the license-expression
97+ is adjusted/changed for the rule or the rule is promoted to being a proper
98+ license text.
7599
76- - you can specify required phrases by surrounding one or more words between the `{{ `
77- and `}} ` tags. Key phrases are words that **must ** be matched/present in order
78- for a RULE to be considered a match.
100+ .. note ::
101+
102+ A rule should never be deleted entirely, only deprecated with the
103+ `is_deprecated ` data field as older versions of scancode could still
104+ use and link to a particular rule and this is useful to debug license
105+ detections. If rules are deleted and the same identifier is assigned
106+ to another rule text then the same rule identifier might have different
107+ text for different versions of scancode and this is inconsistent data.
108+
109+ - you can use a `notes ` text field to document this rule and explain where you
110+ found it first.
111+
112+ - if no license should be detected for your .RULE text (`is_false_positive ` cases),
113+ do not add a license expression, just add a ``notes `` field.
114+
115+ - `is_continuous ` should be set to `True ` if a rule can only be matched as a
116+ continious piece of text and not as approximate or partial matches.
117+
118+ - `language ` should be set as a two-letter ISO 639-1 language code if the rule
119+ text is a non-english language. See https://en.wikipedia.org/wiki/ISO_639-1
79120
80121See the ``src/licensedcode/models.py `` directory for a list of all possible values
81122and other options.
82123
83- .. note ::
124+ False positive rules and license clues
125+ --------------------------------------
84126
85- Add rules in a local developement installation and run `scancode-reindex-licenses `
86- to make sure we reindex the rules and this validates the new licenses.
127+ `false-positive ` rules
128+ ^^^^^^^^^^^^^^^^^^^^^^^
129+
130+ There can also be false positive rules, which if detected in the file scanned, will not
131+ be present in the result license detections. These just have the license text and a
132+ `is_false_positive ` flag set to True. You must add some notes documenting where this
133+ false positive rule was found as false positive rules often have a specific origin.
134+
135+ False positive rules must be very specific, and should contain as much words in the
136+ rule text as possible, before and after the words which were matched wrongly. This
137+ is to ensure we don't discard postentially correct matches at all. For example
138+ sometimes `gpl ` or other 3 letter license names are detected as a false-positive
139+ in code as these are likely to appear and in this case we have to add a
140+ false-positive rule with the entire symbol (like a function/variable name) or
141+ entire lines of code, potentially with lines before/after.
142+
143+ `license-clue ` rules
144+ ^^^^^^^^^^^^^^^^^^^^^
145+
146+ License clues are pieces of license text which are not directly related to
147+ what the license is exactly for that piece of code, but a clue to what the
148+ license terms could be.
149+
150+ Some cases of license clues are:
151+
152+ - generic permissive terms related to the license, but cannot be matched to
153+ a particular license
154+ - references to non-legal entities/names which has certain license conditions
155+ - certain statements which indicate that a license text/notice is present
156+ elsewhere, but does not say anything about what this license is
157+
158+ If a rule is categorized as a `license-clue ` the effects of this are:
159+
160+ - This license key is not represented in the `detected_license_expression `
161+ for this file
162+ - The license match is not present in the file-level or top-level
163+ `license_detections ` data mapping, but present in a seperate file-level
164+ `license_clues ` data mapping
165+
166+ But if these license clues are present in a package sepcific context, like
167+ in a file/data mapping where package licenses are declared, this is detected
168+ and reported as-is like other license detections.
169+
170+ selecting `is-false-positive ` vs `is-license-clue `
171+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
172+
173+ If a piece of text/words is not related to the license of the
174+ particular file/package it was found in at all, then this is
175+ a `false-positive ` rule. If this could be related but we cannot
176+ be sure from just the text, this is a `license-clue `.
177+
178+ If a piece of license text references some code/module/package
179+ which might or might not be present in the codebase, or license
180+ conditions which might be optionally relevant, this could be
181+ useful and therefore a `license-clue `.
182+
183+ For categorizing a rule as a `false-positive ` license rule, we must
184+ be sure that this piece of text cannot ever be related to the license
185+ of the code where it could be found, otherwise this is a `license-clue `
186+ or even other types of license rules.
187+
188+ See examples in scancode licenseDB of rules with these tags for
189+ more details on these data fields.
190+
191+ Required phrases in rules
192+ -------------------------
193+
194+ Required phrases are words that **must ** be matched/present
195+ in order for a RULE to be considered a match.
196+
197+ Required phrases are the most important parts of a license rule text,
198+ and in case of a partial match, absence of the required phrases in the
199+ matched part of the text in most cases will result in a wrong match
200+ (or a false postive in some cases). In other words, to partially match
201+ a piece of text with a license rule, we must check the presense of the
202+ required phrases of the rule in that piece of text.
203+
204+ For example consider the following text:
205+
206+ This program is free software: you can redistribute it and/or modify it
207+ under the terms of the {{GNU AGPL v3 License}} as published by
208+ the Free Software Foundation, version 3 of the License.
209+
210+ Here the text ``GNU AGPL v3 License `` is essential to be present exactly in
211+ the text for a correct match, and otherwise this can match partially with
212+ something which is almost the same text, but an entirely different license.
213+ Like ``GNU GPL v3 License `` which is only a character/word different.
214+
215+ You can specify required phrases by surrounding one or more words between
216+ the `{{ ` and `}} ` tags. See the example above for a required phrase
217+ marked in a rule.
218+
219+ Here are some guidelines on marking required phrases in a rule:
220+
221+ - Mark the entire essential part of the license text as a required phrase
222+ - Always include numerical versions or distinguishing parts of the license text
223+ in the required phrase
224+ - Required phrases are usually license names, alias names or other license references
225+ - License references like named local files and links to webpages which contain the
226+ license name should also be marked as a required phrases
227+ - If there are multiple occurances of the distinguishing parts, or the
228+ license names we must mark all of them as required phrases.
229+
230+
231+ Marking required phrases automatically
232+ --------------------------------------
233+
234+ Required phrases present in larger license texts are used in multiple ways on
235+ their own in scancode:
236+
237+ - To mark the same required phrase present in other license texts
238+ - Used as a seperate license detection step for partial/unknown matches
239+ - Determine the license expression of a new piece of license text/rule to
240+ be added to licensedb
241+
242+ For these reasons there are the following console scripts to automatically:
243+
244+ - mark required phrases in rules propagated from other rules
245+ - create individual new rules which are marked as required phrase in larger
246+ texts
247+
248+ See :ref: `cli-required-phrases ` for the available options.
249+
250+
251+ Helper scripts to add many license rules together
252+ -------------------------------------------------
253+
254+ Adding many license rules at once from a single file with a script
255+ is beneficial because:
256+
257+ - you don't need to create seperate files for each rule
258+ - there is a numerical part to differentiate rules for the same
259+ license key, and this doesn't need to be determined
260+ - rule validations (checking for inconsistent data fields) are
261+ performed and violations are displayed all at once
262+ - ignorables (copyrights, references etc) are added automatically
263+ - rules which are already present will be skipped automatically
264+
265+ This can be beneficial even if you're adding a single or just a couple
266+ rules for the same reasons.
267+
268+ These are the locations of the rule template and script from the
269+ root of the scancode source directory:
270+
271+ - the script: `etc/scripts/licenses/buildrules.py `
272+ - the template: `etc/scripts/licenses/buildrules-template.txt `
273+ - an example template file: `etc/scripts/licenses/buildrules-example.txt `
274+
275+ These are the steps to execute the script and create rules:
276+
277+ - start from a activated scancode developement virtualenv
278+ See :ref: `install-scancode-from-source `
279+ - Populate the template file with rules
280+ see :ref: `adding_a_new_rule ` for more info on adding rules
281+ - Run the script from the activated virtualenv with:
282+ `python etc/scripts/licenses/buildrules.py etc/scripts/licenses/buildrules-template.txt `
283+ - If there are any errors, fix them in the rule template and run the script again
284+ - Run `scancode-reindex-licenses ` to check if the rules are being indexed properly
285+ See :ref: `cli-scancode-reindex-licenses ` for more details
0 commit comments