Skip to content

Commit 3d54648

Browse files
Improve how-to-guides reated to license data udpates
Signed-off-by: Ayan Sinha Mahapatra <asmahapatra@aboutcode.org>
1 parent 022ddc8 commit 3d54648

File tree

6 files changed

+307
-24
lines changed

6 files changed

+307
-24
lines changed

docs/source/how-to-guides/how-to-add-new-license-detection-rule.rst

Lines changed: 220 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
.. _how-to-add-new-license-detection-rule:
1+
.. _how-to-add-modify-license-detection-rule:
22

3-
How to add new license rules for enhanced detection
4-
===================================================
3+
How to add or modify license rules for enhanced detection
4+
==========================================================
55

66
ScanCode relies on license rules to detect licenses. A rule is a simple text
77
file containing a license text or notice or mention with YAML frontmatter with data
88
attributes that tells ScanCode which license expression to report when the text
99
is detected, and other properties.
1010

11-
See the :ref:`faq` for a high level description of adding license detection rules.
11+
.. _adding_a_new_rule:
1212

1313
How to add a new license detection rule?
1414
----------------------------------------
@@ -39,7 +39,6 @@ better detection.
3939
For a simple `mit AND apache-2.0` license expression detection, here is an example
4040
rule file::
4141

42-
4342
---
4443
license_expression: mit AND apache-2.0
4544
is_license_notice: yes
@@ -51,36 +50,236 @@ rule file::
5150
## License
5251
The MIT License (MIT) + Apache 2.0. Read [LICENSE](LICENSE).
5352

54-
See the ``src/licensedcode/data/rules/`` directory for many examples.
53+
.. note::
5554

56-
More (advanced) rules options:
55+
Add rules in a local developement installation and run `scancode-reindex-licenses`
56+
to make sure we reindex the rules and this validates the new licenses.
5757

58-
- you can use a `notes` text field to document this rule and explain where you
59-
found it first.
58+
See the ``src/licensedcode/data/rules/`` directory for many examples.
6059

61-
- if no license should be detected for your .RULE text, do not add a license expression,
62-
just add a ``notes`` field.
60+
The mandatory rules options are:
6361

64-
- Each rule needs to have one flag to describe the type of license rule. The options are:
62+
- Every rule (except `is_false_positive` rules) must have a `license-expression`
63+
field. The license keys used in the license expression must also be
64+
present in the scancode licenseDB as a `.LICENSE` file and these license
65+
keys can be joined by: `OR`, `AND`, `WITH` operators
66+
67+
- Each rule needs to have one flag to describe the type of license rule.
68+
You cannot use more than one flag for a rule. The options are:
6569

6670
- `is_license_notice`
6771
- `is_license_text`
6872
- `is_license_tag`
6973
- `is_license_reference`
7074
- `is_license_intro`
75+
- `is_license_clue`
76+
- `is_false_positive`
77+
78+
Some more optional rule data fields:
79+
80+
- `minimum_coverage` is the percentage of rule text which must be present for
81+
this rule to match to a piece of text. Highly recommended to set these for
82+
rules which have high similarity to rules of another license.
83+
84+
- `relevance` is a license rule data which signifies how relevant/important a
85+
piece of rule text is, and eventually contributes to the match score.
86+
A lower relevance means lower confidence that the `license-expression`
87+
set for this rule is correct for this rule text. This is set to `100`
88+
as a default if the rule has atleast 18 words (or more), unless
89+
otherwise set explicitly. Relevance should be set when the text does not
90+
completely represent the given license-expression for the rule. For
91+
example some rules reference just `gpl` and not to a specific version
92+
of the gpl license, or some online references to a license might be
93+
modified and outdated in some cases.
7194

72-
- There can also be false positive rules, which if detected in the file scanned, will not
73-
be present in the result license detections. These just have the license text and a
74-
`is_false_positive` flag set to True.
95+
- if a rule is being deprecated it should be marked with the `is_deprecated`
96+
data field being set to `True`. This can be because the license-expression
97+
is adjusted/changed for the rule or the rule is promoted to being a proper
98+
license text.
7599

76-
- you can specify required phrases by surrounding one or more words between the `{{`
77-
and `}}` tags. Key phrases are words that **must** be matched/present in order
78-
for a RULE to be considered a match.
100+
.. note::
101+
102+
A rule should never be deleted entirely, only deprecated with the
103+
`is_deprecated` data field as older versions of scancode could still
104+
use and link to a particular rule and this is useful to debug license
105+
detections. If rules are deleted and the same identifier is assigned
106+
to another rule text then the same rule identifier might have different
107+
text for different versions of scancode and this is inconsistent data.
108+
109+
- you can use a `notes` text field to document this rule and explain where you
110+
found it first.
111+
112+
- if no license should be detected for your .RULE text (`is_false_positive` cases),
113+
do not add a license expression, just add a ``notes`` field.
114+
115+
- `is_continuous` should be set to `True` if a rule can only be matched as a
116+
continious piece of text and not as approximate or partial matches.
117+
118+
- `language` should be set as a two-letter ISO 639-1 language code if the rule
119+
text is a non-english language. See https://en.wikipedia.org/wiki/ISO_639-1
79120

80121
See the ``src/licensedcode/models.py`` directory for a list of all possible values
81122
and other options.
82123

83-
.. note::
124+
False positive rules and license clues
125+
--------------------------------------
84126

85-
Add rules in a local developement installation and run `scancode-reindex-licenses`
86-
to make sure we reindex the rules and this validates the new licenses.
127+
`false-positive` rules
128+
^^^^^^^^^^^^^^^^^^^^^^^
129+
130+
There can also be false positive rules, which if detected in the file scanned, will not
131+
be present in the result license detections. These just have the license text and a
132+
`is_false_positive` flag set to True. You must add some notes documenting where this
133+
false positive rule was found as false positive rules often have a specific origin.
134+
135+
False positive rules must be very specific, and should contain as much words in the
136+
rule text as possible, before and after the words which were matched wrongly. This
137+
is to ensure we don't discard postentially correct matches at all. For example
138+
sometimes `gpl` or other 3 letter license names are detected as a false-positive
139+
in code as these are likely to appear and in this case we have to add a
140+
false-positive rule with the entire symbol (like a function/variable name) or
141+
entire lines of code, potentially with lines before/after.
142+
143+
`license-clue` rules
144+
^^^^^^^^^^^^^^^^^^^^^
145+
146+
License clues are pieces of license text which are not directly related to
147+
what the license is exactly for that piece of code, but a clue to what the
148+
license terms could be.
149+
150+
Some cases of license clues are:
151+
152+
- generic permissive terms related to the license, but cannot be matched to
153+
a particular license
154+
- references to non-legal entities/names which has certain license conditions
155+
- certain statements which indicate that a license text/notice is present
156+
elsewhere, but does not say anything about what this license is
157+
158+
If a rule is categorized as a `license-clue` the effects of this are:
159+
160+
- This license key is not represented in the `detected_license_expression`
161+
for this file
162+
- The license match is not present in the file-level or top-level
163+
`license_detections` data mapping, but present in a seperate file-level
164+
`license_clues` data mapping
165+
166+
But if these license clues are present in a package sepcific context, like
167+
in a file/data mapping where package licenses are declared, this is detected
168+
and reported as-is like other license detections.
169+
170+
selecting `is-false-positive` vs `is-license-clue`
171+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
172+
173+
If a piece of text/words is not related to the license of the
174+
particular file/package it was found in at all, then this is
175+
a `false-positive` rule. If this could be related but we cannot
176+
be sure from just the text, this is a `license-clue`.
177+
178+
If a piece of license text references some code/module/package
179+
which might or might not be present in the codebase, or license
180+
conditions which might be optionally relevant, this could be
181+
useful and therefore a `license-clue`.
182+
183+
For categorizing a rule as a `false-positive` license rule, we must
184+
be sure that this piece of text cannot ever be related to the license
185+
of the code where it could be found, otherwise this is a `license-clue`
186+
or even other types of license rules.
187+
188+
See examples in scancode licenseDB of rules with these tags for
189+
more details on these data fields.
190+
191+
Required phrases in rules
192+
-------------------------
193+
194+
Required phrases are words that **must** be matched/present
195+
in order for a RULE to be considered a match.
196+
197+
Required phrases are the most important parts of a license rule text,
198+
and in case of a partial match, absence of the required phrases in the
199+
matched part of the text in most cases will result in a wrong match
200+
(or a false postive in some cases). In other words, to partially match
201+
a piece of text with a license rule, we must check the presense of the
202+
required phrases of the rule in that piece of text.
203+
204+
For example consider the following text:
205+
206+
This program is free software: you can redistribute it and/or modify it
207+
under the terms of the {{GNU AGPL v3 License}} as published by
208+
the Free Software Foundation, version 3 of the License.
209+
210+
Here the text ``GNU AGPL v3 License`` is essential to be present exactly in
211+
the text for a correct match, and otherwise this can match partially with
212+
something which is almost the same text, but an entirely different license.
213+
Like ``GNU GPL v3 License`` which is only a character/word different.
214+
215+
You can specify required phrases by surrounding one or more words between
216+
the `{{` and `}}` tags. See the example above for a required phrase
217+
marked in a rule.
218+
219+
Here are some guidelines on marking required phrases in a rule:
220+
221+
- Mark the entire essential part of the license text as a required phrase
222+
- Always include numerical versions or distinguishing parts of the license text
223+
in the required phrase
224+
- Required phrases are usually license names, alias names or other license references
225+
- License references like named local files and links to webpages which contain the
226+
license name should also be marked as a required phrases
227+
- If there are multiple occurances of the distinguishing parts, or the
228+
license names we must mark all of them as required phrases.
229+
230+
231+
Marking required phrases automatically
232+
--------------------------------------
233+
234+
Required phrases present in larger license texts are used in multiple ways on
235+
their own in scancode:
236+
237+
- To mark the same required phrase present in other license texts
238+
- Used as a seperate license detection step for partial/unknown matches
239+
- Determine the license expression of a new piece of license text/rule to
240+
be added to licensedb
241+
242+
For these reasons there are the following console scripts to automatically:
243+
244+
- mark required phrases in rules propagated from other rules
245+
- create individual new rules which are marked as required phrase in larger
246+
texts
247+
248+
See :ref:`cli-required-phrases` for the available options.
249+
250+
251+
Helper scripts to add many license rules together
252+
-------------------------------------------------
253+
254+
Adding many license rules at once from a single file with a script
255+
is beneficial because:
256+
257+
- you don't need to create seperate files for each rule
258+
- there is a numerical part to differentiate rules for the same
259+
license key, and this doesn't need to be determined
260+
- rule validations (checking for inconsistent data fields) are
261+
performed and violations are displayed all at once
262+
- ignorables (copyrights, references etc) are added automatically
263+
- rules which are already present will be skipped automatically
264+
265+
This can be beneficial even if you're adding a single or just a couple
266+
rules for the same reasons.
267+
268+
These are the locations of the rule template and script from the
269+
root of the scancode source directory:
270+
271+
- the script: `etc/scripts/licenses/buildrules.py`
272+
- the template: `etc/scripts/licenses/buildrules-template.txt`
273+
- an example template file: `etc/scripts/licenses/buildrules-example.txt`
274+
275+
These are the steps to execute the script and create rules:
276+
277+
- start from a activated scancode developement virtualenv
278+
See :ref:`install-scancode-from-source`
279+
- Populate the template file with rules
280+
see :ref:`adding_a_new_rule` for more info on adding rules
281+
- Run the script from the activated virtualenv with:
282+
`python etc/scripts/licenses/buildrules.py etc/scripts/licenses/buildrules-template.txt`
283+
- If there are any errors, fix them in the rule template and run the script again
284+
- Run `scancode-reindex-licenses` to check if the rules are being indexed properly
285+
See :ref:`cli-scancode-reindex-licenses` for more details

docs/source/how-to-guides/how-to-install-new-license-plugin.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Finally, the same directory containing the class definition must also contain th
7070
licenses and/or rules. Licenses must be contained in a directory called ``licenses`` and rules
7171
must be contained in a directory called ``rules``.
7272

73-
See :ref:`how-to-add-new-license` and :ref:`how-to-add-new-license-detection-rule` to understand
73+
See :ref:`how-to-add-new-license` and :ref:`how-to-add-modify-license-detection-rule` to understand
7474
the structure of license and rule files, respectively.
7575

7676
After creating this plugin, you can upload it to PyPI so that others can use it, or you can

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Learn via practical step-by-step guides.
5757
Helps you accomplish things.
5858

5959
- :ref:`how-to-add-new-license`
60-
- :ref:`how-to-add-new-license-detection-rule`
60+
- :ref:`how-to-add-modify-license-detection-rule`
6161
- :ref:`how-to-install-new-license-plugin`
6262
- :ref:`how-to-generate-attribution-docs`
6363

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
.. _cli-required-phrases:
2+
3+
scancode required phrases related CLI commands
4+
==============================================
5+
6+
Required phrases are words that **must** be matched/present
7+
in order for a RULE to be considered a match.
8+
9+
Due to the large number of rules present and a large
10+
volume of rules being added regularly to the licecedb, these
11+
CLI options automatically mark and propagate required phrases
12+
in rules and add these required phrases as individual rules
13+
to be used in scancode license detection.
14+
15+
**Quick Reference**
16+
-------------------
17+
18+
Usage: add-required-phrases [OPTIONS]
19+
20+
Update license detection rules with new "required phrases" to improve rules
21+
detection accuracy.
22+
23+
Options:
24+
-o, --from-other-rules Propagate existing required phrases from
25+
other rules to all selected rules. Mutually
26+
exclusive with --from-license-attributes.
27+
-a, --from-license-attributes Propagate license attributes as required
28+
phrases to all selected rules. Mutually
29+
exclusive with --from-other-rule.
30+
-l, --license-expression STRING
31+
Optional license expression filter. If
32+
provided, only consider the rules that are
33+
using this expression. Otherwise, process
34+
all rules. Example: `apache-2.0`.
35+
--validate Validate that all rules and licenses and
36+
rules are consistent, for all rule
37+
languages. For this validation, run a mock
38+
indexing. The regenerated index is not saved
39+
to disk.
40+
-r, --reindex Recreate and cache the licenses index with
41+
updated rules add the end.
42+
-w, --write-phrase-source In modified rule files, write the source
43+
field to trace the source of required
44+
phrases applied to that rule.
45+
-d, --delete-phrase-source In rule files, delete the source extra debug
46+
data used to trace source of phrases.
47+
--dry-run Do not save rules.
48+
-v, --verbose Print verbose logging information.
49+
-h, --help Show this message and exit.
50+
51+
Usage: gen-new-required-phrases-rules [OPTIONS]
52+
53+
Create new license detection rules from "required phrases" in existing
54+
rules. Also update existing rules with "is_required_phrase" if they are
55+
"required phrases" but are not tagged as such.
56+
57+
Options:
58+
-l, --license-expression STRING
59+
Optional license expression filter. If
60+
provided, only consider the rules that are
61+
using this expression. Otherwise, process
62+
all rules. Example: `apache-2.0`.
63+
--max-count INT Optional maximum count of rules to process.
64+
If provided as a non-zero value, stop after
65+
processing this count of rules.
66+
-r, --reindex Recreate and cache the licenses index with
67+
updated rules add the end.
68+
--validate Validate that all rules and licenses and
69+
rules are consistent, for all rule
70+
languages. For this validation, run a mock
71+
indexing. The regenerated index is not saved
72+
to disk.
73+
--min-tokens INT Minimum number of tokens in the text used to
74+
generate a 'good' new rule.
75+
--min-single-token-len INT Minimum length of the token in a single-word
76+
rule text used to generate a 'good' new
77+
rule.
78+
--update-only Do not create new rules, only update
79+
existing rules.
80+
-v, --verbose Print verbose logging information.
81+
-h, --help Show this message and exit

docs/source/reference/scancode-cli/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,3 +89,4 @@ For more details into the post-scan CLI options, see :ref:`cli-post-scan-options
8989
cli-scancode-reindex-licenses
9090
cli-scancode-license-data
9191
cli-scancode-train-gibberish-model
92+
cli-required-phrases

src/licensedcode/models.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1353,7 +1353,9 @@ class BasicRule:
13531353
default=True,
13541354
repr=False,
13551355
metadata=dict(
1356-
help='Flag set to True if this is a builtin standard rule.')
1356+
help='Flag set to True if this is a builtin standard rule and'
1357+
'not a synthetic rule created for a specific matched text.'
1358+
)
13571359
)
13581360

13591361
# The is_license_xxx flags below are an indication of what this rule

0 commit comments

Comments
 (0)