1
+ Overview
2
+ ========
3
+
4
+ This `licensedcode ` module have utilities to accurately detect a vast array
5
+ of open-source and proprietary licenses. It manages a comprehensive
6
+ database of license texts, patterns, and rules, enabling ScanCode to
7
+ perform scans and provide precise license conclusions.
8
+
9
+ Key Functionality
10
+ -----------------
11
+
12
+ * License Rule Management: Stores and manages a large collection of
13
+ license rules, including full texts, snippets, and regular expressions.
14
+
15
+ * Pattern Matching: Implements sophisticated algorithms for matching
16
+ detected code against known license patterns and texts.
17
+
18
+ * License Detection Logic: Contains the core logic for processing scan
19
+ input, applying rules, and determining the presence and type of
20
+ licenses.
21
+
22
+ * Rule-based Detection: Utilizes a robust system of rules to identify
23
+ licenses even when only fragments or variations of license texts are
24
+ present.
25
+
26
+ * License Expression Parsing: Supports the parsing and interpretation of
27
+ complex license expressions (e.g., "MIT AND Apache-2.0").
28
+
29
+
30
+ How It Works (High-Level)
31
+ -------------------------
32
+
33
+ At a high level, the `licensedcode`` module operates by:
34
+
35
+ 1. Loading License Data: It initializes by loading a curated set of
36
+ license texts, short license identifiers, and detection rules from its
37
+ internal data store.
38
+
39
+ 2. Scanning Input: When ScanCode processes a file or directory, the
40
+ content is converted into an internal representation (a "query").
41
+
42
+ 3. Applying Rules: The module then applies its extensive set of rules and
43
+ patterns to the input content through a multi-stage pipeline, looking
44
+ for matches.
45
+
46
+ 4. Reporting Detections: Upon successful matches, it reports the
47
+ identified licenses, their confidence levels, and the exact locations
48
+ (lines, characters) where they were found.
49
+
50
+ For a more in-depth understanding of the underlying technical principles
51
+ and the detection pipeline, please refer to the sections below.
52
+
53
+
54
+
1
55
ScanCode license detection overview and key design elements
2
- ===========================================================
56
+ -----------------------------------------------------------
3
57
4
58
License detection involves identifying commonalities between the text of a
5
59
scanned query file and the indexed license and rule texts. The process
@@ -17,7 +71,7 @@ rather than strings.
17
71
18
72
19
73
Rules and licenses
20
- ------------------
74
+ ^^^^^^^^^^^^^^^^^^
21
75
22
76
The detection uses an index of reference license texts and along with a set
23
77
of "rules" that are common notices or mentions of these licenses. One
@@ -28,7 +82,7 @@ resemblance and containment of the matched texts.
28
82
29
83
30
84
Words as integers
31
- -----------------
85
+ ^^^^^^^^^^^^^^^^^
32
86
33
87
A dictionary that maps words to a unique integer is used to transform a
34
88
scanned text "query" words, as well as the words in the indexed license
@@ -78,7 +132,7 @@ With integers, we can be faster:
78
132
79
133
80
134
Common/junk tokens
81
- ------------------
135
+ ^^^^^^^^^^^^^^^^^^
82
136
83
137
The quality and speed of detection is supported by classifying each word as
84
138
either good/discriminant or common/junk. Junk tokens are either very
@@ -89,7 +143,7 @@ referred to as low (junk) tokens and high (good) tokens.
89
143
90
144
91
145
Query processing
92
- ----------------
146
+ ^^^^^^^^^^^^^^^^
93
147
94
148
When a file is scanned, it is first converted to a query object which is a list of
95
149
integer token ids. A query is further broken down in slices (a.k.a. query runs) based
@@ -101,7 +155,7 @@ matching.
101
155
102
156
103
157
Matching pipeline
104
- -----------------
158
+ ^^^^^^^^^^^^^^^^^
105
159
106
160
The matching pipeline consist of:
107
161
@@ -133,14 +187,16 @@ The matching pipeline consist of:
133
187
significantly. The number of multiple local sequence alignments that are required
134
188
in this step is also made much smaller by the pre-matching done using sets.
135
189
136
- - finally all the collected matches are merged, refined and filtered to yield the
137
- final results. The merging considers the ressemblance, containment and overlap
138
- between scanned texts and the matched texts and several secondary factors.
139
- Filtering is based on the density and length of matches as well as the number of
140
- good or frequent tokens matched.
141
- Last, each match receives a score which base on the length of the rule text
142
- and how of this rule text was matched. Optionally we can also collect the exact
143
- matched texts and identify which portions were not matched for each instance.
190
+ - finally all the collected matches are merged, refined and filtered to
191
+ yield the final results. The merging considers the ressemblance,
192
+ containment and overlap between scanned texts and the matched texts and
193
+ several secondary factors. Filtering is based on the density and length
194
+ of matches as well as the number of good or frequent tokens matched.
195
+ Lastly, each match receives a score calculated based on the sum of the
196
+ underlying match scores, weighted by the length of the match relative to
197
+ the overall detection length. Optionally we can also collect the exact
198
+ matched texts and identify which portions were not matched for each
199
+ instance.
144
200
145
201
146
202
Comparison with other tools approaches
0 commit comments