1
1
ScanCode license detection overview and key design elements
2
2
===========================================================
3
3
4
- License detection is about finding common texts between the text of a query file
5
- being scanned and the texts of the indexed license texts and rule texts. The process
6
- strives to be correct first and fast second .
4
+ License detection involves identifying commonalities between the text of a
5
+ scanned query file and the indexed license and rule texts. The process
6
+ prioritizes accuracy over speed .
7
7
8
- Ideally we want to find the best alignment possible between two texts so we know
8
+ Ideally, we want to find the best alignment possible between two texts so we know
9
9
exactly where they match: the scanned text and one or more of the many license texts.
10
10
We settle for good alignments rather than optimal alignments by still returning
11
11
accurate and correct matches in a reasonable amount of time.
12
12
13
- Correctness is essential but efficiency too: both in terms of speed and memory usage.
14
- One key to efficient matching is to process not characters but whole words and use
15
- internally not strings but integers to represent a word.
13
+ Correctness is essential but efficiency matters too: both in terms of speed
14
+ and memory usage. One key to efficient matching is to process whole words
15
+ instead of characters, and to represent words internally using integers
16
+ rather than strings.
16
17
17
18
18
19
Rules and licenses
19
20
------------------
20
21
21
- The detection uses an index of reference license texts and a set of "rules" that are
22
- common notices or mentions of these licenses. The things that makes detection
23
- sometimes difficult is that a license reference can be very short as in "this is GPL"
24
- or very long as a full license text for the GPLv3. To cope with this we use different
25
- matching strategies and also compute the resemblance and containment of texts that
26
- are matched.
22
+ The detection uses an index of reference license texts and along with a set
23
+ of "rules" that are common notices or mentions of these licenses. One
24
+ challenge in detection is that a license reference can be very short as in
25
+ "this is GPL" or very long as a full license text for the GPLv3. To cope
26
+ with this, we use different matching strategies and also compute both the
27
+ resemblance and containment of the matched texts .
27
28
28
29
29
30
Words as integers
30
31
-----------------
31
32
32
- A dictionary mapping words to a unique integer is used to transform a scanned text
33
- "query" words and reference indexed license texts and rules words to numbers.
34
- This is possible because we have a limited number of words across all the license
35
- texts (about 15K). We further assign these ids to words such that very common words
36
- have a low id and less common, more discriminant words have a higher id. And define a
37
- thresholds for this ids range such that very common words below that threshold cannot
38
- possible form a license text or mention together.
33
+ A dictionary that maps words to a unique integer is used to transform a
34
+ scanned text "query" words, as well as the words in the indexed license
35
+ texts and rules, to numbers. This is possible because we have a limited
36
+ number of words across all the license texts (about 15K). We further assign
37
+ these ids to words such that very common words have a low id, while less
38
+ frequent, more distinctive words have a higher id. A thresholds is defined
39
+ for this ids range such that very common words below the threshold cannot,
40
+ by themselves, form a valid license text or reference.
39
41
40
- Once that mapping is applied, the detection then only deal with integers in two
42
+ Once that mapping is applied, the detection process deals only with integers in two
41
43
dimensions:
42
44
43
45
- the token ids (and whether they are in the high or low range).
44
46
- their positions in the query (qpos) and the indexed rule (ipos).
45
47
46
48
We also use an integer id for a rule.
47
49
48
- All operations are from then on dealing with list , arrays or sets of integers in
49
- defined ranges.
50
+ From this point, all operations are performed on lists , arrays or sets of
51
+ integers in defined ranges.
50
52
51
- Matches are reduced to sets of integers we call "Spans":
53
+ Matches are reduced to sets of integers referred to as "Spans":
52
54
53
55
- matched positions on the query side
54
56
- matched positions on the index side
55
57
56
- By using integers in known ranges throughout, several operations are reduced to
57
- integer and integer sets or lists comparisons and intersection. These operations
58
- are faster and more readily optimizable.
58
+ By using integers within known ranges throughout the process, several
59
+ operations are simplified to comparisons and intersections of integers,
60
+ integer sets, or lists. These operations are faster and more easily
61
+ optimized.
59
62
60
- With integers, we also use less memory:
63
+ With integers, we use less memory:
61
64
62
65
- we can use arrays of unsigned 16 bits ints that store each number on two bytes
63
66
rather than bigger lists of ints.
64
67
- we can replace dictionaries by sparse lists or arrays where the index is an integer key.
65
68
- we can use succinct, bit level representations (e.g. bitmaps) of integer sets.
66
69
67
- Smaller data structures also means faster processing as the processors need to move
70
+ Smaller data structures also mean faster processing, as processors need to move
68
71
less data in memory.
69
72
70
- With integers we can also be faster:
71
-
73
+ With integers, we can be faster:
74
+
72
75
- a dict key lookup is slower than a list of array index lookup.
73
76
- processing large list of small structures is faster (such as bitmaps, etc).
74
77
- we can leverage libraries that speed up integer set operations.
@@ -77,12 +80,12 @@ With integers we can also be faster:
77
80
Common/junk tokens
78
81
------------------
79
82
80
- The quality and speed of detection is supported by classifying each word as either
81
- good/discriminant or common/junk. Junk tokens are either very frequent of tokens that
82
- taken together together cannot form some valid license mention or notice. When a
83
- numeric id is assigned to a token during initial indexing, junk tokens are assigned a
84
- lower id than good tokens. These are then called low or junk tokens and high or good
85
- tokens.
83
+ The quality and speed of detection is supported by classifying each word as
84
+ either good/discriminant or common/junk. Junk tokens are either very
85
+ frequent of tokens or ones that, even combined, cannot form a valid license
86
+ mention or notice. When a numeric id is assigned to a token during initial
87
+ indexing, junk tokens are assigned a lower id than good tokens. These are
88
+ referred to as low (junk) tokens and high (good) tokens.
86
89
87
90
88
91
Query processing
@@ -92,50 +95,52 @@ When a file is scanned, it is first converted to a query object which is a list
92
95
integer token ids. A query is further broken down in slices (a.k.a. query runs) based
93
96
on heuristics.
94
97
95
- While the query is processed a set of matched and matchable positions for for high
96
- and low token ids is kept to track what is left to do in matching.
98
+ While the query is processed, a set of matched and matchable positions for
99
+ high and low token ids is kept to keep track what is left to do in
100
+ matching.
97
101
98
102
99
103
Matching pipeline
100
104
-----------------
101
105
102
106
The matching pipeline consist of:
103
107
104
- - we start with matching the whole query at once against hashes on the whole text
105
- looked up agains a mapping of hash to license rule. We exit if we have a match.
106
-
108
+ - we start by matching the whole query at once against hashes on the whole
109
+ text looked up in a mapping from hash to license rule. The process exits
110
+ if a match is found.
111
+
107
112
- then we match the whole query for exact matches using an automaton (Aho-Corasick).
108
- We exit if we have a match.
113
+ The process exits if a match is found .
109
114
110
115
- then each query run is processed in sequence:
111
116
112
117
- the best potentially matching rules are found with two rounds of approximate
113
- "set" matching. This set matching uses a "bag of words" approach where the
118
+ "set" matching. This set matching uses a "bag of words" approach where the
114
119
scanned text is transformed in a vector of integers based on the presence or
115
- absence of a word. It is compared against the index of vectors. This is similar
120
+ absence of a word. It is then compared against the index of vectors. This is similar
116
121
conceptually to a traditional inverted index search for information retrieval.
117
122
The best matches are ranked using a resemblance and containment comparison. A
118
123
second round is performed on the best matches using multisets which are set where
119
124
the number of occurrence of each word is also taken into account. The best matches
120
125
are ranked again using a resemblance and containment comparison and is more
121
126
accurate than the previous set matching.
122
-
127
+
123
128
- using the ranked potential candidate matches from the two previous rounds, we
124
129
then perform a pair-wise local sequence alignment between these candidates and
125
130
the query run. This sequence alignment is essentially an optimized diff working
126
131
on integer sequences and takes advantage of the fact that some very frequent
127
132
words are considered less discriminant: this speeds up the sequence alignment
128
133
significantly. The number of multiple local sequence alignments that are required
129
134
in this step is also made much smaller by the pre-matching done using sets.
130
-
135
+
131
136
- finally all the collected matches are merged, refined and filtered to yield the
132
137
final results. The merging considers the ressemblance, containment and overlap
133
138
between scanned texts and the matched texts and several secondary factors.
134
139
Filtering is based on the density and length of matches as well as the number of
135
140
good or frequent tokens matched.
136
- Last, each match receives a score which is the based on the length of the rule text
141
+ Last, each match receives a score which base on the length of the rule text
137
142
and how of this rule text was matched. Optionally we can also collect the exact
138
- matched texts and which part was not match for each match .
143
+ matched texts and identify which portions were not matched for each instance .
139
144
140
145
141
146
Comparison with other tools approaches
@@ -151,13 +156,13 @@ reassemble possible matches. They tend to suffer from the same issues as a pure
151
156
based approach and require an intimate knowledge of the license texts and how they
152
157
relate to each other.
153
158
154
- Some tools use pair-wise comparisons like ScanCode. But in doing so they usually
155
- perform poorly because a multiple local sequence alignment is an expensisve
156
- computation. Say you scan 1000 files and you have 1000 reference texts. You would
157
- need to recursively make multiple times 1000 comparisons with each scanned file very
158
- quickly performing the equivalent 100 million diffs or more to process these files.
159
- Because of the progressive matching pipeline used in ScanCode, sequence alignments
160
- may not be needed at all in the common cases and when they are, only a few are
161
- needed.
159
+ Some tools use pair-wise comparisons like ScanCode. But in doing so, they
160
+ usually perform poorly because a multiple local sequence alignment is an
161
+ expensisve computation. Say you scan 1000 files and you have 1000 reference
162
+ texts. You would need to perform multiple rounds of comparison ,1000 per
163
+ files, resulting in the equivalent of 100 million diffs or more to process
164
+ all files. Because of the progressive matching pipeline used in ScanCode,
165
+ sequence alignments are often unnecessary in common cases, and when they
166
+ are required, only a few are needed.
162
167
163
168
See also this list: https://wiki.debian.org/CopyrightReviewTools
0 commit comments