Skip to content

Commit 313e78f

Browse files
committed
Improve wording and flow in licensedcode/README #2246
Signed-off-by: Chin Yeung Li <[email protected]>
1 parent 01fb718 commit 313e78f

File tree

1 file changed

+62
-57
lines changed

1 file changed

+62
-57
lines changed

src/licensedcode/README.rst

Lines changed: 62 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,74 +1,77 @@
11
ScanCode license detection overview and key design elements
22
===========================================================
33

4-
License detection is about finding common texts between the text of a query file
5-
being scanned and the texts of the indexed license texts and rule texts. The process
6-
strives to be correct first and fast second.
4+
License detection involves identifying commonalities between the text of a
5+
scanned query file and the indexed license and rule texts. The process
6+
prioritizes accuracy over speed.
77

8-
Ideally we want to find the best alignment possible between two texts so we know
8+
Ideally, we want to find the best alignment possible between two texts so we know
99
exactly where they match: the scanned text and one or more of the many license texts.
1010
We settle for good alignments rather than optimal alignments by still returning
1111
accurate and correct matches in a reasonable amount of time.
1212

13-
Correctness is essential but efficiency too: both in terms of speed and memory usage.
14-
One key to efficient matching is to process not characters but whole words and use
15-
internally not strings but integers to represent a word.
13+
Correctness is essential but efficiency matters too: both in terms of speed
14+
and memory usage. One key to efficient matching is to process whole words
15+
instead of characters, and to represent words internally using integers
16+
rather than strings.
1617

1718

1819
Rules and licenses
1920
------------------
2021

21-
The detection uses an index of reference license texts and a set of "rules" that are
22-
common notices or mentions of these licenses. The things that makes detection
23-
sometimes difficult is that a license reference can be very short as in "this is GPL"
24-
or very long as a full license text for the GPLv3. To cope with this we use different
25-
matching strategies and also compute the resemblance and containment of texts that
26-
are matched.
22+
The detection uses an index of reference license texts and along with a set
23+
of "rules" that are common notices or mentions of these licenses. One
24+
challenge in detection is that a license reference can be very short as in
25+
"this is GPL" or very long as a full license text for the GPLv3. To cope
26+
with this, we use different matching strategies and also compute both the
27+
resemblance and containment of the matched texts.
2728

2829

2930
Words as integers
3031
-----------------
3132

32-
A dictionary mapping words to a unique integer is used to transform a scanned text
33-
"query" words and reference indexed license texts and rules words to numbers.
34-
This is possible because we have a limited number of words across all the license
35-
texts (about 15K). We further assign these ids to words such that very common words
36-
have a low id and less common, more discriminant words have a higher id. And define a
37-
thresholds for this ids range such that very common words below that threshold cannot
38-
possible form a license text or mention together.
33+
A dictionary that maps words to a unique integer is used to transform a
34+
scanned text "query" words, as well as the words in the indexed license
35+
texts and rules, to numbers. This is possible because we have a limited
36+
number of words across all the license texts (about 15K). We further assign
37+
these ids to words such that very common words have a low id, while less
38+
frequent, more distinctive words have a higher id. A thresholds is defined
39+
for this ids range such that very common words below the threshold cannot,
40+
by themselves, form a valid license text or reference.
3941

40-
Once that mapping is applied, the detection then only deal with integers in two
42+
Once that mapping is applied, the detection process deals only with integers in two
4143
dimensions:
4244

4345
- the token ids (and whether they are in the high or low range).
4446
- their positions in the query (qpos) and the indexed rule (ipos).
4547

4648
We also use an integer id for a rule.
4749

48-
All operations are from then on dealing with list, arrays or sets of integers in
49-
defined ranges.
50+
From this point, all operations are performed on lists, arrays or sets of
51+
integers in defined ranges.
5052

51-
Matches are reduced to sets of integers we call "Spans":
53+
Matches are reduced to sets of integers referred to as "Spans":
5254

5355
- matched positions on the query side
5456
- matched positions on the index side
5557

56-
By using integers in known ranges throughout, several operations are reduced to
57-
integer and integer sets or lists comparisons and intersection. These operations
58-
are faster and more readily optimizable.
58+
By using integers within known ranges throughout the process, several
59+
operations are simplified to comparisons and intersections of integers,
60+
integer sets, or lists. These operations are faster and more easily
61+
optimized.
5962

60-
With integers, we also use less memory:
63+
With integers, we use less memory:
6164

6265
- we can use arrays of unsigned 16 bits ints that store each number on two bytes
6366
rather than bigger lists of ints.
6467
- we can replace dictionaries by sparse lists or arrays where the index is an integer key.
6568
- we can use succinct, bit level representations (e.g. bitmaps) of integer sets.
6669

67-
Smaller data structures also means faster processing as the processors need to move
70+
Smaller data structures also mean faster processing, as processors need to move
6871
less data in memory.
6972

70-
With integers we can also be faster:
71-
73+
With integers, we can be faster:
74+
7275
- a dict key lookup is slower than a list of array index lookup.
7376
- processing large list of small structures is faster (such as bitmaps, etc).
7477
- we can leverage libraries that speed up integer set operations.
@@ -77,12 +80,12 @@ With integers we can also be faster:
7780
Common/junk tokens
7881
------------------
7982

80-
The quality and speed of detection is supported by classifying each word as either
81-
good/discriminant or common/junk. Junk tokens are either very frequent of tokens that
82-
taken together together cannot form some valid license mention or notice. When a
83-
numeric id is assigned to a token during initial indexing, junk tokens are assigned a
84-
lower id than good tokens. These are then called low or junk tokens and high or good
85-
tokens.
83+
The quality and speed of detection is supported by classifying each word as
84+
either good/discriminant or common/junk. Junk tokens are either very
85+
frequent of tokens or ones that, even combined, cannot form a valid license
86+
mention or notice. When a numeric id is assigned to a token during initial
87+
indexing, junk tokens are assigned a lower id than good tokens. These are
88+
referred to as low (junk) tokens and high (good) tokens.
8689

8790

8891
Query processing
@@ -92,50 +95,52 @@ When a file is scanned, it is first converted to a query object which is a list
9295
integer token ids. A query is further broken down in slices (a.k.a. query runs) based
9396
on heuristics.
9497

95-
While the query is processed a set of matched and matchable positions for for high
96-
and low token ids is kept to track what is left to do in matching.
98+
While the query is processed, a set of matched and matchable positions for
99+
high and low token ids is kept to keep track what is left to do in
100+
matching.
97101

98102

99103
Matching pipeline
100104
-----------------
101105

102106
The matching pipeline consist of:
103107

104-
- we start with matching the whole query at once against hashes on the whole text
105-
looked up agains a mapping of hash to license rule. We exit if we have a match.
106-
108+
- we start by matching the whole query at once against hashes on the whole
109+
text looked up in a mapping from hash to license rule. The process exits
110+
if a match is found.
111+
107112
- then we match the whole query for exact matches using an automaton (Aho-Corasick).
108-
We exit if we have a match.
113+
The process exits if a match is found.
109114

110115
- then each query run is processed in sequence:
111116

112117
- the best potentially matching rules are found with two rounds of approximate
113-
"set" matching. This set matching uses a "bag of words" approach where the
118+
"set" matching. This set matching uses a "bag of words" approach where the
114119
scanned text is transformed in a vector of integers based on the presence or
115-
absence of a word. It is compared against the index of vectors. This is similar
120+
absence of a word. It is then compared against the index of vectors. This is similar
116121
conceptually to a traditional inverted index search for information retrieval.
117122
The best matches are ranked using a resemblance and containment comparison. A
118123
second round is performed on the best matches using multisets which are set where
119124
the number of occurrence of each word is also taken into account. The best matches
120125
are ranked again using a resemblance and containment comparison and is more
121126
accurate than the previous set matching.
122-
127+
123128
- using the ranked potential candidate matches from the two previous rounds, we
124129
then perform a pair-wise local sequence alignment between these candidates and
125130
the query run. This sequence alignment is essentially an optimized diff working
126131
on integer sequences and takes advantage of the fact that some very frequent
127132
words are considered less discriminant: this speeds up the sequence alignment
128133
significantly. The number of multiple local sequence alignments that are required
129134
in this step is also made much smaller by the pre-matching done using sets.
130-
135+
131136
- finally all the collected matches are merged, refined and filtered to yield the
132137
final results. The merging considers the ressemblance, containment and overlap
133138
between scanned texts and the matched texts and several secondary factors.
134139
Filtering is based on the density and length of matches as well as the number of
135140
good or frequent tokens matched.
136-
Last, each match receives a score which is the based on the length of the rule text
141+
Last, each match receives a score which base on the length of the rule text
137142
and how of this rule text was matched. Optionally we can also collect the exact
138-
matched texts and which part was not match for each match.
143+
matched texts and identify which portions were not matched for each instance.
139144

140145

141146
Comparison with other tools approaches
@@ -151,13 +156,13 @@ reassemble possible matches. They tend to suffer from the same issues as a pure
151156
based approach and require an intimate knowledge of the license texts and how they
152157
relate to each other.
153158

154-
Some tools use pair-wise comparisons like ScanCode. But in doing so they usually
155-
perform poorly because a multiple local sequence alignment is an expensisve
156-
computation. Say you scan 1000 files and you have 1000 reference texts. You would
157-
need to recursively make multiple times 1000 comparisons with each scanned file very
158-
quickly performing the equivalent 100 million diffs or more to process these files.
159-
Because of the progressive matching pipeline used in ScanCode, sequence alignments
160-
may not be needed at all in the common cases and when they are, only a few are
161-
needed.
159+
Some tools use pair-wise comparisons like ScanCode. But in doing so, they
160+
usually perform poorly because a multiple local sequence alignment is an
161+
expensisve computation. Say you scan 1000 files and you have 1000 reference
162+
texts. You would need to perform multiple rounds of comparison ,1000 per
163+
files, resulting in the equivalent of 100 million diffs or more to process
164+
all files. Because of the progressive matching pipeline used in ScanCode,
165+
sequence alignments are often unnecessary in common cases, and when they
166+
are required, only a few are needed.
162167

163168
See also this list: https://wiki.debian.org/CopyrightReviewTools

0 commit comments

Comments
 (0)