Skip to content

Commit 56d19e9

Browse files
committed
Add/reorganize sections #2246
* I've also updated score text in README based on https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/licensedcode/detection.py#L404 Signed-off-by: Chin Yeung Li <[email protected]>
1 parent 313e78f commit 56d19e9

File tree

1 file changed

+70
-14
lines changed

1 file changed

+70
-14
lines changed

src/licensedcode/README.rst

Lines changed: 70 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,59 @@
1+
Overview
2+
========
3+
4+
This `licensedcode` module have utilities to accurately detect a vast array
5+
of open-source and proprietary licenses. It manages a comprehensive
6+
database of license texts, patterns, and rules, enabling ScanCode to
7+
perform scans and provide precise license conclusions.
8+
9+
Key Functionality
10+
-----------------
11+
12+
* License Rule Management: Stores and manages a large collection of
13+
license rules, including full texts, snippets, and regular expressions.
14+
15+
* Pattern Matching: Implements sophisticated algorithms for matching
16+
detected code against known license patterns and texts.
17+
18+
* License Detection Logic: Contains the core logic for processing scan
19+
input, applying rules, and determining the presence and type of
20+
licenses.
21+
22+
* Rule-based Detection: Utilizes a robust system of rules to identify
23+
licenses even when only fragments or variations of license texts are
24+
present.
25+
26+
* License Expression Parsing: Supports the parsing and interpretation of
27+
complex license expressions (e.g., "MIT AND Apache-2.0").
28+
29+
30+
How It Works (High-Level)
31+
-------------------------
32+
33+
At a high level, the `licensedcode`` module operates by:
34+
35+
1. Loading License Data: It initializes by loading a curated set of
36+
license texts, short license identifiers, and detection rules from its
37+
internal data store.
38+
39+
2. Scanning Input: When ScanCode processes a file or directory, the
40+
content is converted into an internal representation (a "query").
41+
42+
3. Applying Rules: The module then applies its extensive set of rules and
43+
patterns to the input content through a multi-stage pipeline, looking
44+
for matches.
45+
46+
4. Reporting Detections: Upon successful matches, it reports the
47+
identified licenses, their confidence levels, and the exact locations
48+
(lines, characters) where they were found.
49+
50+
For a more in-depth understanding of the underlying technical principles
51+
and the detection pipeline, please refer to the sections below.
52+
53+
54+
155
ScanCode license detection overview and key design elements
2-
===========================================================
56+
-----------------------------------------------------------
357

458
License detection involves identifying commonalities between the text of a
559
scanned query file and the indexed license and rule texts. The process
@@ -17,7 +71,7 @@ rather than strings.
1771

1872

1973
Rules and licenses
20-
------------------
74+
^^^^^^^^^^^^^^^^^^
2175

2276
The detection uses an index of reference license texts and along with a set
2377
of "rules" that are common notices or mentions of these licenses. One
@@ -28,7 +82,7 @@ resemblance and containment of the matched texts.
2882

2983

3084
Words as integers
31-
-----------------
85+
^^^^^^^^^^^^^^^^^
3286

3387
A dictionary that maps words to a unique integer is used to transform a
3488
scanned text "query" words, as well as the words in the indexed license
@@ -78,7 +132,7 @@ With integers, we can be faster:
78132

79133

80134
Common/junk tokens
81-
------------------
135+
^^^^^^^^^^^^^^^^^^
82136

83137
The quality and speed of detection is supported by classifying each word as
84138
either good/discriminant or common/junk. Junk tokens are either very
@@ -89,7 +143,7 @@ referred to as low (junk) tokens and high (good) tokens.
89143

90144

91145
Query processing
92-
----------------
146+
^^^^^^^^^^^^^^^^
93147

94148
When a file is scanned, it is first converted to a query object which is a list of
95149
integer token ids. A query is further broken down in slices (a.k.a. query runs) based
@@ -101,7 +155,7 @@ matching.
101155

102156

103157
Matching pipeline
104-
-----------------
158+
^^^^^^^^^^^^^^^^^
105159

106160
The matching pipeline consist of:
107161

@@ -133,14 +187,16 @@ The matching pipeline consist of:
133187
significantly. The number of multiple local sequence alignments that are required
134188
in this step is also made much smaller by the pre-matching done using sets.
135189

136-
- finally all the collected matches are merged, refined and filtered to yield the
137-
final results. The merging considers the ressemblance, containment and overlap
138-
between scanned texts and the matched texts and several secondary factors.
139-
Filtering is based on the density and length of matches as well as the number of
140-
good or frequent tokens matched.
141-
Last, each match receives a score which base on the length of the rule text
142-
and how of this rule text was matched. Optionally we can also collect the exact
143-
matched texts and identify which portions were not matched for each instance.
190+
- finally all the collected matches are merged, refined and filtered to
191+
yield the final results. The merging considers the ressemblance,
192+
containment and overlap between scanned texts and the matched texts and
193+
several secondary factors. Filtering is based on the density and length
194+
of matches as well as the number of good or frequent tokens matched.
195+
Lastly, each match receives a score calculated based on the sum of the
196+
underlying match scores, weighted by the length of the match relative to
197+
the overall detection length. Optionally we can also collect the exact
198+
matched texts and identify which portions were not matched for each
199+
instance.
144200

145201

146202
Comparison with other tools approaches

0 commit comments

Comments
 (0)