Skip to content

Commit 9e6ec75

Browse files
Merge branch 'develop' into support-py-314
2 parents fa7051a + bb72b83 commit 9e6ec75

File tree

62 files changed

+128964
-117
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+128964
-117
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,3 +116,7 @@ selenium
116116
rpmdb.sqlite-*
117117
/.ruff_cache/
118118
.env
119+
120+
# Ignore gibberish detector model, this is trained on configure and build and
121+
# should not be committed
122+
gib_model.pki

CHANGELOG.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,12 @@ v3.5.0 - 2026-01-15
2222

2323
https://github.com/aboutcode-org/scancode-toolkit/pull/4606
2424

25-
- Detect gibberish in copyrights to improve detection
25+
- Add gibberish detection to copyright scanning. This is done using a
26+
2-character Markov chain. A new CLI command,
27+
``scancode-train-gibberish-model``, has been added to regenerate the model
28+
used by the detector.
2629
https://github.com/aboutcode-org/scancode-toolkit/pull/4610
30+
https://github.com/aboutcode-org/scancode-toolkit/issues/2402
2731

2832
- Use a cython implementation of sequence matcher to improve
2933
license detection performance
@@ -73,7 +77,6 @@ v3.5.0 - 2026-01-15
7377
https://github.com/aboutcode-org/scancode-toolkit/pull/4666
7478
https://github.com/aboutcode-org/scancode-toolkit/pull/4622
7579

76-
7780
v32.4.1 - 2025-07-23
7881
--------------------
7982

Dockerfile

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,13 @@ WORKDIR /scancode-toolkit
3838
COPY . /scancode-toolkit
3939

4040
# Initial configuration using ./configure, scancode-reindex-licenses to build
41-
# the base license index and scancode-reindex-package-patterns to build the
42-
# package patterns cache
41+
# the base license index, scancode-reindex-package-patterns to build the package
42+
# patterns cache, and scancode-train-gibberish-model to train the Markov chain
43+
# model used for gibberish detection.
4344
RUN ./configure \
4445
&& ./venv/bin/scancode-reindex-licenses \
45-
&& ./venv/bin/scancode-reindex-package-patterns
46+
&& ./venv/bin/scancode-reindex-package-patterns \
47+
&& ./venv/bin/scancode-train-gibberish-model
4648

4749
# Add scancode to path
4850
ENV PATH=/scancode-toolkit:$PATH

configure

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,6 @@ find_python
319319
create_virtualenv "$VIRTUALENV_DIR"
320320
install_packages "$CFG_REQUIREMENTS"
321321
. "$CFG_BIN_DIR/activate"
322-
322+
"$CFG_BIN_DIR/scancode-train-gibberish-model"
323323

324324
set +e

configure.bat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ if %ERRORLEVEL% neq 0 (
161161
%CFG_QUIET% ^
162162
%PIP_EXTRA_ARGS% ^
163163
%CFG_REQUIREMENTS%
164-
164+
"%CFG_BIN_DIR%\scancode-train-gibberish-model"
165165

166166
@rem ################################
167167
:create_bin_junction
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
.. _cli-scancode-train-gibberish-model:
2+
3+
ScanCode train gibberish model
4+
==============================
5+
6+
ScanCode uses a 2-character Markov chain to perform gibberish detection on text.
7+
At a high level, it detects gibberish strings by seeing if a sequence of letters
8+
is part or a whole word, two letters at a time. It does this by checking how
9+
likely it is to go from one letter to another. The probabilities of going from
10+
one letter to another are determined by a model that has been trained on a large
11+
set of valid text, where it counts each transition between letters and computes
12+
a probability based off of that. These probabilities and thresholds are stored
13+
in a model that is saved to a Python pickle.
14+
15+
The training corpus for the gibberish detector can be found in
16+
``src/textcode/data/gibberish/``.
17+
18+
``big.txt`` contains the main source of valid words that the gibberish detector
19+
model is trained on.
20+
21+
``good.txt`` and ``bad.txt`` are used to determine the average threshold, where
22+
any letter transition whose average transition probability falls below this
23+
threshold is classified as gibberish.
24+
25+
26+
Usage: ``scancode-train-gibberish-model [OPTIONS]``
27+
28+
Quick Reference
29+
---------------
30+
31+
--big FILE Text file containing main training corpus for the gibberish
32+
detector
33+
--good FILE Text file containing text considered to be not gibberish (good)
34+
--bad FILE Text file containing text considered to be gibberish (bad)
35+
-h, --help Show this message and exit.
36+
37+
----
38+
39+
.. _cli-scancode-train-gibberish-model-big-option:
40+
41+
``--big`` option
42+
^^^^^^^^^^^^^^^^
43+
44+
The ``--big`` option allows the user to use a different text file to train the
45+
gibberish detector model.
46+
47+
.. _cli-scancode-train-gibberish-model-good-option:
48+
49+
``--good`` option
50+
^^^^^^^^^^^^^^^^^
51+
52+
The ``--good`` option allows the user to use a different text file containing
53+
strings considered to be valid copyrights. This option is used to adjust the
54+
average transition probability threshold that determines whether or not a string
55+
is gibberish.
56+
57+
.. _cli-scancode-train-gibberish-model-bad-option:
58+
59+
``--bad`` option
60+
^^^^^^^^^^^^^^^^
61+
62+
The ``--bad`` option allows the user to use a different text file containing
63+
strings considered to be invalid copyrights. This option is used to adjust the
64+
average transition probability threshold that determines whether or not a string
65+
is gibberish.

docs/source/reference/scancode-cli/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,4 @@ For more details into the post-scan CLI options, see :ref:`cli-post-scan-options
8888
cli-extractcode
8989
cli-scancode-reindex-licenses
9090
cli-scancode-license-data
91+
cli-scancode-train-gibberish-model

etc/release/scancode-create-pypi-wheel.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ set -e
2020
./configure --dev
2121
venv/bin/scancode-reindex-licenses
2222
venv/bin/scancode-reindex-package-patterns
23+
venv/bin/scancode-train-gibberish-model
2324

2425
python_tag=$( python -c "import platform;print(f\"cp{''.join(platform.python_version_tuple()[:2])}\")" )
2526

etc/release/scancode-create-release-app-linux.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ cp -r etc/thirdparty $release_dir/etc
6666
./configure --dev
6767
venv/bin/scancode-reindex-licenses
6868
venv/bin/scancode-reindex-package-patterns
69+
venv/bin/scancode-train-gibberish-model
6970
venv/bin/python setup.py --quiet bdist_wheel --python-tag cp$python_version
7071

7172
cp -r \

etc/release/scancode-create-release-app-macos.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ cp -r etc/thirdparty $release_dir/etc
6464
./configure --dev
6565
venv/bin/scancode-reindex-licenses
6666
venv/bin/scancode-reindex-package-patterns
67+
venv/bin/scancode-train-gibberish-model
6768
venv/bin/python setup.py --quiet bdist_wheel --python-tag cp$python_version
6869

6970
cp -r \

0 commit comments

Comments
 (0)