Skip to content

Commit 2577ba0

Browse files
authored
update federatedcode related commands (#800)
* Get debian packages from federated repos for now #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Set download_url using purl2url #668 Signed-off-by: Jono Yang <jyang@nexb.com> * create packages right after cloning repos for given package type #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Add logging #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Update federate_packages.py #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Update requirements #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Update help text and argument #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Add missing argument #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Move scanpipe import #668 Signed-off-by: Jono Yang <jyang@nexb.com> * Break long line Signed-off-by: Jono Yang <jyang@nexb.com> * Update broken links in docs Signed-off-by: Jono Yang <jyang@nexb.com> * Update expected test results Signed-off-by: Jono Yang <jyang@nexb.com> * Run `make postgres` in ci Signed-off-by: Jono Yang <jyang@nexb.com> * Update service section Signed-off-by: Jono Yang <jyang@nexb.com> * set packagedb host in testing envfile Signed-off-by: Jono Yang <jyang@nexb.com> * Revert test setup changes Signed-off-by: Jono Yang <jyang@nexb.com> * Set db name in test envfile Signed-off-by: Jono Yang <jyang@nexb.com> * Pin scancode.io requirement to 35.5.0 #668 * There is a postgres update from 13 to 17 in scancode.io that we will handle in purldb later Signed-off-by: Jono Yang <jyang@nexb.com> * Set POSTGRES_DB Signed-off-by: Jono Yang <jyang@nexb.com> --------- Signed-off-by: Jono Yang <jyang@nexb.com>
1 parent 6be2304 commit 2577ba0

15 files changed

Lines changed: 577 additions & 172 deletions

File tree

azure-pipelines.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ resources:
1010
- container: postgres
1111
image: postgres:13
1212
env:
13+
POSTGRES_DB: packagedb
1314
POSTGRES_USER: postgres
1415
POSTGRES_PASSWORD: postgres
1516
ports:

docs/source/contributing.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ overlooked. We value any suggestions to improve
124124

125125
.. tip::
126126
Our documentation is treated like code. Make sure to check our
127-
`writing guidelines <https://scancode-toolkit.readthedocs.io/en/latest/contribute/contrib_doc.html>`_
127+
`writing guidelines <https://scancode-toolkit.readthedocs.io/en/stable/contribute/contrib_doc.html>`_
128128
to help guide new users.
129129

130130
Other Ways
@@ -140,7 +140,7 @@ questions, and interact with us and other community members on
140140
Helpful Resources
141141
-----------------
142142

143-
- Review our `comprehensive guide <https://scancode-toolkit.readthedocs.io/en/latest/contribute/index.html>`_
143+
- Review our `comprehensive guide <https://scancode-toolkit.readthedocs.io/en/stable/contribute/index.html>`_
144144
for more details on how to add quality contributions to our codebase and documentation
145145
- Check this free resource on `how to contribute to an open source project on github <https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github>`_
146146
- Follow `this wiki page <https://aboutcode.readthedocs.io/en/latest/contributing/writing_good_commit_messages.html>`_

docs/source/how-to-guides/installation.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -175,8 +175,9 @@ Make sure those are installed before attempting the ScanCode.io installation::
175175
bzip2 xz-utils zlib1g libxml2-dev libxslt1-dev libpopt0 \
176176
libgpgme11 libdevmapper1.02.1 libguestfs-tools
177177

178-
See also `ScanCode-toolkit Prerequisites <https://scancode-toolkit.readthedocs.io/en/
179-
latest/getting-started/install.html#prerequisites>`_ for more details.
178+
See also `ScanCode-toolkit Prerequisites
179+
<https://scancode-toolkit.readthedocs.io/en/stable/getting-started/install.html#prerequisites>`_
180+
for more details.
180181

181182

182183
Clone and Configure

docs/source/purldb/rest_api.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -906,7 +906,8 @@ Also each package can have list of ``addon_pipelines`` to run on the package.
906906
Find all addon pipelines `here. <https://scancodeio.readthedocs.io/en/latest/built-in-pipelines.html>`_
907907
908908
909-
If the ``reindex`` flag is set to True, existing package will be rescanned and all the non existing package will be indexed.
909+
If the ``reindex`` flag is set to True, existing package will be rescanned and
910+
all the non existing package will be indexed.
910911
If the ``reindex_set`` flag is set to True, then all the package in the same set will be rescanned.
911912
912913

etc/scripts/utils_requirements.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ def split_req(req):
152152
if not req:
153153
raise ValueError("req is required")
154154
# do not allow multiple constraints and tags
155-
if not any(c in req for c in ",;"):
155+
if any(c in req for c in ",;"):
156156
raise Exception(f"complex requirements with : or ; not supported: {req}")
157157
req = "".join(req.split())
158158
if not any(c in req for c in comparators):

minecode/management/commands/defederate_packages.py

Lines changed: 78 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,23 @@
88
#
99

1010
import logging
11+
import os
1112
import sys
12-
import tempfile
13+
from pathlib import Path
14+
from urllib.parse import urljoin
15+
16+
from django.conf import settings
17+
18+
import requests
19+
import saneyaml
1320

14-
import os
15-
from commoncode.fileutils import walk
1621
from aboutcode.federated import DataFederation
17-
from minecode_pipelines import pipes
22+
from commoncode import fileutils
23+
from minecode.management import federatedcode
1824
from minecode.management.commands import VerboseCommand
1925
from packagedb import models as packagedb_models
2026
from packageurl import PackageURL
21-
import saneyaml
27+
from packageurl.contrib import purl2url
2228

2329
"""
2430
Utility command to find license oddities.
@@ -32,27 +38,38 @@
3238
logger.setLevel(logging.DEBUG)
3339

3440

35-
def yield_purls_from_yaml_files(location):
36-
for root, _, files in walk(location):
41+
def yield_purl_strs_from_yaml_files(location):
42+
for root, _, files in fileutils.walk(location):
43+
if "purls.yml" not in files:
44+
continue
3745
for file in files:
38-
if not (file == "purls.yml"):
39-
continue
4046
fp = os.path.join(root, file)
4147
with open(fp) as f:
4248
purl_strs = saneyaml.load(f.read()) or []
43-
for purl_str in purl_strs:
44-
yield PackageURL.from_string(purl_str)
49+
yield from purl_strs
4550

4651

4752
class Command(VerboseCommand):
48-
help = "Find packages with an ambiguous declared license."
53+
help = "Create Packages from FederatedCode repos"
4954

5055
def add_arguments(self, parser):
51-
parser.add_argument("-i", "--input", type=str, help="Define the input file name")
56+
parser.add_argument(
57+
"-d",
58+
"--working-directory",
59+
type=str,
60+
required=False,
61+
help="Directory where FederatedCode repos will be cloned",
62+
)
5263

5364
def handle(self, *args, **options):
5465
logger.setLevel(self.get_verbosity(**options))
55-
working_path = tempfile.mkdtemp()
66+
working_dir = options.get("working_directory")
67+
if working_dir:
68+
working_path = Path(working_dir)
69+
else:
70+
working_path = Path(fileutils.get_temp_dir())
71+
72+
account_url = f"{settings.FEDERATEDCODE_GIT_ACCOUNT_URL}/"
5673

5774
# Clone data and config repo
5875
data_federation = DataFederation.from_url(
@@ -62,17 +79,50 @@ def handle(self, *args, **options):
6279
data_cluster = data_federation.get_cluster("purls")
6380

6481
checked_out_repos = {}
65-
for purl_type, data_repository in data_cluster._data_repositories_by_purl_type.items():
66-
repo_name = data_repository.name
67-
checked_out_repos[repo_name] = pipes.init_local_checkout(
68-
repo_name=repo_name,
69-
working_path=working_path,
70-
logger=logger,
71-
)
72-
73-
# iterate through checked out repos and import data
74-
for repo_name, repo_data in checked_out_repos.items():
75-
repo = repo_data.get("repo")
76-
for purl in yield_purls_from_yaml_files(repo.working_dir):
77-
# TODO: use batch create for efficiency
78-
package = packagedb_models.Package.objects.create(**purl.to_dict())
82+
for package_type, data_repositories in data_cluster._data_repositories_by_purl_type.items():
83+
for data_repository in data_repositories:
84+
repo_name = data_repository.name
85+
repo_url = urljoin(account_url, repo_name)
86+
if requests.get(repo_url).ok:
87+
clone_path = working_path / package_type / repo_name
88+
checked_out_repos[repo_name] = federatedcode.clone_repository(
89+
repo_url=repo_url,
90+
clone_path=clone_path,
91+
logger=logger.log,
92+
)
93+
else:
94+
break
95+
96+
# iterate through checked out repos and import data
97+
packages_to_write = []
98+
for repo_name, repo in checked_out_repos.items():
99+
logger.log(f"Creating Packages from {repo_name}")
100+
for i, purl_str in enumerate(
101+
yield_purl_strs_from_yaml_files(repo.working_dir), start=1
102+
):
103+
purl = PackageURL.from_string(purl_str)
104+
if packages_to_write and not i % 5000:
105+
packagedb_models.Package.objects.bulk_create(packages_to_write)
106+
logger.log(f"Created {i} Packages from {repo_name}")
107+
packages_to_write.clear()
108+
package = packagedb_models.Package(
109+
type=purl.type,
110+
namespace=purl.namespace,
111+
name=purl.name,
112+
version=purl.version,
113+
qualifiers=purl.qualifiers,
114+
subpath=purl.subpath or "",
115+
download_url=purl2url.get_download_url(purl_str),
116+
repository_download_url=purl2url.get_repo_download_url(purl_str),
117+
)
118+
packages_to_write.append(package)
119+
120+
if packages_to_write:
121+
packagedb_models.Package.objects.bulk_create(packages_to_write)
122+
logger.log(f"Created {i} Packages from {repo_name}")
123+
packages_to_write.clear()
124+
125+
# clean up
126+
package_type_clone_path = working_path / package_type
127+
fileutils.delete(package_type_clone_path)
128+
checked_out_repos.clear()

minecode/management/commands/federate_packages.py

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,15 @@
99

1010
import logging
1111
import sys
12+
from pathlib import Path
1213

1314
from aboutcode.federated import DataFederation
14-
from scanpipe.pipes import federatedcode
15-
from minecode_pipelines import pipes
15+
from commoncode import fileutils
16+
from minecode.management import federatedcode
1617
from minecode.management.commands import VerboseCommand
18+
from minecode_pipelines import pipes
1719
from packagedb import models as packagedb_models
1820

19-
2021
"""
2122
Utility command to find license oddities.
2223
"""
@@ -42,21 +43,32 @@ def commit_message(commit_batch, total_commit_batch="many"):
4243
return f"""\
4344
Save PackageURLs from PurlDB ({commit_batch}/{total_commit_batch})
4445
45-
Tool: {tool_name}@v{VERSION}
46+
Tool: {tool_name}@v{settings.PURLDB_VERSION}
4647
Reference: https://{settings.ALLOWED_HOSTS[0]}
4748
4849
Signed-off-by: {author_name} <{author_email}>
4950
"""
5051

5152

5253
class Command(VerboseCommand):
53-
help = "Find packages with an ambiguous declared license."
54+
help = "Save and commit purls from PackageDB to FederatedCode repos."
5455

5556
def add_arguments(self, parser):
56-
parser.add_argument("-i", "--input", type=str, help="Define the input file name")
57+
parser.add_argument(
58+
"-d",
59+
"--working-directory",
60+
type=str,
61+
required=False,
62+
help="Directory where FederatedCode repos will be cloned",
63+
)
5764

5865
def handle(self, *args, **options):
5966
logger.setLevel(self.get_verbosity(**options))
67+
working_dir = options.get("working_directory")
68+
if working_dir:
69+
working_path = Path(working_dir)
70+
else:
71+
working_path = Path(fileutils.get_temp_dir())
6072

6173
# Clone data and config repo
6274
data_federation = DataFederation.from_url(
@@ -68,9 +80,17 @@ def handle(self, *args, **options):
6880
# TODO: do something more efficient
6981
files_to_commit = []
7082
commit_batch = 1
71-
files_per_commit = PACKAGE_BATCH_SIZE
72-
for package in packagedb_models.Package.objects.all():
73-
package_repo, datafile_path = data_cluster.get_datafile_repo_and_path(purl=package.purl)
83+
for i, package in enumerate(
84+
packagedb_models.Package.objects.all().iterator(chunk_size=PACKAGE_BATCH_SIZE), start=1
85+
):
86+
package_repo_name, datafile_path = data_cluster.get_datafile_repo_and_path(
87+
purl=package.purl
88+
)
89+
_, package_repo = federatedcode.get_or_create_repository(
90+
repo_name=package_repo_name,
91+
working_path=working_path,
92+
logger=logger.log,
93+
)
7494
purl_file = pipes.write_packageurls_to_file(
7595
repo=package_repo,
7696
relative_datafile_path=datafile_path,
@@ -80,13 +100,14 @@ def handle(self, *args, **options):
80100
if purl_file not in files_to_commit:
81101
files_to_commit.append(purl_file)
82102

83-
if len(files_to_commit) == files_per_commit:
103+
if len(files_to_commit) == PACKAGE_BATCH_SIZE:
84104
federatedcode.commit_and_push_changes(
85105
commit_message=commit_message(commit_batch),
86106
repo=package_repo,
87107
files_to_commit=files_to_commit,
88-
logger=logger,
108+
logger=logger.log,
89109
)
110+
logger.log(f"Committed {i} purls to {package_repo_name}")
90111
files_to_commit.clear()
91112
commit_batch += 1
92113

@@ -95,5 +116,8 @@ def handle(self, *args, **options):
95116
commit_message=commit_message(commit_batch),
96117
repo=package_repo,
97118
files_to_commit=files_to_commit,
98-
logger=logger,
119+
logger=logger.log,
99120
)
121+
logger.log(f"Committed {i} purls to {package_repo_name}")
122+
files_to_commit.clear()
123+
commit_batch += 1

0 commit comments

Comments
 (0)