Skip to content

Commit 09f1c39

Browse files
authored
Kaggle CLI: Dataset image upload via datasets metadata --update (#959)
Changes: - Adds functionality to specify an image file on disk to upload and set for the dataset, using default crops. - Updates documentation about new parameters Local testing: - [screen](http://screen/78g4ZpYK96QUUh8) - [screencast](https://screencast.googleplex.com/cast/NDc2NjU5MjU3ODYxNzM0NHwwMmFjMTc4Yy1jZg) http://b/500108129
1 parent 1dca025 commit 09f1c39

2 files changed

Lines changed: 117 additions & 6 deletions

File tree

docs/datasets_metadata.md

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ Here's an example containing file metadata:
5454
],
5555
"expectedUpdateFrequency": "monthly",
5656
"userSpecifiedSources": "World Bank and OECD ([link](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD))",
57+
"image": "relative/path/to/new/image.png"
5758
}
5859
```
5960

@@ -90,6 +91,10 @@ The following metadata is currently supported:
9091
* `title`: Field description
9192
* `type`: Field type. A best-effort list of types will be kept at the bottom of this page, but new types may be added that are not documented here.
9293
* `keywords`: Contains an array of strings that correspond to an existing tag on Kaggle. If a specified tag doesn't exist, the upload will continue, but that specific tag won't be added.
94+
* `kaggle datasets metadata --update` (update metadata for an existing Dataset) supports all fields mentioned above for `kaggle datasets version`, and additionally:
95+
* `expectedUpdateFrequency`: How often you expect to update your dataset with new versions. See [section below](#expected-update-frequencies) for possible values.
96+
* `userSpecifiedSources`: An explanation of the source(s) of your dataset. Most basic markdown features are supported for this string.
97+
* `image`: A relative file path to a new image file you want to use for your dataset. The path should be relative to the location of the dataset-metadata.json file. See [section below](#images) for more specifics about file types and expected image size.
9398

9499
We will add further metadata processing in upcoming versions of the API.
95100

@@ -170,5 +175,45 @@ You can specify the following values for `expectedUpdateFrequency`:
170175
* `daily`
171176
* `hourly`
172177

173-
## Sources
174-
You can report your dataset sources in a markdown string for `userSpecifiedSources`. Most basic markdown features are supported.
178+
## Images
179+
You can update your dataset image by providing a relative path from your `datasets-metadata.json` to an image file, using the `image` property.
180+
181+
For example, if your metadata file and image are located at:
182+
- `/some/path/dataset-metadata.json`
183+
- `/some/path/image.png`
184+
185+
This property should be specified as:
186+
```
187+
"image": "image.png"
188+
```
189+
190+
If instead, your files were located at:
191+
- `/some/path/dataset-metadata.json`
192+
- `/some/path/alternative/path/to/other-image.jpg`
193+
194+
This property should be specified as:
195+
```
196+
"image": "alternative/path/to/other-image.jpg"
197+
```
198+
199+
### Supported image file types and expected dimensions
200+
201+
The following file types are supported:
202+
203+
* `.png`
204+
* `.jpg`
205+
* `.jpeg`
206+
* `.webp`
207+
208+
The image needs to have a minimum width of 560px and a minimum height of 280px.
209+
210+
The same image file will be used for two different crops:
211+
212+
- Header, 2:1 ratio
213+
- Crop rectangle: width: 560px, height: 280px, top: 0, left: 0
214+
- For an image with dimensions 560px x 280px, this will be the entire rectangular image.
215+
- Thumbnail, 1:1 ratio
216+
- Crop rectangle: width: 280px, height: 280px, top: 0, left: 140px
217+
- For an image with dimensions 560px x 280px, this will be a centered 280px square.
218+
219+
While you can upload a larger image than 560px x 280px, the crops as specified above will be applied, and this may not look good. These crops can always be edited in the UI on kaggle.com on the settings page for your dataset.

src/kaggle/api/kaggle_api_extended.py

Lines changed: 70 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
import json # Needed by mypy.
2626
import logging
2727
import os
28+
from pathlib import Path
2829

2930
import re # Needed by mypy.
3031
import shutil
@@ -38,6 +39,7 @@
3839
from random import random
3940

4041
import bleach
42+
import mimetypes
4143
import requests
4244
import urllib3.exceptions as urllib3_exceptions
4345
from requests import RequestException
@@ -82,6 +84,8 @@
8284
SubmissionSortBy,
8385
)
8486

87+
from kagglesdk.common.types.cropped_image_upload import CroppedImageUpload, CroppedImageRectangle
88+
8589
from kagglesdk.datasets.types.dataset_api_service import (
8690
ApiListDatasetsRequest,
8791
ApiListDatasetFilesRequest,
@@ -639,7 +643,6 @@ def with_retry(
639643
retry_multiplier: float = 1.7,
640644
randomness_factor: float = 0.5,
641645
) -> Callable[[KaggleObject], KaggleObject]:
642-
643646
def retriable_func(*args):
644647
for i in range(1, max_retries + 1):
645648
try:
@@ -1928,6 +1931,13 @@ def dataset_metadata_update(self, dataset, path):
19281931
expected_update_frequency = metadata.get("expectedUpdateFrequency")
19291932
if expected_update_frequency:
19301933
update_settings.expected_update_frequency = expected_update_frequency
1934+
1935+
effective_relative_path_to_image = metadata.get("image")
1936+
if effective_relative_path_to_image:
1937+
cropped_image_upload = self._upload_dataset_image_file(effective_path, effective_relative_path_to_image)
1938+
if cropped_image_upload:
1939+
update_settings.image = cropped_image_upload
1940+
19311941
request = ApiUpdateDatasetMetadataRequest()
19321942
request.owner_slug = owner_slug
19331943
request.dataset_slug = dataset_slug
@@ -1938,6 +1948,53 @@ def dataset_metadata_update(self, dataset, path):
19381948
[print(error_message) for error_message in response.errors]
19391949
exit(1)
19401950

1951+
def _upload_dataset_image_file(
1952+
self, metadata_file_path, relative_image_file_path, quiet=False
1953+
) -> CroppedImageUpload:
1954+
image_full_path = os.path.join(metadata_file_path, relative_image_file_path)
1955+
ext = Path(image_full_path).suffix
1956+
if ext not in [".jpg", ".jpeg", ".png", ".webp"]:
1957+
raise ValueError("Image file requires an extension of .jpg, .jpeg, .png, or .webp: %s" % image_full_path)
1958+
1959+
if not os.path.isfile(image_full_path):
1960+
raise ValueError("Image file was not found: %s" % image_full_path)
1961+
1962+
file_name = os.path.basename(image_full_path)
1963+
# Best guess for MIME type based on filename is ok, given we don't trust MIME type in the backend.
1964+
content_type, _ = mimetypes.guess_type(file_name)
1965+
with ResumableUploadContext() as upload_context:
1966+
upload_file = self._upload_file(
1967+
file_name,
1968+
image_full_path,
1969+
ApiBlobType.INBOX,
1970+
upload_context,
1971+
quiet,
1972+
resources=None,
1973+
content_type=content_type,
1974+
)
1975+
if not upload_file:
1976+
raise ValueError("Error uploading image file: %s" % image_full_path)
1977+
1978+
header_image_rect = CroppedImageRectangle()
1979+
header_image_rect.title = "cover image"
1980+
header_image_rect.top = 0
1981+
header_image_rect.left = 0
1982+
header_image_rect.width = 560
1983+
header_image_rect.height = 280
1984+
1985+
thumbnail_rect = CroppedImageRectangle()
1986+
thumbnail_rect.title = "thumbnail"
1987+
thumbnail_rect.top = 0
1988+
thumbnail_rect.left = 140
1989+
thumbnail_rect.width = 280
1990+
thumbnail_rect.height = 280
1991+
1992+
cropped_image_upload = CroppedImageUpload()
1993+
cropped_image_upload.token = upload_file.token
1994+
cropped_image_upload.crop_rectangles = [header_image_rect, thumbnail_rect]
1995+
1996+
return cropped_image_upload
1997+
19411998
@staticmethod
19421999
def _new_license(name):
19432000
l = SettingsLicense()
@@ -2244,7 +2301,12 @@ def dataset_download_cli(
22442301
self.dataset_download_file(dataset, file_name, path=path, force=force, quiet=quiet, licenses=licenses)
22452302

22462303
def _upload_blob(
2247-
self, path: str, quiet: bool, blob_type: ApiBlobType, upload_context: ResumableUploadContext
2304+
self,
2305+
path: str,
2306+
quiet: bool,
2307+
blob_type: ApiBlobType,
2308+
upload_context: ResumableUploadContext,
2309+
content_type: Optional[str] = None,
22482310
) -> ResumableFileUpload | str | None:
22492311
"""Uploads a file.
22502312
@@ -2253,6 +2315,7 @@ def _upload_blob(
22532315
quiet (bool): Suppress verbose output (default is False).
22542316
blob_type (ApiBlobType): The entity to which the file/blob refers.
22552317
upload_context (ResumableUploadContext): The context for resumable uploads.
2318+
content_type (str): Optional MIME content type, e.g. "text/plain", "image/png"
22562319
22572320
Returns:
22582321
Union[ResumableFileUpload, str, None]: A ResumableFileUpload object, a string, or None.
@@ -2266,9 +2329,10 @@ def _upload_blob(
22662329
start_blob_upload_request.name = file_name
22672330
start_blob_upload_request.content_length = content_length
22682331
start_blob_upload_request.last_modified_epoch_seconds = last_modified_epoch_seconds
2332+
if content_type:
2333+
start_blob_upload_request.content_type = content_type
22692334

22702335
file_upload = upload_context.new_resumable_file_upload(path, start_blob_upload_request)
2271-
22722336
for i in range(0, self.MAX_UPLOAD_RESUME_ATTEMPTS):
22732337
if file_upload.upload_complete:
22742338
return file_upload
@@ -4902,6 +4966,7 @@ def _upload_file(
49024966
upload_context: ResumableUploadContext,
49034967
quiet: bool,
49044968
resources: Optional[List[Dict[str, Union[str, Dict[str, List[Dict[str, str]]]]]]],
4969+
content_type: Optional[str] = None,
49054970
) -> Union[UploadFile, None]:
49064971
"""A helper function to upload a single file.
49074972
@@ -4912,6 +4977,7 @@ def _upload_file(
49124977
upload_context (ResumableUploadContext): The context for resumable uploads.
49134978
quiet (bool): Suppress verbose output.
49144979
resources (Optional[List[Dict[str, Union[str, Dict[str, List[Dict[str, str]]]]]]]): Optional file metadata.
4980+
content_type (str): Optional MIME content type, e.g. "text/plain", "image/png"
49154981
49164982
Returns:
49174983
Union[UploadFile, None]: An UploadFile object if the upload was successful, otherwise None.
@@ -4921,7 +4987,7 @@ def _upload_file(
49214987
print("Starting upload for file " + file_name)
49224988

49234989
content_length = os.path.getsize(full_path)
4924-
token = self._upload_blob(full_path, quiet, blob_type, upload_context)
4990+
token = self._upload_blob(full_path, quiet, blob_type, upload_context, content_type)
49254991
if token is None:
49264992
if not quiet:
49274993
print("Upload unsuccessful: " + file_name)

0 commit comments

Comments
 (0)