Skip to content

Error restoring JSONs to MongoDB in 0 - Load Data #2

@garandria

Description

@garandria

I am trying to run the noteboook 0 - Load Data and had an issue on the Restore JSONs to MongoDB cell. I have already downloaded repositories.json.gz and runs.json.gz as mentioned in the README. More details below:

gha-dataset/jupyter $ ls
'0 - Load data.ipynb'  '1 - Dataset metrics.ipynb'   repositories.json.gz   runs.json.gz
gha-dataset/jupyter $ file repositories.json.gz
repositories.json.gz: gzip compressed data, from Unix, original size modulo 2^32 69208463
gha-dataset/jupyter $ du -sh repositories.json.gz
67M     repositories.json.gz
gha-dataset/jupyter $ file runs.json.gz
runs.json.gz: gzip compressed data, from Unix, original size modulo 2^32 1063390519 gzip compressed data, reserved method, has CRC, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 1063390519
gha-dataset/jupyter $ du -sh runs.json.gz
964M    runs.json.gz

Now, when I run the cell mentioned above, I get the following enconding error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[11], line 2
      1 with gzip.open("repositories.json.gz", "rt") as fd:
----> 2     for line in tqdm(fd):
      3         mongo_repositories.insert_one(json.loads(fd.read()))
      5 with gzip.open("runs.json.gz", "rt") as fd:

File ~/.py-venv/lib/python3.11/site-packages/tqdm/notebook.py:250, in tqdm_notebook.__iter__(self)
    248 try:
    249     it = super().__iter__()
--> 250     for obj in it:
    251         # return super(tqdm...) will not catch exception
    252         yield obj
    253 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File ~/.py-venv/lib/python3.11/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
   1178 time = self._time
   1180 try:
-> 1181     for obj in iterable:
   1182         yield obj
   1183         # Update and possibly print the progressbar.
   1184         # Note: does not call self.update(1) for speed optimisation.

File <frozen codecs>:322, in decode(self, input, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Any idea?
(cc @YuTeruya @kanalsop)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions