I am trying to run the noteboook 0 - Load Data and had an issue on the Restore JSONs to MongoDB cell. I have already downloaded repositories.json.gz and runs.json.gz as mentioned in the README. More details below:
gha-dataset/jupyter $ ls
'0 - Load data.ipynb' '1 - Dataset metrics.ipynb' repositories.json.gz runs.json.gz
gha-dataset/jupyter $ file repositories.json.gz
repositories.json.gz: gzip compressed data, from Unix, original size modulo 2^32 69208463
gha-dataset/jupyter $ du -sh repositories.json.gz
67M repositories.json.gz
gha-dataset/jupyter $ file runs.json.gz
runs.json.gz: gzip compressed data, from Unix, original size modulo 2^32 1063390519 gzip compressed data, reserved method, has CRC, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 1063390519
gha-dataset/jupyter $ du -sh runs.json.gz
964M runs.json.gz
Now, when I run the cell mentioned above, I get the following enconding error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[11], line 2
1 with gzip.open("repositories.json.gz", "rt") as fd:
----> 2 for line in tqdm(fd):
3 mongo_repositories.insert_one(json.loads(fd.read()))
5 with gzip.open("runs.json.gz", "rt") as fd:
File ~/.py-venv/lib/python3.11/site-packages/tqdm/notebook.py:250, in tqdm_notebook.__iter__(self)
248 try:
249 it = super().__iter__()
--> 250 for obj in it:
251 # return super(tqdm...) will not catch exception
252 yield obj
253 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt
File ~/.py-venv/lib/python3.11/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
1178 time = self._time
1180 try:
-> 1181 for obj in iterable:
1182 yield obj
1183 # Update and possibly print the progressbar.
1184 # Note: does not call self.update(1) for speed optimisation.
File <frozen codecs>:322, in decode(self, input, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Any idea?
(cc @YuTeruya @kanalsop)
I am trying to run the noteboook
0 - Load Dataand had an issue on theRestore JSONs to MongoDBcell. I have already downloadedrepositories.json.gzandruns.json.gzas mentioned in the README. More details below:Now, when I run the cell mentioned above, I get the following enconding error:
Any idea?
(cc @YuTeruya @kanalsop)