Source{d} mining software for dockerhub dataset
This repo contains scripts and utilities used to produce a docker images libraries dataset by analyzing in-depth each image's filesystem.
Requirements:
- Have Docker Installed
- Have IPython Installed
ipython main.py ./images.txt ./packagesWhere ./images.txt contains the list of images to analyse, one per line. If no tag is specified, latest will be used.
Example images.txt:
amancevice/superset
ubuntu:18.04
express-gateway
alpine/node
archmageinc/node-web-dev
And ./packages the folder where the result will be written on disk.
The output directory structure is the same as the DockerhubMetadata dataset. The top level directory is the first two letters of the image name, the inner directories correspond to the name, including the /. :latest is stripped from the file names. Examples: the configuration for tensorflow/tensorflow:2.0.0b0 will be at te/tensorflow/tensorflow:2.0.0b0.json, and for mongo:latest at mo/mongo.json.
show_countis a bash script that specifically shows the amount of already fetched images in source{d}'stypos{1-4}nodes. It is of no use outside of the organization and should be removed before making the repo public. It is left here as documentation about the ongoing tasks.