Skip to content

Commit c1e1329

Browse files
committed
recreate project as dirhash
0 parents  commit c1e1329

14 files changed

Lines changed: 2589 additions & 0 deletions

File tree

.coveragerc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[run]
2+
branch = True
3+
source = dirhash

.gitignore

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
*.egg-info/
24+
.installed.cfg
25+
*.egg
26+
MANIFEST
27+
28+
# PyInstaller
29+
# Usually these files are written by a python script from a template
30+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
31+
*.manifest
32+
*.spec
33+
34+
# Installer logs
35+
pip-log.txt
36+
pip-delete-this-directory.txt
37+
38+
# Unit test / coverage reports
39+
htmlcov/
40+
.tox/
41+
.coverage
42+
.coverage.*
43+
.cache
44+
nosetests.xml
45+
coverage.xml
46+
*.cover
47+
.hypothesis/
48+
.pytest_cache/
49+
50+
# Translations
51+
*.mo
52+
*.pot
53+
54+
# Django stuff:
55+
*.log
56+
local_settings.py
57+
db.sqlite3
58+
59+
# Flask stuff:
60+
instance/
61+
.webassets-cache
62+
63+
# Scrapy stuff:
64+
.scrapy
65+
66+
# Sphinx documentation
67+
docs/_build/
68+
69+
# PyBuilder
70+
target/
71+
72+
# Jupyter Notebook
73+
.ipynb_checkpoints
74+
75+
# pyenv
76+
.python-version
77+
78+
# celery beat schedule file
79+
celerybeat-schedule
80+
81+
# SageMath parsed files
82+
*.sage.py
83+
84+
# Environments
85+
.env
86+
.venv
87+
env/
88+
venv/
89+
ENV/
90+
env.bak/
91+
venv.bak/
92+
93+
# Spyder project settings
94+
.spyderproject
95+
.spyproject
96+
97+
# Rope project settings
98+
.ropeproject
99+
100+
# mkdocs documentation
101+
/site
102+
103+
# mypy
104+
.mypy_cache/
105+
106+
# Pycharm
107+
.idea/
108+
109+
110+
# Project specific
111+
benchmark/test_cases/*

.travis.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
language: python
2+
dist: trusty
3+
python:
4+
- "2.7"
5+
- "3.6"
6+
install:
7+
- pip install -e .
8+
- pip install pytest-cov codecov
9+
- pip freeze
10+
script:
11+
- py.test --cov-config=.coveragerc --cov=dirhash tests/
12+
after_success:
13+
- coverage report
14+
- codecov

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2019 Anders Huss
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
[![Build Status](https://travis-ci.com/andhus/dirhash.svg?branch=master)](https://travis-ci.com/andhus/dirhash)
2+
[![codecov](https://codecov.io/gh/andhus/dirhash/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash)
3+
4+
# dirhash
5+
A lightweight python module and tool for computing the hash of any
6+
directory based on its files' structure and content.
7+
- Supports any hashing algorithm of Python's built-in `hashlib` module
8+
- `.gitignore` style "wildmatch" patterns for expressive filtering of files to
9+
include/exclude.
10+
- Multiprocessing for up to [6x speed-up](#performance)
11+
12+
## Installation
13+
```commandline
14+
git clone git@github.com:andhus/dirhash.git
15+
pip install dirhash/
16+
```
17+
18+
## Usage
19+
Python module:
20+
```python
21+
from dirhash import dirhash
22+
23+
dirpath = 'path/to/directory'
24+
dir_md5 = dirhash(dirpath, 'md5')
25+
filtered_sha1 = dirhash(dirpath, 'sha1', ignore=['.*', '.*/', '*.pyc'])
26+
pyfiles_sha3_512 = dirhash(dirpath, 'sha3_512', match=['*.py'])
27+
```
28+
CLI:
29+
```commandline
30+
dirhash path/to/directory -a md5
31+
dirhash path/to/directory -a sha1 -i ".* .*/ *.pyc"
32+
dirhash path/to/directory -a sha3_512 -m "*.py"
33+
```
34+
35+
## Why?
36+
If you (or your application) need to verify the integrity of a set of files as well
37+
as their name and location, you might find this useful. Use-cases range from
38+
verification of your image classification dataset (before spending GPU-$$$ on
39+
training your fancy Deep Learning model) to validation of generated files in
40+
regression-testing.
41+
42+
There isn't really a standard way of doing this. There are plenty of recipes out
43+
there (see e.g. these SO-questions for [linux](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents)
44+
and [python](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python))
45+
but I couldn't find one that is properly tested (there are some gotcha:s to cover!)
46+
and documented with a compelling user interface. `dirhash` was created with this as
47+
the goal.
48+
49+
[checksumdir](https://github.com/cakepietoast/checksumdir) is another python
50+
module/tool with similar intent (that inspired this project) but it lacks much of the
51+
functionality offered here (most notably including file names/structure in the hash)
52+
and lacks tests.
53+
54+
## Performance
55+
The python `hashlib` implementation of common hashing algorithms are highly
56+
optimised. `dirhash` mainly parses the file tree, pipes data to `hashlib` and
57+
combines the output. Reasonable measures have been taken to minimize the overhead
58+
and for common use-cases, the majority of time is spent reading data from disk
59+
and executing `hashlib` code.
60+
61+
The main effort to boost performance is support for multiprocessing, where the
62+
reading and hashing is parallelized over individual files.
63+
64+
As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash/blob/master/dirhash/cli.py)
65+
with the shell command:
66+
67+
`find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5`
68+
69+
which is the top answer for the SO-question:
70+
[Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents)
71+
Results for two test cases are shown below. Both have 1 GiB of random data: in
72+
"flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in
73+
"nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories
74+
in a binary tree of depth 8.
75+
76+
Implementation | Test Case | Time (s) | Speed up
77+
------------------- | --------------- | -------: | -------:
78+
shell reference | flat_1k_1MB | 2.29 | -> 1.0
79+
`dirhash` | flat_1k_1MB | 1.67 | 1.36
80+
`dirhash`(8 workers)| flat_1k_1MB | 0.48 | **4.73**
81+
shell reference | nested_32k_32kB | 6.82 | -> 1.0
82+
`dirhash` | nested_32k_32kB | 3.43 | 2.00
83+
`dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00**
84+
85+
The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash/tree/master/benchmark).
86+
87+
## Documentation
88+
Please refer to `dirhash -h` and the python [source code](https://github.com/andhus/dirhash/blob/master/dirhash/__init__.py).

benchmark/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Benchmark
2+
3+
As a reference, the performance of `dirhash` is benchmarked against the shell command:
4+
5+
`find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5`
6+
7+
(top answer for the SO-question:
8+
[Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents))
9+
10+
Each test case contains 1 GiB of random data, split equally into 8, 1k or 32k files,
11+
in a flat or nested (binary tree of depth 8) structure.
12+
13+
For a fair comparison, *the CLI version* of `dirhash` was used (including startup
14+
time for loading of python modules etc.).
15+
16+
For full details/reproducibility see/run the `run.py` script for which the output is
17+
found in `results.csv`. These results were generated on a MacBook Pro (2018):
18+
- 2,2 GHz Intel Core i7 (`sysctl -n hw.physicalcpu hw.logicalcpu`-> 6, 12)
19+
- 16 GB 2400 MHz DDR4
20+
- APPLE SSD AP0512M
21+
22+
23+
24+
## Sample results:
25+
26+
Implementation | Test Case | Time (s) | Speed up
27+
------------------- | --------------- | -------: | -------:
28+
shell reference | flat_1k_1MB | 2.29 | -> 1.0
29+
`dirhash` | flat_1k_1MB | 1.67 | 1.36
30+
`dirhash`(8 workers)| flat_1k_1MB | 0.48 | **4.73**
31+
shell reference | nested_32k_32kB | 6.82 | -> 1.0
32+
`dirhash` | nested_32k_32kB | 3.43 | 2.00
33+
`dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00**

benchmark/results.csv

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
,test_case,implementation,algorithm,workers,t_best,t_median,speed-up (median)
2+
0,flat_8_128MB,shell reference,md5,1,2.014,2.02,1.0
3+
1,flat_8_128MB,dirhash,md5,1,1.602,1.604,1.2593516209476308
4+
2,flat_8_128MB,dirhash,md5,2,0.977,0.98,2.061224489795918
5+
3,flat_8_128MB,dirhash,md5,4,0.562,0.569,3.5500878734622145
6+
4,flat_8_128MB,dirhash,md5,8,0.464,0.473,4.2706131078224105
7+
5,flat_1k_1MB,shell reference,md5,1,2.263,2.268,1.0
8+
6,flat_1k_1MB,dirhash,md5,1,1.662,1.667,1.3605278944211157
9+
7,flat_1k_1MB,dirhash,md5,2,0.978,0.983,2.3072227873855544
10+
8,flat_1k_1MB,dirhash,md5,4,0.57,0.58,3.910344827586207
11+
9,flat_1k_1MB,dirhash,md5,8,0.476,0.48,4.725
12+
10,flat_32k_32kB,shell reference,md5,1,6.711,6.721,1.0
13+
11,flat_32k_32kB,dirhash,md5,1,3.329,3.354,2.003875968992248
14+
12,flat_32k_32kB,dirhash,md5,2,2.067,2.074,3.240597878495661
15+
13,flat_32k_32kB,dirhash,md5,4,1.345,1.362,4.934654919236417
16+
14,flat_32k_32kB,dirhash,md5,8,1.09,1.094,6.143510054844606
17+
15,nested_1k_1MB,shell reference,md5,1,2.296,2.306,1.0
18+
16,nested_1k_1MB,dirhash,md5,1,1.713,1.714,1.3453908984830805
19+
17,nested_1k_1MB,dirhash,md5,2,0.996,1.009,2.285431119920714
20+
18,nested_1k_1MB,dirhash,md5,4,0.601,0.602,3.8305647840531565
21+
19,nested_1k_1MB,dirhash,md5,8,0.499,0.505,4.566336633663366
22+
20,nested_32k_32kB,shell reference,md5,1,6.814,6.818,1.0
23+
21,nested_32k_32kB,dirhash,md5,1,3.376,3.426,1.9900758902510214
24+
22,nested_32k_32kB,dirhash,md5,2,2.147,2.153,3.166744078030655
25+
23,nested_32k_32kB,dirhash,md5,4,1.414,1.416,4.814971751412429
26+
24,nested_32k_32kB,dirhash,md5,8,1.137,1.138,5.991212653778559
27+
25,flat_8_128MB,shell reference,sha1,1,2.181,2.196,1.0
28+
26,flat_8_128MB,dirhash,sha1,1,1.214,1.225,1.7926530612244898
29+
27,flat_8_128MB,dirhash,sha1,2,0.768,0.774,2.8372093023255816
30+
28,flat_8_128MB,dirhash,sha1,4,0.467,0.474,4.632911392405064
31+
29,flat_8_128MB,dirhash,sha1,8,0.47,0.477,4.603773584905661
32+
30,flat_1k_1MB,shell reference,sha1,1,2.221,2.229,1.0
33+
31,flat_1k_1MB,dirhash,sha1,1,1.252,1.263,1.7648456057007127
34+
32,flat_1k_1MB,dirhash,sha1,2,0.774,0.777,2.8687258687258685
35+
33,flat_1k_1MB,dirhash,sha1,4,0.471,0.477,4.672955974842767
36+
34,flat_1k_1MB,dirhash,sha1,8,0.378,0.478,4.663179916317992
37+
35,flat_32k_32kB,shell reference,sha1,1,4.178,4.224,1.0
38+
36,flat_32k_32kB,dirhash,sha1,1,2.921,3.008,1.4042553191489362
39+
37,flat_32k_32kB,dirhash,sha1,2,1.888,1.892,2.232558139534884
40+
38,flat_32k_32kB,dirhash,sha1,4,1.266,1.275,3.3129411764705887
41+
39,flat_32k_32kB,dirhash,sha1,8,1.072,1.079,3.914735866543096
42+
40,nested_1k_1MB,shell reference,sha1,1,2.236,2.26,1.0
43+
41,nested_1k_1MB,dirhash,sha1,1,1.308,1.314,1.719939117199391
44+
42,nested_1k_1MB,dirhash,sha1,2,0.797,0.8,2.8249999999999997
45+
43,nested_1k_1MB,dirhash,sha1,4,0.501,0.509,4.4400785854616895
46+
44,nested_1k_1MB,dirhash,sha1,8,0.499,0.503,4.493041749502981
47+
45,nested_32k_32kB,shell reference,sha1,1,4.383,4.406,1.0
48+
46,nested_32k_32kB,dirhash,sha1,1,3.041,3.05,1.4445901639344263
49+
47,nested_32k_32kB,dirhash,sha1,2,1.943,1.965,2.242239185750636
50+
48,nested_32k_32kB,dirhash,sha1,4,1.329,1.334,3.3028485757121433
51+
49,nested_32k_32kB,dirhash,sha1,8,1.14,1.149,3.8346388163620535

0 commit comments

Comments
 (0)