Skip to content

Commit 4c584a2

Browse files
authored
Merge pull request #595 from qlustered/dev
9.1.0
2 parents 2db8427 + f4e58b5 commit 4c584a2

35 files changed

Lines changed: 4402 additions & 859 deletions

.github/workflows/main.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@ jobs:
1515
architecture: ['x64']
1616

1717
steps:
18-
- uses: actions/checkout@v3
18+
- uses: actions/checkout@v5
1919

2020
- name: Setup Python
21-
uses: actions/setup-python@v4
21+
uses: actions/setup-python@v6
2222
with:
2323
python-version: ${{ matrix.python-version }}
2424
architecture: ${{ matrix.architecture }}
@@ -50,7 +50,7 @@ jobs:
5050
5151
- name: Upload coverage
5252
if: ${{ matrix.python-version == '3.14' }}
53-
uses: codecov/codecov-action@v4
53+
uses: codecov/codecov-action@v5
5454
with:
5555
token: ${{ secrets.CODECOV_TOKEN }}
5656
file: coverage.xml

AUTHORS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,3 +86,5 @@ Authors in order of the timeline of their contributions:
8686
- [srini047](https://github.com/srini047) for fixing README typo.
8787
- [Nagato-Yuzuru](https://github.com/Nagato-Yuzuru) for colored view tests.
8888
- [akshat62](https://github.com/akshat62) for adding Fraction numeric support.
89+
- [akshat62](https://github.com/akshat62) for adding wildcard/glob pattern support for `exclude_paths` and `include_paths`.
90+
- [mgorny](https://github.com/mgorny) for adding missing files to sdist and removing obsolete `MANIFEST.in`.

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
# DeepDiff Change log
22

3+
- v9-1-0
4+
- Added multiprocessing support for DeepDiff: parallel distance computation and parallel subtree diffing with aggregated worker stats, deterministic ordering, and automatic fallback to serial when unsafe (e.g. `custom_operators`, `*_obj_callback`, `ignore_order_func`)
5+
- Added wildcard/glob pattern support for `exclude_paths` and `include_paths` thanks to [akshat62](https://github.com/akshat62)
6+
- Reimplemented internal cache for improved performance
7+
- Memoized `GlobPathMatcher` to remove exponential-time matching cliff
8+
- Comprehensive type-hint corrections across `deephash.py`, `helper.py`, `delta.py`, `diff.py`, `distance.py`, `path.py`, and `serialization.py` (also fixed real bugs: misplaced paren in `path._guess_type` call, and `len(other.indexes > 1)``len(other.indexes) > 1` in `diff._compare_in_order`)
9+
- Security: Delta dunder-attribute traversal in `check_elem()` now raises immediately instead of going through `_raise_or_log()`, with full-path preflight validation in `_get_elements_and_details()` so the `set_item_added` path cannot silently skip malicious dunder paths
10+
- Fixed nested NamedTuple set/frozenset Delta updates dropping the outer container
11+
- Fixed tuple Deltas using iterable opcodes silently doing nothing for insert/delete-only changes
12+
- Fixed Delta with both moved and added iterable items mutating the Delta's own internal diff data
13+
- Fixed crash during path sorting when removing multiple dictionary items with complex keys
14+
- Packaging: added missing files to sdist and removed obsolete `MANIFEST.in` thanks to [mgorny](https://github.com/mgorny)
15+
- Updated GitHub Actions workflows and dependencies
16+
317
- v9-0-0
418
- migration note:
519
- `to_dict()` and `to_json()` now accept a `verbose_level` parameter and always return a usable text-view dict. When the original view is `'tree'`, they default to `verbose_level=2` for full detail. The old `view_override` parameter is removed. To get the previous results, you will need to pass the explicit verbose_level to `to_json` and `to_dict` if you are using the tree view.

MANIFEST.in

Lines changed: 0 additions & 20 deletions
This file was deleted.

README.md

Lines changed: 16 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# DeepDiff v 9.0.0
1+
# DeepDiff v 9.1.0
22

33
![Downloads](https://img.shields.io/pypi/dm/deepdiff.svg?style=flat)
44
![Python Versions](https://img.shields.io/pypi/pyversions/deepdiff.svg?style=flat)
@@ -21,29 +21,25 @@
2121

2222
Tested on Python 3.10+ and PyPy3.
2323

24-
- **[Documentation](https://zepworks.com/deepdiff/9.0.0/)**
24+
- **[Documentation](https://zepworks.com/deepdiff/9.1.0/)**
2525

2626
## What is new?
2727

2828
Please check the [ChangeLog](CHANGELOG.md) file for the detailed information.
2929

30-
DeepDiff 9-0-0
31-
- migration note:
32-
- `to_dict()` and `to_json()` now accept a `verbose_level` parameter and always return a usable text-view dict. When the original view is `'tree'`, they default to `verbose_level=2` for full detail. The old `view_override` parameter is removed. To get the previous results, you will need to pass the explicit verbose_level to `to_json` and `to_dict` if you are using the tree view.
33-
- Dropping support for Python 3.9
34-
- Support for python 3.14
35-
- Added support for callable `group_by` thanks to @echan5
36-
- Added `FlatDeltaDict` TypedDict for `to_flat_dicts` return type
37-
- Fixed colored view display when all list items are removed thanks to @yannrouillard
38-
- Fixed `hasattr()` swallowing `AttributeError` in `__slots__` handling for objects with `__getattr__` thanks to @tpvasconcelos
39-
- Fixed `ignore_order=True` missing int-vs-float type changes
40-
- Fixed Delta producing phantom entries when items both move and change values with `iterable_compare_func` thanks to @devin13cox
41-
- Fixed `_convert_oversized_ints` failing on NamedTuples
42-
- Fixed orjson `TypeError` for integers exceeding 64-bit range
43-
- Fixed parameter bug in `to_flat_dicts` where `include_action_in_path` and `report_type_changes` were not being passed through
44-
- Fixed `ignore_keys` issue in `detailed__dict__` thanks to @vitalis89
45-
- Fixed logarithmic similarity type hint thanks to @ljames8
46-
- Added `Fraction` numeric support thanks to @akshat62
30+
DeepDiff 9-1-0
31+
- Added multiprocessing support for DeepDiff: parallel distance computation and parallel subtree diffing with aggregated worker stats, deterministic ordering, and automatic fallback to serial when unsafe (e.g. `custom_operators`, `*_obj_callback`, `ignore_order_func`)
32+
- Added wildcard/glob pattern support for `exclude_paths` and `include_paths` thanks to @akshat62
33+
- Reimplemented internal cache for improved performance
34+
- Memoized `GlobPathMatcher` to remove exponential-time matching cliff
35+
- Comprehensive type-hint corrections across `deephash.py`, `helper.py`, `delta.py`, `diff.py`, `distance.py`, `path.py`, and `serialization.py` (also fixed real bugs: misplaced paren in `path._guess_type` call, and `len(other.indexes > 1)``len(other.indexes) > 1` in `diff._compare_in_order`)
36+
- Security: Delta dunder-attribute traversal in `check_elem()` now raises immediately instead of going through `_raise_or_log()`, with full-path preflight validation in `_get_elements_and_details()` so the `set_item_added` path cannot silently skip malicious dunder paths
37+
- Fixed nested NamedTuple set/frozenset Delta updates dropping the outer container
38+
- Fixed tuple Deltas using iterable opcodes silently doing nothing for insert/delete-only changes
39+
- Fixed Delta with both moved and added iterable items mutating the Delta's own internal diff data
40+
- Fixed crash during path sorting when removing multiple dictionary items with complex keys
41+
- Packaging: added missing files to sdist and removed obsolete `MANIFEST.in` thanks to @mgorny
42+
- Updated GitHub Actions workflows and dependencies
4743

4844
## Installation
4945

@@ -77,7 +73,7 @@ Please take a look at the [CHANGELOG](CHANGELOG.md) file.
7773

7874
# Survey
7975

80-
:mega: **Please fill out our [fast 5-question survey](https://forms.gle/E6qXexcgjoKnSzjB8)** so that we can learn how & why you use DeepDiff, and what improvements we should make. Thank you! :dancers:
76+
:mega: **Please fill out our [fast 10-question survey](https://tally.so/r/J98MPY)** so that we can learn how & why you use DeepDiff, and what improvements we should make. Thank you! :dancers:
8177

8278
# Local dev
8379

benchmarks/__init__.py

Whitespace-only changes.
Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
"""Benchmarks for the internal multiprocessing mode (Subticket #7).
2+
3+
Goal: provide a reproducible "is multiprocessing actually faster?" check for
4+
the workloads multi_processing.md flags as the primary targets — the
5+
``ignore_order=True`` distance loop, paired-subtree diffs, and large lists of
6+
nested dicts. Each workload runs serial first, then parallel at a few worker
7+
counts; we print a single results table.
8+
9+
Usage::
10+
11+
source ~/.venvs/deep/bin/activate
12+
python -m benchmarks.multiprocessing_bench
13+
14+
# Smaller, faster sweep:
15+
python -m benchmarks.multiprocessing_bench --quick
16+
17+
# Just one workload:
18+
python -m benchmarks.multiprocessing_bench --only paired_subtree
19+
20+
The script also asserts that the parallel result equals the serial result for
21+
every workload — a benchmark that produces wrong answers is worse than no
22+
benchmark at all. If any pair diverges the script exits non-zero.
23+
24+
The numbers here are not committed; they're meant to inform threshold tuning
25+
(see DEFAULT_THRESHOLD in deepdiff/_multiprocessing.py) and to expose
26+
regressions when the hot path changes. Re-run on your hardware before drawing
27+
conclusions — process spawn overhead and IPC pickle cost vary wildly across
28+
machines.
29+
"""
30+
31+
import argparse
32+
import os
33+
import sys
34+
import time
35+
from typing import Any, Callable, Dict, List, Tuple
36+
37+
# Make the package importable when the script is run from a checkout.
38+
HERE = os.path.dirname(os.path.abspath(__file__))
39+
ROOT = os.path.dirname(HERE)
40+
if ROOT not in sys.path:
41+
sys.path.insert(0, ROOT)
42+
43+
from deepdiff import DeepDiff # noqa: E402
44+
45+
46+
# ---------------------------------------------------------------------------
47+
# Workloads.
48+
#
49+
# Each builder returns ``(t1, t2, kwargs)`` where ``kwargs`` is the DeepDiff
50+
# constructor arguments common to both the serial and parallel runs.
51+
# Multiprocessing parameters are added by the runner; workloads should not set
52+
# them.
53+
# ---------------------------------------------------------------------------
54+
55+
56+
def workload_paired_subtree(scale: int) -> Tuple[Any, Any, Dict[str, Any]]:
57+
"""Heavy paired-subtree diff path.
58+
59+
Each item is a small dict whose nested ``data`` differs by one element;
60+
pairing kicks in for every item, so the subtree-parallel path runs.
61+
"""
62+
n = scale
63+
t1 = [{"id": i, "data": {"x": i, "y": [i, i + 1, i + 2]}} for i in range(n)]
64+
t2 = [{"id": i, "data": {"x": i, "y": [i, i + 1, i + 3]}} for i in range(n)]
65+
return t1, t2, {"ignore_order": True, "cutoff_intersection_for_pairs": 1}
66+
67+
68+
def workload_distance_loop(scale: int) -> Tuple[Any, Any, Dict[str, Any]]:
69+
"""Heavy added-vs-removed distance grid.
70+
71+
All ids are disjoint between t1 and t2, so every t2 item is "added" and
72+
every t1 item is "removed". The candidate distance grid is N*N, which is
73+
where the distance worker pool earns its keep.
74+
"""
75+
n = scale
76+
t1 = [{"id": i, "v": [i, i, i]} for i in range(n)]
77+
t2 = [{"id": i + 10_000, "v": [i, i, i + 1]} for i in range(n)]
78+
return t1, t2, {"ignore_order": True, "cutoff_intersection_for_pairs": 1}
79+
80+
81+
def workload_large_nested_dicts(scale: int) -> Tuple[Any, Any, Dict[str, Any]]:
82+
"""Large list of moderately-deep dicts with one mutation each.
83+
84+
The shape mirrors the JSON-like blobs the doc calls out: each item is
85+
several layers deep with a mix of strings, ints, and nested lists.
86+
"""
87+
n = scale
88+
89+
def make(i: int, mutate: int) -> Dict[str, Any]:
90+
return {
91+
"id": i,
92+
"name": "name-%d" % i,
93+
"tags": ["t%d" % (i + j) for j in range(5)],
94+
"details": {
95+
"score": i + mutate,
96+
"history": [{"step": j, "value": j * 2 + mutate} for j in range(4)],
97+
"meta": {"created_at": "2024-01-%02d" % ((i % 28) + 1),
98+
"owner": "user-%d" % (i % 17)},
99+
},
100+
}
101+
102+
t1 = [make(i, 0) for i in range(n)]
103+
t2 = [make(i, 1 if i % 7 == 0 else 0) for i in range(n)]
104+
return t1, t2, {"ignore_order": True, "cutoff_intersection_for_pairs": 1}
105+
106+
107+
WORKLOADS: Dict[str, Callable[[int], Tuple[Any, Any, Dict[str, Any]]]] = {
108+
"paired_subtree": workload_paired_subtree,
109+
"distance_loop": workload_distance_loop,
110+
"large_nested_dicts": workload_large_nested_dicts,
111+
}
112+
113+
114+
# ---------------------------------------------------------------------------
115+
# Runner.
116+
# ---------------------------------------------------------------------------
117+
118+
119+
def _time(fn: Callable[[], Any]) -> Tuple[float, Any]:
120+
start = time.perf_counter()
121+
result = fn()
122+
return time.perf_counter() - start, result
123+
124+
125+
def run_one(name: str, scale: int, worker_counts: List[int]) -> List[Dict[str, Any]]:
126+
"""Run one workload serial + parallel and return one row per worker count.
127+
128+
The serial result is computed once and reused as the correctness reference
129+
for every parallel run.
130+
"""
131+
t1, t2, kwargs = WORKLOADS[name](scale)
132+
print(f"\n=== {name} (scale={scale}) ===")
133+
print(f"input shape: t1 has {len(t1)} items, t2 has {len(t2)} items")
134+
135+
serial_time, serial_result = _time(lambda: DeepDiff(t1, t2, **kwargs))
136+
print(f"serial: {serial_time:.3f}s")
137+
138+
rows: List[Dict[str, Any]] = [{
139+
"workload": name, "scale": scale,
140+
"mode": "serial", "workers": 1,
141+
"time_s": serial_time, "speedup": 1.0,
142+
"ok": True,
143+
}]
144+
145+
for workers in worker_counts:
146+
parallel_time, parallel_result = _time(lambda: DeepDiff(
147+
t1, t2,
148+
multiprocessing=True,
149+
multiprocessing_workers=workers,
150+
multiprocessing_threshold=0,
151+
**kwargs,
152+
))
153+
ok = parallel_result == serial_result
154+
speedup = serial_time / parallel_time if parallel_time > 0 else float("inf")
155+
marker = "" if ok else " !! RESULT MISMATCH !!"
156+
print(f"parallel(workers={workers}): {parallel_time:.3f}s "
157+
f"speedup={speedup:.2f}x{marker}")
158+
rows.append({
159+
"workload": name, "scale": scale,
160+
"mode": "parallel", "workers": workers,
161+
"time_s": parallel_time, "speedup": speedup,
162+
"ok": ok,
163+
})
164+
return rows
165+
166+
167+
def print_table(rows: List[Dict[str, Any]]) -> None:
168+
"""Compact summary table at the end of the run."""
169+
print("\n=== summary ===")
170+
header = ("workload", "scale", "mode", "workers", "time_s", "speedup", "ok")
171+
print("%-22s %6s %-9s %7s %10s %9s %4s" % header)
172+
print("-" * 72)
173+
for r in rows:
174+
print("%-22s %6d %-9s %7d %10.3f %9.2f %4s" % (
175+
r["workload"], r["scale"], r["mode"],
176+
r["workers"], r["time_s"], r["speedup"],
177+
"yes" if r["ok"] else "NO",
178+
))
179+
180+
181+
def main() -> int:
182+
parser = argparse.ArgumentParser(description=__doc__,
183+
formatter_class=argparse.RawDescriptionHelpFormatter)
184+
parser.add_argument(
185+
"--only", choices=list(WORKLOADS), action="append", default=None,
186+
help="run only the named workload(s); may be repeated. Default: all.",
187+
)
188+
parser.add_argument(
189+
"--workers", type=int, action="append", default=None,
190+
help="explicit worker count to test; may be repeated. "
191+
"Default: 2 and min(4, cpu_count).",
192+
)
193+
parser.add_argument(
194+
"--scale", type=int, default=None,
195+
help="override per-workload scale (number of items). Larger = more "
196+
"wall time. Default: a per-workload value below.",
197+
)
198+
parser.add_argument(
199+
"--quick", action="store_true",
200+
help="use small scales for a fast sanity-check run.",
201+
)
202+
args = parser.parse_args()
203+
204+
workloads = args.only or list(WORKLOADS)
205+
cpu = os.cpu_count() or 1
206+
workers_list = args.workers or [2, min(4, cpu)]
207+
# Deduplicate while preserving order — repeated --workers flags shouldn't
208+
# cause duplicate rows.
209+
workers_list = list(dict.fromkeys(workers_list))
210+
211+
# Default scales tuned so each row takes a few seconds serially. Override
212+
# via --scale or --quick. These are starting points, not gospel.
213+
default_scales = {
214+
"paired_subtree": 200,
215+
"distance_loop": 120,
216+
"large_nested_dicts": 200,
217+
}
218+
quick_scales = {
219+
"paired_subtree": 60,
220+
"distance_loop": 40,
221+
"large_nested_dicts": 60,
222+
}
223+
scales = quick_scales if args.quick else default_scales
224+
if args.scale is not None:
225+
scales = {name: args.scale for name in workloads}
226+
227+
print("DeepDiff multiprocessing benchmark")
228+
print(f"cpu_count={cpu} workers tested={workers_list}")
229+
230+
all_rows: List[Dict[str, Any]] = []
231+
for name in workloads:
232+
all_rows.extend(run_one(name, scales[name], workers_list))
233+
234+
print_table(all_rows)
235+
236+
# Non-zero exit if any parallel run produced a different result than its
237+
# serial reference — that's the one regression mode this script must catch.
238+
if any(not r["ok"] for r in all_rows):
239+
print("\nFAIL: at least one parallel run did not match its serial reference.")
240+
return 1
241+
return 0
242+
243+
244+
if __name__ == "__main__":
245+
sys.exit(main())

0 commit comments

Comments
 (0)