Skip to content

Commit 5101463

Browse files
committed
Combine multiple strategies for watermark removal
1 parent 5f45eb9 commit 5101463

3 files changed

Lines changed: 299 additions & 14 deletions

File tree

README.md

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -182,9 +182,8 @@ docs/b.pdf|Test3|AppB|||||2-5|200,200,500,500|||
182182
+ `-pp [password]` - The password if the PDF files protected
183183
+ `-pn` - Preserve original directory test names when specifying pages
184184
+ `-nf` - Normalize all PDF fonts to Helvetica 12pt before rendering. See [Font Normalization](#font-normalization) below.
185-
+ `-rw [text]` - Remove text watermarks matching the given string from PDFs before rendering (case-insensitive, exact match after trim). See [Watermark Removal](#watermark-removal) below.
186-
+ `-rwauto` - Auto-detect a vector watermark across PDFs. ImageTester groups PDFs by their containing directory and fingerprints each group separately, so a parent folder with one subfolder per environment (e.g. `pre/`, `uat/`) cleans correctly in one run. Each group needs at least 2 PDFs. See [Watermark Removal](#watermark-removal).
187-
+ `-rwo [dir]` - Standalone output mode. Write cleaned PDFs to the given directory and exit without uploading to Applitools. Combine with `-rw` or `-rwauto`.
185+
+ `-rwauto` - Auto-detect and remove a watermark (a stamped outline like a diagonal "UAT - Proof"). ImageTester groups PDFs by their containing folder, detects the watermark's fill color shared across each group, then strips only the paths drawn in that color — all other content is left intact. Each group needs at least 2 PDFs **from the same source** (all carrying the same watermark); a parent folder with one subfolder per environment (e.g. `pre/`, `uat/`) is cleaned correctly in one run. See [Watermark Removal](#watermark-removal).
186+
+ `-rwo [dir]` - Standalone output mode. Write cleaned PDFs to the given directory and exit without uploading to Applitools. Combine with `-rwauto`.
188187

189188
### Font Normalization
190189
Font changes (family swaps, weight tweaks, kerning differences) are one of the most common sources of
@@ -217,28 +216,49 @@ captured without normalization. Plan for a baseline refresh when rolling this ou
217216

218217
### Watermark Removal
219218

220-
Strips pre-production watermarks ("DRAFT", "PRE-Proof", etc.) from PDFs in memory before uploading
221-
to Eyes. Original PDFs on disk are never modified.
219+
Pre-production watermarks ("DRAFT", "PRE-Proof", etc.) make every page diff against a
220+
clean baseline. The `-rwauto` flag strips them from PDFs in memory before uploading to Eyes.
221+
**The original PDFs on disk are never modified.**
222222

223-
**Run it:**
223+
ImageTester groups your PDFs by their containing folder, detects the watermark's fill color shared
224+
across each group, then removes only the paths drawn in that color — everything else is untouched.
225+
It works whether the watermark is stamped identically in every PDF or restamped at a different
226+
position in each one.
227+
228+
Requirements:
229+
- **At least 2 PDFs per folder, all from the same source** (same template, all carrying the same
230+
watermark). Detection works by comparing the PDFs against each other, so a folder must not mix
231+
unrelated documents or include a PDF that has no watermark.
232+
- Subfolders per environment (`pre/`, `uat/`, `staging/`)? Point at the parent — each subfolder is
233+
detected and cleaned independently.
224234

225235
```
226236
java -jar ImageTester.jar -k YOUR_API_KEY -f pdfs/ -rwauto -a YourApp -fb YourBatch
227237
```
228238

229-
If your PDFs are organized into subfolders by environment (`pre/`, `uat/`, `staging/`), point at the
230-
parent — each subfolder is handled independently.
239+
On success ImageTester logs the color it found, e.g.
240+
`[uat] Watermark color rgb(179, 179, 179) detected across 4 PDF(s)`.
241+
242+
#### Preview locally before uploading
231243

232-
**Preview locally before uploading:**
244+
Add `-rwo <dir>` to write the cleaned **PDFs** to a folder and exit without contacting
245+
Eyes:
233246

234247
```
235248
java -jar ImageTester.jar -f pdfs/ -rwauto -rwo cleaned/
236249
```
237250

238-
Open the cleaned PDFs to verify, then re-run without `-rwo` to upload.
251+
Open the cleaned PDFs to confirm the watermark is gone and nothing else changed, then re-run without
252+
`-rwo` to upload.
253+
254+
#### Troubleshooting
239255

240-
**Single PDF, or watermark still visible after running** — ImageTester prints a notice with next
241-
steps. If unclear, contact Applitools support (support@applitools.com) with a sample PDF.
256+
- **"you're testing one PDF on its own" / nothing removed (`-rwauto`)** — auto mode needs at least 2
257+
same-source PDFs in the folder. Add another report/invoice/email from the same system and re-run.
258+
- **Watermark still visible after `-rwauto`** — the folder probably mixes documents from different
259+
sources, or includes a PDF that has no watermark. Make sure each folder holds only same-source PDFs
260+
that all carry the watermark. If it still doesn't work, contact Applitools support
261+
(support@applitools.com) with a sample PDF.
242262

243263
**Note:** cleaning changes what Eyes sees, so it invalidates baselines captured before cleaning was
244264
enabled. Plan a baseline refresh on rollout.

src/main/java/com/applitools/imagetester/ImageTester.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -697,8 +697,8 @@ private static Options getOptions() {
697697

698698
options.addOption(Option.builder("rwo")
699699
.longOpt("removeWatermarkOut")
700-
.desc("Standalone mode: render watermark-cleaned PDFs to PNG files " +
701-
"in the given directory and exit. Requires -rw. No upload to Applitools.")
700+
.desc("Standalone mode: write watermark-cleaned PDFs to the given " +
701+
"directory and exit. Combine with -rw or -rwauto. No upload to Applitools.")
702702
.hasArg()
703703
.argName("dir")
704704
.build());

stamp_watermark.py

Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
"""Convert an .xlsx to PDF and stamp Excel header/footer watermarks back on.
2+
3+
LibreOffice silently drops VML header/footer pictures during xlsx -> pdf
4+
conversion. This wrapper resolves the watermark image directly out of the
5+
.xlsx package (sheet -> legacyDrawingHF -> VML -> media), runs soffice
6+
headlessly, then overlays the image on every page of the resulting PDF
7+
so the visual-test pipeline sees the watermark the customer expects.
8+
9+
Usage:
10+
python stamp_watermark.py input.xlsx output.pdf
11+
12+
Requires: pypdf, reportlab, Pillow, LibreOffice (soffice) on PATH.
13+
"""
14+
15+
import argparse
16+
import io
17+
import re
18+
import shutil
19+
import subprocess
20+
import sys
21+
import tempfile
22+
import zipfile
23+
from pathlib import Path
24+
from xml.etree import ElementTree as ET
25+
26+
from PIL import Image
27+
from pypdf import PdfReader, PdfWriter
28+
from reportlab.lib.utils import ImageReader
29+
from reportlab.pdfgen import canvas
30+
31+
SOFFICE_TIMEOUT_SECONDS = 120
32+
WATERMARK_SCALE_TO_FIT_RATIO = 0.75
33+
34+
NS_RELATIONSHIPS = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
35+
NS_PACKAGE_RELS = "http://schemas.openxmlformats.org/package/2006/relationships"
36+
NS_SHEETML = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"
37+
NS_VML = "urn:schemas-microsoft-com:vml"
38+
NS_VML_OFFICE = "urn:schemas-microsoft-com:office:office"
39+
40+
VML_STYLE_DIMENSION = re.compile(r"(width|height)\s*:\s*([\d.]+)\s*pt", re.IGNORECASE)
41+
42+
43+
def main():
44+
args = parse_args()
45+
xlsx_path = Path(args.xlsx)
46+
output_path = Path(args.pdf)
47+
48+
if not xlsx_path.is_file():
49+
sys.exit(f"Input not found: {xlsx_path}")
50+
51+
with tempfile.TemporaryDirectory(prefix="stamp-watermark-") as tmp:
52+
tmp_dir = Path(tmp)
53+
pdf_path = convert_to_pdf(xlsx_path, tmp_dir)
54+
watermark = extract_header_footer_watermark(xlsx_path)
55+
56+
if watermark is None:
57+
shutil.copyfile(pdf_path, output_path)
58+
print(f"No header/footer watermark found; wrote plain PDF to {output_path}")
59+
return
60+
61+
image_bytes, dims = watermark
62+
page_count = stamp_pdf(pdf_path, image_bytes, dims, output_path)
63+
print(f"Stamped watermark on {page_count} page(s): {output_path}")
64+
65+
66+
def parse_args():
67+
parser = argparse.ArgumentParser(
68+
description="Convert an .xlsx to PDF and stamp Excel header/footer "
69+
"watermarks back on (LibreOffice drops them)."
70+
)
71+
parser.add_argument("xlsx", help="Input .xlsx file")
72+
parser.add_argument("pdf", help="Output PDF path")
73+
return parser.parse_args()
74+
75+
76+
def convert_to_pdf(xlsx_path, out_dir):
77+
"""Run LibreOffice headless to convert xlsx -> pdf. Return the produced PDF path."""
78+
command = [
79+
"soffice",
80+
"--headless", "--norestore", "--nolockcheck", "--nofirststartwizard",
81+
"--nologo", "--nodefault",
82+
"--convert-to", "pdf",
83+
"--outdir", str(out_dir),
84+
str(xlsx_path),
85+
]
86+
result = subprocess.run(
87+
command, capture_output=True, text=True, timeout=SOFFICE_TIMEOUT_SECONDS
88+
)
89+
if result.returncode != 0:
90+
raise RuntimeError(
91+
f"soffice failed (exit {result.returncode}):\n{result.stderr}"
92+
)
93+
94+
produced = out_dir / (xlsx_path.stem + ".pdf")
95+
if not produced.exists() or produced.stat().st_size == 0:
96+
raise RuntimeError(f"soffice produced no PDF at {produced}")
97+
return produced
98+
99+
100+
def extract_header_footer_watermark(xlsx_path):
101+
"""Return (image_bytes, (width_pt, height_pt)) for the first
102+
header/footer picture in the workbook, or None if none is present."""
103+
with zipfile.ZipFile(xlsx_path) as zf:
104+
sheet_paths = [
105+
name for name in zf.namelist()
106+
if name.startswith("xl/worksheets/") and name.endswith(".xml")
107+
]
108+
for sheet_path in sheet_paths:
109+
found = resolve_watermark_for_sheet(zf, sheet_path)
110+
if found is not None:
111+
return found
112+
return None
113+
114+
115+
def resolve_watermark_for_sheet(zf, sheet_path):
116+
"""For a single sheet XML, walk legacyDrawingHF -> VML -> media. Return
117+
(image_bytes, dims) or None if the sheet has no header/footer picture."""
118+
with zf.open(sheet_path) as f:
119+
sheet_root = ET.parse(f).getroot()
120+
121+
header_footer = sheet_root.find(f"{{{NS_SHEETML}}}headerFooter")
122+
legacy_hf = sheet_root.find(f"{{{NS_SHEETML}}}legacyDrawingHF")
123+
if header_footer is None or legacy_hf is None:
124+
return None
125+
126+
if not has_graphic_token(header_footer):
127+
return None
128+
129+
legacy_rid = legacy_hf.get(f"{{{NS_RELATIONSHIPS}}}id")
130+
if not legacy_rid:
131+
return None
132+
133+
sheet_dir = posix_dir(sheet_path)
134+
sheet_rels_path = f"{sheet_dir}/_rels/{Path(sheet_path).name}.rels"
135+
vml_path = resolve_relationship(zf, sheet_rels_path, legacy_rid, sheet_dir)
136+
if vml_path is None:
137+
return None
138+
139+
image_rid, dims = parse_vml(zf.read(vml_path))
140+
if image_rid is None:
141+
return None
142+
143+
vml_dir = posix_dir(vml_path)
144+
vml_rels_path = f"{vml_dir}/_rels/{Path(vml_path).name}.rels"
145+
image_path = resolve_relationship(zf, vml_rels_path, image_rid, vml_dir)
146+
if image_path is None:
147+
return None
148+
149+
return zf.read(image_path), dims
150+
151+
152+
def has_graphic_token(header_footer):
153+
"""True if any child of <headerFooter> contains the &G picture token."""
154+
for child in header_footer:
155+
if child.text and "&G" in child.text:
156+
return True
157+
return False
158+
159+
160+
def resolve_relationship(zf, rels_path, rid, base_dir):
161+
"""Resolve relationship Id `rid` in `rels_path` against `base_dir`."""
162+
try:
163+
with zf.open(rels_path) as f:
164+
rels_root = ET.parse(f).getroot()
165+
except KeyError:
166+
return None
167+
168+
for rel in rels_root.findall(f"{{{NS_PACKAGE_RELS}}}Relationship"):
169+
if rel.get("Id") == rid:
170+
return normalize_zip_path(base_dir, rel.get("Target", ""))
171+
return None
172+
173+
174+
def normalize_zip_path(base_dir, target):
175+
"""Resolve a relationship Target against its rels file's directory.
176+
Pure string handling so .. segments work the same on every OS."""
177+
parts = base_dir.split("/") + target.split("/")
178+
out = []
179+
for segment in parts:
180+
if segment in ("", "."):
181+
continue
182+
if segment == "..":
183+
if out:
184+
out.pop()
185+
else:
186+
out.append(segment)
187+
return "/".join(out)
188+
189+
190+
def posix_dir(zip_path):
191+
return Path(zip_path).parent.as_posix()
192+
193+
194+
def parse_vml(vml_bytes):
195+
"""Return (first imagedata relid, (width_pt, height_pt)) from a VML drawing,
196+
or (None, (None, None)) if no shape with imagedata is present."""
197+
root = ET.fromstring(vml_bytes)
198+
for shape in root.iter(f"{{{NS_VML}}}shape"):
199+
imagedata = shape.find(f"{{{NS_VML}}}imagedata")
200+
if imagedata is None:
201+
continue
202+
relid = imagedata.get(f"{{{NS_VML_OFFICE}}}relid")
203+
if not relid:
204+
continue
205+
dims = parse_vml_dimensions(shape.get("style", ""))
206+
return relid, dims
207+
return None, (None, None)
208+
209+
210+
def parse_vml_dimensions(style):
211+
found = {}
212+
for key, value in VML_STYLE_DIMENSION.findall(style):
213+
found[key.lower()] = float(value)
214+
return found.get("width"), found.get("height")
215+
216+
217+
def stamp_pdf(pdf_path, image_bytes, dims, output_path):
218+
"""Overlay watermark on every page. Return the number of pages stamped."""
219+
reader = PdfReader(str(pdf_path))
220+
writer = PdfWriter()
221+
image = Image.open(io.BytesIO(image_bytes)).convert("RGBA")
222+
223+
count = 0
224+
for page in reader.pages:
225+
page_w = float(page.mediabox.width)
226+
page_h = float(page.mediabox.height)
227+
wm_w, wm_h = watermark_size(image, dims, page_w, page_h)
228+
overlay = build_overlay_page(image, page_w, page_h, wm_w, wm_h)
229+
page.merge_page(overlay)
230+
writer.add_page(page)
231+
count += 1
232+
233+
with open(output_path, "wb") as f:
234+
writer.write(f)
235+
return count
236+
237+
238+
def watermark_size(image, dims, page_w, page_h):
239+
"""Pick the watermark draw size in points. Prefer the dimensions declared in
240+
the VML (matches Excel's intent); otherwise fit-to-page preserving aspect."""
241+
wm_w, wm_h = dims
242+
if wm_w and wm_h:
243+
return wm_w, wm_h
244+
scale = min(
245+
(page_w * WATERMARK_SCALE_TO_FIT_RATIO) / image.width,
246+
(page_h * WATERMARK_SCALE_TO_FIT_RATIO) / image.height,
247+
)
248+
return image.width * scale, image.height * scale
249+
250+
251+
def build_overlay_page(image, page_w, page_h, wm_w, wm_h):
252+
"""Create a single-page PDF the size of the target page with the watermark
253+
centered, and return that page for merging."""
254+
buf = io.BytesIO()
255+
c = canvas.Canvas(buf, pagesize=(page_w, page_h))
256+
x = (page_w - wm_w) / 2
257+
y = (page_h - wm_h) / 2
258+
c.drawImage(ImageReader(image), x, y, width=wm_w, height=wm_h, mask="auto")
259+
c.save()
260+
buf.seek(0)
261+
return PdfReader(buf).pages[0]
262+
263+
264+
if __name__ == "__main__":
265+
main()

0 commit comments

Comments
 (0)