Skip to content

Commit 63decae

Browse files
committed
Escape XML special characters in GDALMetadata writer (#1614)
`_build_gdal_metadata_xml` embedded caller-supplied dict keys and values into the GDALMetadata XML payload using plain f-strings, with no XML escaping. Any string containing `& < > " '` corrupted the document and round-tripped through `_parse_gdal_metadata` as `{}`; a crafted key like `foo" malicious="bar` injected extra attributes into the `<Item>` element. Route every text slot through `xml.sax.saxutils.escape` and every attribute slot through `quoteattr`, mirroring the helpers added in `_vrt.py` by #1607. Sample indices stay int-cast literals. Reachable from `to_geotiff`, `_write_vrt_tiled`, and `write_geotiff_gpu` whenever the user supplies `attrs['gdal_metadata']` as a dict. Found by the geotiff security sweep (Cat 5, MEDIUM). Same bug class as #1607 on a separate code path. Existing `test_features.TestGDALMetadata::test_build_gdal_metadata_xml` still passes; new tests pin escape behaviour for every XML special character and guard against attribute / element injection.
1 parent 0088dfe commit 63decae

3 files changed

Lines changed: 150 additions & 3 deletions

File tree

.claude/sweep-security-state.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ fire,2026-04-25,,,,,"Clean. Despite the module's size hint, fire.py is purely pe
1818
flood,2026-05-03,1437,MEDIUM,3,,Re-audit 2026-05-03. MEDIUM Cat 3 fixed in PR #1438 (travel_time and flood_depth_vegetation now validate mannings_n DataArray values are finite and strictly positive via _validate_mannings_n_dataarray helper). No remaining unfixed findings. Other categories clean: every allocation is same-shape as input; no flat index math; NaN propagation explicit in every backend; tan_slope clamped by _TAN_MIN; no CUDA kernels; no file I/O; every public API calls _validate_raster on DataArray inputs.
1919
focal,2026-04-27,1284,HIGH,1,,"HIGH (fixed PR #1286): apply(), focal_stats(), and hotspots() accepted unbounded user-supplied kernels via custom_kernel(), which only checks shape parity. The kernel-size guard from #1241 (_check_kernel_memory) only ran inside circle_kernel/annulus_kernel, so a (50001, 50001) custom kernel on a 10x10 raster allocated ~10 GB on the kernel itself plus a much larger padded raster before any work -- same shape as the bilateral DoS in #1236. Fixed by adding _check_kernel_vs_raster_memory in focal.py and wiring it into apply(), focal_stats(), and hotspots() after custom_kernel() validation. All 134 focal tests + 19 bilateral tests pass. No other findings: 10 CUDA kernels all have proper bounds + stencil guards; _validate_raster called on every public entry point; hotspots already raises ZeroDivisionError on constant-value rasters; _focal_variety_cuda uses a fixed-size local buffer (silent truncation but bounded); _focal_std_cuda/_focal_var_cuda clamp the catastrophic-cancellation case via if var < 0.0: var = 0.0; no file I/O."
2020
geodesic,2026-04-27,1283,HIGH,1,,"HIGH (fixed PR #1285): slope(method='geodesic') and aspect(method='geodesic') stack a (3, H, W) float64 array (data, lat, lon) before dispatch with no memory check. A large lat/lon-tagged raster passed to either function would OOM. Fixed by adding _check_geodesic_memory(rows, cols) in xrspatial/geodesic.py (mirrors morphology._check_kernel_memory): budgets 56 bytes/cell (24 stacked float64 + 4 float32 output + 24 padded copy + slack) and raises MemoryError when > 50% of available RAM; called from slope.py and aspect.py inside the geodesic branch before dispatch. No other findings: 6 CUDA kernels all have bounds guards (e.g. _run_gpu_geodesic_aspect at geodesic.py:395), custom 16x16 thread blocks avoid register spill, no shared memory, _validate_raster runs upstream in slope/aspect, all backends cast to float32, slope_mag < 1e-7 flat threshold prevents arctan2 NaN propagation, curvature correction uses hardcoded WGS84 R."
21-
geotiff,2026-05-11,1607,MEDIUM,5,,"MEDIUM (Cat 5 XML injection, filed #1607): write_vrt in _vrt.py embedded the caller-supplied crs_wkt and source filenames into VRT XML via plain f-strings, so values carrying XML special chars (< > & "" ') could break the document or inject elements like <Metadata><MDI key='AREA_OR_POINT'>Point</MDI></Metadata> (verified the injection flipped raster_type from 'area' to 'point' on a round trip). Fix: route every text slot through xml.sax.saxutils.escape via new _xml_text/_xml_attr helpers; numeric fields stay int/float literal interpolation. Other categories clean: parse_header / parse_ifd enforce MAX_IFD_ENTRY_COUNT, MAX_IFD_ENTRY_BYTES, MAX_IFDS. _safe_xml.safe_fromstring rejects DOCTYPE. read_to_array gates dims via MAX_PIXELS_DEFAULT (1B px) and HTTP per-tile via XRSPATIAL_COG_MAX_TILE_BYTES (256 MiB). validate_tile_layout runs before both CPU and GPU tile decoders (#1219). All CUDA kernels in _gpu_decode.py guard tile_idx >= tile_offsets.shape[0] and write loops gate on out_pos < dst_len. LZW/deflate decoders are called only with expected_size > 0 from _decode_strip_or_tile, so the decompression-bomb cap fires. _FileSource uses os.path.realpath; _vrt.parse_vrt canonicalises SourceFilename via realpath. No int32 overflow paths."
21+
geotiff,2026-05-11,1614,MEDIUM,5,,"MEDIUM (Cat 5 XML injection, filed #1614): _build_gdal_metadata_xml in _geotags.py used plain f-strings to embed caller-supplied keys and values into the GDALMetadata XML payload (tag 42112), so a key or value carrying XML special chars (< > & "" ') silently produced malformed XML (ParseError on read -> attrs round-tripped as {}) or let a crafted key inject attributes into <Item> (e.g. name='foo"" malicious=""bar'). Fix mirrors #1607: route every text slot through xml.sax.saxutils.escape and every attribute slot through quoteattr; sample indices are emitted from int() casts. Same bug class as #1607 but on a separate code path the earlier sweep did not cover. Other categories clean (rest of geotiff already hardened by #1607/#1579/#1584/#1219/#1196/#1189). Reachable from to_geotiff, _write_vrt_tiled, and write_geotiff_gpu whenever a caller passes attrs['gdal_metadata'] as a dict."
2222
glcm,2026-04-24,1257,HIGH,1,,"HIGH (fixed #1257): glcm_texture() validated window_size only as >= 3 and distance only as >= 1, with no upper bound on either. _glcm_numba_kernel iterates range(r-half, r+half+1) for every pixel, so window_size=1_000_001 on a 10x10 raster ran ~10^14 loop iterations with all neighbors failing the interior bounds check (CPU DoS). On the dask backends depth = window_size // 2 + distance drove map_overlap padding, so a huge window also caused oversize per-chunk allocations (memory DoS). Fixed by adding max_val caps in the public entrypoint: window_size <= max(3, min(rows, cols)) and distance <= max(1, window_size // 2). One cap covers every backend because cupy and dask+cupy call through to the CPU kernel after cupy.asnumpy. No other HIGH findings: levels is already capped at 256 so the per-pixel np.zeros((levels, levels)) matrix in the kernel is bounded to 512 KB. No CUDA kernels. No file I/O. Quantization clips to [0, levels-1] before the kernel and NaN maps to -1 which the kernel filters with i_val >= 0. Entropy log(p) and correlation p / (std_i * std_j) are both guarded. All four backends use _validate_raster and cast to float64 before quantizing. MEDIUM (unfixed, Cat 1): the per-pixel np.zeros((levels, levels)) allocation inside the hot loop is a perf issue (levels=256 -> 512 KB alloc+free per pixel) but not a security issue because levels is bounded. Could be hoisted out of the loop or replaced with an in-place clear, but that is an efficiency concern, not security."
2323
gpu_rtx,2026-04-29,1308,HIGH,1,,"HIGH (fixed #1308 / PR #1310): hillshade_rtx (gpu_rtx/hillshade.py:184) and viewshed_gpu (gpu_rtx/viewshed.py:269) allocated cupy device buffers sized by raster shape with no memory check. create_triangulation (mesh_utils.py:23-24) adds verts (12 B/px) + triangles (24 B/px) = 36 B/px; hillshade_rtx adds d_rays(32) + d_hits(16) + d_aux(12) + d_output(4) = 64 B/px (100 B/px total); viewshed_gpu adds d_rays(32) + d_hits(16) + d_visgrid(4) + d_vsrays(32) = 84 B/px (120 B/px total). A 30000x30000 raster asked for 90-108 GB of VRAM before cupy surfaced an opaque allocator error. Fixed by adding gpu_rtx/_memory.py with _available_gpu_memory_bytes() and _check_gpu_memory(func_name, h, w) helpers (cost_distance #1262 / sky_view_factor #1299 pattern, 120 B/px budget covers worst case, raises MemoryError when required > 50% of free VRAM, skips silently when memGetInfo() unavailable). Wired into both entry points after the cupy.ndarray type check and before create_triangulation. 9 new tests in test_gpu_rtx_memory.py (5 helper-unit + 4 end-to-end gated on has_rtx). All 81 existing hillshade/viewshed tests still pass. Cat 4 clean: all CUDA kernels (hillshade.py:25/62/106, viewshed.py:32/74/116, mesh_utils.py:50) have bounds guards; no shared memory, no syncthreads needed. MEDIUM not fixed (Cat 6): hillshade_rtx and viewshed_gpu do not call _validate_raster directly but parent hillshade() (hillshade.py:252) and viewshed() (viewshed.py:1707) already validate, so input validation runs before the gpu_rtx entry point - defense-in-depth, not exploitable. MEDIUM not fixed (Cat 2): mesh_utils.py:64-68 cast mesh_map_index to int32 in the triangle index buffer; overflows at H*W > 2.1B vertices (~46341x46341+) but the new memory guard rejects rasters that large first - documentation/clarity item rather than exploitable. MEDIUM not fixed (Cat 3): mesh_utils.py:19 scale = maxDim / maxH divides by zero on an all-zero raster, propagating inf/NaN into mesh vertex z-coords; separate follow-up. LOW not fixed (Cat 5): mesh_utils.write() opens user-supplied path without canonicalization but its only call site (mesh_utils.py:38-39) sits behind if False: in create_triangulation, not reachable in production."
2424
hillshade,2026-04-27,,,,,"Clean. Cat 1: only allocation is the output np.empty(data.shape) at line 32 (cupy at line 165) and a _pad_array with hardcoded depth=1 (line 62) -- bounded by caller, no user-controlled amplifier. Azimuth/altitude are scalars and don't drive size. Cat 2: numba kernel uses range(1, rows-1) with simple (y, x) indexing; numba range loops promote to int64. Cat 3: math.sqrt(1.0 + xx_plus_yy) is always >= 1.0 (no neg sqrt, no div-by-zero); NaN elevation propagates correctly through dz_dx/dz_dy -> shaded -> output (the shaded < 0.0 / shaded > 1.0 clamps don't fire on NaN). Azimuth validated to [0, 360], altitude to [0, 90]. Cat 4: _gpu_calc_numba (line 107) guards both grid bounds and 3x3 stencil reads via i > 0 and i < shape[0]-1 and j > 0 and j < shape[1]-1; no shared memory. Cat 5: no file I/O. Cat 6: hillshade() calls _validate_raster (line 252) and _validate_scalar for both azimuth (253) and angle_altitude (254); all four backend paths cast to float32; tests parametrize int32/int64/float32/float64."

xrspatial/geotiff/_geotags.py

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -199,15 +199,34 @@ def _build_gdal_metadata_xml(meta: dict) -> str:
199199
200200
Accepts the same dict format that _parse_gdal_metadata produces:
201201
string keys for dataset-level, (name, band) tuples for per-band.
202+
203+
Every caller-supplied text and attribute slot is routed through
204+
``xml.sax.saxutils.escape`` / ``quoteattr`` so a key or value
205+
containing XML special characters (``& < > " '``) cannot break
206+
the document or inject extra elements. Sample indices are emitted
207+
from an ``int(...)`` cast and need no escaping. See issue #1614.
202208
"""
209+
from xml.sax.saxutils import escape as _xml_escape, quoteattr as _xml_quoteattr
210+
211+
def _text(v) -> str:
212+
if v is None:
213+
return ""
214+
return _xml_escape(str(v), {'"': "&quot;", "'": "&apos;"})
215+
216+
def _attr(v) -> str:
217+
if v is None:
218+
return '""'
219+
return _xml_quoteattr(str(v))
220+
203221
lines = ['<GDALMetadata>']
204222
for key, value in meta.items():
205223
if isinstance(key, tuple):
206224
name, sample = key
207225
lines.append(
208-
f' <Item name="{name}" sample="{sample}">{value}</Item>')
226+
f' <Item name={_attr(name)} sample="{int(sample)}">'
227+
f'{_text(value)}</Item>')
209228
else:
210-
lines.append(f' <Item name="{key}">{value}</Item>')
229+
lines.append(f' <Item name={_attr(key)}>{_text(value)}</Item>')
211230
lines.append('</GDALMetadata>')
212231
return '\n'.join(lines) + '\n'
213232

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
"""Tests for #1614: _build_gdal_metadata_xml escapes XML special chars.
2+
3+
The serialiser previously interpolated keys and values with plain
4+
f-strings, which corrupted the document on any value containing
5+
``& < > " '`` and let a caller-crafted key inject extra attributes
6+
into the ``<Item>`` element. This file pins the escaped behaviour:
7+
all five XML special characters round-trip through write/read
8+
without loss, and attribute slots refuse injection attempts.
9+
"""
10+
from __future__ import annotations
11+
12+
import xml.etree.ElementTree as ET
13+
14+
import numpy as np
15+
import pytest
16+
17+
from xrspatial.geotiff._geotags import (
18+
_build_gdal_metadata_xml,
19+
_parse_gdal_metadata,
20+
)
21+
22+
23+
class TestBuildGdalMetadataXMLEscape:
24+
"""Verify each XML special character round-trips intact."""
25+
26+
@pytest.mark.parametrize(
27+
"payload",
28+
[
29+
"A & B",
30+
"value < 5",
31+
"value > 5",
32+
'quote "x"',
33+
"apos 'x'",
34+
"&<>\"'", # all five at once
35+
"no specials here", # control: plain text still works
36+
],
37+
)
38+
def test_value_round_trip_for_special_chars(self, payload):
39+
meta = {"NOTE": payload}
40+
xml = _build_gdal_metadata_xml(meta)
41+
# Document parses as well-formed XML
42+
ET.fromstring(xml)
43+
# Round-trip via the package's own parser reproduces the value
44+
assert _parse_gdal_metadata(xml) == meta
45+
46+
@pytest.mark.parametrize(
47+
"payload",
48+
[
49+
"A & B",
50+
"value < 5",
51+
'quote "x"',
52+
],
53+
)
54+
def test_per_band_value_round_trip(self, payload):
55+
meta = {("STAT", 0): payload}
56+
xml = _build_gdal_metadata_xml(meta)
57+
ET.fromstring(xml)
58+
assert _parse_gdal_metadata(xml) == meta
59+
60+
def test_attribute_injection_via_name_blocked(self):
61+
"""A key containing a quote cannot inject extra attributes."""
62+
injection_key = 'foo" malicious="bar'
63+
meta = {injection_key: "value"}
64+
xml = _build_gdal_metadata_xml(meta)
65+
66+
root = ET.fromstring(xml)
67+
items = root.findall("Item")
68+
assert len(items) == 1
69+
# Only the legitimate ``name`` attribute exists; the injection
70+
# attempt did not create a second attribute.
71+
assert set(items[0].attrib.keys()) == {"name"}
72+
assert items[0].attrib["name"] == injection_key
73+
assert items[0].text == "value"
74+
# Round-trip reproduces the original key intact.
75+
assert _parse_gdal_metadata(xml) == meta
76+
77+
def test_element_injection_via_value_blocked(self):
78+
"""A value with closing-tag syntax cannot inject a sibling Item."""
79+
meta = {"OUTER": "</Item><Item name=\"X\">stuff</Item><!--"}
80+
xml = _build_gdal_metadata_xml(meta)
81+
82+
root = ET.fromstring(xml)
83+
items = root.findall("Item")
84+
# Only the one Item we wrote -- the injection payload sits inside
85+
# its text content, not as a new element.
86+
assert len(items) == 1
87+
assert items[0].attrib == {"name": "OUTER"}
88+
assert _parse_gdal_metadata(xml) == meta
89+
90+
def test_unicode_value_unaffected(self):
91+
"""Non-ASCII text passes through unchanged (only XML specials escape)."""
92+
meta = {"DESC": "élévation ☃"}
93+
xml = _build_gdal_metadata_xml(meta)
94+
ET.fromstring(xml)
95+
assert _parse_gdal_metadata(xml) == meta
96+
97+
def test_existing_clean_string_format_unchanged(self):
98+
"""Plain keys/values still emit the original double-quoted form."""
99+
xml = _build_gdal_metadata_xml({"DataType": "Generic"})
100+
# quoteattr picks double quotes when the string has no double
101+
# quote of its own, so the legacy assertion ``name="DataType"``
102+
# still holds and downstream tools see the same output.
103+
assert '<Item name="DataType">Generic</Item>' in xml
104+
105+
106+
class TestToGeotiffMetadataRoundTrip:
107+
"""End-to-end: special-char metadata survives a write/read cycle."""
108+
109+
def test_to_geotiff_special_chars_round_trip(self, tmp_path):
110+
import xarray as xr
111+
from xrspatial.geotiff import open_geotiff, to_geotiff
112+
113+
arr = np.ones((4, 4), dtype=np.float32)
114+
da = xr.DataArray(arr, dims=["y", "x"])
115+
da.attrs["gdal_metadata"] = {
116+
"NOTE": "value < 5 & ok",
117+
("STAT", 0): 'quote "x"',
118+
}
119+
120+
path = tmp_path / "meta.tif"
121+
to_geotiff(da, str(path))
122+
123+
opened = open_geotiff(str(path))
124+
round_tripped = opened.attrs.get("gdal_metadata")
125+
assert round_tripped is not None, opened.attrs
126+
# Both dataset-level and per-band entries survive intact.
127+
assert round_tripped.get("NOTE") == "value < 5 & ok"
128+
assert round_tripped.get(("STAT", 0)) == 'quote "x"'

0 commit comments

Comments
 (0)