Skip to content

Commit 0682de9

Browse files
authored
Escape XML special characters in write_vrt output (#1607) (#1610)
write_vrt in _vrt.py emitted the caller-supplied crs_wkt and the source filenames into the VRT XML via plain f-strings. A value containing one of the predefined XML entities (& < > " ') either broke the document or opened the door to element injection: a WKT carrying "</SRS><Metadata><MDI key='AREA_OR_POINT'>Point</MDI></Metadata><SRS>" flipped the parsed VRT's raster_type from "area" to "point" on a round trip. Route every text slot through xml.sax.saxutils.escape / quoteattr via new _xml_text and _xml_attr helpers. Numeric fields (offsets, sizes, pixel scales) stay as int/float literal interpolation because they cannot carry markup. Added test_vrt_xml_escape_1607.py covering: WKT round-trip with each predefined entity; the headline injection case (raster_type stays "area"); ampersand inside a source filename; XML well-formedness of the written bytes.
1 parent c80d0f3 commit 0682de9

3 files changed

Lines changed: 158 additions & 13 deletions

File tree

.claude/sweep-security-state.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ fire,2026-04-25,,,,,"Clean. Despite the module's size hint, fire.py is purely pe
1818
flood,2026-05-03,1437,MEDIUM,3,,Re-audit 2026-05-03. MEDIUM Cat 3 fixed in PR #1438 (travel_time and flood_depth_vegetation now validate mannings_n DataArray values are finite and strictly positive via _validate_mannings_n_dataarray helper). No remaining unfixed findings. Other categories clean: every allocation is same-shape as input; no flat index math; NaN propagation explicit in every backend; tan_slope clamped by _TAN_MIN; no CUDA kernels; no file I/O; every public API calls _validate_raster on DataArray inputs.
1919
focal,2026-04-27,1284,HIGH,1,,"HIGH (fixed PR #1286): apply(), focal_stats(), and hotspots() accepted unbounded user-supplied kernels via custom_kernel(), which only checks shape parity. The kernel-size guard from #1241 (_check_kernel_memory) only ran inside circle_kernel/annulus_kernel, so a (50001, 50001) custom kernel on a 10x10 raster allocated ~10 GB on the kernel itself plus a much larger padded raster before any work -- same shape as the bilateral DoS in #1236. Fixed by adding _check_kernel_vs_raster_memory in focal.py and wiring it into apply(), focal_stats(), and hotspots() after custom_kernel() validation. All 134 focal tests + 19 bilateral tests pass. No other findings: 10 CUDA kernels all have proper bounds + stencil guards; _validate_raster called on every public entry point; hotspots already raises ZeroDivisionError on constant-value rasters; _focal_variety_cuda uses a fixed-size local buffer (silent truncation but bounded); _focal_std_cuda/_focal_var_cuda clamp the catastrophic-cancellation case via if var < 0.0: var = 0.0; no file I/O."
2020
geodesic,2026-04-27,1283,HIGH,1,,"HIGH (fixed PR #1285): slope(method='geodesic') and aspect(method='geodesic') stack a (3, H, W) float64 array (data, lat, lon) before dispatch with no memory check. A large lat/lon-tagged raster passed to either function would OOM. Fixed by adding _check_geodesic_memory(rows, cols) in xrspatial/geodesic.py (mirrors morphology._check_kernel_memory): budgets 56 bytes/cell (24 stacked float64 + 4 float32 output + 24 padded copy + slack) and raises MemoryError when > 50% of available RAM; called from slope.py and aspect.py inside the geodesic branch before dispatch. No other findings: 6 CUDA kernels all have bounds guards (e.g. _run_gpu_geodesic_aspect at geodesic.py:395), custom 16x16 thread blocks avoid register spill, no shared memory, _validate_raster runs upstream in slope/aspect, all backends cast to float32, slope_mag < 1e-7 flat threshold prevents arctan2 NaN propagation, curvature correction uses hardcoded WGS84 R."
21-
geotiff,2026-05-11,1579,HIGH,5,,"HIGH (fixed #1579, PR pending): xml.etree.ElementTree.fromstring in _vrt.parse_vrt and _geotags._parse_gdal_metadata expanded internal XML entities (CWE-776 billion-laughs). A crafted .vrt file or a hostile GDALMetadata tag (42112) embedded in a TIFF could OOM the reader. Fixed by adding _safe_xml.safe_fromstring which refuses DOCTYPE declarations (where entities live) before parsing and prefers defusedxml when installed. GDALMetadata path falls through to empty dict on rejection (non-essential metadata); VRT path raises ValueError (file is required). 9 new tests added under TestVRTXMLEntityExpansion, TestGDALMetadataXMLEntityExpansion, TestSafeXMLHelper. No other CRITICAL/HIGH/MEDIUM findings: existing guards already cover Cat 1 (MAX_PIXELS_DEFAULT + per-tile/strip checks + VRT _check_dimensions + HTTP tile byte cap via XRSPATIAL_COG_MAX_TILE_BYTES, 256 MiB default), Cat 2 (header uses int64 offsets, MAX_IFD_ENTRY_COUNT=100k, MAX_IFDS=256, MAX_IFD_ENTRY_BYTES=256 KiB), Cat 4 (every CUDA kernel inspected has bounds guard: _byte_swap_lanes_kernel:91, _lzw_decode_tiles_kernel:168, _predictor_decode_kernel_*:725-771, _fp_predictor_decode_kernel:792, _assemble_tiles_kernel:841; shared-memory arrays are 4096-cell fixed allocations sized for LZW table, no overflow path), Cat 5 (VRT SourceFilename canonicalised via realpath, _writer atomic-rename via mkstemp in dirname-of-abspath), Cat 6 (open_geotiff promotes integer + nodata to float64 with NaN; _validate_dtype_cast rejects float->int casts). Decompression-bomb guards already enforced via _DECOMPRESS_MARGIN=1.05 cap in _compression.deflate_decompress."
21+
geotiff,2026-05-11,1607,MEDIUM,5,,"MEDIUM (Cat 5 XML injection, filed #1607): write_vrt in _vrt.py embedded the caller-supplied crs_wkt and source filenames into VRT XML via plain f-strings, so values carrying XML special chars (< > & "" ') could break the document or inject elements like <Metadata><MDI key='AREA_OR_POINT'>Point</MDI></Metadata> (verified the injection flipped raster_type from 'area' to 'point' on a round trip). Fix: route every text slot through xml.sax.saxutils.escape via new _xml_text/_xml_attr helpers; numeric fields stay int/float literal interpolation. Other categories clean: parse_header / parse_ifd enforce MAX_IFD_ENTRY_COUNT, MAX_IFD_ENTRY_BYTES, MAX_IFDS. _safe_xml.safe_fromstring rejects DOCTYPE. read_to_array gates dims via MAX_PIXELS_DEFAULT (1B px) and HTTP per-tile via XRSPATIAL_COG_MAX_TILE_BYTES (256 MiB). validate_tile_layout runs before both CPU and GPU tile decoders (#1219). All CUDA kernels in _gpu_decode.py guard tile_idx >= tile_offsets.shape[0] and write loops gate on out_pos < dst_len. LZW/deflate decoders are called only with expected_size > 0 from _decode_strip_or_tile, so the decompression-bomb cap fires. _FileSource uses os.path.realpath; _vrt.parse_vrt canonicalises SourceFilename via realpath. No int32 overflow paths."
2222
glcm,2026-04-24,1257,HIGH,1,,"HIGH (fixed #1257): glcm_texture() validated window_size only as >= 3 and distance only as >= 1, with no upper bound on either. _glcm_numba_kernel iterates range(r-half, r+half+1) for every pixel, so window_size=1_000_001 on a 10x10 raster ran ~10^14 loop iterations with all neighbors failing the interior bounds check (CPU DoS). On the dask backends depth = window_size // 2 + distance drove map_overlap padding, so a huge window also caused oversize per-chunk allocations (memory DoS). Fixed by adding max_val caps in the public entrypoint: window_size <= max(3, min(rows, cols)) and distance <= max(1, window_size // 2). One cap covers every backend because cupy and dask+cupy call through to the CPU kernel after cupy.asnumpy. No other HIGH findings: levels is already capped at 256 so the per-pixel np.zeros((levels, levels)) matrix in the kernel is bounded to 512 KB. No CUDA kernels. No file I/O. Quantization clips to [0, levels-1] before the kernel and NaN maps to -1 which the kernel filters with i_val >= 0. Entropy log(p) and correlation p / (std_i * std_j) are both guarded. All four backends use _validate_raster and cast to float64 before quantizing. MEDIUM (unfixed, Cat 1): the per-pixel np.zeros((levels, levels)) allocation inside the hot loop is a perf issue (levels=256 -> 512 KB alloc+free per pixel) but not a security issue because levels is bounded. Could be hoisted out of the loop or replaced with an in-place clear, but that is an efficiency concern, not security."
2323
gpu_rtx,2026-04-29,1308,HIGH,1,,"HIGH (fixed #1308 / PR #1310): hillshade_rtx (gpu_rtx/hillshade.py:184) and viewshed_gpu (gpu_rtx/viewshed.py:269) allocated cupy device buffers sized by raster shape with no memory check. create_triangulation (mesh_utils.py:23-24) adds verts (12 B/px) + triangles (24 B/px) = 36 B/px; hillshade_rtx adds d_rays(32) + d_hits(16) + d_aux(12) + d_output(4) = 64 B/px (100 B/px total); viewshed_gpu adds d_rays(32) + d_hits(16) + d_visgrid(4) + d_vsrays(32) = 84 B/px (120 B/px total). A 30000x30000 raster asked for 90-108 GB of VRAM before cupy surfaced an opaque allocator error. Fixed by adding gpu_rtx/_memory.py with _available_gpu_memory_bytes() and _check_gpu_memory(func_name, h, w) helpers (cost_distance #1262 / sky_view_factor #1299 pattern, 120 B/px budget covers worst case, raises MemoryError when required > 50% of free VRAM, skips silently when memGetInfo() unavailable). Wired into both entry points after the cupy.ndarray type check and before create_triangulation. 9 new tests in test_gpu_rtx_memory.py (5 helper-unit + 4 end-to-end gated on has_rtx). All 81 existing hillshade/viewshed tests still pass. Cat 4 clean: all CUDA kernels (hillshade.py:25/62/106, viewshed.py:32/74/116, mesh_utils.py:50) have bounds guards; no shared memory, no syncthreads needed. MEDIUM not fixed (Cat 6): hillshade_rtx and viewshed_gpu do not call _validate_raster directly but parent hillshade() (hillshade.py:252) and viewshed() (viewshed.py:1707) already validate, so input validation runs before the gpu_rtx entry point - defense-in-depth, not exploitable. MEDIUM not fixed (Cat 2): mesh_utils.py:64-68 cast mesh_map_index to int32 in the triangle index buffer; overflows at H*W > 2.1B vertices (~46341x46341+) but the new memory guard rejects rasters that large first - documentation/clarity item rather than exploitable. MEDIUM not fixed (Cat 3): mesh_utils.py:19 scale = maxDim / maxH divides by zero on an all-zero raster, propagating inf/NaN into mesh vertex z-coords; separate follow-up. LOW not fixed (Cat 5): mesh_utils.write() opens user-supplied path without canonicalization but its only call site (mesh_utils.py:38-39) sits behind if False: in create_triangulation, not reachable in production."
2424
hillshade,2026-04-27,,,,,"Clean. Cat 1: only allocation is the output np.empty(data.shape) at line 32 (cupy at line 165) and a _pad_array with hardcoded depth=1 (line 62) -- bounded by caller, no user-controlled amplifier. Azimuth/altitude are scalars and don't drive size. Cat 2: numba kernel uses range(1, rows-1) with simple (y, x) indexing; numba range loops promote to int64. Cat 3: math.sqrt(1.0 + xx_plus_yy) is always >= 1.0 (no neg sqrt, no div-by-zero); NaN elevation propagates correctly through dz_dx/dz_dy -> shaded -> output (the shaded < 0.0 / shaded > 1.0 clamps don't fire on NaN). Azimuth validated to [0, 360], altitude to [0, 90]. Cat 4: _gpu_calc_numba (line 107) guards both grid bounds and 3x3 stencil reads via i > 0 and i < shape[0]-1 and j > 0 and j < shape[1]-1; no shared memory. Cat 5: no file I/O. Cat 6: hillshade() calls _validate_raster (line 252) and _validate_scalar for both azimuth (253) and angle_altitude (254); all four backend paths cast to float32; tests parametrize int32/int64/float32/float64."

xrspatial/geotiff/_vrt.py

Lines changed: 47 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,34 @@
77

88
import os
99
from dataclasses import dataclass, field
10+
from xml.sax.saxutils import escape as _xml_escape, quoteattr as _xml_quoteattr
1011

1112
import numpy as np
1213

1314
from ._safe_xml import safe_fromstring
1415

16+
17+
def _xml_text(value) -> str:
18+
"""Escape *value* for safe inclusion as XML element text.
19+
20+
Handles the five XML predefined entities (``& < > " '``). Returns the
21+
empty string when ``value`` is ``None``.
22+
"""
23+
if value is None:
24+
return ""
25+
return _xml_escape(str(value), {'"': "&quot;", "'": "&apos;"})
26+
27+
28+
def _xml_attr(value) -> str:
29+
"""Quote *value* for use as an XML attribute value.
30+
31+
Wraps in matching quotes and escapes the predefined entities. Returns
32+
``'""'`` when ``value`` is ``None``.
33+
"""
34+
if value is None:
35+
return '""'
36+
return _xml_quoteattr(str(value))
37+
1538
# Lazy imports to avoid circular dependency
1639
_DTYPE_MAP = {
1740
'Byte': np.uint8,
@@ -491,17 +514,25 @@ def write_vrt(vrt_path: str, source_files: list[str], *,
491514
vrt_dir = os.path.dirname(os.path.abspath(vrt_path))
492515
n_bands = first['bands']
493516

494-
# Build XML
495-
lines = [f'<VRTDataset rasterXSize="{total_w}" rasterYSize="{total_h}">']
517+
# Build XML. Every interpolated text value is run through _xml_text
518+
# (or _xml_attr for attribute slots) before concatenation so that a
519+
# caller-supplied CRS WKT or a source filename containing XML
520+
# special characters (``< > & " '``) cannot break the document or
521+
# inject extra elements. Numeric fields (offsets, sizes, pixel
522+
# scales) are emitted from int / float literals and need no
523+
# escaping. See issue #1607.
524+
lines = [f'<VRTDataset rasterXSize="{int(total_w)}" rasterYSize="{int(total_h)}">']
496525
if srs:
497-
lines.append(f' <SRS>{srs}</SRS>')
526+
lines.append(f' <SRS>{_xml_text(srs)}</SRS>')
498527
lines.append(f' <GeoTransform>{mosaic_x0}, {res_x}, 0.0, '
499528
f'{mosaic_y_top}, 0.0, {res_y}</GeoTransform>')
500529

501530
for band_num in range(1, n_bands + 1):
502-
lines.append(f' <VRTRasterBand dataType="{vrt_dtype_name}" band="{band_num}">')
531+
lines.append(
532+
f' <VRTRasterBand dataType={_xml_attr(vrt_dtype_name)} '
533+
f'band="{int(band_num)}">')
503534
if nd is not None:
504-
lines.append(f' <NoDataValue>{nd}</NoDataValue>')
535+
lines.append(f' <NoDataValue>{_xml_text(nd)}</NoDataValue>')
505536

506537
for m in sources_meta:
507538
t = m['transform']
@@ -521,13 +552,17 @@ def write_vrt(vrt_path: str, source_files: list[str], *,
521552
pass # different drives on Windows
522553

523554
lines.append(' <SimpleSource>')
524-
lines.append(f' <SourceFilename relativeToVRT="{rel_attr}">'
525-
f'{fname}</SourceFilename>')
526-
lines.append(f' <SourceBand>{band_num}</SourceBand>')
527-
lines.append(f' <SrcRect xOff="0" yOff="0" '
528-
f'xSize="{m["width"]}" ySize="{m["height"]}"/>')
529-
lines.append(f' <DstRect xOff="{dst_x_off}" yOff="{dst_y_off}" '
530-
f'xSize="{m["width"]}" ySize="{m["height"]}"/>')
555+
lines.append(
556+
f' <SourceFilename relativeToVRT="{rel_attr}">'
557+
f'{_xml_text(fname)}</SourceFilename>')
558+
lines.append(f' <SourceBand>{int(band_num)}</SourceBand>')
559+
lines.append(
560+
f' <SrcRect xOff="0" yOff="0" '
561+
f'xSize="{int(m["width"])}" ySize="{int(m["height"])}"/>')
562+
lines.append(
563+
f' <DstRect xOff="{int(dst_x_off)}" '
564+
f'yOff="{int(dst_y_off)}" '
565+
f'xSize="{int(m["width"])}" ySize="{int(m["height"])}"/>')
531566
lines.append(' </SimpleSource>')
532567

533568
lines.append(' </VRTRasterBand>')
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
"""Regression tests for issue #1607: write_vrt must XML-escape
2+
caller-supplied text (CRS WKT, source filenames) so a value carrying
3+
XML special characters cannot break the generated VRT or inject extra
4+
elements that change how the VRT parses when read back.
5+
"""
6+
from __future__ import annotations
7+
8+
import os
9+
import numpy as np
10+
import pytest
11+
import xarray as xr
12+
13+
from xrspatial.geotiff import to_geotiff
14+
from xrspatial.geotiff._vrt import write_vrt, parse_vrt
15+
16+
17+
@pytest.fixture
18+
def sample_tif(tmp_path):
19+
"""Write a tiny GeoTIFF the VRT writer can introspect for metadata."""
20+
arr = np.zeros((4, 4), dtype=np.float32)
21+
y = np.linspace(1.0, 0.0, 4)
22+
x = np.linspace(0.0, 1.0, 4)
23+
da = xr.DataArray(
24+
arr, dims=['y', 'x'],
25+
coords={'y': y, 'x': x},
26+
attrs={'nodata': -9999.0},
27+
)
28+
path = str(tmp_path / 'src.tif')
29+
to_geotiff(da, path)
30+
return path
31+
32+
33+
def test_crs_wkt_with_xml_special_chars_round_trips(sample_tif, tmp_path):
34+
"""A WKT containing ``& < > " '`` must round-trip through write_vrt /
35+
parse_vrt unchanged (the entities are escaped on the way out and
36+
decoded on the way in)."""
37+
nasty_wkt = 'GEOGCS["spec & <chars> with \"quotes\" and \'apostrophes\'"]'
38+
vrt_path = str(tmp_path / 'mosaic.vrt')
39+
write_vrt(vrt_path, [sample_tif], crs_wkt=nasty_wkt)
40+
41+
with open(vrt_path, 'r') as fh:
42+
text = fh.read()
43+
44+
parsed = parse_vrt(text, vrt_dir=str(tmp_path))
45+
assert parsed.crs_wkt == nasty_wkt
46+
47+
48+
def test_crs_wkt_injection_does_not_change_raster_type(sample_tif, tmp_path):
49+
"""The headline #1607 case: a crafted WKT trying to close ``<SRS>``
50+
and inject ``<Metadata><MDI key="AREA_OR_POINT">Point</MDI>...``
51+
must NOT change ``raster_type`` from its default 'area' value."""
52+
injection = (
53+
'</SRS><Metadata><MDI key="AREA_OR_POINT">Point</MDI>'
54+
'</Metadata><SRS>'
55+
)
56+
vrt_path = str(tmp_path / 'evil.vrt')
57+
write_vrt(vrt_path, [sample_tif], crs_wkt=injection)
58+
59+
with open(vrt_path, 'r') as fh:
60+
text = fh.read()
61+
62+
parsed = parse_vrt(text, vrt_dir=str(tmp_path))
63+
assert parsed.raster_type == 'area'
64+
# And the injection round-trips as literal text inside <SRS>.
65+
assert parsed.crs_wkt == injection
66+
67+
68+
def test_source_filename_with_ampersand_round_trips(tmp_path):
69+
"""A source filename containing ``&`` must produce a VRT whose
70+
``<SourceFilename>`` element decodes back to the original on-disk
71+
path (no double-escape, no corruption)."""
72+
# Build a TIFF on disk whose filename has an ampersand.
73+
arr = np.zeros((4, 4), dtype=np.float32)
74+
da = xr.DataArray(
75+
arr, dims=['y', 'x'],
76+
coords={'y': np.linspace(1, 0, 4), 'x': np.linspace(0, 1, 4)},
77+
attrs={'nodata': -9999.0},
78+
)
79+
src = str(tmp_path / 'a&b.tif')
80+
to_geotiff(da, src)
81+
82+
vrt_path = str(tmp_path / 'mosaic.vrt')
83+
write_vrt(vrt_path, [src])
84+
85+
with open(vrt_path, 'r') as fh:
86+
text = fh.read()
87+
# The on-disk text must carry the escaped form, not the raw '&'.
88+
assert '&amp;' in text
89+
assert '<a&b' not in text # no raw ampersand inside a tag context
90+
91+
parsed = parse_vrt(text, vrt_dir=str(tmp_path))
92+
# Exactly one source recorded, pointing at the original path
93+
# (after the parser's realpath canonicalisation).
94+
assert len(parsed.bands) == 1
95+
assert len(parsed.bands[0].sources) == 1
96+
assert os.path.basename(parsed.bands[0].sources[0].filename) == 'a&b.tif'
97+
98+
99+
def test_written_vrt_is_well_formed_xml(sample_tif, tmp_path):
100+
"""Sanity check: the bytes written by write_vrt always parse cleanly
101+
as XML, even when crs_wkt carries every XML predefined entity."""
102+
nasty = '< & > " \''
103+
vrt_path = str(tmp_path / 'wf.vrt')
104+
write_vrt(vrt_path, [sample_tif], crs_wkt=nasty)
105+
106+
# Use stdlib ElementTree (not parse_vrt) so we exercise the
107+
# well-formedness check directly without the DOCTYPE-rejection layer.
108+
import xml.etree.ElementTree as ET
109+
with open(vrt_path, 'r') as fh:
110+
ET.fromstring(fh.read()) # raises on malformed XML

0 commit comments

Comments
 (0)