Skip to content

add opt-in XML character sanitization for package save#2064

Open
SimonCropp wants to merge 2 commits intodotnet:mainfrom
SimonCropp:XmlTextSanitization
Open

add opt-in XML character sanitization for package save#2064
SimonCropp wants to merge 2 commits intodotnet:mainfrom
SimonCropp:XmlTextSanitization

Conversation

@SimonCropp
Copy link
Copy Markdown
Contributor

@SimonCropp SimonCropp commented Apr 15, 2026

Introduces IXmlWriterFactoryFeature and the UseXmlTextSanitization() extension on OpenXmlPackage. When enabled, invalid XML 1.0 characters (control chars, orphaned surrogates, non-characters) are silently removed from text content, attribute values, CDATA, comments, and PI bodies before the underlying XmlWriter sees them. Element/attribute names and WriteRaw payloads are untouched.

Previously, saving a document with text like "Hello\u0002World" — common when exporting data straight from legacy databases — threw deep inside System.Xml.XmlUtf8RawTextWriter. The feature is pay-for-play: packages that don't opt in pay nothing.

Plumbing centralizes XmlWriter creation behind the feature so that OpenXmlPartWriter, OpenXmlPartRootElement.Save, and every XmlDOMTextWriter call site (including InnerXml/OuterXml paths) route through it, inherited transparently from package → part → element.

Covers Word, Excel, and PowerPoint round-trip tests plus unit tests for the sanitizer, the decorator, and the feature registration.

fixes #1532

@SimonCropp SimonCropp changed the title add opt-in XML character sanitization for package save (#1532) add opt-in XML character sanitization for package save Apr 15, 2026
@SimonCropp
Copy link
Copy Markdown
Contributor Author

IMO this should be the default behaviour.

@github-actions
Copy link
Copy Markdown

Test Results

    66 files   -   3      66 suites   - 3   1h 39m 24s ⏱️ - 1m 39s
 2 105 tests + 37   2 102 ✅ + 37   3 💤 ±0  0 ❌ ±0 
38 540 runs  +467  38 498 ✅ +467  42 💤 ±0  0 ❌ ±0 

Results for commit 25a58a1. ± Comparison against base commit 0cb2f6b.

Copy link
Copy Markdown
Member

@twsouthwick twsouthwick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation!

}

public XmlWriter Create(Stream stream, XmlWriterSettings? settings)
=> new SanitizingXmlWriter(DefaultXmlWriterFactoryFeature.Instance.Create(stream, settings));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't just use the default - it should pass in the existing and wrap it


internal static class XmlCharacterSanitizer
{
public static bool IsValidXmlChar(char c)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to not use XmlConvert.IsXmlChar here?


if (package.Features.Get<IXmlWriterFactoryFeature>() is not SanitizingXmlWriterFactoryFeature)
{
package.Features.Set<IXmlWriterFactoryFeature>(SanitizingXmlWriterFactoryFeature.Instance);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap the existing one here

_xmlWriter = ResolveFactory(openXmlPart).Create(partStream, xmlWriterSettings);
}

private static IXmlWriterFactoryFeature ResolveFactory(OpenXmlPart openXmlPart)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to just use GetRequiredFeature - there's a default one already so we don't need it here as well

_writer = Create(stream);
}

public XmlDOMTextWriter(Stream stream, IFeatureCollection? features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass in a IXmlWriterFactoryFeature here instead of a full collection

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in feature to sanitize invalid XML 1.0 characters during package/part serialization by centralizing XmlWriter creation behind a new IXmlWriterFactoryFeature, with a built-in sanitizing implementation that decorates the writer.

Changes:

  • Introduces IXmlWriterFactoryFeature plus default and sanitizing factory implementations, and OpenXmlPackage.UseXmlTextSanitization() to enable it.
  • Routes key SDK serialization paths (OpenXmlPartWriter, OpenXmlPartRootElement.Save, XmlDOMTextWriter call sites) through the factory.
  • Adds unit and round-trip tests for sanitizer behavior across Word/Excel/PowerPoint, and updates public API + docs.
Show a summary per file
File Description
test/DocumentFormat.OpenXml.Framework.Tests/Features/XmlTextSanitizationTests.cs Adds unit tests for sanitizer + sanitizing writer behavior.
test/DocumentFormat.OpenXml.Framework.Tests/Features/XmlTextSanitizationTests.Word.cs Word round-trip tests validating opt-in sanitization behavior.
test/DocumentFormat.OpenXml.Framework.Tests/Features/XmlTextSanitizationTests.Excel.cs Excel round-trip tests validating opt-in sanitization behavior.
test/DocumentFormat.OpenXml.Framework.Tests/Features/XmlTextSanitizationTests.Powerpoint.cs PowerPoint round-trip tests validating opt-in sanitization behavior.
src/DocumentFormat.OpenXml.Linq/Feature/OpenXmlPartRootXElementFeature.cs Uses XmlDOMTextWriter overload that can consume part features (factory).
src/DocumentFormat.OpenXml.Framework/XmlDOMTextWriter.cs Adds feature-aware constructors and resolves writer factory from IFeatureCollection.
src/DocumentFormat.OpenXml.Framework/PublicAPI/PublicAPI.Unshipped.txt Declares new public surface area (IXmlWriterFactoryFeature, extension).
src/DocumentFormat.OpenXml.Framework/OpenXmlPartWriter.cs Switches part-based writer creation to use the resolved writer factory.
src/DocumentFormat.OpenXml.Framework/OpenXmlPartRootElement.cs Routes Save(Stream) writer creation through XmlDOMTextWriter + features.
src/DocumentFormat.OpenXml.Framework/OpenXmlLeafTextElement.cs Routes InnerXml writing through feature-aware XmlDOMTextWriter.
src/DocumentFormat.OpenXml.Framework/OpenXmlElement.cs Routes InnerXml/OuterXml writing through feature-aware XmlDOMTextWriter.
src/DocumentFormat.OpenXml.Framework/OpenXmlCompositeElement.cs Routes InnerXml wrapper writing through feature-aware XmlDOMTextWriter.
src/DocumentFormat.OpenXml.Framework/Features/XmlTextSanitizationExtensions.cs Adds UseXmlTextSanitization() opt-in extension for packages.
src/DocumentFormat.OpenXml.Framework/Features/XmlCharacterSanitizer.cs Implements core XML 1.0 character filtering logic.
src/DocumentFormat.OpenXml.Framework/Features/SanitizingXmlWriterFactoryFeature.cs Factory feature that wraps writers with SanitizingXmlWriter.
src/DocumentFormat.OpenXml.Framework/Features/SanitizingXmlWriter.cs XmlWriter decorator that sanitizes text-ish writes while leaving WriteRaw untouched.
src/DocumentFormat.OpenXml.Framework/Features/IXmlWriterFactoryFeature.cs New public feature interface for centralized writer creation.
src/DocumentFormat.OpenXml.Framework/Features/DefaultXmlWriterFactoryFeature.cs Default factory mapping to System.Xml.XmlWriter.Create.
src/DocumentFormat.OpenXml.Framework/Features/DefaultFeatures.cs Registers IXmlWriterFactoryFeature in the default feature set.
docs/Features.md Documents the new feature and the opt-in sanitization behavior.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

src/DocumentFormat.OpenXml.Framework/Features/SanitizingXmlWriter.cs:204

  • WriteCharsAsync has the same parameter-validation gap as the sync path: it only checks buffer != null before iterating/sanitizing. Validate index/count (and index + count overflow) up front so invalid arguments produce consistent XmlWriter-style exceptions instead of IndexOutOfRangeException.
    public override Task WriteCharsAsync(char[] buffer, int index, int count)
    {
        if (buffer is null)
        {
            throw new ArgumentNullException(nameof(buffer));
        }
  • Files reviewed: 20/20 changed files
  • Comments generated: 1

{
throw new ArgumentNullException(nameof(buffer));
}

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteChars validates only buffer != null. If index/count are negative or out of range, the sanitizer loop will throw IndexOutOfRangeException (or worse on overflow) rather than the ArgumentOutOfRangeException/ArgumentException behavior callers expect from XmlWriter. Add explicit bounds validation (including index + count overflow) before calling TrySanitizeSpan/iterating, and mirror the inner writer’s exception types.

This issue also appears on line 199 of the same file.

Suggested change
if (index < 0)
{
throw new ArgumentOutOfRangeException(nameof(index));
}
if (count < 0)
{
throw new ArgumentOutOfRangeException(nameof(count));
}
if (index > buffer.Length - count)
{
throw new ArgumentException("The sum of index and count is larger than the buffer length.");
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request for Open XML SDK - Exclude unsupported XML characters during file generation

3 participants