Skip to content

fix: single-quote YAML strings with backslashes for safe round-tripping#11160

Open
NIK-TIGER-BILL wants to merge 2 commits intodeepset-ai:mainfrom
NIK-TIGER-BILL:fix/yaml-regex-escape
Open

fix: single-quote YAML strings with backslashes for safe round-tripping#11160
NIK-TIGER-BILL wants to merge 2 commits intodeepset-ai:mainfrom
NIK-TIGER-BILL:fix/yaml-regex-escape

Conversation

@NIK-TIGER-BILL
Copy link
Copy Markdown

Related Issues

Proposed Changes:

When serializing pipelines containing regex patterns (e.g. `\b\w+\b`) or file paths with backslashes, PyYAML emits plain scalars that can be misinterpreted as YAML escape sequences on load. On Python 3.13+ this produces `SyntaxWarning`, and on some configurations it causes `ReaderError: unacceptable character #x0008`.

Fix: Override `YamlDumper.represent_str` to emit single-quoted YAML scalars for any string containing a backslash. In single-quoted YAML scalars, no escape sequences are interpreted, so the round-trip is always safe.

Before:

After:

Strings without backslashes continue to be emitted as plain scalars (no change).

How did you test it?

  • Added 6 new unit tests in `test/marshal/test_yaml.py`:
    • Single backslash sequence round-trip
    • Complex regex round-trip
    • Windows path round-trip
    • No-backslash strings unchanged
    • Single-quote style verification
    • Full Pipeline round-trip with DocumentCleaner regex parameters
  • All 9 tests pass (3 existing + 6 new)

Notes for the reviewer

This is a minimal, targeted fix. An alternative approach would be to always quote all strings, but that would make the YAML output significantly more verbose for no additional safety benefit.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I have used conventional commit type `fix:` for my PR title.
  • I have documented my code.
  • I have added a release note file.
  • I have run pre-commit hooks and fixed any issue.

NIK-TIGER-BILL added 2 commits April 20, 2026 23:37
When serializing pipelines containing regex patterns (e.g. \b, \w) or
file paths with backslashes, PyYAML may emit plain scalars that are
misinterpreted on load as YAML escape sequences, causing ReaderError
(#x0008) or SyntaxWarning on Python 3.13+.

Fix: override YamlDumper.represent_str to emit single-quoted scalars
for any string containing a backslash. In single-quoted YAML scalars,
no escape sequences are interpreted, so the round-trip is always safe.

Closes deepset-ai#11093

Signed-off-by: NIK-TIGER-BILL <nik.tiger.bill@github.com>
Signed-off-by: NIK-TIGER-BILL <nik.tiger.bill@github.com>
@NIK-TIGER-BILL NIK-TIGER-BILL requested a review from a team as a code owner April 20, 2026 23:38
@NIK-TIGER-BILL NIK-TIGER-BILL requested review from bogdankostic and removed request for a team April 20, 2026 23:38
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 20, 2026

Someone is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


NIK-TIGER-BILL seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Apr 20, 2026
@sjrl sjrl requested review from anakin87 and removed request for bogdankostic April 21, 2026 06:04
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 21, 2026

@anakin87 reassigning to you for review since you're assigned the issue

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NIK-TIGER-BILL please sign the CLA, then I'll review

Please also take a look at failing workflows, related to format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid escape sequences in regex in the pipeline YAML

4 participants