Skip to content

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

@joachimmetz

Description

@joachimmetz

Certain file systems allow for characters that either have a special meaning in Unicode such as U+d800 and/or non-Unicode characters

The extended bodyfile 3 format currently does not specify how to handle these characters. Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

Open questions

  • What about "Unicode compatibility characters" ?
  • What about U+110000-U+ffffffff
  • What about original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932. What if there are 2 paths that decode to the same string? How should the original path be best preserved?
  • filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

A related discussion dfxml-working-group/dfxml_schema#34

Also consider if the format should be extended with a header to specify its encoding?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions