Skip to content

Decoding Invalid Code Sequence Consistency #41

@informatimago

Description

@informatimago

We would want to know when we have accumulated in a buffer enough bytes to decode a character, depending on the current encodng…
babel doesn't provide a convenient (efficient) API to test that, but I hoped to be able to use OCTETS-TO-STRING for that.
Unfortunately, handling of incomplete code sequences by the different encoding is not consistent.

cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-8)
"¶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-8)
"�"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-16)
"슶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-16)
> Debug: Failed assertion: (= babel-encodings::i babel-encodings::end)
> While executing: (:internal swank::invoke-default-debugger), in process new-repl-thread(1481).
> Type cmd-/ to continue, cmd-. to abort, cmd-\ for a list of available restarts.
> If continued: test the assertion again.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<simple-error #x302006CBABDD>.
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 2 :encoding :eucjp)
"こ"
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 1 :encoding :eucjp)
> Debug: Illegal :eucjp character starting at position 0.
> While executing: (:internal swank::invoke-default-debugger), in process repl-thread(3921).
> Type cmd-. to abort, cmd-\ for a list of available restarts.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<babel-encodings:end-of-input-in-character #x302006CA4EAD>.
cl-user>

I would suggest to add a keyword parameter to specify what to do in such a case:

| :on-invalid-code substitution-character | would insert the given substitution-character in place of the code. |
| :on-invalid-code :ignore                | would ignore the code and go on.                                    |
| :on-invalid-code :error                 | would signal a babel-encodings:character-decoding-error condition.  |

I would propose also, to provide an efficient function to query the length of a code sequence for the next character:

(babel:decode-character bytes &key start end encoding)
--> character ;
    sequence-valid-p ;
    length
  • If a character can be decoded, then it is returned as primary value, otherwise NIL.

  • If the code sequence is definitely invalid then NIL, else T. Notably if it is just too short, but could be a valid code sequence if completed, T should be returned.

  • If the character is decoded and returned, then the length of the decoded code sequence is returned; if sequence-valid-p then a minimal code sequence length with the given prefix is returned; otherwise a minimum code sequence length.

| character | sequence-valid-p | length                                                         |
|-----------+------------------+----------------------------------------------------------------|
| ch        | T                | length of the decoded sequence                                 |
| ch        | NIL              | --impossible--                                                 |
| NIL       | T                | minimal length of a valid code sequence with the given prefix. |
| NIL       | NIL              | minimal length of a valid code sequence.                       |

For example, in the case NIL T len, if len <= (- end start), then it means the given code sequence is valid, but the decoded code is not the code of a character. eg. #(#xED #xA0 #x80) is UTF-8 for 55296, but (code-char 55296) --> nil.

(babel:decode-character (coerce #(65 32 66) '(vector (unsigned-byte 8)))
                         :start 0 :end 3 :encoding :utf-8)
--> #\A
    T
    1

(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 3 :encoding :utf-8)
--> #\À
    T
    2

(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 1 :encoding :utf-8)
--> NIL
    T
    2

(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 1 :encoding :utf-8)
--> NIL
    T
    2

(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
                        :start 0 :end 2 :encoding :utf-8)
--> NIL
    NIL
    1

(babel:decode-character (coerce #(#xED #xA0 #x80) '(vector (unsigned-byte 8)))
                        :start 0 :end 3 :encoding :utf-8)
--> NIL
    T
    3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions