Skip to content

scitbx: fix flex.*_from_byte_str truncation of non-ASCII Unicode input#1169

Draft
bkpoon wants to merge 1 commit into
masterfrom
fix_byte_str
Draft

scitbx: fix flex.*_from_byte_str truncation of non-ASCII Unicode input#1169
bkpoon wants to merge 1 commit into
masterfrom
fix_byte_str

Conversation

@bkpoon

@bkpoon bkpoon commented Jun 6, 2026

Copy link
Copy Markdown
Member
  • shared_from_byte_str (flex.int_from_byte_str, double_from_byte_str, size_t_from_byte_str, ...) re-encodes a unicode argument to UTF-8 bytes but took its length from len(byte_str), the code-point count; for non-ASCII input that is smaller than the UTF-8 buffer, so the resulting array was silently truncated
  • take the length from the byte buffer actually used (PyBytes_GET_SIZE), keeping the pointer and size consistent
  • own the re-encoded UTF-8 bytes with handle<>, releasing them; the unicode branch previously leaked the PyUnicode_AsUTF8String reference
  • add exercise_from_byte_str_unicode regression test

- shared_from_byte_str (flex.int_from_byte_str, double_from_byte_str,
  size_t_from_byte_str, ...) re-encodes a unicode argument to UTF-8 bytes
  but took its length from len(byte_str), the code-point count; for
  non-ASCII input that is smaller than the UTF-8 buffer, so the resulting
  array was silently truncated
- take the length from the byte buffer actually used (PyBytes_GET_SIZE),
  keeping the pointer and size consistent
- own the re-encoded UTF-8 bytes with handle<>, releasing them; the
  unicode branch previously leaked the PyUnicode_AsUTF8String reference
- add exercise_from_byte_str_unicode regression test
@bkpoon bkpoon requested a review from phyy-nx June 6, 2026 22:20
@bkpoon

bkpoon commented Jun 6, 2026

Copy link
Copy Markdown
Member Author

LIke 5c7d953 but for byte strings. Do detector data use byte strings?

@phyy-nx phyy-nx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the dials and dxtbx tests on this branch, no problems. Did a search in both for "from_byte_str", didn't find it.

Found two references to it in iotbx.detector, in a function called 'slice_callback_with_high_performance_http_data' which I believe was used by the original spotfinder server (dials has a new one). Both follow this kind of pattern:

BYU = self.object.read()
linearintdata = flex.int_from_byte_str(BYU)
provisional_size = linearintdata.size()
assert provisional_size == self.size1*self.size2

Which to me looks like a self-consistency check that would have alerted the user to a problem here already. So to me, this change set seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants