Skip to content

Add RawValue to save original bencodes bytes without deserializing.#10

Open
canoriz wants to merge 1 commit intobluk:trunkfrom
canoriz:raw
Open

Add RawValue to save original bencodes bytes without deserializing.#10
canoriz wants to merge 1 commit intobluk:trunkfrom
canoriz:raw

Conversation

@canoriz
Copy link
Copy Markdown

@canoriz canoriz commented Apr 4, 2026

In BitTorrent specification, the info_hash is calculated from sha1 of bencoded Info struct. The specification explicitly says: info_hash should be calculated directly from the original bencode, not from the decoded then encoded bencode.

For a long time, I calculated info_hash from the decoded then encoded bencode. But recently I encounter a torrent file whose info field has a private: 0 field, which I don't support now. This resulted to a wrong info_hash.

I understand I can add a private: Option<u8> field to my Info struct, and set some serde attribute like skip_if_none, this will work perfectly. But I would like to add a RawValue type, which have some advantages.

Advantages

  1. Can use RawValue to calculate correct info_hash, no matter what unsupported fields a torrent file has.
  2. Can handle malformed bencode (map not in alphabetical order) in torrent file, though I've never seen one before.
  3. In metadata extension, we need to send info to other peers. We can now send RawValue out directly, no more encoding from Info struct every time. Though it's possible without RawValue: we can always keep Info and bencoded Info in the same time.

Implementation details

I mostly followed the approach of RawValue in serde_json with some small differences: serde_json has two versions of RawValue: borrowed and owned, represented by &'a RawValue and Box<RawValue>. I only implemented the owned version, in simple RawValue, I don't see many use cases of borrowed version of RawValue in BitTorrent uses.

However, we might (I'm not sure by now) have break change if we want to add borrowed version of RawValue in the future, if we use this only one owned version now.

@bluk
Copy link
Copy Markdown
Owner

bluk commented Apr 7, 2026

Can handle malformed bencode (map not in alphabetical order) in torrent file, though I've never seen one before.

This does occur in a random sampling of "real world" torrent files.

Overall, I like having a dedicated RawValue type, but the amount and complexity of the code is relatively substantial for essentially one use case. It would probably not be behind a feature cfg flag because most users of the library would require it.

Did you try using ByteString as the info's type like in https://github.com/bluk/bt_bencode/blob/trunk/tests/de.rs#L50 ? This fulfills the functionality that you are asking for in the first and second "advantage". The ByteString will contain the value as a non-decoded byte array. You can also just use a borrowed &[u8] as well. The intent is that you have a dedicated Info structure to decode the data your application needs and then to decode using a ByteString type to get the info hash in a 2 pass decoding flow.

On the third "advantage", if the info data does not change, I would just encode the data once into a regular byte buffer and re-use that byte buffer. I do not believe RawValue provides any functionality here.

If I were to re-implement the raw value functionality (instead of the current ByteString workaround), I would write a non-Serde-based implementation. Since the data would have to at least be decoded twice if you used a Serde based decoding library, I would have preferred a cleaner and straightforward API to read the data.

@canoriz
Copy link
Copy Markdown
Author

canoriz commented Apr 7, 2026

Did you try using ByteString as the info's type like in https://github.com/bluk/bt_bencode/blob/trunk/tests/de.rs#L50 ?

This works. I never thought of it before implementing the RawValue, probably because I checked serde_json and found they have a RawValue.

It would probably not be behind a feature cfg flag because most users of the library would require it.

I put it behind a raw_value because serde_json did so, probably because raw value feature will cause a slight overhead if you don't use it. I will change if you don't like a feature flag.

I would write a non-Serde-based implementation. Since the data would have to at least be decoded twice if you used a Serde based decoding library

I'm not sure what you mean writting a non-serde one. To me, serde_json is a widely used library and I think they handled RawValue well. For the "decoded twice" thing, if you mean first decode to RawValue, then from RawValue to T, maybe we can add a WithRaw<T> which has raw data and T as fields at same time.

To me, it seems the RawValue does not have much use now, since ByteString is exactly what I want. Some real use ByteString can't do is serializing, but I don't see any use of serializing a RawValue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants