Replaced BZip2 with Multi Threaded BZip2 implementation by drone1400 · Pull Request #1062 · adamhathcock/sharpcompress

drone1400 · 2025-12-06T15:15:15Z

Changes Summary

Replaces BZip2 implementation with a more performant multi-threaded one

Issues

Is not a COMPLETE replacement because it does not replace the BZip2 implementation in the LZMA's Registry.cs due to the current multi-threaded implementation not having built in concatenated stream support (should be easy enough to resolve if necessary)
Due to some internal logic, causes the BZip2_Reader_Factory test to fail. This test expects an InvalidOperationException to be thrown when opening an invalid BZ2 stream with a ReaderFactory. The problem is thusly:
- the original BZ2 implementation checks both the BZ2 Header, the BZ2 Block Header magic byte sequence, AND the BZ2 Block Header CRC in the constructor. This causes an exception to be thrown when the ReaderFactory is evaluating if the given stream is a valid Tar.Bz2 archive.
- with the multi-threaded implementation, the ReaderFactory evaluates if the stream is a valid Tar.Bz2 archive, but it determines that is not the case and moves on without throwing an exception, later throwing another exception when evaluating a different archive type

Context
A few years ago, I started using SharpCompress in a personal hobby project but was frustrated by how slow BZip2 Compression was.

I ended up implementing my own multi-threaded BZip2 compressor here by modifying an existing single threaded BZip2 implementation.
My implementation: https://github.com/drone1400/bzip2
The original single thread that it was based on: https://github.com/jaime-olivares/bzip2

Anyway, I used it until this year without issue, but as I was working on various updates on my personal hobby project, I ended up wondering why I never got around to implementing a multi-threaded DECOMPRESSOR too at the time...

So now I went and did just that! I finished implementing a multi threaded decompressor, cleaned up the code and made some improvements in the process, and now I figured I might as well submit a pull request too.

Benchmarks
Here is a comparison for compressing/decompressing a few MB tar archive with both my multi-threaded implementation and the existing SharpCompress BZip2 one.

CST = Single Thread Compression
CMT = Multi Thread Compression
DST = Single Thread Decompression
DMT = Multi Thread Decompression
CSC = Compression using SharpCompress BZip2 implementation
DSC = Decompression using SharpCompress BZip2 implementation

- currently causes BZip2_Reader_Factory test to fail due to some implementation details, to figure out later... - have not replaced original BZip2 implementation in LZMA Registry.cs due to BZip2MT implementation not having a built in concatenated stream support

drone1400 · 2025-12-06T17:07:00Z

Oups! I was too hasty by half! While the multi-thread streams themselves work fine, I have actually run into an issue with calling ReaderFactory.Open on a very large tar.bz2 file taking forever. Seems this was not something noticeable in the tests. Will look into it and report back what's going on.

EDIT: Issue located and fixed already.

…s when reading very few bytes

adamhathcock · 2025-12-08T11:09:10Z

Thank you for this!

I will have to think of it but BZip2 can be multi-threaded without seeking on a stream? I was working on allowing multi-threading on files as a lot of algorithms can be parallelized just by acting on a seekable stream.

I wasn't including BZip2 in this as it's a single stream of bytes (I think)

I think you need to run csharpier to format this.

drone1400 · 2025-12-08T15:45:07Z

I will have to think of it but BZip2 can be multi-threaded without seeking on a stream? I was working on allowing multi-threading on files as a lot of algorithms can be parallelized just by acting on a seekable stream.

I wasn't including BZip2 in this as it's a single stream of bytes (I think)
While the compressed data ends up being a single stream of bytes (or rather, a single stream of bits since it's not even byte aligned), the blocks themselves are independent, so as long as you can split the input stream into blocks, you can have multiple threads working on different blocks.

The compression part is easier, since you can just split the input stream into blocks of fixed size, process multiple blocks at the same time and then just put them back together in the same order that you read them in.

The decompression is a bit trickier since you have to find the 48 bit start sequence within a bitstream. So what I'm doing there is reading the input stream bit by bit, copying it into a temporary buffer at the same time. When I find a 48 bit start sequence, I send the temporary buffer to the worker thread and continue splitting the next block while waiting for the worker threads to finish the block I'm waiting for.

Actually, now that I think about it, there is a potential problem there, I suppose if the 48 bit start sequence were to be found in the middle of a block, it would mess up the decompression as the block would be incomplete and not able to be decompressed.

I'm not sure if this is actually possible though, and so far I have not encountered any such issue with the few hundreds of real files I tested and thousands of randomly generated files I tested... But it's something that should be considered.

I think you need to run csharpier to format this.
Oups, my bad!

adamhathcock · 2026-01-05T16:46:17Z

The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done.

Also, how is parallelism controlled? I can't see much in a diff

drone1400 · 2026-01-13T10:02:27Z

Apologies for the late reply.

The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done.

As the code in this branch currently is, there is no way for the users to select which implementation to use.

Also, how is parallelism controlled? I can't see much in a diff

What do you mean by this exactly?

adamhathcock · 2026-01-13T11:21:37Z

Apologies for the late reply.

The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done.

As the code in this branch currently is, there is no way for the users to select which implementation to use.

Can that be added? I wouldn't want everyone to change to use this just yet

Also, how is parallelism controlled? I can't see much in a diff

What do you mean by this exactly?

Where are ThreadStart or a TaskScheduler used to fire Threads/Tasks? or something similar?

drone1400 · 2026-01-15T04:16:23Z

Can that be added? I wouldn't want everyone to change to use this just yet

Yes, that should be possible. I'll see about doing that next few days.

Where are ThreadStart or a TaskScheduler used to fire Threads/Tasks? or something similar?
What I am currently doing is using ThreadPool.QueueUserWorkItem when calling ReadByte and Read for the BZip2ParallelInputStream class, respectively WriteByte and Write for the BZip2ParallelOutputStream class.

I do have a way to limit the maximum number of blocks that get queued up, in BZip2ParallelInputStream there's _mtMaxPendingBlocks which gets initialized to a value provided via the stream constructor or defaults to Environment.ProcessorCount.

Huh, come to think of it, I did forget to do the same thing with _mtMaxPendingBlocks for the BZip2ParallelOutputStream. I should probably fix that.

Fixed oversight in BZip2ParallelInputStream.cs causing very slow read…

f8daa08

…s when reading very few bytes

Reformatted modified files with csharpier

1f2195e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced BZip2 with Multi Threaded BZip2 implementation#1062

Replaced BZip2 with Multi Threaded BZip2 implementation#1062
drone1400 wants to merge 3 commits into
adamhathcock:masterfrom
drone1400:multi-thread-bzip2

drone1400 commented Dec 6, 2025

Uh oh!

drone1400 commented Dec 6, 2025 •

edited

Loading

Uh oh!

adamhathcock commented Dec 8, 2025

Uh oh!

drone1400 commented Dec 8, 2025

Uh oh!

adamhathcock commented Jan 5, 2026

Uh oh!

drone1400 commented Jan 13, 2026 •

edited

Loading

Uh oh!

adamhathcock commented Jan 13, 2026

Uh oh!

drone1400 commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drone1400 commented Dec 6, 2025

Uh oh!

drone1400 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamhathcock commented Dec 8, 2025

Uh oh!

drone1400 commented Dec 8, 2025

Uh oh!

adamhathcock commented Jan 5, 2026

Uh oh!

drone1400 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamhathcock commented Jan 13, 2026

Uh oh!

drone1400 commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drone1400 commented Dec 6, 2025 •

edited

Loading

drone1400 commented Jan 13, 2026 •

edited

Loading