Skip to content

Replaced BZip2 with Multi Threaded BZip2 implementation#1062

Open
drone1400 wants to merge 3 commits into
adamhathcock:masterfrom
drone1400:multi-thread-bzip2
Open

Replaced BZip2 with Multi Threaded BZip2 implementation#1062
drone1400 wants to merge 3 commits into
adamhathcock:masterfrom
drone1400:multi-thread-bzip2

Conversation

@drone1400
Copy link
Copy Markdown

Changes Summary

  • Replaces BZip2 implementation with a more performant multi-threaded one

Issues

  • Is not a COMPLETE replacement because it does not replace the BZip2 implementation in the LZMA's Registry.cs due to the current multi-threaded implementation not having built in concatenated stream support (should be easy enough to resolve if necessary)
  • Due to some internal logic, causes the BZip2_Reader_Factory test to fail. This test expects an InvalidOperationException to be thrown when opening an invalid BZ2 stream with a ReaderFactory. The problem is thusly:
    • the original BZ2 implementation checks both the BZ2 Header, the BZ2 Block Header magic byte sequence, AND the BZ2 Block Header CRC in the constructor. This causes an exception to be thrown when the ReaderFactory is evaluating if the given stream is a valid Tar.Bz2 archive.
    • with the multi-threaded implementation, the ReaderFactory evaluates if the stream is a valid Tar.Bz2 archive, but it determines that is not the case and moves on without throwing an exception, later throwing another exception when evaluating a different archive type

Context
A few years ago, I started using SharpCompress in a personal hobby project but was frustrated by how slow BZip2 Compression was.

I ended up implementing my own multi-threaded BZip2 compressor here by modifying an existing single threaded BZip2 implementation.
My implementation: https://github.com/drone1400/bzip2
The original single thread that it was based on: https://github.com/jaime-olivares/bzip2

Anyway, I used it until this year without issue, but as I was working on various updates on my personal hobby project, I ended up wondering why I never got around to implementing a multi-threaded DECOMPRESSOR too at the time...

So now I went and did just that! I finished implementing a multi threaded decompressor, cleaned up the code and made some improvements in the process, and now I figured I might as well submit a pull request too.

Benchmarks
Here is a comparison for compressing/decompressing a few MB tar archive with both my multi-threaded implementation and the existing SharpCompress BZip2 one.

benchmark-fb3-v2
  • CST = Single Thread Compression
  • CMT = Multi Thread Compression
  • DST = Single Thread Decompression
  • DMT = Multi Thread Decompression
  • CSC = Compression using SharpCompress BZip2 implementation
  • DSC = Decompression using SharpCompress BZip2 implementation

- currently causes BZip2_Reader_Factory test to fail due to some implementation details, to figure out later...
- have not replaced original BZip2 implementation in LZMA Registry.cs due to BZip2MT implementation not having a built in concatenated stream support
@drone1400
Copy link
Copy Markdown
Author

drone1400 commented Dec 6, 2025

Oups! I was too hasty by half! While the multi-thread streams themselves work fine, I have actually run into an issue with calling ReaderFactory.Open on a very large tar.bz2 file taking forever. Seems this was not something noticeable in the tests. Will look into it and report back what's going on.

EDIT: Issue located and fixed already.

@adamhathcock
Copy link
Copy Markdown
Owner

Thank you for this!

I will have to think of it but BZip2 can be multi-threaded without seeking on a stream? I was working on allowing multi-threading on files as a lot of algorithms can be parallelized just by acting on a seekable stream.

I wasn't including BZip2 in this as it's a single stream of bytes (I think)

I think you need to run csharpier to format this.

@drone1400
Copy link
Copy Markdown
Author

I will have to think of it but BZip2 can be multi-threaded without seeking on a stream? I was working on allowing multi-threading on files as a lot of algorithms can be parallelized just by acting on a seekable stream.

I wasn't including BZip2 in this as it's a single stream of bytes (I think)
While the compressed data ends up being a single stream of bytes (or rather, a single stream of bits since it's not even byte aligned), the blocks themselves are independent, so as long as you can split the input stream into blocks, you can have multiple threads working on different blocks.

The compression part is easier, since you can just split the input stream into blocks of fixed size, process multiple blocks at the same time and then just put them back together in the same order that you read them in.

The decompression is a bit trickier since you have to find the 48 bit start sequence within a bitstream. So what I'm doing there is reading the input stream bit by bit, copying it into a temporary buffer at the same time. When I find a 48 bit start sequence, I send the temporary buffer to the worker thread and continue splitting the next block while waiting for the worker threads to finish the block I'm waiting for.

Actually, now that I think about it, there is a potential problem there, I suppose if the 48 bit start sequence were to be found in the middle of a block, it would mess up the decompression as the block would be incomplete and not able to be decompressed.

I'm not sure if this is actually possible though, and so far I have not encountered any such issue with the few hundreds of real files I tested and thousands of randomly generated files I tested... But it's something that should be considered.

I think you need to run csharpier to format this.
Oups, my bad!

@adamhathcock
Copy link
Copy Markdown
Owner

The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done.

Also, how is parallelism controlled? I can't see much in a diff

@drone1400
Copy link
Copy Markdown
Author

drone1400 commented Jan 13, 2026

Apologies for the late reply.

The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done.

As the code in this branch currently is, there is no way for the users to select which implementation to use.

Also, how is parallelism controlled? I can't see much in a diff

What do you mean by this exactly?

@adamhathcock
Copy link
Copy Markdown
Owner

Apologies for the late reply.

The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done.

As the code in this branch currently is, there is no way for the users to select which implementation to use.

Can that be added? I wouldn't want everyone to change to use this just yet

Also, how is parallelism controlled? I can't see much in a diff

What do you mean by this exactly?

Where are ThreadStart or a TaskScheduler used to fire Threads/Tasks? or something similar?

@drone1400
Copy link
Copy Markdown
Author

Can that be added? I wouldn't want everyone to change to use this just yet

Yes, that should be possible. I'll see about doing that next few days.

Where are ThreadStart or a TaskScheduler used to fire Threads/Tasks? or something similar?
What I am currently doing is using ThreadPool.QueueUserWorkItem when calling ReadByte and Read for the BZip2ParallelInputStream class, respectively WriteByte and Write for the BZip2ParallelOutputStream class.

I do have a way to limit the maximum number of blocks that get queued up, in BZip2ParallelInputStream there's _mtMaxPendingBlocks which gets initialized to a value provided via the stream constructor or defaults to Environment.ProcessorCount.

Huh, come to think of it, I did forget to do the same thing with _mtMaxPendingBlocks for the BZip2ParallelOutputStream. I should probably fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants