Replaced BZip2 with Multi Threaded BZip2 implementation#1062
Conversation
- currently causes BZip2_Reader_Factory test to fail due to some implementation details, to figure out later... - have not replaced original BZip2 implementation in LZMA Registry.cs due to BZip2MT implementation not having a built in concatenated stream support
|
Oups! I was too hasty by half! While the multi-thread streams themselves work fine, I have actually run into an issue with calling EDIT: Issue located and fixed already. |
…s when reading very few bytes
|
Thank you for this! I will have to think of it but BZip2 can be multi-threaded without seeking on a stream? I was working on allowing multi-threading on files as a lot of algorithms can be parallelized just by acting on a seekable stream. I wasn't including BZip2 in this as it's a single stream of bytes (I think) I think you need to run csharpier to format this. |
The compression part is easier, since you can just split the input stream into blocks of fixed size, process multiple blocks at the same time and then just put them back together in the same order that you read them in. The decompression is a bit trickier since you have to find the 48 bit start sequence within a bitstream. So what I'm doing there is reading the input stream bit by bit, copying it into a temporary buffer at the same time. When I find a 48 bit start sequence, I send the temporary buffer to the worker thread and continue splitting the next block while waiting for the worker threads to finish the block I'm waiting for. Actually, now that I think about it, there is a potential problem there, I suppose if the 48 bit start sequence were to be found in the middle of a block, it would mess up the decompression as the block would be incomplete and not able to be decompressed. I'm not sure if this is actually possible though, and so far I have not encountered any such issue with the few hundreds of real files I tested and thousands of randomly generated files I tested... But it's something that should be considered.
|
|
The only worry I have with this is the ability for users to control which BZip2 implementation to use. I'm not sure now that should be done. Also, how is parallelism controlled? I can't see much in a diff |
|
Apologies for the late reply.
As the code in this branch currently is, there is no way for the users to select which implementation to use.
What do you mean by this exactly? |
Can that be added? I wouldn't want everyone to change to use this just yet
Where are ThreadStart or a TaskScheduler used to fire Threads/Tasks? or something similar? |
Yes, that should be possible. I'll see about doing that next few days.
I do have a way to limit the maximum number of blocks that get queued up, in Huh, come to think of it, I did forget to do the same thing with |
Changes Summary
Issues
BZip2_Reader_Factorytest to fail. This test expects anInvalidOperationExceptionto be thrown when opening an invalid BZ2 stream with a ReaderFactory. The problem is thusly:Context
A few years ago, I started using SharpCompress in a personal hobby project but was frustrated by how slow BZip2 Compression was.
I ended up implementing my own multi-threaded BZip2 compressor here by modifying an existing single threaded BZip2 implementation.
My implementation: https://github.com/drone1400/bzip2
The original single thread that it was based on: https://github.com/jaime-olivares/bzip2
Anyway, I used it until this year without issue, but as I was working on various updates on my personal hobby project, I ended up wondering why I never got around to implementing a multi-threaded DECOMPRESSOR too at the time...
So now I went and did just that! I finished implementing a multi threaded decompressor, cleaned up the code and made some improvements in the process, and now I figured I might as well submit a pull request too.
Benchmarks
Here is a comparison for compressing/decompressing a few MB tar archive with both my multi-threaded implementation and the existing SharpCompress BZip2 one.