Final changes before feedback iteration

fierg · fierg · commit b18e7ef0d49e · 2020-01-26T23:23:57.000+01:00
diff --git a/doc/thesis/src/abstract.tex b/doc/thesis/src/abstract.tex
@@ -1,5 +1,3 @@
 \chapter*{Abstract}
 %% ==============================
-Hier steht eine Kurzzusammenfassung (Abstract) der Arbeit. Stellen Sie kurz und präzise Ziel und Gegenstand der Arbeit, die angewendeten Methoden, sowie die Ergebnisse der Arbeit dar. Halten Sie dabei die ersten Punkten eher kurz und fokussieren Sie die Ergebnisse. Bewerten Sie auch die Ergebnissen und ordnen Sie diese in den Kontext ein.
-
-Die Kurzzusammenfassung sollte maximal 1 Seite lang sein.
+The paper reviews a combination of preprocessing methods applied to the coding scheme Run Length Encoding (RLE) and analyzes their effects to the compression ratio. It is shown that the implemented preprocessing steps enable RLE to achieve a reasonable compression on two corpora at an acceptable speed. Furthermore, the results are discussed in detail and the implementation is depicted.
diff --git a/doc/thesis/src/einleitung.tex b/doc/thesis/src/einleitung.tex
@@ -17,14 +17,14 @@ \section{Problem statement}
 %% ==============================
 \label{ch:Introduction:sec:Problem statement}
 \par{
-Some strings like \emph{aaaabbbb} achieve a very good compression rate because the string only has two different characters and they repeat more than twice. Therefore it can be compressed to $a^4b^4$ so from 8 byte down to 4 bytes if you encode it properly. On the other hand, if the input is highly mixed characters with few or no repetitions at all like \emph{abcdefgh}, the run length encoding of the string is $a^1b^1c^1d^1e^1f^1g^1h^1$ which needs up to 16 bytes depending on the implementation. So the inherent problem with run length encoding is obviously the possible explosion in size, due to missing repetitions in the input string. Expanding the string to twice the original size is rather undesirable worst case behavior for a compression algorithm so one has to make sure the input data is fitted for RLE as compression scheme. One goal is to improve the compression ratio on data currently not suited for run length encoding and perform better than the originally proposed RLE, in order for it to work on arbitrary data. Another goal should be to minimize the increase in size in the worst case scenario.}
+Some strings like \emph{aaaabbbb} achieve a very good compression rate because the string only has two different characters and they repeat more than twice. Therefore it can be compressed to $a^4b^4$ so from 8 byte down to 4 bytes if you encode it properly. On the other hand, if the input is highly mixed characters with few or no repetitions at all like \emph{abcdefgh}, the run length encoding of the string is $a^1b^1c^1d^1e^1f^1g^1h^1$ which needs up to 16 bytes, depending on the implementation. So the inherent problem with run length encoding is obviously the possible explosion in size, due to missing repetitions in the input string. Expanding the string to twice the original size is rather undesirable worst case behavior for a compression algorithm so one has to make sure the input data is fitted for RLE as compression scheme. One goal is to improve the compression ratio on data currently not suited for run length encoding and perform better than the originally proposed RLE, in order for it to work on arbitrary data.}
 
 %% ==============================
 \section{Main Objective}
 %% ==============================
 \label{ch:Introduction:sec:Main Objective}
 \par{
-The main objectives that derives from the problem statement, is to achieve an improved compression ratio compared to regular run length encoding on strings or files that are currently not suited for the method. Additionally it is desirable to further increase its performance in cases it is already reasonable. To unify the measurements, the compression ratio is calculated by encoding all files listed in the Calgary corpus which will be presented in Section \ref{tab:t05 The Calgary Corpus}. Since most improvements like permutations on the input, for example a reversible Burros-Wheeler-Transformation to increase the number of consecutive symbols or a different way of reading the byte stream take quite some time, encoding and decoding speed will decrease with increasing preprocessing effort.
+The main objectives that derives from the problem statement, is to achieve an improved compression ratio compared to regular run length encoding on strings or files that are currently not suited for the method. Additionally it is desirable to further increase its performance in cases it is already reasonable. To unify the measurements, the compression ratio is calculated by encoding all files listed in the Calgary corpus which will be presented in Section \ref{tab:t05 The Calgary Corpus}. Since most improvements like permutations on the input, for example a reversible Burros-Wheeler-Transformation to increase the number of consecutive symbols or a different way of reading the byte stream take quite some time, encoding and decoding speed will decrease with increasing preprocessing effort. A secondary goal is, to keep encoding an decoding speed at a reasonable pace.
 }
 %% ==============================
 \section{Structure of this work}
diff --git a/doc/thesis/src/entwurf.tex b/doc/thesis/src/entwurf.tex
@@ -184,7 +184,7 @@ \section{Applying the Burrows-Wheeler-Transformation}
 }
 
 \par{
-In general a Burrows-Wheeler-Transformation should also increase the runs in the implementation of Section \ref{ch:Analysis:sec:Improvements by Preprocessing:subSec:vertReading} and \ref{ch:Analysis:sec:Improvements by Preprocessing:subSec:byteRemapping} so those preprocessing steps were also applied in combination. To do so, it was first swapped against an sufficient implementation provided by a paper from M. Burrows and D. J. Wheeler \cite{Burrows94} from 1994. Their method is also the one described in Section \ref{ch:Analysis:sec:Improvements by Preprocessing:subSec:bwt} and could handle arbitrary input but it also had some downsides like the additional index $I$ of the transformation, which had to be persisted as well. The major downside of this implementation is the at least quadratic time complexity which made it still rather slow with increasing sizes of chunks, so again the input had to be spliced into small parts. If chunks exceeded a length of more than one kilobyte it became unacceptably slow even though it strongly improved the transformation results so most of the time and in table \ref{tab:t12 Burrows Wheeler Transformation on byte wise RLE} and ref\ref{tab:t13 Modified Burrows Wheeler Transformation on byte wise RLE} the transformation was performed on chunks of size 512 byte. To overcome this degradation of the original algorithm and the necessity of saving  additional indices, the implementation had to be swapped once more against one that was first described by \cite{Burrows-linear-time} in 2009 which claimed to perform in linear time complexity.
+In general a Burrows-Wheeler-Transformation should also increase the runs in the implementation of Section \ref{ch:Analysis:sec:Improvements by Preprocessing:subSec:vertReading} and \ref{ch:Analysis:sec:Improvements by Preprocessing:subSec:byteRemapping} so those preprocessing steps were also applied in combination. To do so, it was first swapped against an sufficient implementation provided by a paper from M. Burrows and D. J. Wheeler \cite{Burrows94} from 1994. Their method is also the one described in Section \ref{ch:Analysis:sec:Improvements by Preprocessing:subSec:bwt} and could handle arbitrary input but it also had some downsides like the additional index $I$ of the transformation, which had to be persisted as well. The major downside of this implementation is the at least quadratic time complexity which made it still rather slow with increasing sizes of chunks, so again the input had to be spliced into small parts. If chunks exceeded a length of more than one kilobyte it became unacceptably slow even though it strongly improved the transformation results so most of the time and in table \ref{tab:t12 Burrows Wheeler Transformation on byte wise RLE} and \ref{tab:t13 Modified Burrows Wheeler Transformation on byte wise RLE} the transformation was performed on chunks of size 512 byte. To overcome this degradation of the original algorithm and the necessity of saving  additional indices, the implementation had to be swapped once more against one that was first described by \cite{Burrows-linear-time} in 2009 which claimed to perform in linear time complexity.
 }
 \par{
 In form of the C library \href{https://code.google.com/archive/p/libdivsufsort}{libdivsufsort} a working implementation of BWTS was found, the bijective Burrows-Wheeler-Scott-Transformation described in \cite{DBLP:journals/corr/abs-1201-3077}. This kind of Burrows-Wheeler-Transformation does not require additional information, no start and stop symbols neither an index of its original position. Briefly, it does not construct a matrix of all cyclic rotations, instead it is computed with a suffix array sorted with DivSufSort \cite{LibDivSufSort}, closer described in the paper \cite{DBLP:journals/corr/abs-1710-01896}, which is the fastest currently known method of constructing the transformation. To use it properly the code was ported to Kotlin but there are also ports of this library in Java and Go available which are recommended because the original code is neither documented nor readable and the functionality can easily be used via a dependency.
@@ -281,7 +281,7 @@ \section{Applying the Burrows-Wheeler-Transformation}
 	\end{tikzpicture}
 	%}
 	%\end{scaletikzpicturetowidth}
-	\caption{Byte mapping and varying maximum run lengths}
+	\caption{Byte mapping and varying maximum run lengths, all preprocessing steps}
 	\label{fig:3:Different run lengths with and without transformations}
 \end{figure}
 
diff --git a/doc/thesis/src/grundlagen.tex b/doc/thesis/src/grundlagen.tex
@@ -14,7 +14,7 @@ \section{Compression and Encoding fundamentals}
 %% ==============================
 \label{ch:Principles of compression:sec:Compression}
 \par{
-	The basic idea of compression is to remove redundancy in data, since all non random data contains redundant information. Pattern or structure identification and exploitation enables storing the original data in less space. Compression can be broken down into two broad categories: Lossless and lossy compression. Lossless compression makes it possible to reproduce the original data exactly while lossy compression allows the some degradation in the encoded data to gain even higher compression at the cost of losing some of the original information. To understand compression, one first has to understand some basic principles of information theory like entropy and different approaches to compress different types of data with different encoding. We will also show the key differences between probability coding and dictionary coding.}
+	The basic idea of compression is to remove redundancy in data, since all non random data contains redundant information. Pattern or structure identification and exploitation enables storing the original data in less space. Compression can be broken down into two broad categories: Lossless and lossy compression. Lossless compression makes it possible to reproduce the original data exactly while lossy compression allows the some degradation in the encoded data to gain even higher compression at the cost of losing some of the original information. To understand the applied methods, one first has to understand some basic principles of information theory like entropy and different approaches to compress different types of data with different encoding. We will also show the key differences between probability coding and dictionary coding.}
 
 \subsection{Information Theory and Entropy}
 \par{
@@ -51,9 +51,6 @@ \subsection{Information Theory and Entropy}
 \end{tabular}
 	\caption{Entropy in relation to the Probability of Symbols}
 \end{table}
-
-
-TODO: describe some relations between probability and entropy of information source
 }
 \subsection{General Analysis}
 \par{
@@ -260,7 +257,7 @@ \section{State of the Art}
 %% ==============================
 \label{ch:Principles of compression:sec:SOTA}
 \par{
-The current state of the art at the time of writing is depicted in the table below, all algorithms used highest possible compression scheme available. In general there is always a balance between compressed size and compression or decompression speed, where the faster algorithms use mostly some form of dictionary coding sometimes in combination with a Huffman coder to be able to output variable length codes. The more advanced ones like PPMd or ZPAQ use complex context modeling or context mixing approaches where they generate a probability distribution for the next symbol based on just read symbols. In general these complex methods achieve better compression at the expense of using more space or time. The currently modern and most used algorithms have been executed with the Calgary corpus and the results are shown in table \ref{tab:t20 stat of the art}, all files were processed separately. Algorithms like \textit{compress}, \textit{gzip} and \textit{ZIP} use a dictionary based method like LZ77 or some derivative and achieve reasonable compression at very fast rates. \textit{bzip2} and \textit{p7zip} use a Burrows-Wheeler-Transformation (described in \ref{ch:Principles of compression:sec:Other:subSec:bwt}) and then a combination of different techniques to improve performance. They sometimes even offer explicit choosing of a method, like \textit{7zip} which achieves even better compression when using an advanced probabilistic method called Prediction by partial matching, also implemented but with even better results by \textit{ZPAQ} which performs even better while still maintaining reasonable speeds. In the last decade methods like \textins{zstandard} developed by Facebook or \textit{brotli} developed by Google arose, mostly implementing ANS and other methods arose but they are either desired to be just faster or designed for a different task like \textit{brotli} for HTML and JSON compression and do not achieved best results on this corpus. In the context of the \href{http://prize.hutter1.net/}{hutter prize}, a competitive compression challenge, other advanced probabilistic methods were developed like \textit{paq8hp1} to \textit{paq8hp12}, a series of algorithms by Alexander Rhatushnyak \cite{mahoney2011large} who won the challenge. A more recent fork of \textit{paq8} from 2014 is the current leader, \textit{cmix}. It is still the best performing algorithm on this list and achieved by far the best results on the Calgary corpus at expense of almost 3 hours of computing and 32GB of needed ram, displayed in \ref{tab:t100benchmark}. Unfortunately it was not possible to run the paq8hp* variants due to an major bug in the only available releases. Most algorithms are available from the default repositories of your Linux distribution, or in the case of Fedora and CentOS, pre-installed with your operating system. The paq8hp* algorithms can be found on the Hutter price website and cmix is developed on \href{https://github.com/byronknoll/cmix}{github}.
+The current state of the art at the time of writing is depicted in the table below, all algorithms used highest possible compression scheme available. In general there is always a balance between compressed size and compression or decompression speed, where the faster algorithms use mostly some form of dictionary coding sometimes in combination with a Huffman coder to be able to output variable length codes. The more advanced ones like PPMd or ZPAQ use complex context modeling or context mixing approaches where they generate a probability distribution for the next symbol based on just read symbols. In general these complex methods achieve better compression at the expense of using more space or time. The currently modern and most used algorithms have been executed with the Calgary corpus and the results are shown in table \ref{tab:t20 stat of the art}, all files were processed separately. Algorithms like \textit{compress}, \textit{gzip} and \textit{ZIP} use a dictionary based method like LZ77 or some derivative and achieve reasonable compression at very fast rates. \textit{bzip2} and \textit{p7zip} use a Burrows-Wheeler-Transformation (described in \ref{ch:Principles of compression:sec:Other:subSec:bwt}) and then a combination of different techniques to improve performance. They sometimes even offer explicit choosing of a method, like \textit{7zip} which achieves even better compression when using an advanced probabilistic method called Prediction by partial matching, also implemented but with even better results by \textit{ZPAQ} which performs even better while still maintaining reasonable speeds. In the last decade methods like \textins{zstandard} developed by Facebook or \textit{brotli} developed by Google arose, mostly implementing ANS and other methods arose but they are either desired to be just faster or designed for a different task like \textit{brotli} for HTML and JSON compression and do not achieved best results on this corpus. In the context of the \href{http://prize.hutter1.net/}{hutter prize}, a competitive compression challenge, other advanced probabilistic methods were developed like \textit{paq8hp1} to \textit{paq8hp12}, a series of algorithms by Alexander Rhatushnyak \cite{mahoney2011large} who won the challenge. A more recent fork of \textit{paq8} from 2014 is the current leader, \textit{cmix}. It is still the best performing algorithm on this list and achieved by far the best results on the Calgary corpus at expense of almost 3 hours of computing and 32GB of needed ram, displayed in \ref{tab:t100benchmark}. Unfortunately it was not possible to run the paq8hp* variants due to an major bug in the only available releases. Most algorithms are available from the default repositories of your Linux distribution, or in the case of Fedora and CentOS, pre-installed with your operating system. The paq8hp* algorithms can be found on the Hutter price website and cmix is developed on \href{https://github.com/byronknoll/cmix}{github} \cite{cmix}.
 	\begin{table}[h]
 		\begin{tabular}{r|r|r|r|r}
 			method & options &  size in bytes & compression & \textit{bps}\\
diff --git a/doc/thesis/thesis.bib b/doc/thesis/thesis.bib
@@ -349,7 +349,7 @@ @misc{LibDivSufSort
 	year = {2015},
 	version = {2.0.1},
 	journal = {GitHub repository},
-	howpublished = {\url{https://github.com/y-256/libdivsufsort}}
+	howpublished = {\href{https://github.com/y-256/libdivsufsort}{github}}
 }
 
 @misc{kanzi,
@@ -358,9 +358,20 @@ @misc{kanzi
 	year = {2019},
 	version = {1.6},
 	journal = {GitHub repository},
-	howpublished = {\url{https://github.com/flanglet/kanzi}}
+	howpublished = {\href{https://github.com/flanglet/kanzi/releases}{github}}
 }
 
+@misc{cmix,
+	title = {cmix compression algorithm},
+	author = {Byron Knoll},
+	year = {2019},
+	version = {v18},
+	journal = {GitHub repository},
+	howpublished = {\href{https://github.com/byronknoll/cmix/releases}{github}}
+}
+
+
+
 @proceedings{rle-patent,
 	editor={A. Harry {Robinson} and C. {Cherry}},
 	journal={Proceedings of the IEEE},
diff --git a/doc/thesis/thesis.pdf b/doc/thesis/thesis.pdf