-
Notifications
You must be signed in to change notification settings - Fork 1
Sample ~ Data Encoding
If you haven't read the introductory tutorial already, it's a good time to do so, as this benchmark description assumes familiarity with the ildl-plugin.
The benchmark files described in the reminder of this presentation are available in the ildl-plugin repository, in the tests/benchmarks/src/ildl/benchmark/gcd directory.
The current benchmark is the Gaussian integer greatest common divisor, which is explained on http://math.stackexchange.com/questions/82350/gcd-of-gaussian-integers.
Complex numbers with integers as both their real and imaginary parts are known as Gaussian integers, and are a countable subset of all complex numbers. The operations defined on Gaussian integers are similar to complex number operations, with one exception: division is not precise, as it rounds the result to the nearest Gaussian integer, meaning that both the real and imaginary axes will contain integers.
Let's take a look at the following snippet. It encodes the GCD algorithm for complex numbers in a tail recursive function.
def gcd_direct(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
@annotation.tailrec
def gcd_inner(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
val remainder = n1 % n2
if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
}
if (n1.norm >= n2.norm)
gcd_inner(n1, n2)
else
gcd_inner(n2, n1)
}The performance of this code isn't optimal. If we run this algorithm without any compiler and virtual machine optimizations, for n1=544+185i and n2=131+181i each GCD operation takes around 130 microseconds. Tuples, genericity and boxing put a lot of pressure in the execution flow of this algorithm, stressing the garbage collector and burning CPU cycles that we could save. How can we improve the memory footprint and the performance of our code by changing the layout representations of our data types without altering the source code?
Several factors influence the overall speedup that we wish to obtain. The main data-structure is a pair of integers. The pair class has a pair of generic fields which we would like to specialize in order to avoid an increasing number of heap objects. Additionally, we proceed with the observation that a Long is better than two Integers in terms of memory footprint. Finally, boxing should be avoided by any costs.
It would be interesting if we would be able to write the code above (which is a straightforward encoding of the algorithm from mathematics) but execute code similar to the snippet below, with the desired semantics (our encoding of two Integers to one Long doesn't mean that the compiler will auto-adjust):
def gcd_direct(n1: Long, n2: Long): Long = {
@annotation.tailrec
def gcd_inner(n1: Long, n2: Long): (Int, Int) = {
val remainder = n1 % n2
if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
}
if (n1.norm >= n2.norm)
gcd_inner(n1, n2)
else
gcd_inner(n2, n1)
}In the following diagram, we added intermediate steps to the transformation, in order to isolate the individual factors influencing the overall speedup.
The diagram follows the transformation of the main data type in the program: the pair of integers. The top part explains the transformation, the middle shows the updated types and the bottom part shows the exact transformation description objects used for the adrt scopes:
+--- class specialization: the tuple class is already specialized in the Scala backend,
| so the two integers are stored in the unboxed format. However,
| a limitation in the Scala specialization (fixed by the
| miniboxing transformation) dictates that even the specialized
| class has a pair of generic fields. By adding our
| hand-specialized alternative to tuples, we eliminate those
| fields, saving on the heap memory footprint and thus on the
| GC cycles.
|
| +--> encoding: instead of having a class with two fields we have the
| | java.lang.Long which encodes the two fields in one. We
| | expect to get a small performance hit, as the real and
| | imaginary components have to be encoded.
| |
| | +--> unboxing: we need to convert j.l.Long to
| | | scala.Long and the backend unboxes it
| | | without our intervention.
| | |
(3, 4) ===> Complex(3, 4) ===> java.lang.Long ===> scala.Long (compiled to Java's unboxed long)
\ ^ ^ ^
\___________/ / /
\ step1.IntPairAsGaussianInt.../ /
\ / /
\ / /
\________________________/ /
\ step2.IntPairAsGaussianInteger /
\ /
\ /
\_________________________________/
step3.IntPairAsGaussianInteger
Using adrt we manage to encode all three optimizations in one TransformationDescriptionObject that includes all the necessary information to adjust our encoding with the semantics of the operators that are used in this block of code (i.e., the mod operation (%) between Long integers is not the usual modulo operation as the Scala compiler knows it, it is a mod operation between two Ints encoded in one Long).
Finally using adrt we flip the switch to on for this transformation, without messing the beautiful and straightforward code.
adrt(IntPairAsGaussianInteger) {
def gcd_adrt(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
@annotation.tailrec
def gcd_inner(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
val remainder = n1 % n2
if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
}
if (n1.norm >= n2.norm)
gcd_inner(n1, n2)
else
gcd_inner(n2, n1)
}
}We include three parts of transformation in the repository. In this section we will present the description object for one transformation. Imagine that using adrt would enable only one of the three optimizations. We use that device for exposition in this wiki by showing only one optimization (in the repository you can find all three optimizations separated into three files (ildl.benchmark.gcd.step[1-3]). This, simpler transformation, uses a manually specialized tuple class.
object IntPairAsGaussianInteger extends TransformationDescription {
// Here we're un-importing scala.Long and switching to Complex
// but it's ultimately the same code
import scala.{ Long => _ }
case class Complex(_1: Int, _2: Int)
// coercions:
def toRepr(pair: (Int, Int)): Complex @high = pack(pair._1, pair._2)
def toHigh(l: Complex @high): (Int, Int) = (re(l), im(l))
// constructor:
def ctor_Tuple2(_1: Int, _2: Int): Complex @high = pack(_1, _2)
// interface: (no need to expose everything)
def implicit_IntPairAsGaussianIntegerImplicit_%(n1: Complex @high, n2: Complex @high): Complex @high = %(n1, n2)
def implicit_IntPairAsGaussianIntegerImplicit_norm(n: Complex @high): Int = norm(n)
// extension methods:
def extension_==(receiver: Complex @high, other: Complex @high): Boolean = receiver == other
def extension_toString(receiver: Complex @high): String = toHigh(receiver).toString
// implementation:
private def pack(re: Int, im: Int): Complex @high = Complex(re, im)
private def re(l: Complex @high): Int = l._1
private def im(l: Complex @high): Int = l._2
private def norm(l: Complex @high): Int = re(l)^2 + im(l)^2
private def c(l: Complex @high): Complex @high = pack(re(l), -im(l))
private def +(n1: Complex @high, n2: Complex @high): Complex @high = pack(re(n1) + re(n2), im(n1) + im(n2))
private def -(n1: Complex @high, n2: Complex @high): Complex @high = pack(re(n1) - re(n2), im(n1) - im(n2))
private def *(n1: Complex @high, n2: Complex @high): Complex @high = pack(re(n1) * re(n2) - im(n1) * im(n2), re(n1) * im(n2) + im(n1) * re(n2))
private def *(n1: Complex @high, n2: Int): Complex @high = pack(re(n1) * n2, im(n1) * n2)
private def /(n1: Complex @high, n2: Complex @high): Complex @high = {
val denom = *(n2, c(n2))
val numer = *(n1, c(n2))
assert(im(denom) == 0)
pack(math.round(re(numer).toFloat / re(denom)), math.round(im(numer).toFloat / re(denom)))
}
private def %(n1: Complex @high, n2: Complex @high): Complex @high = pos(this.-(n1,*(/(n1, n2), n2)))
private def pos(n1: Complex @high): Complex @high = if (re(n1) < 0) pack(-re(n1), -im(n1)) else n1
private def bits(n1: Complex @high): String = "<not applicable>"
}Note that a Complex type definition exists to pack the integers, instead of a tuple. This case class contains two Ints explicitly. All relevant operations are redefined.
To run the benchmarks in a local installation, do:
$ cd ildl-plugin
$ sbt
...
> ildl-benchmarks/runMain ildl.benchmark.gcd.BenchmarkRunner
| Benchmark | Running Time | Speedup |
|---|---|---|
| 10000 GCD runs, original code | 28.1 ms | none |
| 10000 GCD runs, step 1 transformation | 12.5 ms | 2.2x |
| 10000 GCD runs, step 2 transformation | 15.0 ms | 1.9x |
| 10000 GCD runs, step 3 transformation | 2.2 ms | 12.7x |
Conclusion: the speedups observed range between 2.2x and 12.7x.
- to continue reading the samples and benchmarks series, jump to the efficient collections optimization
- get back to the home page
The `ildl-plugin` is a meta-programming technique aimed at allowing safe, custom transformations across library boundaries. Using `ildl`-based transformations, we were able to obtain speedups in excess of 20x and have optimized code across a wide range of use-cases. **Return to the main page** or **return to the OOPSLA Step by Step Guide**
Some of the pages of this wiki are in flux. If you can't find what you are looking for, please [file a bug](https://github.com/miniboxing/ildl-plugin/issues).