Skip to content

Sample ~ Data Encoding

Vlad Ureche edited this page Jun 11, 2015 · 37 revisions

If you haven't read the [[introduction|Tutorial--Introduction]] and [[first example|Tutorial--Example-(Part-1)]] already, it's a good time to do so, as this benchmark description assumes familiarity with the ildl-plugin.

We will show a benchmark that uses the transformation developed during the example section. You will finally see why we spent so much energy defining the large transformation description object in the tutorial: it brings speedups of up to 12.7x.

The benchmark files are available in the ildl-plugin repository, in the tests/benchmarks/src/ildl/benchmark/gcd directory. If you have the ildl-plugin projects in the Scala IDE, this benchmark is available under the ildl-benchnarks project, in the src directory, in package ildl.benchmark.gcd.

Problem

Have you noticed...?

Have you noticed that, in the introductory example, we defined complex number with integer components instead of floating-point ones, as it is usually done? Did it seem peculiar?

Complex numbers with integers as both their real and imaginary parts are known as Gaussian integers, and are a countable subset of all complex numbers. The operations defined on Gaussian integers are similar to complex number operations, with one exception: division is not precise, as it rounds the result to the nearest Gaussian integer. This matches the behaviour of integers:

scala> 5/2
res0: Int = 2

The interesting thing about Gaussian integers is that we can define the greatest common divisor, which is computed exactly as for integers. This following stackexchange page explains more: http://math.stackexchange.com/questions/82350/gcd-of-gaussian-integers.

Solution

Let's take a look at the following snippet. It encodes the GCD algorithm for complex numbers in a tail recursive function.

def gcd_direct(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
  @annotation.tailrec
  def gcd_inner(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
    val remainder = n1 % n2
    if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
  }
  if (n1.norm >= n2.norm)
    gcd_inner(n1, n2)
  else
    gcd_inner(n2, n1)
}

The performance of this code isn't optimal. If we run this algorithm without any compiler and virtual machine optimizations, for n1=544+185i and n2=131+181i each GCD operation takes around 130 microseconds. Tuples, genericity and boxing put a lot of pressure in the execution flow of this algorithm, stressing the garbage collector and burning CPU cycles that we could save. How can we improve the memory footprint and the performance of our code by changing the layout representations of our data types without altering the source code?

Solution

Several factors influence the overall speedup that we wish to obtain. The main data-structure is a pair of integers. The pair class has a pair of generic fields which we would like to specialize in order to avoid an increasing number of heap objects. Additionally, we proceed with the observation that a Long is better than two Integers in terms of memory footprint. Finally, boxing should be avoided by any costs.

It would be interesting if we would be able to write the code above (which is a straightforward encoding of the algorithm from mathematics) but execute code similar to the snippet below, with the desired semantics (our encoding of two Integers to one Long doesn't mean that the compiler will auto-adjust):

def gcd_direct(n1: Long, n2: Long): Long = {
  @annotation.tailrec
  def gcd_inner(n1: Long, n2: Long): (Int, Int) = {
    val remainder = n1 % n2
    if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
  }
  if (n1.norm >= n2.norm)
    gcd_inner(n1, n2)
  else
    gcd_inner(n2, n1)
}

Automating the Solution

In the following diagram, we added intermediate steps to the transformation, in order to isolate the individual factors influencing the overall speedup.

The diagram follows the transformation of the main data type in the program: the pair of integers. The top part explains the transformation, the middle shows the updated types and the bottom part shows the exact transformation description objects used for the adrt scopes:

       +--- class specialization: the tuple class is already specialized in the Scala backend,
        |                          so the two integers are stored in the unboxed format. However,
        |                          a limitation in the Scala specialization (fixed by the
        |                          miniboxing transformation) dictates that even the specialized
        |                          class has a pair of generic fields. By adding our
        |                          hand-specialized alternative to tuples, we eliminate those
        |                          fields, saving on the heap memory footprint and thus on the
        |                          GC cycles.
        |
        |                  +--> encoding: instead of having a class with two fields we have the
        |                  |              java.lang.Long which encodes the two fields in one. We
        |                  |              expect to get a small performance hit, as the real and
        |                  |              imaginary components have to be encoded.
        |                  |
        |                  |                   +--> unboxing: we need to convert j.l.Long to
        |                  |                   |              scala.Long and the backend unboxes it
        |                  |                   |              without our intervention.
        |                  |                   |
(3, 4) ===> Complex(3, 4) ===> java.lang.Long ===> scala.Long (compiled to Java's unboxed long)
   \             ^                    ^                ^
    \___________/                    /                /
     \ step1.IntPairAsGaussianInt.../                /
      \                            /                /
       \                          /                /
        \________________________/                /
         \ step2.IntPairAsGaussianInteger        /
          \                                     /
           \                                   /
            \_________________________________/
              step3.IntPairAsGaussianInteger

Using adrt we manage to encode all three optimizations in one TransformationDescriptionObject that includes all the necessary information to adjust our encoding with the semantics of the operators that are used in this block of code (i.e., the mod operation (%) between Long integers is not the usual modulo operation as the Scala compiler knows it, it is a mod operation between two Ints encoded in one Long).

Finally using adrt we flip the switch to on for this transformation, without messing the beautiful and straightforward code.

adrt(IntPairAsGaussianInteger) {
  def gcd_adrt(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
    @annotation.tailrec
    def gcd_inner(n1: (Int, Int), n2: (Int, Int)): (Int, Int) = {
      val remainder = n1 % n2
      if (remainder.norm == 0) n2 else gcd_inner(n2, remainder)
    }
    if (n1.norm >= n2.norm)
      gcd_inner(n1, n2)
    else
      gcd_inner(n2, n1)
  }
}

Description Object

We include three parts of transformation in the repository. In this section we will present the description object for one transformation. Imagine that using adrt would enable only one of the three optimizations. We use that device for exposition in this wiki by showing only one optimization (in the repository you can find all three optimizations separated into three files (ildl.benchmark.gcd.step[1-3]). This, simpler transformation, uses a manually specialized tuple class.

object IntPairAsGaussianInteger extends TransformationDescription {

  // Here we're un-importing scala.Long and switching to Complex
  // but it's ultimately the same code
  import scala.{ Long => _ }

  case class Complex(_1: Int, _2: Int)

  // coercions:
  def toRepr(pair: (Int, Int)): Complex @high = pack(pair._1, pair._2)
  def toHigh(l: Complex @high): (Int, Int) = (re(l), im(l))

  // constructor:
  def ctor_Tuple2(_1: Int, _2: Int): Complex @high = pack(_1, _2)

  // interface: (no need to expose everything)
  def implicit_IntPairAsGaussianIntegerImplicit_%(n1: Complex @high, n2: Complex @high): Complex @high = %(n1, n2)
  def implicit_IntPairAsGaussianIntegerImplicit_norm(n: Complex @high): Int = norm(n)

  // extension methods:
  def extension_==(receiver: Complex @high, other: Complex @high): Boolean = receiver == other
  def extension_toString(receiver: Complex @high): String = toHigh(receiver).toString

  // implementation:
  private def pack(re: Int, im: Int): Complex @high = Complex(re, im)
  private def re(l: Complex @high): Int = l._1
  private def im(l: Complex @high): Int = l._2
  private def norm(l: Complex @high): Int = re(l)^2 + im(l)^2
  private def c(l: Complex @high): Complex @high = pack(re(l), -im(l))
  private def +(n1: Complex @high, n2: Complex @high): Complex @high = pack(re(n1) + re(n2), im(n1) + im(n2))
  private def -(n1: Complex @high, n2: Complex @high): Complex @high = pack(re(n1) - re(n2), im(n1) - im(n2))
  private def *(n1: Complex @high, n2: Complex @high): Complex @high = pack(re(n1) * re(n2) - im(n1) * im(n2), re(n1) * im(n2) + im(n1) * re(n2))
  private def *(n1: Complex @high, n2: Int): Complex @high = pack(re(n1) * n2, im(n1) * n2)
  private def /(n1: Complex @high, n2: Complex @high): Complex @high = {
    val denom = *(n2, c(n2))
    val numer = *(n1, c(n2))
    assert(im(denom) == 0)
    pack(math.round(re(numer).toFloat / re(denom)), math.round(im(numer).toFloat / re(denom)))
  }
  private def %(n1: Complex @high, n2: Complex @high): Complex @high = pos(this.-(n1,*(/(n1, n2), n2)))
  private def pos(n1: Complex @high): Complex @high = if (re(n1) < 0) pack(-re(n1), -im(n1)) else n1
  private def bits(n1: Complex @high): String = "<not applicable>"
}

Note that a Complex type definition exists to pack the integers, instead of a tuple. This case class contains two Ints explicitly. All relevant operations are redefined.

Benchmark

To run the benchmarks in a local installation, do:

$ cd ildl-plugin
$ sbt
...
> ildl-benchmarks/runMain ildl.benchmark.gcd.BenchmarkRunner
Benchmark Running Time Speedup
10000 GCD runs, original code 28.1 ms none
10000 GCD runs, step 1 transformation 12.5 ms 2.2x
10000 GCD runs, step 2 transformation 15.0 ms 1.9x
10000 GCD runs, step 3 transformation 2.2 ms 12.7x

Conclusion: the speedups observed range between 2.2x and 12.7x.

From here:

Frog Work Ahead: Some of the pages of this wiki are in flux. If you can't find what you are looking for, please [file a bug](https://github.com/miniboxing/ildl-plugin/issues).

Clone this wiki locally