Mistakes in micro benchmarks

time to read 5 min | 900 words

So on my last post I showed a bunch of small micro benchmark, and aside from the actual results, I wasn’t really sure what was going on there. Luckily, I know a few perf experts, so I was able to lean on them.

In particular, the changes that were recommended were:

  • Don’t make just a single tiny operation, it is easy to get too much jitter in the setup for the call if the op is too cheap.
  • Pay attention to potential data issues, the compiler / jit can decide to put something on a register, in which case you are benching the CPU directly, which won’t be the case in the real world.

I also learned how to get the actual assembly being run, which is great. All in all, we get the following benchmark code:

[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.RyuJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.RyuJit)]
public unsafe class ToCastOrNotToCast
{
    byte* p1, p2, p3, p4;
    FooHeader* h1, h2,h3,h4;
    public ToCastOrNotToCast()
    {
        p1 = (byte*)Marshal.AllocHGlobal(1024);
        p2 = (byte*)Marshal.AllocHGlobal(1024);
        p3 = (byte*)Marshal.AllocHGlobal(1024);
        p4 = (byte*)Marshal.AllocHGlobal(1024);
        h1 = (FooHeader*)p1;
        h2 = (FooHeader*)p2;
        h3 = (FooHeader*)p3;
        h4 = (FooHeader*)p4;
    }

    [Benchmark]
    [OperationsPerInvoke(4)]
    public void NoCast()
    {
        h1->PageNumber++;
        h2->PageNumber++;
        h3->PageNumber++;
        h4->PageNumber++;
    }

    [Benchmark]
    [OperationsPerInvoke(4)]
    public void Cast()
    {
        ((FooHeader*)p1)->PageNumber++;
        ((FooHeader*)p2)->PageNumber++;
        ((FooHeader*)p3)->PageNumber++;
        ((FooHeader*)p4)->PageNumber++;
    }
}

And the following results:

          Method | Platform |       Jit |   AvrTime |    StdDev |             op/s |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X64 | LegacyJit | 0.2135 ns | 0.0113 ns | 4,683,511,436.74 |
          NoCast |      X64 | LegacyJit | 0.2116 ns | 0.0017 ns | 4,725,696,633.67 |
            Cast |      X64 |    RyuJit | 0.2177 ns | 0.0038 ns | 4,593,221,104.97 |
          NoCast |      X64 |    RyuJit | 0.2097 ns | 0.0006 ns | 4,769,090,600.54 |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X86 | LegacyJit | 0.7465 ns | 0.1743 ns | 1,339,630,922.79 |
          NoCast |      X86 | LegacyJit | 0.7474 ns | 0.1320 ns | 1,337,986,425.19 |
            Cast |      X86 |    RyuJit | 0.7481 ns | 0.3014 ns | 1,336,808,932.91 |
          NoCast |      X86 |    RyuJit | 0.7426 ns | 0.0039 ns | 1,346,537,728.81 |

Interestingly enough, the NoCast approach is faster in pretty much all setups.

Here is the assembly code for LegacyJit in x64:

image

For RyuJit, the code is identical for the cast code, and the only difference in the no casting code is that the mov edx, ecx is mov rdx,rcx in RyuJit.

As an aside, X64 assembly code is much easier to read than x86 assembly code.

In short, casting or not casting has a very minor performance difference, but not casting allows us to save a pointer reference in the object, which means it will be somewhat smaller, and if we are going to have a lot of them, then that can be a pretty nice space saving.