Mistakes in micro benchmarks

Oct 19 2015

Mistakes in micro benchmarks

time to read 5 min | 900 words

So on my last post I showed a bunch of small micro benchmark, and aside from the actual results, I wasn’t really sure what was going on there. Luckily, I know a few perf experts, so I was able to lean on them.

In particular, the changes that were recommended were:

Don’t make just a single tiny operation, it is easy to get too much jitter in the setup for the call if the op is too cheap.
Pay attention to potential data issues, the compiler / jit can decide to put something on a register, in which case you are benching the CPU directly, which won’t be the case in the real world.

I also learned how to get the actual assembly being run, which is great. All in all, we get the following benchmark code:

[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.RyuJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.RyuJit)]
public unsafe class ToCastOrNotToCast
{
    byte* p1, p2, p3, p4;
    FooHeader* h1, h2,h3,h4;
    public ToCastOrNotToCast()
    {
        p1 = (byte*)Marshal.AllocHGlobal(1024);
        p2 = (byte*)Marshal.AllocHGlobal(1024);
        p3 = (byte*)Marshal.AllocHGlobal(1024);
        p4 = (byte*)Marshal.AllocHGlobal(1024);
        h1 = (FooHeader*)p1;
        h2 = (FooHeader*)p2;
        h3 = (FooHeader*)p3;
        h4 = (FooHeader*)p4;
    }

    [Benchmark]
    [OperationsPerInvoke(4)]
    public void NoCast()
    {
        h1->PageNumber++;
        h2->PageNumber++;
        h3->PageNumber++;
        h4->PageNumber++;
    }

    [Benchmark]
    [OperationsPerInvoke(4)]
    public void Cast()
    {
        ((FooHeader*)p1)->PageNumber++;
        ((FooHeader*)p2)->PageNumber++;
        ((FooHeader*)p3)->PageNumber++;
        ((FooHeader*)p4)->PageNumber++;
    }
}

And the following results:

          Method | Platform |       Jit |   AvrTime |    StdDev |             op/s |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X64 | LegacyJit | 0.2135 ns | 0.0113 ns | 4,683,511,436.74 |
          NoCast |      X64 | LegacyJit | 0.2116 ns | 0.0017 ns | 4,725,696,633.67 |
            Cast |      X64 |    RyuJit | 0.2177 ns | 0.0038 ns | 4,593,221,104.97 |
          NoCast |      X64 |    RyuJit | 0.2097 ns | 0.0006 ns | 4,769,090,600.54 |

---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X86 | LegacyJit | 0.7465 ns | 0.1743 ns | 1,339,630,922.79 |
          NoCast |      X86 | LegacyJit | 0.7474 ns | 0.1320 ns | 1,337,986,425.19 |
            Cast |      X86 |    RyuJit | 0.7481 ns | 0.3014 ns | 1,336,808,932.91 |
          NoCast |      X86 |    RyuJit | 0.7426 ns | 0.0039 ns | 1,346,537,728.81 |

Interestingly enough, the NoCast approach is faster in pretty much all setups.

Here is the assembly code for LegacyJit in x64:

For RyuJit, the code is identical for the cast code, and the only difference in the no casting code is that the mov edx, ecx is mov rdx,rcx in RyuJit.

As an aside, X64 assembly code is much easier to read than x86 assembly code.

In short, casting or not casting has a very minor performance difference, but not casting allows us to save a pointer reference in the object, which means it will be somewhat smaller, and if we are going to have a lot of them, then that can be a pretty nice space saving.

Tweet Share Share 4 comments

Tags:

Comments

19 Oct 2015
10:33 AM

OmariO

Can you explain "but not casting allows us to save a pointer reference in the object"? They are both pointers, aren't they?

19 Oct 2015
11:02 AM

Oliver Hallam

I'm confused by the performance difference you've observed. The differences are well within a standard deviation so it might just be noise.

The only difference there is with the offsets, which should be down to the order of the fields in your test class. I suspect if you swapped around your FooHeaders and your bytes at the top of the class then you'd see the generated assembly for the two tests swap round. It would be interesting to benchmark field order alone just to check.

I agree with your conclusion though that there's only a trivial difference between the two, and holding an additional copy of the pointer to avoid the cast is not beneficial.

19 Oct 2015
13:56 PM

Oren Eini

OmariO, Because there isn't any difference in performance, there is no incentive to have both a byte* and FooHeader* fields. So the size of the class can be smaller

19 Oct 2015
13:57 PM

Oren Eini

Oliver, I actually tested it both ways, and I don't see any meaningful difference between them

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB