Production postmorterm: Houston, we have a problem

time to read 4 min | 660 words

As you read this post, you might want to also consider letting this play in the background. We had a UDP port leak in RavenDB. We squashed it like a bug, but somehow it kep repeating.

 

We found one cause of it (and fixed it), finally. That was after several rounds of looking at the code and fixing a few “this error condition can lead to the socket not being properly disposed”.

Finally, we pushed to our own internal systems, and monitored things, and saw that it was good. But the bloody bug kept repeating. Now, instead of manifesting as thousands of UDP ports, we had just a dozen or so, but they were (very) slowly increasing. And it drove us nuts. We had logging there, and we could see that we didn’t had the kind of problems that we had before. And everything looked good.

A full reproduction of the issue can be here, but the relevant piece of code is here:

Timer timer = new Timer(async state =>
{
    try
    {
        var addresses = await Dns.GetHostAddressesAsync("time.nasa.gov");
        var endPoint = new IPEndPoint(addresses[0], 123);

        using (var udpClient = new UdpClient())
        {
            udpClient.Connect(endPoint);
            udpClient.Client.ReceiveTimeout = 100;
            udpClient.Client.SendTimeout = 100;
            await udpClient.SendAsync(new byte[] {12, 32, 43}, 3);
            await udpClient.ReceiveAsync();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
        Environment.Exit(-1);
    }
});
timer.Change(500, 500);

As you can see, we are issuing a request to a time server, wrap the usage of the UDP socket in a using statement, make sure to have proper error handling, setup the proper timeouts, the works.

Our read code is actually awash with logging, detailed error handling, and we poured over that a crazy amount of time to figure out what was going on.

If you run this code, and watch the number of used TCP ports, you’ll see a very curious issue. It is always increasing. What is worse, there are no errors, nothing. It just goes into a black hole in the sky and doesn’t work.

In this case, I’ve explicitly created a malformed request, so it is expected that the remote server will not reply to me. That allows us to generate the proper scenario. In production, of course, we send the right value, and we typically get the right result, so we didn’t see this.

The error we had was the timeout values. The documentation quite clearly states that they apply to the syncronous method only, and since they don’t say a word about the async method, this does not apply to the async methods. Given how UDP works, that makes perfect sense. To support timeout on the async methods, the UdpClient would need to start a timer to do so.  However, given the API, it is very easy to see how we kept missing this.

The real issue is that when we make a request to a server, and for whatever reason, the UDP reply packet is dropped, we just hang in an async manner. That is, we have an async call that will never return. That call holds the UDP port open, and over time, that shows up as a leak. That is pretty horrible, but the good thing is that once we knew what the problem was, fixing it was trivial.