Production postmorterm: Houston, we have a problem
As you read this post, you might want to also consider letting this play in the background. We had a UDP port leak in RavenDB. We squashed it like a bug, but somehow it kep repeating.
We found one cause of it (and fixed it), finally. That was after several rounds of looking at the code and fixing a few “this error condition can lead to the socket not being properly disposed”.
Finally, we pushed to our own internal systems, and monitored things, and saw that it was good. But the bloody bug kept repeating. Now, instead of manifesting as thousands of UDP ports, we had just a dozen or so, but they were (very) slowly increasing. And it drove us nuts. We had logging there, and we could see that we didn’t had the kind of problems that we had before. And everything looked good.
A full reproduction of the issue can be here, but the relevant piece of code is here:
Timer timer = new Timer(async state => { try { var addresses = await Dns.GetHostAddressesAsync("time.nasa.gov"); var endPoint = new IPEndPoint(addresses[0], 123); using (var udpClient = new UdpClient()) { udpClient.Connect(endPoint); udpClient.Client.ReceiveTimeout = 100; udpClient.Client.SendTimeout = 100; await udpClient.SendAsync(new byte[] {12, 32, 43}, 3); await udpClient.ReceiveAsync(); } } catch (Exception e) { Console.WriteLine(e); Environment.Exit(-1); } }); timer.Change(500, 500);
As you can see, we are issuing a request to a time server, wrap the usage of the UDP socket in a using statement, make sure to have proper error handling, setup the proper timeouts, the works.
Our read code is actually awash with logging, detailed error handling, and we poured over that a crazy amount of time to figure out what was going on.
If you run this code, and watch the number of used TCP ports, you’ll see a very curious issue. It is always increasing. What is worse, there are no errors, nothing. It just goes into a black hole in the sky and doesn’t work.
In this case, I’ve explicitly created a malformed request, so it is expected that the remote server will not reply to me. That allows us to generate the proper scenario. In production, of course, we send the right value, and we typically get the right result, so we didn’t see this.
The error we had was the timeout values. The documentation quite clearly states that they apply to the syncronous method only, and since they don’t say a word about the async method, this does not apply to the async methods. Given how UDP works, that makes perfect sense. To support timeout on the async methods, the UdpClient would need to start a timer to do so. However, given the API, it is very easy to see how we kept missing this.
The real issue is that when we make a request to a server, and for whatever reason, the UDP reply packet is dropped, we just hang in an async manner. That is, we have an async call that will never return. That call holds the UDP port open, and over time, that shows up as a leak. That is pretty horrible, but the good thing is that once we knew what the problem was, fixing it was trivial.
Comments
When I saw the code, I immediately suspected a timeout issue... It's weird that ReceiveAsync doesn't have an overload to specify the timeout; actually, calling it without a timeout is almost certainly a mistake, since there's no guarantee that you'll receive anything, so there should only be an overload with a timeout.
You say the fix is trivial. But just in case its not obvious to everyone, would you mind sharing it?
I've gone and taken a look at the UdpClient class and I can't see anything obvious on it that would allow you to determine when a response is available in a timely manner without polling - so is there something I'm missing on it, or are you dropping down to the Socket level and ignoring this class, or something else?
Damien, It is something like:
Task.WhenAny(udpClient.ReceiveAsync(), Task.Delay(100))
var receiveTask = udpClient.ReceiveAsync(); if (await Task.WhenAny(receiveTask, Task.Delay(TimeSpan.FromMilliseconds(500))) != receiveTask) { throw new TimeoutException("your message here"); }
You could spare massive logging if you inject the tracing code if you only have a leak by hooking into the respective APIs at runtime like it is shown here at Generic Resource Leak Detection with ETW and EasyHook: http://geekswithblogs.net/akraus1/archive/2016/01/30/172079.aspx. Not sure if you wade manually through your logs and call stacks to find the excess udp ports but when you use this approach any resource imbalance will pop up in WPA immediately.
Alois, Seeing the resource imbalance wasn't the issue. Figuring out why was the problem
Alois, That said, those looks like very interesting tools
I second Ayende's comments about those tools Alois - that's something quite impressive you've built. I've bookmarked to give it a try next time I'm trying to track down a leak (just last week I found one in a third party library that keep thread static data around until it was manually cleared - that wasn't called out in the docs - I now have a couple of GB of address space back!)
I have looked at the problem repro. That is indeed nasty. Except for many pinned async handles there are no threads piling up. Only more and more file handles are leaked which point to \Device\Afd which are indeed socket handles. It looks like a nice memory leak with no indication who is not releasing the handles.
ETW Tracing in conjunction with Memory dumps can give you enough information to solve otherwise pretty much unsolvable issues. Today for example I have found that the Intel IPP library allocates and frees a memory block where on a different thread another external software tries to call on that just freed memory block HeapSetInformation. With just a memory dump you would never have seen this. Although I do not have the source code for either one I know that the handle passed to HeapSetInformation was never a valid heap handle so it is certainly a programming bug of the other external software. That race condition did only show up 1/10 times only on some machines which made it rather hard to repro. I have held trainings for ETW several times and I still find few people are really stepping in because of the high learning curve. But things are getting better albeit at a slower rate as I would like.
First off @Oren: thank you for taking the time to post this post morterm. I know that is something everyone says they will do when they encounter a problem, but it usually gets swept under the rug or simply forgotten.
@Alois: wow. Awesome blog post and perspective. I am a C# developer and I have spent a considerable amount of time learning ETW and am still perplexed by it. I would like to see more integration/tooling around this... and also the elephant in the room now is that it is not cross-platform (or can even work with .NET Core?).
I am currently investigating Serilog (and Seq) for my logging needs. It just seems like there should be something as robust/awesome that Alois demonstrates in his blog post to help fix the problems that are demonstrated in this blog post. It would be awesome if there was a standard logging tool such as what ETW provides that is easy to setup/configure and use in .NET.
I think the bottom line here is... a more comprehensive logging/instrumentation solution for .NET is sorely needed. :P
Two additional thoughts (sorry, I cannot edit/append my previous comment -- please consider using Disqus for your comments section :)): 1) Looks like Seq is close to what I was saying, but it doesn't look like it gets into the detail of what @Alois is describing in his blog post. More here: http://getseq.net 2) I've referenced this post in comment a comment within an outstanding vote I have to improve the .NET Asynchronous Programming Model. You can find that here: http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/9126493-improve-asynchronous-programming-model
Comment preview