Production postmorterm: Houston, we have a problem

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Feb 02 2016

Production postmortermHouston, we have a problem

time to read 4 min | 660 words

As you read this post, you might want to also consider letting this play in the background. We had a UDP port leak in RavenDB. We squashed it like a bug, but somehow it kep repeating.

We found one cause of it (and fixed it), finally. That was after several rounds of looking at the code and fixing a few “this error condition can lead to the socket not being properly disposed”.

Finally, we pushed to our own internal systems, and monitored things, and saw that it was good. But the bloody bug kept repeating. Now, instead of manifesting as thousands of UDP ports, we had just a dozen or so, but they were (very) slowly increasing. And it drove us nuts. We had logging there, and we could see that we didn’t had the kind of problems that we had before. And everything looked good.

A full reproduction of the issue can be here, but the relevant piece of code is here:

Timer timer = new Timer(async state =>
{
    try
    {
        var addresses = await Dns.GetHostAddressesAsync("time.nasa.gov");
        var endPoint = new IPEndPoint(addresses[0], 123);

        using (var udpClient = new UdpClient())
        {
            udpClient.Connect(endPoint);
            udpClient.Client.ReceiveTimeout = 100;
            udpClient.Client.SendTimeout = 100;
            await udpClient.SendAsync(new byte[] {12, 32, 43}, 3);
            await udpClient.ReceiveAsync();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
        Environment.Exit(-1);
    }
});
timer.Change(500, 500);

As you can see, we are issuing a request to a time server, wrap the usage of the UDP socket in a using statement, make sure to have proper error handling, setup the proper timeouts, the works.

Our read code is actually awash with logging, detailed error handling, and we poured over that a crazy amount of time to figure out what was going on.

If you run this code, and watch the number of used TCP ports, you’ll see a very curious issue. It is always increasing. What is worse, there are no errors, nothing. It just goes into a black hole in the sky and doesn’t work.

In this case, I’ve explicitly created a malformed request, so it is expected that the remote server will not reply to me. That allows us to generate the proper scenario. In production, of course, we send the right value, and we typically get the right result, so we didn’t see this.

The error we had was the timeout values. The documentation quite clearly states that they apply to the syncronous method only, and since they don’t say a word about the async method, this does not apply to the async methods. Given how UDP works, that makes perfect sense. To support timeout on the async methods, the UdpClient would need to start a timer to do so. However, given the API, it is very easy to see how we kept missing this.

The real issue is that when we make a request to a server, and for whatever reason, the UDP reply packet is dropped, we just hang in an async manner. That is, we have an async call that will never return. That call holds the UDP port open, and over time, that shows up as a leak. That is pretty horrible, but the good thing is that once we knew what the problem was, fixing it was trivial.

Tweet Share Share 12 comments

Tags:

Comments

02 Feb 2016
13:14 PM

Thomas Levesque

When I saw the code, I immediately suspected a timeout issue... It's weird that ReceiveAsync doesn't have an overload to specify the timeout; actually, calling it without a timeout is almost certainly a mistake, since there's no guarantee that you'll receive anything, so there should only be an overload with a timeout.

02 Feb 2016
13:17 PM

Damien

You say the fix is trivial. But just in case its not obvious to everyone, would you mind sharing it?

I've gone and taken a look at the UdpClient class and I can't see anything obvious on it that would allow you to determine when a response is available in a timely manner without polling - so is there something I'm missing on it, or are you dropping down to the Socket level and ignoring this class, or something else?

02 Feb 2016
13:20 PM

Oren Eini

Damien, It is something like:

Task.WhenAny(udpClient.ReceiveAsync(), Task.Delay(100))

02 Feb 2016
18:10 PM

Tal

var receiveTask = udpClient.ReceiveAsync(); if (await Task.WhenAny(receiveTask, Task.Delay(TimeSpan.FromMilliseconds(500))) != receiveTask) { throw new TimeoutException("your message here"); }

02 Feb 2016
21:35 PM

Alois Kraus

You could spare massive logging if you inject the tracing code if you only have a leak by hooking into the respective APIs at runtime like it is shown here at Generic Resource Leak Detection with ETW and EasyHook: http://geekswithblogs.net/akraus1/archive/2016/01/30/172079.aspx. Not sure if you wade manually through your logs and call stacks to find the excess udp ports but when you use this approach any resource imbalance will pop up in WPA immediately.

02 Feb 2016
21:43 PM

Oren Eini

Alois, Seeing the resource imbalance wasn't the issue. Figuring out why was the problem

02 Feb 2016
21:45 PM

Oren Eini

Alois, That said, those looks like very interesting tools

02 Feb 2016
23:08 PM

Ian Yates

I second Ayende's comments about those tools Alois - that's something quite impressive you've built. I've bookmarked to give it a try next time I'm trying to track down a leak (just last week I found one in a third party library that keep thread static data around until it was manually cleared - that wasn't called out in the docs - I now have a couple of GB of address space back!)

02 Feb 2016
23:18 PM

Alois Kraus

I have looked at the problem repro. That is indeed nasty. Except for many pinned async handles there are no threads piling up. Only more and more file handles are leaked which point to \Device\Afd which are indeed socket handles. It looks like a nice memory leak with no indication who is not releasing the handles.

02 Feb 2016
23:28 PM

Alois Kraus

ETW Tracing in conjunction with Memory dumps can give you enough information to solve otherwise pretty much unsolvable issues. Today for example I have found that the Intel IPP library allocates and frees a memory block where on a different thread another external software tries to call on that just freed memory block HeapSetInformation. With just a memory dump you would never have seen this. Although I do not have the source code for either one I know that the handle passed to HeapSetInformation was never a valid heap handle so it is certainly a programming bug of the other external software. That race condition did only show up 1/10 times only on some machines which made it rather hard to repro. I have held trainings for ETW several times and I still find few people are really stepping in because of the high learning curve. But things are getting better albeit at a slower rate as I would like.

03 Feb 2016
11:18 AM

Mike-EEE

First off @Oren: thank you for taking the time to post this post morterm. I know that is something everyone says they will do when they encounter a problem, but it usually gets swept under the rug or simply forgotten.

@Alois: wow. Awesome blog post and perspective. I am a C# developer and I have spent a considerable amount of time learning ETW and am still perplexed by it. I would like to see more integration/tooling around this... and also the elephant in the room now is that it is not cross-platform (or can even work with .NET Core?).

I am currently investigating Serilog (and Seq) for my logging needs. It just seems like there should be something as robust/awesome that Alois demonstrates in his blog post to help fix the problems that are demonstrated in this blog post. It would be awesome if there was a standard logging tool such as what ETW provides that is easy to setup/configure and use in .NET.

I think the bottom line here is... a more comprehensive logging/instrumentation solution for .NET is sorely needed. :P

03 Feb 2016
17:09 PM

Mike-EEE

Two additional thoughts (sorry, I cannot edit/append my previous comment -- please consider using Disqus for your comments section :)): 1) Looks like Seq is close to what I was saying, but it doesn't look like it gets into the detail of what @Alois is describing in his blog post. More here: http://getseq.net 2) I've referenced this post in comment a comment within an outstanding vote I have to improve the .NET Asynchronous Programming Model. You can find that here: http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/9126493-improve-asynchronous-programming-model

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Production postmortermHouston, we have a problem

More posts in "Production postmorterm" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Production postmorterm" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication