Production PostmortemThe Razor Suicide

time to read 5 min | 930 words

Unlike previous posts in this series, this is actually something that happened to our own production server today. It resulted in our website being inaccessible for a a couple of hours, and like most such stories, its root cause is tremedously simple, and through a series of unfortunate accidents, it had escalated to a major issue.

First, this post is dedicted to this book, which should be read by any self respecting developer whose code is expected to hit production.

Cover image for Release It!

This post is also written a few hours only after the incident was resolved. Before we actually implemented anything except a temporary workaround. I’ll probably have another post in a couple of days to talk about the steps we are going to take to alleviate a repeat of this incident.

The incident started innocently enough, when one of the guys on the team discovered that the startup time of a certain instance jumped by a lot. Investigating into why he realized that the issue was extremely slow responses from our server. That was a cause of triple concern, actually. First, why are we accepting such slow responses instead of time limiting them? Second, why are we making a remote syncronous call during startup? And third, why on Earth is our server so slow?

Logging into the server, it didn’t take long to see what the problem was. The www.RavenDB.net website (the code that runs the RavenDB website, not RavenDB itself) was consuming a lot of CPU and quite a bit of memory. In a bit to restore the other services which reside on the same box, we reset the process. Our main concern at the time was to restore service as soon as possible, and we planned on investigating further through the logs.

However, in a few minutes, the www.RavenDB.net website started consuming more and more resources. At that point, we started considering a DoS attach of some sort and looked a the logs. The logs did show a very high number of requests, much more than I would expect. But looking further into them, it looked like they were mostly bots of various kinds indexing our site.

Considering that this might be the case of Google hammering us, we configured a robots.txt on the site and waited to see if this would have an impact. It didn’t.

The next step was to take a process dump of the process, and then analyze it. During this period, we had to shut down www.RavenDB.net because it was killing all other services running on the server.

Looking at the dump in WinDBG, we started with the obvious commands.

  • !runaway – to find out the thread cpu times
  • switching to the busiest threads
  • !clrstack - to see what it is doing

Honestly, this is a much nicer way of looking at this, though:

image

As you can see, the threads are currently actually doing parsing of a Razor template, and seems to be doing that on a fairly continous basis, consuming all system resources.

At that point, I started getting concerned to the well being of the poor guy’s inbox as a result of this code. That was the point where we actually did what should have probably been our first action, and looked at the error log of the website.

Previously, we looked at the live metrics and the request log, but didn’t consider looking into the error log for the system. The error log for the website, for today only, was 6GB in size, and was pretty full of errors such as:

- error: (221, 88) 'HibernatingRhinos.Orders.Common.EmailProcessing.EmailTemplates.RavenDBWebsite.Models.DownloadQuestionMailInput' does not contain a definition for 'Unsubscribe' and no extension method 'Unsubscribe' accepting a first argument of type

And at that point, we had enough to point a suspicious finger. We have an email that we send out, and we used to have a valid template. At some point, the code was changed and the Unsubscribe was removed. Nothing broke because the template is just a file, not actually compiling code. However, in production, when we tried to send the email, Razor would parse the text, fail compilation because of the missing member, and basically thorw a hissy fit.

Update: We investigated further, and it looks like the following was the actual “solution” to the outage:

image

The “solution” is in quotes, becasue this fixes the problem, but we need to still implement steps to ensure that something like that doesn’t repeat.

Unforuntately, at that point, we would consider this email as failing, and move on to the next one. That next one would also fail, and so would the next one, etc. Because all of them failed, they would get picked up again next time this run.

Once we knew where the problem was. The workaround was to deploy a version with no email sending. For this weekend, that will work. But come Sunday, someone is going to go over this piece of code with a veyr fine comb. I’ll post more about it once this actually roll around.

More posts in "Production Postmortem" series:

  1. (12 Dec 2023) The Spawn of Denial of Service
  2. (24 Jul 2023) The dog ate my request
  3. (03 Jul 2023) ENOMEM when trying to free memory
  4. (27 Jan 2023) The server ate all my memory
  5. (23 Jan 2023) The big server that couldn’t handle the load
  6. (16 Jan 2023) The heisenbug server
  7. (03 Oct 2022) Do you trust this server?
  8. (15 Sep 2022) The missed indexing reference
  9. (05 Aug 2022) The allocating query
  10. (22 Jul 2022) Efficiency all the way to Out of Memory error
  11. (18 Jul 2022) Broken networks and compressed streams
  12. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  13. (12 Jul 2022) The data corruption in the node.js stack
  14. (11 Jul 2022) Out of memory on a clear sky
  15. (29 Apr 2022) Deduplicating replication speed
  16. (25 Apr 2022) The network latency and the I/O spikes
  17. (22 Apr 2022) The encrypted database that was too big to replicate
  18. (20 Apr 2022) Misleading security and other production snafus
  19. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  20. (13 Dec 2021) The memory leak that only happened on Linux
  21. (17 Sep 2021) The Guinness record for page faults & high CPU
  22. (07 Jan 2021) The file system limitation
  23. (23 Mar 2020) high CPU when there is little work to be done
  24. (21 Feb 2020) The self signed certificate that couldn’t
  25. (31 Jan 2020) The slow slowdown of large systems
  26. (07 Jun 2019) Printer out of paper and the RavenDB hang
  27. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  28. (25 Dec 2018) Handled errors and the curse of recursive error handling
  29. (23 Nov 2018) The ARM is killing me
  30. (22 Feb 2018) The unavailable Linux server
  31. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  32. (01 Dec 2017) The random high CPU
  33. (07 Aug 2017) 30% boost with a single line change
  34. (04 Aug 2017) The case of 99.99% percentile
  35. (02 Aug 2017) The lightly loaded trashing server
  36. (23 Aug 2016) The insidious cost of managed memory
  37. (05 Feb 2016) A null reference in our abstraction
  38. (27 Jan 2016) The Razor Suicide
  39. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  40. (21 Oct 2015) The case of the slow index rebuild
  41. (22 Sep 2015) The case of the Unicode Poo
  42. (03 Sep 2015) The industry at large
  43. (01 Sep 2015) The case of the lying configuration file
  44. (31 Aug 2015) The case of the memory eater and high load
  45. (14 Aug 2015) The case of the man in the middle
  46. (05 Aug 2015) Reading the errors
  47. (29 Jul 2015) The evil licensing code
  48. (23 Jul 2015) The case of the native memory leak
  49. (16 Jul 2015) The case of the intransigent new database
  50. (13 Jul 2015) The case of the hung over server
  51. (09 Jul 2015) The case of the infected cluster