Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,520
|
Comments: 51,141
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 765 words

I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.

In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.

We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.

The process is straightforward:

  • Change something.
  • Verify that it works better than before.
  • Run the entire test suite (upward of 30K tests) to see if there are any breaks.

The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.

I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.

One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.

I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.

And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.

Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.

My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.

RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.

It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.

I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.

The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.

As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).

The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.

time to read 1 min | 122 words

I’m trying to pay a SaaS bill online, and I run into the following issue. I have insufficient permissions to pay the invoice on the account. No insufficient funds, which is something that you’ll routinely run into when dealing with payment processing. But insufficient permissions!

 

Is… paying something an act that requires permissions? That something that happens? Can I get more vulnerabilities like that? When I get people to drive-by pay for my bills?

I can’t think of a scenario where you are prevented from paying to the provider. That is… weird.

And now I’m in this “nice” position where I have to chase after the provider to give them money, because otherwise they’ll close the account.

time to read 2 min | 369 words

RavenDB is a .NET application, written in C#. It also has a non trivial amount of unmanaged memory usage. We absolutely need that to get the proper level of performance that we require.

With managing memory manually, there is also the possibility that we’ll mess it up. We run into one such case, when running our full test suite (over 10,000 tests) we would get random crashes due to heap corruption. Those issues are nasty, because there is a big separation between the root cause and the actual problem manifesting.

I recently learned that you can use the gflags tool on .NET executables. We were able to narrow the problem to a single scenario, but we still had no idea where the problem really occurred. So I installed the Debugging Tools for Windows and then executed:

 &"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\gflags.exe" /p /enable C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe

What this does is enable a special debug heap at the executable level, which applies to all operations (managed and native memory alike).

With that enabled, I ran the scenario in question:

PS C:\Work\ravendb-6.0\test\Tryouts>  C:\Work\ravendb-6.0\test\Tryouts\bin\release\net7.0\Tryouts.exe
42896
Starting to run 0
Max number of concurrent tests is: 16
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.
         To attach debugger to test process (x64), use proc-id: 42896. Url http://127.0.0.1:51595
Ignore request for setting processor affinity. Requested cores: 3. Number of cores on the machine: 32.  License limits: A: 3/32. Total utilized cores: 3. Max licensed cores: 1024
http://127.0.0.1:51595/studio/index.html#databases/documents?&database=Should_correctly_reduce_after_updating_all_documents_1&withStop=true&disableAnalytics=true
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
    at Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Encode(System.ReadOnlySpan`1<Byte>, System.Span`1<Byte>)
    at Sparrow.Server.Compression.HopeEncoder`1[[Sparrow.Server.Compression.Encoder3Gram`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], Sparrow.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]].Encode(System.ReadOnlySpan`1<Byte> ByRef, System.Span`1<Byte> ByRef)
    at Voron.Data.CompactTrees.PersistentDictionary.ReplaceIfBetter[[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593],[Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server, Version=6.0.0.0, Culture=neutral, PublicKeyToken=37f41c7f99471593]](Voron.Impl.LowLevelTransaction, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Raven.Server.Documents.Indexes.Persistence.Corax.CoraxDocumentTrainEnumerator, Voron.Data.CompactTrees.PersistentDictionary)
    at Raven.Server.Documents.Indexes.Persistence.Corax.CoraxIndexPersistence.Initialize(Voron.StorageEnvironment)

That pinpointed things so I was able to know exactly where we are messing up.

I was also able to reproduce the behavior on the debugger:

image (3)

This saved me hours or days of trying to figure out where the problem actually is.

time to read 9 min | 1666 words

I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries. The first two tiers of tests all passed, which is great. Now was the time to test how this change performed. I threw 50M records into RavenDB and indexed them. I did… not like the numbers I got back. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output.

When running the same scenario under the profiler, the process crashed. That is… quite annoying, as you can imagine. In fact, I discovered a really startling issue.

If I index the data and query on it, I get the results I expect. If I restart the process and run the same query, I get an ExecutionEngineException. Trying to debug those is a PITA. In this case, I’m 100% at fault, we are doing a lot of unsafe things to get better performance, and it appears that I messed up something along the way. But my only reproduction is a 50M records dataset. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. Indexing this in release mode takes about 20 minutes. In debug mode, it takes a lot longer.

Trying to find an error there, especially one that can only happen after you restart the process is going to be a challenging task. But this isn’t my first rodeo. Part of good system design is knowing how to address just these sorts of issues.

The indexing process inside RavenDB is single-threaded per index. That means that we can rule out a huge chunk of issues around race conditions. It also means that we can play certain tricks. Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces.

Here is what this looks like in terms of code:

In this case, you can see that this is a development only feature, so it is really bare-bones one. What it does is capture the indexing and commit operations on the system and write them to a file. I have another piece of similarly trivial code that reads and applies them, as shown below. Don’t bother to dig into that, the code itself isn’t really that interesting. What is important is that I have captured the behavior of the system and can now replay it at will.

The code itself isn’t much, but it does the job. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code.

That was a massive win, since it dropped my search area from 50M documents to merely 1.2 million.

The key aspect of this is that I now have a way to play around with things. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). When using that, I can reproduce this far faster. Here is the output when I run this in release mode:

1 With 0
2 With 10000
3 With 10000
4 With 10000
5 With 10000
6 With 10000
7 With 10000
8 With 10000
9 With 10000
10 With 10000
11 With 10000
Fatal error. Internal CLR error. (0x80131506)

So now I dropped the search area to 120,000 items, which is pretty awesome. Even more important, when I run this in debug mode, I get this:

1 With 0
2 With 10000
Process terminated. Assertion failed.
   at Voron.Data.Containers.Container.Get(Low...

So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on.

The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the next batch. There is a bunch of work that we do around batch optimizations that likely cause this sort of behavior. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on.

With that, I can now start applying more techniques to narrow down what is going on. But by far the most important aspect as far as I’m concerned is the feedback cycle. I can now hit F5 to run the code and encounter the problem in a few seconds.

image

It looks like we hit a debug assertion because we keep a reference to an item that was already freed. That is really interesting, and now I can find out which item and then figure out why this is the case. And at each point, I can simply go one step back in the investigation, and reproduce the state, I have to hit F5 and wait a bit. This means that I can be far more liberal in how I figure out this bug.

This is triggered by a query on the indexed data, and if I follow up the stack, I have:

image

This is really interesting, I wonder… what happens if I query before I restart the database? With this structure, this is easy to do.

image

This is actually a big relief. I had no idea why restarting the database would cause us to expose this bug.

Another thing to note is that when I ran into the problem, I reproduced this on a query that sorted on a single field. In the test code, I’m testing on all fields, so that may be an asset in exposing this faster.

Right now the problem reproduces on the id field, which is unique. That helps, because it removes a large swath of code that deals with multiple terms for an entry. The current stage is that I can now reproduce this issue without running the queries, and I know exactly where it goes wrong.

image

And I can put a breakpoint on the exact location where this entry is created:

image

By the way, note that I’m modifying the code instead of using a conditional breakpoint. This is because of the performance difference. For a conditional breakpoint, the debugger has to stop execution, evaluate the condition and decide what to do. If this is run a lot, it can have a huge impact on performance. Easier to modify the code. The fact that I can do that and hit F5 and get to the same state allows me to have a lot more freedom in the ergonomics of how I work.

This allows me to discover that the entry in question was created during the second transaction. But the failure happens during the third, which is really interesting. More to the point, it means that I can now do this:

image

With the idea that this will trigger the assert on the exact entry that cause the problem. This is a good idea, and I wish that it worked, but we are actually doing a non-trivial amount of work during the commit process, so now we have a negative feedback and another clue. This is happening in the commit phase of the indexing process. It’s not a big loss, I can do the same in the commit process as well. I have done just that and now I know that I have a problem when indexing the term: “tweets/1212163952102137856”. Which leads to this code:

image

And at this point, I can now single step through this and figure out what is going on, I hope.

When working on complex data structures, one of the things that you need to do is to allow to visualize them. Being able to manually inspect the internal structure of your data structures can save you a lot of debugging. As I mentioned, this isn’t my first rodeo. So when I narrowed it down to a specific location, I started looking into exactly what is going on.

Beforehand, I need to explain a couple of terms (pun intended):

  • tweets/1212163952102137856 – this is the entry that triggers the error.
  • tweets/1212163846623727616 – this is the term that should be returned for 1679560

Here is what the structure looks like at the time of the insert:

image

You can notice that the value here for the last page is the same as the one that we are checking for 1679560.

To explain what is going on will take us down a pretty complex path that you probably don’t care about, but the situation is that we are keeping track of the id in two locations. Making sure to add and remove it in both locations as appropriate. However, at certain points, we may decide to shuffle things around inside the tree, and we didn’t sync that up properly with the rest of the system, leading to a dangling reference.

Now that I know what is going on, I can figure out how to fix it. But the story of this post was mostly about how I figured it out, not the bug itself.

The key aspect was to get to the point where I can reproduce this easily, so I can repeat it as many times that is needed to slowly inch closer to the solution.

time to read 4 min | 741 words

I’m doing a pretty major refactoring inside of RavenDB right now. I was able to finish a bunch of work and submitted things to the CI server for testing. RavenDB has several layers of tests, which we run depending on context.

During development, we’ll usually run the FastTests. About 2,300 tests are being run to validate various behaviors for RavenDB, and on my machine, they take just over 3 minutes to complete. The next tier is the SlowTests, which run for about 3 hours on the CI server and run about 26,000 tests. Beyond that we actually have a few more layers, like tests that are being run only on the nightly builds and stress tests, which can take several minutes each to complete.

In short, the usual process is that you write the code and write the relevant tests. You also validate that you didn’t break anything by running the FastTests locally. Then we let CI pick up the rest of the work. At the last count, we had about 9 dedicated machines as CI agents and given our workload, an actual full test run of a PR may complete the next day.

I’m mentioning all of that to explain that when I reviewed the build log for my PR, I found that there were a bunch of tests that failed. That was reasonable, given the scope of my changes. I sat down to grind through them, fixing them one at a time. Some of them were quite important things that I didn’t take into account, after all. For example, one of the tests failed because I didn’t account for sorting on a dynamic numeric field. Sorting on a numeric field worked, and a dynamic text field also worked. But dynamic numeric field didn’t. It’s the sort of thing that I would never think of, but we got the tests to cover us.

But when I moved to the next test, it didn’t fail. I ran it again, and it still didn’t fail. I ran it in a loop, and it failed on the 5th iteration. That… sucked. Because it meant that I had a race condition in there somewhere. I ran the loop again, and it failed again on the 5th. In fact, in every iteration I tried, it would only fail on the 5th iteration.

When trying to isolate a test failure like that, I usually run that in a loop, and hope that with enough iterations, I’ll get it to reproduce. Having it happen constantly on the 5th iteration was… really strange. I tried figuring out what was going on, and I realized that the test was generating 1000 documents using a random. The fact that I’m using Random is the reason it is non-deterministic, of course, except that this is the code inside my test base class:

image

So this is properly initialized with a seed, so it will be consistent.

Read the code again, do you see the problem?

image

That is a static value. So there are two problems here:

  • I’m getting the bad values on the fifth run in a consistent manner because that is the set of results that reproduce the error.
  • This is a shared instance that may be called from multiple tests at once, leading to the wrong result because Random is not thread safe.

Before fixing this issue so it would run properly, I decided to use an ancient debugging technique. It’s called printf().

In this case, I wrote out all the values that were generated by the test and wrote a new test to replay them. That one failed consistently.

The problem was that it was still too big in scope. I iterated over this approach, trying to end up with a smaller section of the codebase that I could invoke to repeat this issue. That took most of the day. But the end result is a test like this:

As you can see, in terms of the amount of code that it invokes, it is pretty minimal. Which is pretty awesome, since that allowed me to figure out what the problem was:

image

I’ve been developing software professionally for over two decades at this point. I still get caught up with things like that, sigh.

time to read 5 min | 916 words

I’m working on improving the performance of Corax, RavenDB’s new search engine. Along the way, I introduced a bug, a fairly nasty one. At a random location, while indexing a ~50 million documents corpus, we are getting an access violation exception. That means that I messed something up.

That makes sense, given that my changes were mostly about making things lower-level. Working directly with pointers and avoiding length checks. At our speed, even the use of Span can be a killer for performance, and we want to be as close to the raw metal as possible. The particular changeset that I was working on was able to improve the indexing speed from 90,000 per second to 120,000 per second. That is a change that I absolutely want to keep, so I started investigating it.

I mentioned that it is a fairly nasty problem. A truly nasty problem would be heap corruption that is discovered after the fact and is very hard to trace. In this case, it was not consistent, which is really strange. One of the important aspects of Corax is that it is single-threaded, which means that a lot of complexity is out the window. It means that for the same input, we always have the same behavior. If there is any variance, such as not crashing all the time, it means that there are external factors involved.

At any rate, given that it happened at least half the time, I was able to attach WinDBG to the process and wait for the exception to happen, this is what I got:

(5e20.1468): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
Corax!Corax.IndexWriter.AddEntriesToTermResultViaSmallPostingList+0x953:
00007ffa`24dcea53 c4e261902411    vpgatherdd xmm4,dword ptr [rcx+xmm2],xmm3 ds:0000026d`516514e7=????????

Now, look at the last line, that is an interesting one, we use the VPGATHERDD assembly instruction. It is gathering packed DWORD values, in C#, this is generated using the Avx2.GatherVector128() method. We are using that to do some bit packing in this case, so this makes a lot of sense.

Next, let’s see what we get from the exception:

0:074> .exr -1
ExceptionAddress: 00007ffafc2bfe7c (KERNELBASE!RaiseException+0x000000000000006c)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000080
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 0000026d51650000
Attempt to read from address 0000026d51650000

All of this points to an out-of-bounds read, but why is that? The call we have for GatherVector128() is used inside a method named: ReadAvx2(). And this method is called like this:

private unsafe static ulong Read(int stateBitPos, byte* inputBufferPtr, int bitsToRead, int inputBufferSize, out int outputStateBit)
{
    if ((stateBitPos + bitsToRead) / 8 >= inputBufferSize)
        throw new ArgumentOutOfRangeException();
    if ( Avx2.IsSupported)
    {
        return ReadAvx2(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
    }
    return ReadScalar(stateBitPos, inputBufferPtr, bitsToRead, out outputStateBit);
}

It is an optimized approach to read some bits from a buffer, I’ll skip the details on exactly how this works. As you can see, we have a proper bounds check here, ensuring that we aren’t reading past the end of the buffer.

Except…

That we aren’t actually checking this. What we are doing is checking that we can access the bytes range, but consider the following scenario:

image

We have a memory page and a buffer that is located toward the end of it.  We are now trying to access the last bit in the buffer, using ReadAvx2(). If we’ll check the actual bytes range, it will pass, we are trying to access the last byte.

However, we are going to call GatherVector128(), which means that we’ll actually access 16 bytes(!), and only the first byte is in the valid memory range, the rest is going to be read from the next page, which isn’t mapped.

This also explains why we are not always failing. If the next page is valid (which is subject to the decisions of the operating system allocator), it will pass. So that is why we didn’t have 100% reproduction. In fact, this is the sort of bug that is very easy to hide for a very long time in the system, given that it is dependent on the actual memory structure of the application.

Once we figured out what was going on, it was pretty easy to understand, but the fact that the AVX instructions will read after the end of the buffer was really confusing. Because even when we used Span, and its range checks, it would be completely ignored. Makes total sense, given that those aren’t really methods, but compiler intrinsics that are translated to direct machine instructions.

Amusingly enough, now that we found the problem, we ran into something very similar a long while ago. Then it was the wrong instruction being used (loading a word, instead of a byte), that would fail, but the same overal setup. It will sometimes fail, depending on the state of the next page in the memory.

We actually built some tooling around managing that, we call that electric fence memory. We allocate memory so any out-of-band access would always hit invalid memory, stopping us in our tracks. That means that I can get easy reproduction of those sorts of issues, and once we have that, the rest isn’t really that interesting, to be honest. It’s just a normal bug fix. It’s the hunt for the root cause that is both incredibly frustrating and quite rewarding.

time to read 2 min | 393 words

A customer reported a scenario where RavenDB was using stupendous amounts of memory. In the orders of tens of GB on a system that didn’t have that much load.

Our first suspicion was that this is an issue with reading the metrics, since RavenDB will try to keep as much of the data in memory, which sometimes leads users to worry. I spoke about this at length in the past.

In this case, that wasn’t the case. We were able to drill down into the exact cause of the memory usage and we found out that RavenDB was using an abnormally high amount of memory. The question was why that was, exactly.

We looked into the common operations on the server, and we found a suspicious query, it looked something like this:

from index 'Sales/Actions'
where endsWith(WorkflowStage, '/Final')

The endsWith query was suspicious, so we looked into that further. In general, endsWith requires us to scan all the unique terms for a particular field, but in most cases, there aren’t that many unique values for a field. In this case, however, that wasn’t the case, here are some of the values for WorkflowStage:

  • Workflows/3a1af12a-b5d2-4c96-9348-177ebaacab6c/Step-2
  • Workflows/6aacc86c-2f28-4b8b-8dee-1024314d5add/Final

In total, there were about 250 million sales in the database, each one of them with a unique WorflowStage value.

What does this mean, in terms of RavenDB query execution? Well, the fields are indexed, but we need to effectively do:

This isn’t the actual code, but it will show you what is going on.

In other words, in order to process this query, we have to scan (and materialize) all 250 million unique terms for this field. Obviously that is going to consume a lot of memory.

But what is the solution to that? Instead of doing an expensive endsWith query, we can move the computation from the query time to the index time.

In other words, instead of indexing the WorkflowStage field  as is, we’ll extract the information we want from it. The index would have one of those:

IsFinalWorkFlowStage = doc.WorkflowStage.EndsWith(“/Final”),

WorkflowStagePostfix = doc.WorkflowStage.Split(‘/’).Last()

The first one will check whether the value is final or not, while the second just gets the (one of hopefully a few) postfixes for the field. We can then query using equality instead of endsWith, leading to far better performance and greatly reduced memory usage, since we don’t need to materialize any values during the query.

time to read 2 min | 277 words

image A user of ours called us, quite frantic. They are running a lot of systems on RavenDB, and have been for quite some time.

However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.

The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).

After verifying the situation, the on-call engineer escalated the issue. The problem was weird. There was enough memory, for sure, but for some reason RavenDB would be unable to run properly.

An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.

Once we figured that out, it was actually easy to understand what is going on.

There are actually quite a few limits that you have to take into account. I talked about them here. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.

In this case, the suspect was: vm.max_map_count.

Beyond just total memory, Linux has a limit on the number of memory mappings that a process may have. And RavenDB uses Voron, which is based on mmap(), and each database and each index typically have multiple maps going on.

Given the number of databases involved…

The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system's limits.

time to read 11 min | 2005 words

imageA user reported that they observed nodes in the cluster “going dark”. Basically, they would stop communicating with the rest of the cluster, but would otherwise appear functional. Both the internal and external metrics were all fine, the server would just stop responding to anything over the network. The solution for the problem was to restart the service (note, the service, not the whole machine), but the problem would happen every few days.

As you can imagine, we are taking this sort of thing very seriously, so we looked into the problem. And we came up short. The problem made absolutely no sense. The problem occurred on a (minor) version migration, but there was absolutely nothing related to this that could cause it. What was really weird was that the service itself continue to work. We could see log entries being written and it was able to execute scheduled backups, for example. It would just refuse to talk to us over the network.

That was super strange, since the network itself was fine. All the monitoring systems were green, after all. For that matter, the user was able to SSH into the system to restart the service. This didn’t match with any other issue we could think of. Since the user worked around the problem by restarting the server, we didn’t have a lead.

Then we noticed the exact same problem in one of our cloud instances, and there we have much better diagnostic capabilities. Once we had noticed a problematic server, we were able to SSH into that and try to figure out what was going on.

Here is what we found out:

  • The server will not respond to HTTP(s) communication either from outside the machine or by trying to connect from inside the machine.
  • The server will respond to SNMP queries both from inside the machine and outside of it (which is how we typically monitor the system).

When we designed RavenDB, we implemented a “maintenance hatch” for such scenarios, in addition to using HTTP(s) for communication, RavenDB also exposes a named pipe that allows you to connect to the server without going through the network at all. This ensures that if you have administrator privileges on the server, you are able to connect even if there are network issues, certificate problems, etc.

Here is the kicker. Under this particular situation, we could not activate this escape hatch. That is not supposed to be possible. Named pipes on Linux, where we run into the problem, are basically Unix Sockets. A network issue such as a firewall problem or something similar isn’t going to affect them.

At the same time, we were able to communicate with the process using SNMP. What is the problem?

Lacking any other options, we dumped the process, restarted the service, and tried to do the analysis offline. We couldn’t find any problem. All the details we looked at said that everything was fine, the server was properly listening to new connections and it should work. That was… weird.

And then it happened again, and we did the same analysis, and it came back the same. We were clueless. One of the things that we updated between versions was the .NET runtime that we were using, so we opened an issue to see if anyone ran into the same problem.

And then it happened again. This time, we knew that just looking at the dump wouldn’t help us, so we tried other avenues. Linux has a pretty rich set of knobs and dials that you can look at to see what was going on. We suspected that this may be an issue with running out of file descriptors, running out of memory, etc.

We tried looking into what is going on inside the process using strace, and everything was fine. The trace clearly showed that the server was processing requests and was able to send and receive data properly.

Wait, go through that statement again please!

It is fine? But the reason we are using strace is that there is a problem. It looks like the problem fixed itself. That was annoying, because we were hoping to use the trace to figure out what is going on. We added more monitoring along the way, which would let us know if the server found itself isolated. And we waited.

The next time we ran into the problem, the first thing we did was run strace, we needed to get the root cause as soon as possible, and we were afraid that it would fix itself before we had a chance to get to the root cause. The moment we used strace, the server got back online, continuing as if there was never any issue.

Over the next few instances of this issue, we were able to confirm the following observations:

  1. The service would stop responding to TCP and Unix Sockets entirely.
  2. There were no firewall or network issues.
  3. The service was up and functional, tailing the log showed activity.
  4. We could query the server state using SNMP.
  5. Running strace on the service process would fix the problem.

There are a few more things, the actual trigger for the fix wasn’t strace itself. It was the ptrace() call, which it uses. That would cause the service to start responding again. The ptrace() call is basically the beginning and the end of debugging under Linux. Everything uses it.

If you want to dump a memory process, you start with ptrace(). You want to trace the calls, ptrace(). You want to debug the process? GDB will start by calling ptrace(), etc.

And doing that would alleviate the problem.

That was… quite annoying.

We still had absolutely no indication of what the root cause even was.

We suspected it may be something inside Kestrel that was causing a problem. But that wouldn’t affect the named pipes / Unix sockets that we also saw.

Networking worked, because SNMP did. We thought that this may be because SNMP uses UDP instead of TCP, and looked into that, but we couldn’t figure out how that would be any different.

Looking at this further, we found that we have this in the code dumps:

      ~~~~ 5072
         1 Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
         1 System.Net.Sockets.SocketAsyncEngine.EventLoop()
         1 System.Net.Sockets.SocketAsyncEngine+<>c.ctor>b__14_0(Object)

As you can see, we are waiting for this in the .NET Sockets thread. The SNMP, on the other hand, looked like:

Thread (0x559):
   [Native Frames]
   System.Net.Sockets!System.Net.Sockets.SocketPal.SysReceive()
   System.Net.Sockets!System.Net.Sockets.SocketPal.TryCompleteReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketAsyncContext.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketPal.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.Socket.ReceiveFrom()
   SharpSnmpLib.Engine!Lextm.SharpSnmpLib.Pipeline.ListenerBinding.AsyncReceive()

That was really interesting, since it meant that for sockets (both HTTP and Unix), we were always using async calls, but for SNMP, we were using the synchronous API. We initially suspected that this may be something related to the thread pool. Maybe we had something that blocked it, but it turns out to be a lot more interesting. Here is the code that is actually handling the SNMP:

var count = _socket.ReceiveFrom(buffer, ref remote);

Task.Factory.StartNew(() => HandleMessage(buffer, count, (IPEndPoint)remote));

In other words, we are actually reading from the socket in a blocking manner, but then processing the actual message using the thread pool. So being able to get results via SNMP meant the thread pool was well.

At this point we resulted to hair pulling, rubber ducking and in some instances, shaking our fists to heaven.

I reminded myself that I’m an adult with a bit of experience solving problems, and dug deeper. We started looking into how .NET is actually handling sockets in async mode. This end up here, doing a system call:

while ((numEvents = epoll_wait(port, events, *count, -1)) < 0 && errno == EINTR);

Reading through the man page for epoll_wait() I learned how epoll() works, that it is complex and that we need to be aware of level-triggered and edge-triggered options. Since .NET uses edge-triggered events (EPOLLET, which I keep reading as electronic chicken), we focused on that.

There are a lot of edge cases and things to cover, but everything we checked was handled properly. We finally had a good smoking gun. For some reason, we weren’t getting notifications from epoll(), even though we should. Using strace() or friends somehow fixes that.

We actually found the exact scenario we saw in StackOverflow, but without any idea what the issue was. Truly, there is an XKCD for everything.

Our current understanding of the issue:

  • All async sockets in .NET are going through the same socket engine, and are using epoll() under the covers.
  • SNMP is using synchronous calls, so it wasn’t using epoll().

That covers both of the weird things that we are seeing. So what is the issue?

It is not in .NET. Given the size & scope of .NET, we wouldn’t be the only ones seeing that. Below .NET, there is the kernel, so we looked into that. The machines we were running that on were using kernel 5.4.0-azure-1095, so we looked into that.

And it looked like it is a kernel bug, which was fixed in the next updated kernel. A race condition inside the kernel would cause us to miss wakeups, and then we would basically just stall without anything to wake us up.

We dug deeper to understand a bit more about this situation, and we got this:

       Some system calls return with EINTR if a signal was sent to a
       tracee, but delivery was suppressed by the tracer.  (This is very
       typical operation: it is usually done by debuggers on every
       attach, in order to not introduce a bogus SIGSTOP).  As of Linux
       3.2.9, the following system calls are affected (this list is
       likely incomplete): epoll_wait(2), and read(2) from an inotify(7)
       file descriptor.  The usual symptom of this bug is that when you
       attach to a quiescent process with the command

           strace -p <process-ID>

       then, instead of the usual and expected one-line output such as

           restart_syscall(<... resuming interrupted call ...>_

       or

           select(6, [5], NULL, [5], NULL_

       ('_' denotes the cursor position), you observe more than one
       line.  For example:

               clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
               epoll_wait(4,_

       What is not visible here is that the process was blocked in
       epoll_wait(2) before strace(1) has attached to it.  Attaching
       caused epoll_wait(2) to return to user space with the error
       EINTR.  In this particular case, the program reacted to EINTR by
       checking the current time, and then executing epoll_wait(2)
       again.  (Programs which do not expect such "stray" EINTR errors
       may behave in an unintended way upon an strace(1) attach.)

And.. that is exactly what is happening. On attaching, the epoll_wait() will return with EINTR, which will cause .NET to retry the command, and that “fixes” the issue.

It makes total sense now, and concludes the discovery process of a pretty nasty bug.

Now, if you’ll excuse me, I need to go and apologize to a rubber duck.

image

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (75):
    01 Jul 2024 - Efficient snapshotable state
  2. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  3. re (33):
    28 May 2024 - Secure Drop protocol
  4. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  5. Production Postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}