Ayende @ Rahien

filter by tags archive

architecture (623) rss
bugs (451) rss
community (382) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1104) rss
raven (1471) rss
ravendb.net (557) rss
reviews (184) rss

2025
- October (3)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Jul 05 2013

Optimized, but way out of line

time to read 2 min | 263 words

Tweet Share Share 6 comments

Tags:

raven
bugs

Yesterday I showed the following profiler results.

The first place someone would look at optimizing is the CaseInsensitiveComparator.Compare call, right? That is clearly taking a really long time.

In fact, we re-wrote that method to use pointer magic in order to get it to work faster. But that isn’t the issue. Look at the numbers. Compare is called 68 million times. That means that under the profiler, it is still completing an operation in 0.000073 ms. I am not even sure what time unit that is.

However, we can see that the major caller for Compare was FindGreaterOrEqual, right? And that is called only 11 thousand times. What does this tells you?

That tells me we have a Schlemiel the Painter's algorithm. So I double checked, and SkipList insert perf should be O(logN) on average. But what we were seeing is O(N) costs .Indeed, it appeared that we had a bug in which our skip list implementation would never do skips, essentially turning it into a sorted linked list, with insert perf of O(N). Once we fixed that bug… well, things got better, a lot better.

As you can see, the cost of this went down really hard. And now we can focus on other stuff.

Jul 04 2013

Optimize this, dude!

time to read 1 min | 55 words

Tweet Share Share 29 comments

Tags:

raven
bugs

So, we have a perf problem in the managed leveldb implementation. I pulled out dotTrace and looked at the numbers:

Do you see where the problem is?

Jul 02 2013

Spot the bug…

time to read 1 min | 39 words

Tweet Share Share 11 comments

Tags:

bugs

Do you see it? It took me ages to find it.

Jun 18 2013

TPL and the case of the !@#(*@! hung process

time to read 5 min | 814 words

Tweet Share Share 11 comments

Tags:

raven
bugs

So, here I am writing some really fun code, when I found out that I am running into dead locks in the code. I activate emergency protocols and went into deep debugging mode.

After being really through in figuring out several possible causes, I was still left with what is effectively a WTF @!(*!@ DAMN !(@*#!@* YOU !@*!@( outburst and a sudden longing for something to repeatedly hit.

Eventually, however, I figure out what was going on.

I have the following method: Aggregator.AggregateAsync(), inside which we have a call to the PulseAll method. That method will then go and execute the following code:

   1: public void PulseAll()

   2: {

   3:     Interlocked.Increment(ref state);

   4:     TaskCompletionSource<object> result;

   5:     while (waiters.TryDequeue(out result))

   6:     {

   7:         result.SetResult(null);

   8:     }

   9: }

After that, I return from the method. In another piece of the code (Aggregator.Dispose) I am waiting for the task that is running the AggregateAsync method to complete.

Nothing worked! It took me a while before I figured out that I wanted to check the stack, where I found this:

Basically, I had a dead lock because when I called SetResult on the completion source (which freed the Dispose code to run), I actually switched over to that task and allowed it to run. Still in the same thread, but in a different task, I run through the rest of the code and eventually got to the Aggregator.Dispose(). Now, I could only get to it if it the PulseAll() method was called. But, because we are on the same thread, that task hasn’t been completed yet!

In the end, I “solved” that by introducing a DisposeAsync() method, which allowed us to yield the thread, and then the AggregateAsync task was completed, and then we could move on.

But I am really not very happy about this. Any ideas about proper way to handle async & IDisposable?

Jun 17 2013

Deadlocking with the TPL, how to

time to read 21 min | 4088 words

Tweet Share Share 21 comments

Tags:

bugs

As I mentioned, I run into a very nasty issue with the TPL. I am not sure if it is me doing things wrong, or an actual issue.

Let us look at the code, shall we?

We start with a very simple code:

   1: public class AsyncEvent

   2: {

   3:     private volatile TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();

4:

   5:     public Task WaitAsync()

   6:     {

   7:         return tcs.Task;

   8:     }

9:

  10:     public void PulseAll()

  11:     {

  12:         var taskCompletionSource = tcs;

  13:         tcs = new TaskCompletionSource<object>();

  14:         taskCompletionSource.SetResult(null);

  15:     }

  16: }

This is effectively an auto reset event. All the waiters will be released when the PulseAll it called. Then we have this runner, which just execute work:

   1: public class Runner : IDisposable

   2: {

   3:     private readonly ConcurrentQueue<TaskCompletionSource<object>> items =

   4:         new ConcurrentQueue<TaskCompletionSource<object>>();

   5:     private readonly Task<Task> _bg;

   6:     private readonly AsyncEvent _event = new AsyncEvent();

   7:     private volatile bool _done;

8:

   9:     public Runner()

  10:     {

  11:         _bg = Task.Factory.StartNew(() => Background());

  12:     }

13:

  14:     private async Task Background()

  15:     {

  16:         while (_done == false)

  17:         {

  18:             TaskCompletionSource<object> result;

  19:             if (items.TryDequeue(out result) == false)

  20:             {

  21:                 await _event.WaitAsync();

  22:                 continue;

  23:             }

24:

  25:             //work here, note that we do NOT use await!

26:

  27:             result.SetResult(null);

  28:         }

  29:     }

30:

  31:     public Task AddWork()

  32:     {

  33:         var tcs = new TaskCompletionSource<object>();

  34:         items.Enqueue(tcs);

35:

  36:         _event.PulseAll();

37:

  38:         return tcs.Task;

  39:     }

40:

  41:     public void Dispose()

  42:     {

  43:         _done = true;

  44:         _event.PulseAll();

  45:         _bg.Wait();

  46:     }

  47: }

And finally, the code that causes the problem:

   1: public static async Task Run()

   2: {

   3:     using (var runner = new Runner())

   4:     {

   5:         await runner.AddWork();

   6:     }

   7: }

So far, it is all pretty innocent, I think you would agree. But this cause hangs with a dead lock. Here is why:

Because tasks can share threads, we are in the Background task thread, and we are trying to wait on that background task completion.

Result, deadlock.

If I add:

   1: await Task.Yield();

Because that forces this method to be completed in another thread, but that looks more like something that you add after you discover the bug, to be honest.

May 27 2013

ChallengeThe problem of locking down tasks…

time to read 18 min | 3449 words

Tweet Share Share 7 comments

Tags:

The following code has very subtle bug:

   1: public class AsyncQueue

   2: {

   3:     private readonly Queue<int> items = new Queue<int>();

   4:     private volatile LinkedList<TaskCompletionSource<object>> waiters = new LinkedList<TaskCompletionSource<object>>();

5:

   6:     public void Enqueue(int i)

   7:     {

   8:         lock (items)

   9:         {

  10:             items.Enqueue(i);

  11:             while (waiters.First != null)

  12:             {

  13:                 waiters.First.Value.TrySetResult(null);

  14:                 waiters.RemoveFirst();

  15:             }

  16:         }

  17:     }

18:

  19:     public async Task<IEnumerable<int>> DrainAsync()

  20:     {

  21:         while (true)

  22:         {

  23:             TaskCompletionSource<object> taskCompletionSource;

  24:             lock (items)

  25:             {

  26:                 if (items.Count > 0)

  27:                 {

  28:                     return YieldAllItems();

  29:                 }

  30:                 taskCompletionSource = new TaskCompletionSource<object>();

  31:                 waiters.AddLast(taskCompletionSource);

  32:             }

  33:             await taskCompletionSource.Task;

  34:         }

  35:     }

36:

  37:     private IEnumerable<int> YieldAllItems()

  38:     {

  39:         while (items.Count > 0)

  40:         {

  41:             yield return items.Dequeue();

  42:         }

43:

  44:     }

  45: }

I’ll even give you a hint, try to run the following client code:

   1: for (int i = 0; i < 1000 * 1000; i++)

   2: {

   3:     q.Enqueue(i);

   4:     if (i%100 == 0)

   5:     {

   6:         Task.Factory.StartNew(async () =>

   7:             {

   8:                 foreach (var result in await q.DrainAsync())

   9:                 {

  10:                     Console.WriteLine(result);

  11:                 }

  12:             });

  13:     }

14:

  15: }

Can you figure out what the problem is?

May 01 2013

This is debugging, old school

time to read 1 min | 37 words

Tweet Share Share 5 comments

Tags:

raven
bugs

Dear god in heaven, how much I did not miss that.

Feb 18 2013

Bug hunting in a massively multi threaded environment

time to read 8 min | 1597 words

Tweet Share Share 5 comments

Tags:

We got a really nasty bug report from a user. Sometimes, out of the blue, RavenDB would throw an error:

System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at Raven.Json.Linq.RavenJArray.WriteTo(JsonWriter writer, JsonConverter[] converters) in c:\Builds\RavenDB-Stable\Raven.Abstractions\Json\Linq\RavenJArray.cs:line 174
   at Raven.Json.Linq.RavenJObject.WriteTo(JsonWriter writer, JsonConverter[] converters) in c:\Builds\RavenDB-Stable\Raven.Abstractions\Json\Linq\RavenJObject.cs:line 275
   at Raven.Json.Linq.RavenJObject.WriteTo(JsonWriter writer, JsonConverter[] converters) in c:\Builds\RavenDB-Stable\Raven.Abstractions\Json\Linq\RavenJObject.cs:line 275

Unfortunately, this error only happened once in a while, usually when the system was under load. But they weren’t able to provide a repro for that.

Luckily, they were able to tell us that they suspected that this is related to the replication support. I quickly setup a database with replication and wrote the following code to try to reproduce this:

using(var store = new DocumentStore
{
    Url = "http://localhost:8080",
    DefaultDatabase = "hello"
}.Initialize())
{
    using(var session = store.OpenSession())
    {
        session.Store(new ReadingList
        {
            UserId = "test/1",
            Id = "lists/1",
            Books = new List<ReadingList.ReadBook>()
        });
        session.SaveChanges();
    }
    Parallel.For(0, 100, i =>
    {
        while (true)
        {
            try
            {
                using (var session = store.OpenSession())
                {
                    session.Advanced.UseOptimisticConcurrency = true;
                    session.Load<ReadingList>("lists/1")
                            .Books.Add(new ReadingList.ReadBook
                            {
                                ReadAt = DateTime.Now,
                                Title = "test " + i
                            });
                    session.SaveChanges();
                }
                break;
            }
            catch (ConcurrencyException)
            {
                
            }
        }
    });
}

And that reproduced the bug! Hurrah! Done deal, we can move on, right?

Except that the bug was only there when we have massive amount of threads hitting the db at once, and trying to figure out what is actually going on there was next to impossible using standard debugging tools. Instead, I reached down to my tracing toolbelt and starting pulling stuff out. First, we identified that the problem occurred when iterating over RavenJArray, which is our own object, so we added the following:

        ConcurrentQueue<StackTrace>  addStackTraces = new ConcurrentQueue<StackTrace>();

        public void Add(RavenJToken token)
        {
            if (isSnapshot)
                throw new InvalidOperationException("Cannot modify a snapshot, this is probably a bug");

            addStackTraces.Enqueue(new StackTrace(true));

            Items.Add(token);
        }

And this one (where the exception is raised):

public override void WriteTo(JsonWriter writer, params JsonConverter[] converters)
{
    writer.WriteStartArray();

    if (Items != null)
    {
        try
        {
            foreach (var token in Items)
            {
                token.WriteTo(writer, converters);
            }
        }
        catch (InvalidOperationException e)
        {
            foreach (var stackTrace in addStackTraces)
            {
                Console.WriteLine(stackTrace);
            }
            throw;
        }
    }

    writer.WriteEndArray();
}

With the idea that we would actually be able to get what is going on there. By tracking down who added items to this particular instance, I hoped that I would be able to figure out why we have an instance that is shared among multiple threads.

When I had that, it was pretty easy to see that it was indeed the replication bundle that was causing the issue. The problem was that the replication bundle was modifying an inner array inside the document metadata. We protected the root properties of the metadata from concurrent modifications, and most of the time, it works just fine. But the problem was that now we had a bundle that was modifying a nested array, which wasn’t protected.

This is one of those bugs that are really hard to catch:

My understanding of the code said that this is not possible, since I believed that we protected the nested properties as well*.
This bug will only surface if and only if:

You have the replication bundle enabled.
You have a great deal of concurrent modifications (with optimistic concurrency enabled) to the same document.
You are unlucky.

I was grateful that the user figured out the replication connection, because I already sat on that bug previously, and there was no way I could figure out what is going on unless I had the trace to point me to where the actual problem was.

Nov 16 2012

Production issue: ASP.Net Cache kills the application

time to read 1 min | 140 words

Tweet Share Share 33 comments

Tags:

bugs

In one of our production deployments, we occasionally get a complete server process crash. Investigating the event log, we have this:

Exception: System.InvalidOperationException

Message: Collection was modified; enumeration operation may not execute.

StackTrace:    at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()

   at System.Web.Hosting.ObjectCacheHost.TrimCache(Int32 percent)

   at System.Web.Hosting.HostingEnvironment.TrimCache(Int32 percent)

   at System.Web.Hosting.HostingEnvironment.TrimCache(Int32 percent)

   at System.Web.Hosting.ApplicationManager.TrimCaches(Int32 percent)

   at System.Web.Hosting.CacheManager.CollectInfrequently(Int64 privateBytes)

   at System.Web.Hosting.CacheManager.PBytesMonitorThread(Object state)

   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)

   at System.Threading._TimerCallback.PerformTimerCallback(Object state)

As you can see, this is a case of what appears to be a run of the mill race condition, translated to a process killing exception because it was thrown from a separate thread.

This thread, by the way, is the ASP.Net Cache cleanup thread, and we have no control whatsoever over that. To make things worse, this application doesn’t even use the ASP.NET Cache in any way shape or form.

Any ideas how to resolve this would be very welcome.

Jul 25 2012

Silverlight streaming - the race condition is already included

time to read 8 min | 1539 words

Tweet Share Share 22 comments

Tags:

bugs

I am not sure how to call this issue, except maddening. For a simple repro, you can check this github repository.

The story is quite simple, let us assume that you need to send a set of values from the server to the client. For example, they might be tick values, or updates, or anything of this sort.

You can do this by keeping an HTTP connection open and sending data periodically. This is a well known technique and it works quite well. Except in Silverlight, where it works, but only if you put the appropriate Thread.Sleep() in crucial places.

Here is an example of the server behavior:

var listener = new HttpListener
{
    Prefixes = {"http://+:8080/"}
};
listener.Start();
while (true)
{
    var ctx = listener.GetContext();
    using (var writer = new StreamWriter(ctx.Response.OutputStream))
    {
        writer.WriteLine("first\r\nsecond");
        writer.Flush();
    }
    Console.ReadKey();
}

In this case, note that we are explicitly flushing the response, then just wait. If you look at the actual network traffic, you can see that this will actually be sent, the connection will remain open, and we can actually send additional data as well.

But how do you consume such a thing in Silverlight?

var webRequest = (HttpWebRequest)WebRequestCreator.ClientHttp.Create(new Uri("http://localhost:8080/"));
webRequest.AllowReadStreamBuffering = false;
webRequest.Method = "GET";

Task.Factory.FromAsync<WebResponse>(webRequest.BeginGetResponse, webRequest.EndGetResponse, null)
    .ContinueWith(task =>
    {
        var responseStream = task.Result.GetResponseStream();
        ReadAsync(responseStream);
    });

We start by making sure that we disable read buffering, then we get the response and start reading from it. The read method is a bit complex, because is has to deal with partial response, but it should still be fairly obvious what is going on:

byte[] buffer = new byte[128];
private int posInBuffer;
private void ReadAsync(Stream responseStream)
{
    Task.Factory.FromAsync<int>(
        (callback, o) => responseStream.BeginRead(buffer, posInBuffer, buffer.Length - posInBuffer, callback, o),
        responseStream.EndRead, null)
        .ContinueWith(task =>
        {
            var read = task.Result;
            if (read == 0) 
                throw new EndOfStreamException();
            // find \r\n in newly read range

            var startPos = 0;
            byte prev = 0;
            bool foundLines = false;
            for (int i = posInBuffer; i < posInBuffer + read; i++)
            {
                if (prev == '\r' && buffer[i] == '\n')
                {
                    foundLines = true;
                    // yeah, we found a line, let us give it to the users
                    var data = Encoding.UTF8.GetString(buffer, startPos, i - 1 - startPos);
                    startPos = i + 1;
                    Dispatcher.BeginInvoke(() =>
                    {
                        ServerResults.Text += data + Environment.NewLine;
                    });
                }
                prev = buffer[i];
            }
            posInBuffer += read;
            if (startPos >= posInBuffer) // read to end
            {
                posInBuffer = 0;
                return;
            }
            if (foundLines == false)
                return;

            // move remaining to the start of buffer, then reset
            Array.Copy(buffer, startPos, buffer, 0, posInBuffer - startPos);
            posInBuffer -= startPos;
        })
        .ContinueWith(task =>
        {
            if (task.IsFaulted)
                return;
            ReadAsync(responseStream);
        });
}

While I am sure that you could find bugs in this code, that isn’t the crucial point.

If we run the server, then run the SL client, we could see that we get just one lousy byte, and that is it. Now, reading about this, it appears that in some versions of some browsers, you need to send 4KB of data to get the connection going. But that isn’t what I have observed. I tried sending 4KB+ of data, and I still saw the exact same behavior, we got called for the first byte, and nothing else.

Eventually, I boiled it down to the following non working example:

writer.WriteLine("first");
writer.Flush();
writer.WriteLine("second");
writer.Flush();

Versus this working example:

writer.WriteLine("first");
writer.Flush();
Thread.Sleep(50);
writer.WriteLine("second");
writer.Flush();

Yes, you got it right, if I put the thread sleep in the server, I’ll get both values in the client. Without the Thread.Sleep, we get only the first byte. It seems like it isn’t an issue of size, but rather of time, and I am at an utter loss to explain what is going on.

Oh, and I am currently awake for 27 hours straight, most of them trying to figure out what the )(&#@!)(DASFPOJDA(FYQ@YREQPOIJFDQ#@R(AHFDS:OKJASPIFHDAPSYUDQ)(RE is going on.

Oren Eini

Oren Eini

CEO of RavenDB

Optimized, but way out of line

Optimize this, dude!

Spot the bug…

TPL and the case of the !@#(*@! hung process

Deadlocking with the TPL, how to

ChallengeThe problem of locking down tasks…

This is debugging, old school

Bug hunting in a massively multi threaded environment

Production issue: ASP.Net Cache kills the application

Silverlight streaming - the race condition is already included

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed