Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,583
|
Comments: 51,214
Privacy Policy · Terms
filter by tags archive
time to read 15 min | 2973 words

Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.

I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:

   1: public class TrivialCsvParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public TrivialCsvParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<string[]> Parse()
  11:     {
  12:         using (var reader = new StreamReader(_path))
  13:         {
  14:             while (true)
  15:             {
  16:                 var line = reader.ReadLine();
  17:                 if (line == null)
  18:                     break;
  19:                 var fields = line.Split(',');
  20:                 yield return fields;
  21:             }
  22:         }
  23:     }
  24: }

This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.

Then next thing to try was just reading through the file without doing any parsing. So I wrote this:

   1: public class NoopParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public NoopParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<object> Parse()
  11:     {
  12:         var buffer = new byte[1024];
  13:         using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))
  14:         {
  15:             while (true)
  16:             {
  17:                 var result = stream.Read(buffer, 0, buffer.Length);
  18:                 if (result == 0)
  19:                     break;
  20:                 yield return null; // noop
  21:             }
  22:         }
  23:     }
  24: }

Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.

I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.

time to read 1 min | 87 words

Well, tomorrow I’ll be 0x20. Leaving aside the fact that I am just entering my twenties (finally, it feels like I was a 0xTeenager for over a decade), there is the tradition to uphold.

Therefor, we have a 32% discount until the end of the year*.

You can use coupon code: 0x20-twentysomething

This applies to:

* Limited to the first 0x32, and not applicable if you have to ask why you get a 32% discount.

time to read 4 min | 652 words

Writing in C (and using only the C std lib as building blocks, which explicitly exclude C++ and all its stuff), generate 1 million unique random numbers.

For reference, here is the code in C#:

   1: var random = new Random();
   2: var set = new HashSet<int>();
   3:  
   4: var sp = Stopwatch.StartNew();
   5:  
   6: while (set.Count < 1000 * 1000)
   7: {
   8:     set.Add(random.Next(0, int.MaxValue));
   9: }
  10:  
  11: Console.WriteLine(sp.ElapsedMilliseconds);

It is a brute force approach, I’ll admit, but it completes in about 150ms on my end.  Solution must run in under 10 seconds.

This question just looks stupid, it actually can tell you quite a bit about the developer.

time to read 92 min | 18267 words

This post was written at 5:30AM, I run into this while doing research for another post, and I couldn’t really let it go.

XML as a text base format is really wasteful in space. But that wasn’t what really made it lose its shine. That was when it became so complex that it stopped being human readable. For example, I give you:

   1: <?xml version="1.0" encoding="UTF-8" ?>
   2:  <SOAP-ENV:Envelope
   3:   xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
   4:   xmlns:xsd="http://www.w3.org/1999/XMLSchema"
   5:   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
   6:    <SOAP-ENV:Body>
   7:        <ns1:getEmployeeDetailsResponse
   8:         xmlns:ns1="urn:MySoapServices"
   9:         SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
  10:            <return xsi:type="ns1:EmployeeContactDetail">
  11:                <employeeName xsi:type="xsd:string">Bill Posters</employeeName>
  12:                <phoneNumber xsi:type="xsd:string">+1-212-7370194</phoneNumber>
  13:                <tempPhoneNumber
  14:                 xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"
  15:                 xsi:type="ns2:Array"
  16:                 ns2:arrayType="ns1:TemporaryPhoneNumber[3]">
  17:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  18:                        <startDate xsi:type="xsd:int">37060</startDate>
  19:                        <endDate xsi:type="xsd:int">37064</endDate>
  20:                        <phoneNumber xsi:type="xsd:string">+1-515-2887505</phoneNumber>
  21:                    </item>
  22:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  23:                        <startDate xsi:type="xsd:int">37074</startDate>
  24:                        <endDate xsi:type="xsd:int">37078</endDate>
  25:                        <phoneNumber xsi:type="xsd:string">+1-516-2890033</phoneNumber>
  26:                    </item>
  27:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  28:                        <startDate xsi:type="xsd:int">37088</startDate>
  29:                        <endDate xsi:type="xsd:int">37092</endDate>
  30:                        <phoneNumber xsi:type="xsd:string">+1-212-7376609</phoneNumber>
  31:                    </item>
  32:                </tempPhoneNumber>
  33:            </return>
  34:        </ns1:getEmployeeDetailsResponse>
  35:    </SOAP-ENV:Body>
  36: /SOAP-ENV:Envelope>

After XML was thrown out of the company of respectable folks, we had JSON show up and entertain us. It is smaller and more concise than XML, and so far has resisted the efforts to make it into some sort of a uber complex enterprisiey tool.

But today I run into quite a few effort to do strange things to JSON. I am talking about things like JSON DB (a compressed json format, not actual json database), JSONH, json.hpack, and friends. All of those attempt to reduce the size of JSON documents.

Let us take an example. the following is a JSON document representing one of RavenDB builds:

   1: {
   2:   "BuildName": "RavenDB Unstable v2.5",
   3:   "IsUnstable": true,
   4:   "Version": "2509-Unstable",
   5:   "PublishedAt": "2013-02-26T12:06:12.0000000",
   6:   "DownloadsIds": [],
   7:   "Changes": [
   8:     {
   9:       "Commiter": {
  10:         "Email": "david@davidwalker.org",
  11:         "Name": "David Walker"
  12:       },
  13:       "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
  14:       "Href": "/app/rest/changes/id:21039",
  15:       "TeamCityId": 21039,
  16:       "Username": "david walker",
  17:       "Comment": "Do not save Has-Api-Key header to metadata\n",
  18:       "Date": "2013-02-20T23:22:43.0000000",
  19:       "Files": [
  20:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  21:       ]
  22:     },
  23:     {
  24:       "Commiter": {
  25:         "Email": "david@davidwalker.org",
  26:         "Name": "David Walker"
  27:       },
  28:       "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
  29:       "Href": "/app/rest/changes/id:21040",
  30:       "TeamCityId": 21040,
  31:       "Username": "david walker",
  32:       "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
  33:       "Date": "2013-02-20T23:23:59.0000000",
  34:       "Files": [
  35:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  36:       ]
  37:     },
  38:     {
  39:       "Commiter": {
  40:         "Email": "ayende@ayende.com",
  41:         "Name": "Ayende Rahien"
  42:       },
  43:       "Version": "5919521286735f50f963824a12bf121cd1df4367",
  44:       "Href": "/app/rest/changes/id:21035",
  45:       "TeamCityId": 21035,
  46:       "Username": "ayende rahien",
  47:       "Comment": "Better disposal\n",
  48:       "Date": "2013-02-26T10:16:45.0000000",
  49:       "Files": [
  50:         "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
  51:       ]
  52:     },
  53:     {
  54:       "Commiter": {
  55:         "Email": "ayende@ayende.com",
  56:         "Name": "Ayende Rahien"
  57:       },
  58:       "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
  59:       "Href": "/app/rest/changes/id:21036",
  60:       "TeamCityId": 21036,
  61:       "Username": "ayende rahien",
  62:       "Comment": "Will ensure that the value is always positive or zero (never negative).\nWhen using numeric calc, will div by 1,024 to get more concentration into buckets.\n",
  63:       "Date": "2013-02-26T10:17:23.0000000",
  64:       "Files": [
  65:         "Raven.Database/Indexing/IndexingUtil.cs"
  66:       ]
  67:     },
  68:     {
  69:       "Commiter": {
  70:         "Email": "ayende@ayende.com",
  71:         "Name": "Ayende Rahien"
  72:       },
  73:       "Version": "7bf51345d39c3993fed5a82eacad6e74b9201601",
  74:       "Href": "/app/rest/changes/id:21037",
  75:       "TeamCityId": 21037,
  76:       "Username": "ayende rahien",
  77:       "Comment": "Fixing a bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
  78:       "Date": "2013-02-26T10:53:01.0000000",
  79:       "Files": [
  80:         "Raven.Database/Indexing/MapReduceIndex.cs",
  81:         "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
  82:         "Raven.Database/Storage/IMappedResultsStorageAction.cs",
  83:         "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
  84:         "Raven.Tests/Issues/RavenDB_784.cs",
  85:         "Raven.Tests/Storage/MappedResults.cs",
  86:         "Raven.Tests/Views/ViewStorage.cs"
  87:       ]
  88:     },
  89:     {
  90:       "Commiter": {
  91:         "Email": "ayende@ayende.com",
  92:         "Name": "Ayende Rahien"
  93:       },
  94:       "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
  95:       "Href": "/app/rest/changes/id:21038",
  96:       "TeamCityId": 21038,
  97:       "Username": "ayende rahien",
  98:       "Comment": "No need for so many repeats\n",
  99:       "Date": "2013-02-26T11:27:49.0000000",
 100:       "Files": [
 101:         "Raven.Tests/Bugs/MultiOutputReduce.cs"
 102:       ]
 103:     },
 104:     {
 105:       "Commiter": {
 106:         "Email": "ayende@ayende.com",
 107:         "Name": "Ayende Rahien"
 108:       },
 109:       "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
 110:       "Href": "/app/rest/changes/id:21041",
 111:       "TeamCityId": 21041,
 112:       "Username": "ayende rahien",
 113:       "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
 114:       "Date": "2013-02-26T11:41:39.0000000",
 115:       "Files": [
 116:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
 117:       ]
 118:     }
 119:   ],
 120:   "ResolvedIssues": [],
 121:   "Contributors": [
 122:     {
 123:       "FullName": "Ayende Rahien",
 124:       "Email": "ayende@ayende.com",
 125:       "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
 126:     },
 127:     {
 128:       "FullName": "David Walker",
 129:       "Email": "david@davidwalker.org",
 130:       "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
 131:     }
 132:   ],
 133:   "BuildTypeId": "bt8",
 134:   "Href": "/app/rest/builds/id:588",
 135:   "ProjectName": "RavenDB",
 136:   "TeamCityId": 588,
 137:   "ProjectId": "project3",
 138:   "Number": 2509
 139: }

This document is 4.52KB in size. Running this through JSONH gives us the following:

   1: [
   2:     14,
   3:     "BuildName",
   4:     "IsUnstable",
   5:     "Version",
   6:     "PublishedAt",
   7:     "DownloadsIds",
   8:     "Changes",
   9:     "ResolvedIssues",
  10:     "Contributors",
  11:     "BuildTypeId",
  12:     "Href",
  13:     "ProjectName",
  14:     "TeamCityId",
  15:     "ProjectId",
  16:     "Number",
  17:     "RavenDB Unstable v2.5",
  18:     true,
  19:     "2509-Unstable",
  20:     "2013-02-26T12:06:12.0000000",
  21:     [
  22:     ],
  23:     [
  24:         {
  25:             "Commiter": {
  26:                 "Email": "david@davidwalker.org",
  27:                 "Name": "David Walker"
  28:             },
  29:             "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
  30:             "Href": "/app/rest/changes/id:21039",
  31:             "TeamCityId": 21039,
  32:             "Username": "david walker",
  33:             "Comment": "Do not save Has-Api-Key header to metadata\n",
  34:             "Date": "2013-02-20T23:22:43.0000000",
  35:             "Files": [
  36:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  37:             ]
  38:         },
  39:         {
  40:             "Commiter": {
  41:                 "Email": "david@davidwalker.org",
  42:                 "Name": "David Walker"
  43:             },
  44:             "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
  45:             "Href": "/app/rest/changes/id:21040",
  46:             "TeamCityId": 21040,
  47:             "Username": "david walker",
  48:             "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
  49:             "Date": "2013-02-20T23:23:59.0000000",
  50:             "Files": [
  51:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  52:             ]
  53:         },
  54:         {
  55:             "Commiter": {
  56:                 "Email": "ayende@ayende.com",
  57:                 "Name": "Ayende Rahien"
  58:             },
  59:             "Version": "5919521286735f50f963824a12bf121cd1df4367",
  60:             "Href": "/app/rest/changes/id:21035",
  61:             "TeamCityId": 21035,
  62:             "Username": "ayende rahien",
  63:             "Comment": "Better disposal\n",
  64:             "Date": "2013-02-26T10:16:45.0000000",
  65:             "Files": [
  66:                 "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
  67:             ]
  68:         },
  69:         {
  70:             "Commiter": {
  71:                 "Email": "ayende@ayende.com",
  72:                 "Name": "Ayende Rahien"
  73:             },
  74:             "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
  75:             "Href": "/app/rest/changes/id:21036",
  76:             "TeamCityId": "...bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
  77:             "Date": "2013-02-26T10:53:01.0000000",
  78:             "Files": [
  79:                 "Raven.Database/Indexing/MapReduceIndex.cs",
  80:                 "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
  81:                 "Raven.Database/Storage/IMappedResultsStorageAction.cs",
  82:                 "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
  83:                 "Raven.Tests/Issues/RavenDB_784.cs",
  84:                 "Raven.Tests/Storage/MappedResults.cs",
  85:                 "Raven.Tests/Views/ViewStorage.cs"
  86:             ]
  87:         },
  88:         {
  89:             "Commiter": {
  90:                 "Email": "ayende@ayende.com",
  91:                 "Name": "Ayende Rahien"
  92:             },
  93:             "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
  94:             "Href": "/app/rest/changes/id:21038",
  95:             "TeamCityId": 21038,
  96:             "Username": "ayende rahien",
  97:             "Comment": "No need for so many repeats\n",
  98:             "Date": "2013-02-26T11:27:49.0000000",
  99:             "Files": [
 100:                 "Raven.Tests/Bugs/MultiOutputReduce.cs"
 101:             ]
 102:         },
 103:         {
 104:             "Commiter": {
 105:                 "Email": "ayende@ayende.com",
 106:                 "Name": "Ayende Rahien"
 107:             },
 108:             "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
 109:             "Href": "/app/rest/changes/id:21041",
 110:             "TeamCityId": 21041,
 111:             "Username": "ayende rahien",
 112:             "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
 113:             "Date": "2013-02-26T11:41:39.0000000",
 114:             "Files": [
 115:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
 116:             ]
 117:         }
 118:     ],
 119:     [
 120:     ],
 121:     [
 122:         {
 123:             "FullName": "Ayende Rahien",
 124:             "Email": "ayende@ayende.com",
 125:             "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
 126:         },
 127:         {
 128:             "FullName": "David Walker",
 129:             "Email": "david@davidwalker.org",
 130:             "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
 131:         }
 132:     ],
 133:     "bt8",
 134:     "/app/rest/builds/id:588",
 135:     "RavenDB",
 136:     588,
 137:     "project3",
 138:     2509
 139: ]

It reduced the document size to 2.93KB! Awesome, nearly half of the size was gone. Except: This is actually generating utterly unreadable mess. I mean, can you look at this and figure out what the hell is going on.

I thought not. At this point, we might as well use a binary format. I happen to have a zip tool at my disposal, so I checked what would happen if I threw this through that. The end result was a file that was 1.42KB. And I had no more loss of readability than I have with the JSONH stuff.

To be frank, I just don’t get efforts like this. JSON is a text base human readable format. If you lose the human readable portion of the format, you might as well drop directly to binary. It is likely to be more efficient and you don’t lose anything by it.

And if you want to compress your data, it is probably better to use something like a compression tool. HTTP Compression, for example, is practically free, since all servers and clients should be able to consume it now. And any tool that you use should be able to inspect through it. And it is likely to generate much better results on your JSON documents than if you will try a clever format like this.

time to read 3 min | 413 words

So, I just finished interviewing a candidate. His CV states that he has been working professionally for about 6 years or so. The initial interview was pretty well, and the candidate was able to talk well about his past experience. I tend to do a generic “who are you?” section, then give them a couple of questions to solve in front of Visual Studio, an architecture question and then a set of technical questions that test how much the candidate knows.

Mostly, I am looking to get an impression about the candidate, since that is all I usually have a chance to do in the span of the interview. The following is a section from the code exercise that this candidate has completed:

for (int i = 0; i < sortedArrLst.Count; i++)
{
    if (sortedArrLst[i].Contains(escapeSrt[0]))
    {
        if (sortedArrLst[i].IndexOf(escapeSrt[0]) == 0)
        {
            sortedArrLst[i] = sortedArrLst[i].Remove(0, escapeSrt[0].Length+1);
            escapeStrDic.Add(sortedArrLst[i], escapeSrt[0]);
        }
        
    }
    if (sortedArrLst[i].Contains(escapeSrt[1]))
    {
        if (sortedArrLst[i].IndexOf(escapeSrt[1]) == 0)
        {
            sortedArrLst[i] = sortedArrLst[i].Remove(0, escapeSrt[1].Length+1);
            escapeStrDic.Add(sortedArrLst[i], escapeSrt[1]);
        }
    }
}

Thank you, failure to use loops will get your disqualified from working at us.

Then there were the gems such as “mutex is a kind of state machine” and “binary search trees are about recursion” or the “I’ll use perfmon to solve a high CPU usage problem in production”.

Then again, the next candidate after that was quite good. Only 4 – 6 to go now.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  2. Webinar (7):
    05 Jun 2025 - Think inside the database
  3. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  4. RavenDB News (2):
    02 May 2025 - May 2025
  5. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}