Ayende @ Rahien

Nov 05 2017

Random perf results that make me happy

time to read 2 min | 212 words

Tags:

Michael Yarichuk is one of the core developers of RavenDB. He is going to do a talk and a workshop on Oredev this week. And I just got his latest slides for review.

His talk is about how you can reduce your GC load and improve performance and it includes the following slide:

On the left you have RavenDB 4.0 and on the right RavenDB 3.5 running the same load under a profiler. Leaving aside that RavenDB 4.0 is much faster overall, look at the numbers. The 3.5 version spent a lot of time in GC, and a lot of that was blocking GC calls. The 4.0 version barely did any GC, and all of that was in the background.

This scenario wasn’t part of any performance work, it was to show the result of about two years of work and it is amazing to look back and understand that we can see a concrete example of the results so clearly.

Michael will be talking about some of the techniques we use to get there, so I highly recommend you come to his talk. He’ll also be doing a full day workshop on modeling data with documents.

Nov 03 2017

The bare minimum a distributed system developer should know aboutCertificates

time to read 5 min | 848 words

Tweet Share Share 16 comments

Tags:

After explaining all the ways that trust can be subverted by DNS, CA and a random wave to wipe away the writing in the sand, let us get down to actual details about what matters here.

HTTPS / SSL / TLS, whatever it is called this week, provides confidentially over the wire for the messages you are sending. What it doesn’t provide you is confidentially from knowing who you talked too. This may seem non obvious at first, because the entire communication is encrypted, so how can a 3rd party know who I’m talking about?

Well, there are two main ways. It can happen through a DNS query. If you need to go to “http://my-awesome-service”, you need to know what the IP of that is, and for that you need to do a DNS query. There are DNS systems that are encrypted, but they aren’t widely deployed and in general you can assume that people can listen to your DNS and figure out what you are doing. If you go to “that-bad-place”, it is probably visible on someone’s logs somewhere.

But the other way that someone can know who you are talking to is that you told them so. How did you do that?

Well, let’s consider one of the primary reasons we have HTTPS. a user has to validate that the hostname they used matched the hostname on the certificate. That seems pretty reasonable, right? But that single requirement pretty much invalidates the notion of confidentiality of who I’m talking to.

Consider the following steps:

I go to “https://my-awesome-service”
This is resolved to IP address 28.23.155.123
I’m starting an SSL connection to that IP, at port 443. Initially, of course, the connection is not encrypted, but I’ve just initiated the SSL connection.

At that point, any outside observer that can listen to the raw network traffic know what site you have visited. But how can this be? Well, at this point, the server needs to return a reply, and it needs to do that using a certificate.

Let us go with the “secure” option and say that we are simply sending over the wire “open ssl connection to 28.23.155.123”. What does this tell the listener? Well, since at this point the server doesn’t know what the client wants, it must reply with a certificate. That certificate must be the same for all such connections and the user will abort the connection if the certificate will not match the expected hostname.

What are the implications there? Well, even assuming that I don’t have a database of matching IP addresses to their hostnames (which I would most assuredly do), I can just connect myself to the remote server and get the certificate. At this point, I can just inspect the hostname from the certificate and know what site the user wanted to visit. This is somewhat mitigated by the fact that a certificate may contain multiple hostnames or even wildcards, but even that match gives me quite a lot of information about who you are talking to.

However, not sending who I want to talk to over the initial connection has a huge cost associated with it. If the server doesn’t know who you want, this means that each IP address may serve only a single hostname (otherwise we may reply with the wrong certificate. Indeed, one of the reasons HTTPS was expensive was this tying of a whole IP address for a single hostname. On the other hand, if we sent the hostname were were interested in, the server would be able to host multiple HTTPS websites on the same machine, and select the right certificate at handshake time.

There are two ways to do that, one is called SNI – Server Name Indication. Which is basically a header in the SSL connection handshake that says what the hostname is. The other is ALPN – Application Level Protocol Negotiation, which allows you to select how you want to talk to the server. This can be very useful if you want to connect to the server as one client with HTTP and on another using HTTP/2.0. That has totally different semantics, so routing based on ALPN can make things much easier.

At this point, the server can make all sorts of interesting decisions with regards to the connection. For example, based on the SNI field, it may forward the connection to another machine, either as the raw SSL stream or by stripping the SSL and sending the unencrypted data to the final destination. The first case, of forwarding the raw SSL stream is the more interesting scenario, because we can do that without having the certificate. We just need to inspect the raw stream header and extract the SNI value, at which point we route that to the right location and send the connection on its merry way.

I might do more posts like this, but I would really appreciate feedback. Both on whatever the content is good and what additional topics would you like me to cover?

Nov 02 2017

RavenDB 4.0 book update is available

time to read 2 min | 388 words

Tweet Share Share 0 comments

Tags:

A new update to the Inside RavenDB book is available. I’m up to chapter 9 (although Chapter 8 is just a skeleton). You can read it here.

In particular, the details about running RavenDB in a cluster and the distributed technologies and approaches it uses are now fully covered. I still have to get back to discussing ETL strategies, but there are two full chapters discussing how RavenDB clusters and replication work in detail. I would dearly appreciate any feedback on that part.

This is a complex topic, and I want to get additional eyes on this to make sure sure that it is understandable. Especially if you are new to distributed system design and how they work.

Another major advantage that we now have a professional editor go through chapter 1 – 7, so the usage of the English language probably leveled up at least twice. Errors, awkward phrasing and outright mistakes remains my own, and I would love to hear about any issues you find.

Also new in this drop is a full chapter talking about how to query RavenDB and dive into the new RQL language. There is still a lot to cover about indexes, and this chapter hasn’t been edited yet, but I think that this should give a good insight into how we are actually doing things and what you can do with the new query language.

In addition to that, we are ramping up documentation work as we start closing things down to the actual final release. We are currently aiming that at the end of the year, so it is right around the corner. I also would like to remind people that we are currently giving 30% discount for purchase of RavenDB licenses, for the duration of the Release Candidate. This offer will go away after the RTM release.

Another source of confusion seems to be the community license. I wanted to clarify that you can absolutely use the community license for production usage, including using features such as high availability and running in a cluster.

So grab a license, or just grab the bits and run with them. But most importantly, grab the book (https://github.com/ravendb/book/releases) and let me know what you think.

Nov 01 2017

The bare minimum a distributed system developer should know aboutTransport level security

time to read 6 min | 1182 words

Tweet Share Share 1 comments

Tags:

Transport level security, also known as TLS or SSL, is a way to secure a connection from prying eyes. This is done using math, so we know that this is good. I can also count to 20 if I take my shoes off, so you know you can trust me on that.

On very slightly more serious mode, SSL/TLS gives us one half of the security requirements. We can negotiate a secure chiper and a key and rest assured that no outside source can listen to what we are saying to the other side.

Notice that I said one half? The other half is knowing who is on the other side. This is usually done using certificates, which provide the public / private keys for the connection, and the signer of the certificate is what provides the identity of the remote connection. In other words, when I’m using SSL/TLS, I need to also know who am I going to be talking to, and then verify in some manner using the certificate that they provide me that they are indeed who they are.

Let us deconstruct the simplest of operations, GET https://my-awesome-service:

First, we need to find the IP of my-awesome-service.
Then, we negotiate an secured connection with this IP.
Profit?

This would seem like the end of things, but we need to dig a bit deeper. I’m contacting my-awesome-service, but before I can do that, I need to first check what IP maps to that name. To do that I need to do a DNS query. DNS is usually unsecured, so anyone can see what kind of host names you are asking for. What is more interesting, there is absolutely nothing that prevent a Bad Guy from spoofing DNS responses to you. In fact, this has been a very fruitful area of attacks.

There is DNS Sec, which will protect you from forged requests in the last mile, but less than 15% of the world wide record are actually signed using DNS Sec, so you can usually assume that you won’t be using that. In fact, even if the domain is signed, because so many domains aren’t, most systems will be configured to assume that an unsigned request is valid by default, instead of the other way around. This make things fun at security circles, I’m sure. But for our purposes, you should know that DNS is great, but you probably shouldn’t rely on it. Errors, mistakes and outright forgery is possible.

If you want to see a simple example, head over to “/etc/hosts” on Linux or “%windir%/system32/drivers/etc/hosts” on Windows and add some fake entries there. You can have fun with pointing stackoverflow.com to lmgtfy.com, for example.

You can do it like so:

54.243.173.79 stackoverflow.com

With 54.243.173.79 being the IP address of lmgtfy.com. Once you have done that, requests that you think are going to stackoverflow.com will be sent to lmgtfy.com, with hilarity soon to follow.

Oh, except that this won’t work. StackOverflow is using HTTPS, and they are also using HSTS (HTTP Strict Transport Security). Basically, this means that once you have visited StackOverflow even once, your browser will remember that this domains require HTTPS to work, and will outright refuse to access the site without it.

But what is the problem? HSTS is great, but it just require HTTPS. So if I managed to spoof your DNS somehow (if I could modify the hosts file, I’m already admin and own the box, but assuming that I haven’t gotten there), all I would really need to do is to make sure that the websites I spoof give you a certificate. But here the second half of SSL come into play. The client making the request is going to validate that the hostname it provide is located in the certificate that the server provided. So far, that make sense. But the server could just generate whatever certificate it wants, no?

In order to prevent that, there is a chain of trust. Basically, you need to have a list of trusted root certificates that your trust, and you verify that the certificate that you got from the remote server was directly (or indirectly, in some cases) signed by them, presumably after some level of verification. Reading the actual list of trusted roots is interesting.

The Mozilla list has about 160 root certificates and includes such entities as the Government of Turkey, where all journalists will tell you that the government is free & fair (all those who would say otherwise are not there). On my Windows machine, there are about 50 root certificates, and at least at one point that included Equifax, who we know can be trusted. On a work machine, you can be fairly certain that there are additional root certificates (from the domain, for example). But for now, we’ll ignore the possibility of a bad trusted root certificate and assume that the system is working as it is meant to be. And to be fair, any violations are punished by revocation of the root certificate. This is the current state with the Equifax root certificate on my machine, for example, it has been revoked.

Another mitigation here is that there is an ongoing process to encourage certificate issuance transparency. That means that a domain can specify which CA are allowed to issue certificates for it (called key pinning). Of course, this is distributed via DNS, and we already seen that this ain’t too hot either, but it is a matter of defense in depth. Key pinning also create some fun ransomware options. If I can get control over your DNS records in some manner, including by spoofing them, I can set key pinning to a key that only I have, resulting in large number of users unable to access you site because it is not using the “correct” key. But I’m digressing. There is also the notion that a browser can do something called OCSP (online certificate status protocol), which basically states that a user can query the CA for whatever the certificate is valid. The catch is that if the CA doesn’t answer (vs. answer that the cert is invalid), the certificate is assume to be valid. This is done because a CA going down may then take down significant parts of the internet, leaving aside such concerns as the latency issues that this would require.

If you think the notion of a rouge trusted root is fantasy, there have been multiple cases of false certificates (DigiNotar, Symantec, TrustWave, etc), each with hundreds of certificates being issues (or even blank checks certificates, which can be used to generate any certificate you wish for). To combat that, there is now an effort to implement Certificate Transparency. Basically, in order to trust a certificate, it must show up in a public list. That allow admins to check that no one issued certificates for their domains.

This post has gotten quite long, so I’ll leave you with this worrisome ending and continue talking about how this applies to distributed systems in the next post.

Oren Eini

Oren Eini

CEO of RavenDB

Random perf results that make me happy

The bare minimum a distributed system developer should know aboutCertificates

RavenDB 4.0 book update is available

The bare minimum a distributed system developer should know aboutTransport level security

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed