Optimizing facets query performance in Corax
RavenDB allows you to query your data freely and cheaply. It is one of those things that makes or breaks a database, after all.
After over a decade of working with Lucene as our backend indexing engine, we built Corax, a new querying & indexing engine that offers far better performance.
Building an indexing engine is a humongous task. It took us close to ten years from the first line of code to Corax actually shipping. But I’m really happy with the way it turned out.
Building a query engine is a big task, and we focused primarily on making the most common queries fast. The issue at hand is that RavenDB has many features, and we don’t have infinite time. So for the less common features, we typically implemented them as a straightforward port from whatever Lucene is doing.
One such feature is facets. Let’s say that I want to buy a jacket. There are way too many choices, so I can use a faceted query to help me narrow it down. Here is what this looks like in code:
from Products
where search(Description, "suit jacket")
select facet(Brand),
facet(Price < 200,
Price between 200 and 400,
Price between 400 and 800,
Price > 800)
And here is what this looks like as a website:
I mentioned that we implemented some features as a straightforward port from Lucene, right?
We did that because RavenDB offers very rich querying semantics, and we couldn’t spend the time to craft every single bit upfront. The idea was that we would get Corax out the door and be faster in most common scenarios, and at least at parity with everything else.
It works for most scenarios, but not all of them. We recently got a query similar to the one above that was slower in Corax than in Lucene. That is usually good news since we have far more optimization opportunities in Corax. Lucene (and especially our usage of it) has already been through the wringer so many times that it is really hard to eke out any more meaningful performance gains. Corax’s architecture, on the other hand, gives us many more chances to do so.
In the case of facets, the way Lucene handles that is roughly similar to this:
def brand_facet(matches: List[int]):
facet = dict()
for term, docsForTerm in reader.terms("Brand"):
facet[term] = count_intersect(matches,docsForTerm)
Given the results of the query, run over all the terms for a particular field and intersect the documents for every term with the matches for the query. Lucene is able to do that efficiently because it materializes all its data into managed memory. That has costs associated with it:
- Higher managed memory usage (and associated GC costs)
- Slower initial queries
The benefit of this approach is that many operations are simple, which is great. Corax, on the other hand, does not materialize all its data into managed memory. It uses persistent data structures on disk (leading to reduced memory usage and faster responses on the first query).
The advantage we have with Corax is that the architecture allows us to optimize a lot more deeply. In this case, however, it turned out to be unnecessary, as we are already keeping track of all the relevant information. We just needed to re-implement faceted search in a Corax-native manner.
You can see the changes here. But here is the summary. For a dataset with 10,000,000 records, with hundreds of brands to facet on, we get:
Yes, that isn’t a mistake. Corax is so fast here that you can barely observe it 🙂.
Comments
this seems to be not real, how can it be? what's the secret sauce? i don't understand where this performance gain came from, memory? sorted-data stucture? SIMD instructions? something here looks like too good to be true
Lines changed in the PR: +783 −104. Stuff that you can achieve only if you abstract things right and keep in mind how the data you operate on are structured. Well done 💪
Just log in and begin the show or use free porn cams.
Крипто босс: официальный сайт для всех любителей криптовалют http://islamdini.kg/index.php?subaction=userinfo&user=itewyhog
Comment preview
Join the conversation...