This blog recently got a nice new feature, a recommended reading section (you can find the one for this blog post at the bottom of the text). From a visual perspective, it isn’t much. Here is what it looks like for the RavenDB 7.1 release announcement:
At least, that is what it shows right now. The beauty of the feature is that this isn’t something that is just done, it is a much bigger feature than that. Let me try to explain it in detail, so you can see why I’m excited about this feature.
What you are actually seeing here is me using several different new features in RavenDB to achieve something that is really quite nice. We have an embedding generation task that automatically processes the blog posts whenever I post or update them.
Here is what the configuration of that looks like:
We are generating embeddings for the Posts
’ Body
field and stripping out all the HTML, so we are left with just the content. We do that in chunks of 2K tokens each (because I have some very long blog posts).
The reason we want to generate those embeddings is that we can then run vector searches for semantic similarity. This is handled using a vector search index, defined like this:
public class Posts_ByVector : AbstractIndexCreationTask<Post>
{
public Posts_ByVector()
{
SearchEngineType = SearchEngineType.Corax;
Map = posts =>
from post in posts
where post.PublishAt != null
select new
{
Vector = LoadVector("Body", "posts-by-vector"),
PublishAt = post.PublishAt,
};
}
}
This index uses the vectors generated by the previously defined embedding generation task. With this setup complete, we are now left with writing the query:
var related = RavenSession.Query<Posts_ByVector.Query, Posts_ByVector>()
.Where(p => p.PublishAt < DateTimeOffset.Now.AsMinutes())
.VectorSearch(x => x.WithField(p => p.Vector), x => x.ForDocument(post.Id))
.Take(3)
.Skip(1) // skip the current post, always the best match :-)
.Select(p => new PostReference { Id = p.Id, Title = p.Title })
.ToList();
What you see here is a query that will fetch all the posts that were already published (so it won’t pick up future posts), and use vector search to match the current blog post embeddings to the embeddings of all the other posts.
In other words, we are doing a “find me all posts that are similar to this one”, but we use the embedding model’s notion of what is similar. As you can see above, even this very simple implementation gives us a really good result with almost no work.
- The embedding generation task is in charge of generating the embeddings - we get automatic embedding updates whenever a post is created or updated.
- The vector index will pick up any new vectors created for those posts and index them.
- The query doesn’t even need to load or generate any embeddings, everything happens directly inside the database.
- A new post that is relevant to old content will show up automatically in their recommendations.
Beyond just the feature itself, I want to bring your attention to the fact that we are now done. In most other systems, you’d now need to deal with chunking and handling rate limits yourself, then figure out how to deal with updates and new posts (I asked an AI model how to deal with that, and it started to write a Kafka architecture to process it, I noped out fast), handling caching to avoid repeated expensive model calls, etc.
In my eyes, beyond the actual feature itself, the beauty is in all the code that isn’t there. All of those capabilities are already in the box in RavenDB - this new feature is just that we applied them now to my blog. Hopefully, it is an interesting feature, and you should be able to see some good additional recommendations right below this text for further reading.