UX Tricks perceived responsiveness for high latency operations / failure modes

time to read 4 min | 725 words

We recently had to deal with an interesting problem in an unreliable application. To make it simple, I'll use this blog as an example, since that is much easier.

Let us assume that you are reading this post, and have a sudden desire to post a comment. Now, you can do so, we'll refresh the page, and you can see your new comment. So far, pretty simple and easy. That is the way we have been writing web applications since the nineties.

However, let us say that for whatever reason, we decided to put the actual content of the page inside a CDN, like Amazon CloudFront. The minimum amount of time we can store a page in a CDN is 5 minutes or so. The reason we want to use a CDN is so users' won't be impacted when the site is down for whatever reason.

Now, there are two types of user interactions in the blog. There is me, writing articles, and posting them. And then there is the reader, who read the post, and then send a scathing comment about the post. The software in question suffers from occasional outages, so we needed to have a way to cover that without the users' noticing. In this case, we divided the operations into two categories. Rare, which require the software to be fully operational (in the example above, posting a new blog post) and common, such as a user posting a comment, which should work even in reduced availability mode.

The solution was to utilize the CDN both for caching and reliability. The blog post itself is going to be hosted on a CDN, so if the core software fail, the user doesn't see that. But what about the actual interactions? As it turns out, while we want the user to think that everything is fine, we are actually okay with delaying some things. In this case, the perceived operation is actually more important than the actual operation.

Let us consider the standard system behavior, when there is no failure. The user goes to a particular page, and get the cached results from CDN. At the same time, we contact the live site, to try to get anything that is happening after the cached version. In the blog post example, the Ajax call would be something like "give me all the comments after 123". This takes us half way there, even though we use a CDN, we now have all the updated data when the user loads the page.

Except that we can take it further. Instead of doing an Ajax call, we can open a web socket, and ask the server "stream me everything after comment 123". As new comments come in, they will show up on the page automatically. If there is an outage, the user is going to see the latest version cached in the CDN, and won't be aware that they aren't getting the new comments. When the site is up again, we'll have all the new comments show up, just like magic.

This takes care of the user's reading habits. But what happens if the user is actually commenting? In the good case, the system is up and running, accept the new comment, distribute it to all the web sockets listening to the page's comment, and everything si good. But what happens in the failure case?

In this case, we are going to accept the user's comment, and then write it to local storage. And from local storage, we will display it to the user so as far as the user is concerned, the comment was successfully posted. Naturally we need to retry on the background, and eventually we'll send the comment to the server successfully.

And yes, this is only relevant for cases where we can actually afford to have the user believe that an operation has been successful when it hasn't been. But for stuff like comments, this is a really good system to make sure that everyone has a good user experience.

Oh, and this also has the benefit of having an easy way to handle abuse. You can not send the comment to the server, but do show it to the user. This way the spammer think that they have successfully posted, but they are just spamming themselves.