How to gracefully handle cache expiration

I received few inquires for a technical run down of how Memcached wrapper handles expiration of keys. 

When I first read about this concept, it was pretty hard to understand since most of the sources were way too technical (the fact that English is my second language probably did not help either) for somebody who just entered word of ‘holy crap! you can store stuff in memory’.

I will try my best to explain the problem with just caching and one of the ways you can avoid it without using too much of technical jargon.

Concept of Caching

Caching, to a less experienced developer is viewed as a tool that solves all of the performance problems. When Memcache was first introduced , developers would simply wrap a chunk of code in a block that does the following:

  1. Look up a key from cache
  2. If key does not exist in cache, call a random method to generate data that needs to be associated with that key 
  3. Store that data under the key in cache

And when same code block was executed again it would find the data associated with that key within the cache pool and give you the data… extremely fast. 

To a developer that’s a wow factor by it self, I remember how impressed I was when I implemented something like this for the first time.

Such reaction is enough to simply put your tools down and call it a job well done. You can go home and celebrate your achievement and dream about handling thousands of users at any given time because this caching thing is awesome.

Rain on Your Parade

While you are still thinking that you solved all of the scalability problems you will ever have, your website get’s a tsunami of new visitors from a TV PR campaign your marketing team launched. 

Everything looks great, you are smirking at how well the cache is working.

Then the site goes down.

You scrambles to find a solution, it can’t be the caching you just implemented. It’s just too fast to fail.

Usually (from what I seen) people will point fingers at the database or whatever complex/slow pieces of code you were hiding behind a cache. 

(And while the database might be the slowest part of your application, that’s not the reason why your web site went down.)

Since cache is sooo fast, and your logs are telling you that your database simply died as soon as it started to see a little bit of requests you naturally assume that’s the main issue and it should be addressed.

So in a moment of panic you scream for more database slaves, a better tune of your cluster and perhaps a crazy last second sharding implementation.

That works out great, you are back in business and things are looking up.

Paradise in a Desert

Just like a paradise you might discover after walking under a burning sun in a desert, the solution you put in place to prevent another downtime due to a massive amount of traffic was simply an illusion.

The web site will still go down under the same conditions (unless of-course you invested in a small data center that hosts dozens of database clusters, then it’s debatable). 

The reason for such mistakes is overlooking the fact that once you cache something that takes more than x seconds to execute, it will eventually expire from cache.

Either from having a short time to live (TTL) or by being pushed out the caching pool to make space for fresher data.

Breaking it Down

Let’s say you have 35 queries on your web site that you put behind a caching layer. You request the page and it flies. Absolutely no issues. 

Even when cache expires and you grab the page it loads pretty damn fast, you can add 10 more concurrent visitors and there will be no problems. The database picks it all up easily and puts in back in the cache.

Now, if you multiply those 10 concurrent requests by let’s say 20. If all of the queries are cached, cache pool can process 200 requests without any issues.

But once the data in the cache expires all of a sudden all 200 requests to cache pool return 'resource not found’ and sends all of those requests directly to the database at once.

And then the alarms go off.

The chain reaction is usually something along these lines:

  1. Database becomes overloaded and clients stall waiting for data to be returned back from the database.
  2. As the clients wait for a possibly dead database, http server keeps those clients in it’s pool while attempting to serve new clients using resources it has left.
  3. The http server has ran out of resources since it can’t process clients as fast as it usually does because everything is being used up by requests that are waiting for database.

That sucks, right? If only there was a way to prevent this from happening.

Caching Just Got Smarter

One of the approaches to this problem is fairy simple. The concept is to wrap the original resource you are caching in an array that contains a time stamp that is set to a time that is just a few minutes short of the time when the item is actually set to expire.

And when your application unpacks cached resource it will check that time stamp and if current time greater it means that item it just retrieved is going to expire relatively soon. 

Once it knows that item is going to expire, it will update the cached record with a new one that contains the exacly the same data it just recieved but with a longer expiration time.

Essentially telling anybody else who is pulling the data that the item is not going to expire any time soon.

Then you simply lie to your application and tell them that this request did not get anything back from cache. It will now execute a database query you were caching and save it back into the database with a new expiration date.

All of the new requests will now have an updated version of cached data.

While it’s possible that more than one request will slip through the flood gates, this is usually really rare. When I tested with 200 concurrent connections the key was updated by a single request. 

Since you are caching everything, your database and apache should be pumping out requests fairly fast so you can probably afford more than a single person slipping through this barrier one in a while without creating an unrecoverable request queue of death.

Walk the Talk

Let’s get a little bit more technical and try to implement this solution in our code. First we need to come up with a time interval that we will subtract from the original expiration date. 

This time interval depends on your caching strategy, basic rule of thumb take a median expiration intervals for the most important keys in cache that represent data from really slow database queries.

If that number is let’s say an hour and you can guarantee that on average, during a relatively busy day you get more than one request every 10 minutes, it’s safe to set time interval to 10 minutes.

If you have cached data that needs to be updated at a faster rate, always make sure that you are guaranteed that there will be a single request between original expiration time and the fake expiration time (ie: your safety net).

So since we now have a number in mind, let’s write a simple wrapper for Memcached extension so we can override set() method and wrap a resource we are caching in an array containing our 'fake’ expiration date:

As you can see, we simply intercept set() method on original extension so we can call wrap() method on a resource we are caching. In return that method will take original expiration time we are attempting to set and subtract 10 minutes from it prior to adding it to our array.

Now, we need to intercept a get() method so we can unwrap the previously wrapped data and check the fake expiration date we set in order to determine if we should pretend the result is no longer cached.

To do so let’s add following methods to our wrapper:

You now have a pretty solid protection that stops random flood of requests that bypass your caching layer at the same time.

As part of my Memcached wrapper, I included a simple proof of concept script that you can use to test this scenario your self.

When I bench marked the script in question with 200 concurrent requests the results did all the talking:

Using technique we implemented:

As you can see only a single request out of 200 got through to query the database and update cache.

Using raw get/set methods:

…. the Apache could no longer keep up with requests.

Fin.

  1. alekseykorzun posted this