How Drupal's cron is killing you in your sleep + a simple cache warmer

A lot of what's written about performance tuning for Drupal is focused on large sites, and benchmarking is often done by requesting the same page over and over in an attempt to maximize the number of requests per second (à la ab). Unfortunately, this differs from the real world in two key ways:

  1. Most of our projects aren't regularly driving traffic at millions of hits per day.
  2. Our users request a lot of different pages.

In this post I'll model a site with 500 nodes, and 1200 hits per day. That's fewer than 1 request per minute, yet for many small businesses this would be a fairly healthy traffic flow. In this case, it might at first seem that high performance doesn't matter. A clever hacker could probably manage to install Linux on the office coffee maker and get acceptable HTTP throughput. However, latency still matters a great deal, even for small sites:

User impatience is measured in units of 1 tenth of a second starting at 200 milliseconds or so.

Drupal's page cache is capable of delivering blazing fast response times, but only when the cache is warm. And the reality for most small to mid-sized sites is the page cache is being cleared out far quicker than it's regenerated.

hit rate

histogram

These graphs show what happens to a site that runs cron.php every six hours. Remember, system_cron() calls cache_clear_all(), and every time the cache is cleared the hit rate crashes to zero. The traffic follows a Pareto distribution: a few pages are very popular with a "long tail" of less common requests. Initially the hit rate jumps back up as the popular pages get cached. But the "long tail" pages never really gain an acceptable hit rate before it starts all over again. To make matters worse, I've seen site operators that run cron far more often – imagine if cron were run every fifteen minutes; there would be almost no page caching at all. So what can be done?

Run cron.php less often

Somewhere between six hours and one day is likely adequate for most sites. If you have tasks than need to run more often (such as notifications), consider breaking up your cron runs with Elysia cron or perhaps a drush script. Still, as the graph shows it can take a long time for the page cache to kick in no matter what the frequency. Furthermore, the cache is also cleared during other operations such as node edits or comment posting.

Run a crawler

Almost every smart Drupal developer I've discussed this problem with has the same answer: run a crawler. I'm not thrilled about this solution because in some ways it seems very inefficient; but some testing shows the impact can be minimal. In my tests 500 nodes took less than a minute to regenerate from a cold cache; and less than 5 seconds when fetching from the cache. This assumes you have the XML Sitemap module installed. While not required, a sitemap certainly make's wget's job easier.

wget --quiet http:∕∕example.com∕sitemap.xml --output-document - |\
perl -n -e 'print if s#</?loc>##g' | wget -q --delete-after -i -

Invent a preemptive cache handler

It seems the ideal cache handler would regenerate cache entries before they expire, sadly as far as I know this doesn't exist yet for Drupal 6. (There's Pressflow Preempt for D5, but it seems it never made it to D6). If you have ideas about this please let me know!

Comments

But cach_clear_all does a

But cach_clear_all does a check. If NULL is passed as the $cid then it does a check on some variables. It makes sure that cache_clear_all is only run on the tables once per "period" defined by the cache_lifetime settings.

Check your sites admin/settings/performance. Is the minimum cache lifetime set to "none"?

But even if it is set to none, it only deletes items which are expired...

Boost for 6.x is the

Boost for 6.x is the answer!
It has a crawler built in so your site will never have an expired page. It can regenerate the cache before it expires. It solves all the issues you brought up. I can have a cache that lasts several weeks using Boost.

Also consider setting your

Also consider setting your 'Minimum cache lifetime' setting in admin/settings/performance. That way if you run cron every hour it doesn't have to clear your caches every single run.

dylan's picture

Minimum cache lifetime isn't

Minimum cache lifetime isn't much help for small sites. What it's really good for is preventing the caches from being cleared too often in cases where you have large numbers of writes. (As in thousands of comments per day). In the site charted above, one would need to set the lifetime to 12 hours or more to make a significant difference, and even then the caches will still expire eventually.

Regarding Boost: I have only tinkered with it, so I can't comment in detail on how it handles cache expiration. The built-in crawler is good idea. So far, the thing that has kept me off boost is that it's not actually cache handler – so it's completely unaware of cache clear events. (Correction: see mikeytown2's comment below.) In a follow-up post, I'll try to compare Boost to cacherouter with the file backend.

Actually Boost is a cache

Actually Boost is a cache handler; it has some of the most tightly integrated cache expiration logic ever seen in any CMS. The default for Boost is to ignore the cache_clear_all cron call, but if wanted this can happen by setting Ignore cache flushing: to Disabled.

Boost will clear or expire the cached page under the following hooks - Nodeapi; comments; voting api.
In addition to clearing that node it can also clear the views containing the node; cck node reference fields; taxonomy term pages containing that node; and the menu item above it (parent), next to it (siblings), and sub items below it (children).

You can also set custom expiration times by content container (view, node, taxonomy, panels, ect...); content type (page, story, ect...) or by ID (nid, tid, view display, ect...). Boost in short allow you to spend a lot of time tinkering with the cache to get it exactly how you want it to work. It is very powerful and there is a reason it is used by a lot of sites out there.

dylan's picture

mikeytown2, Thanks for the

mikeytown2,

Thanks for the correction! I was thinking that because of Moshe's comment in this issue, and the fact that it does not include a cache.inc handler, (a.k.a. $conf['cache_inc'] = boost.inc) that it didn't respond to things like node_save().

I will definitely be giving Boost a closer look. I wonder if the project page could be updated to clarify that it is bypassing cache.inc and cache_clear_all() in order to provide more fine-grained control.

Thanks for this post. I have

Thanks for this post. I have my cron set to run every 5 minutes and it never occurred to me that there would be any harm in it. I'm going to investigate the options in the comments.

Thanks again!

Michelle

Thanks for the post Dylan,

Thanks for the post Dylan, and to all who commented! This will definitely help, as a good portion of the sites we build are in the low-medium traffic range, and performance is always a concern.

This is incorrect. This call

This is incorrect. This call is pretty innocous; it just prunes expired and unused items from the cache tables.

Whats called during system_cron() is cache_clear_all(NULL, $table). It is easy to assume that this wipes whole table. But not so - $cid is NULL in cache_clear_all() and thus empty($cid) is TRUE and thus only expired items get deleted. Follow the (twisted) logic of cache_clear_all and you will see it. The comments for this function are pretty confusiHere is the relevant section

function cache_clear_all($cid = NULL, $table = NULL, $wildcard = FALSE) {
  global $user;
 
  if (!isset($cid) && !isset($table)) {
    // Clear the block cache first, so stale data will
    // not end up in the page cache.
    cache_clear_all(NULL, 'cache_block');
    cache_clear_all(NULL, 'cache_page');
    return;
  }
 
  if (empty($cid)) {
    if (variable_get('cache_lifetime', 0)) {
      // We store the time in the current user's $user->cache variable which
      // will be saved into the sessions table by sess_write(). We then
      // simulate that the cache was flushed for this user by not returning
      // cached data that was cached before the timestamp.
      $user->cache = time();
 
      $cache_flush = variable_get('cache_flush_'. $table, 0);
      if ($cache_flush == 0) {
        // This is the first request to clear the cache, start a timer.
        variable_set('cache_flush_'. $table, time());
      }
      else if (time() > ($cache_flush + variable_get('cache_lifetime', 0))) {
        // Clear the cache for everyone, cache_lifetime seconds have
        // passed since the first request to clear the cache.
        db_query("DELETE FROM {". $table ."} WHERE expire != %d AND expire < %d", CACHE_PERMANENT, time());
        variable_set('cache_flush_'. $table, 0);
      }
    }
    else {
      // No minimum cache lifetime, flush all temporary cache entries now.
      db_query("DELETE FROM {". $table ."} WHERE expire != %d AND expire < %d", CACHE_PERMANENT, time());
    }
  }
 
<php>

only if 'minimum cache lifetime' is set

if 'minimum cache lifetime' is none, then cache_page is emptied on every cron run.

that's sensical.

it probably makes more sense (semantically) to set a larger cache_lifetime to prevent cache clears than to run cron less frequently.

Please read page_set_cache source code

firebus, I'm sorry, but you're just plain wrong.
(and so are most folks who think they know how this works)

page_set_cache clearly uses the Magic Constant CACHE_TEMPORARY which does NOT honor cache_lifetime in any way, shape or form.
EVERY page in EVERY D6/D7 site is nuked from the cache EVERY time you run cron.

Period.

The fact that everybody who reads the text on the Performance admin page, and who reads the cache.inc source above gets this wrong should tell us just how non-intuitive this is.

The D6/D7 cache is probably doing wonderful things for performance for various portions of your content/site. But it is most definitely NOT caching the HTML output for the lifetime you select in your Performance page. It *IS* caching it until the next cron run, if you have cache enabled, and that is all it is doing.

PS
Do not take this missive amiss.
I suffered under the same mis-conception for 18 months before I dug into this after endless complaints of our sites being "slow".

I was initially baffled by disbelief that anybody would do such a thing to an HTML cache, and added error_log() statements to verify.

My patch may not pass 'poll' UnitTest, and it may not be suitable for sites relying on the buggy behaviour. But it works fine for me, and loadtests have been running much better for weeks on our QA boxes.
It's also in PRODUCTION for a couple microsites.
One had un-related issues of a directory rename confusing theme registry.

none taken

you're right! i'm shocked. the expires column in cache_page is set to "-1" no matter what the minimum cache lifetime is set to, so cron empties the cache on every run.

moshe is also totally mistaken in his parent comment.

You're misunderstanding what

You're misunderstanding what cache_lifetime does. It doesn't set a minimum TTL for a particular cached item - it determines a minimum amount of time that an entire cache bin should be allowed to continue before it can be wiped by cache_clear_all(). There are plenty of legitimate inefficiencies about the cache system, but this isn't the bug you think it is.

Nobody Understands

I defy any core Drupal developer to correctly document D6/D7 caching.

For starters, the docs and the words on the Performance tab reference the page cache and claim cache_lifetime applies to it.

If you read the source of page_cache_set, you will see that this is patently false.

And, Les, I'm sorry, but you are wrong. The page_cache bin does *NOT* survive for a minimum amount of time in D6/D7. It lives until cron runs, however often you have set that up. Period.

If you want to re-write the docs to clarify that the cache_lifetime applies to page_cache bin as a whole, feel free to do so. They will still be wrong, because that's not what it DOES, but at least the intent will be more clear.

Richard, it is you who does

Richard, it is you who does not understand. The page cache is not emptied on every cron run unless the minimum cache lifetime has been exceeded. The code is right there in front of you if you just scroll up a bit. It was explained to you 2 years ago by Moshe and still you can't understand it. Saying it repeatedly does not make it true.

Understood

I stand corrected.

It only nukes the cached pages when cache_lifetime has been exceeded, using logic that takes hours to comprehend.

The GUI only lets you put in up to 1 Day lifetime.

Attempting to exceed that by much, in settings.php or with a custom module, will have the CSS and JS nuked out from under the cached page, resulting in an un-styled non-JS-functional site.

I've done it.

It managed to get to production that way.

It was not pretty.

Bottom Line:

The Page Cache has a maximum lifetime of 1 Day + your cron interval (or maybe 2 X your cron interval). Hard to tell from the code.

So the D6/D7 cache is effective only for high-write sites like blogs, forums, news, RSS, comments, etc.

It's completely useless for organizational CMS semi-static sites getting a long-term cache on pages for rare writes many reads.

It's a cache that only fits that "frequent write" use case, rather than an abstract architecture "cache" that can be utilized for multiple use cases.

I kept trying to force it to fit my Use Case, and it simply doesn't.

I'm stuck with the common work-around of the masses: Set up your cron jobs to hit cron.php and then crawl your site to prime the pump on the cache.

I do appreciate all the input, and apologize that the interaction of CACHE_TEMPORARY magic number being abused as an additive to a time-stamp ended up confusing me, but only if you've set a non-zero cache_lifetime. Whew.

and if this *is* the intent, it's not a good intent

what use case does setting a minimum lifetime for a bin solve?

if my goal, in caching, is to improve performance by reducing the number of times a page is built, how does setting a minimum lifetime for the bin solve this? if we're talking about the page cache, the unit you want to address with any minimum lifetime settings i the page.

Intent

I don't think there is a coherent strategy (intent) to caching in D6. Almost every Drupal expert I know is surprised to find out that an HTML document only lasts until the next cron run.
Many mistakenly refer to the source includes/cache.inc and comments and documentation, without spotting that those are all clearly wrong and out of sync with the reality of page_cache_set function in includes/common.inc

The notion that the intent is for the lifetime of a BIN as whole is certainly a new twist. Makes one wonder by the page_cache table would have an expires column for each page, since that's certainly not normalized.
But I cannot say with any certainty what IS the intent, because what the docs say, what the source comments say, and what everybody claims are at such odds with each other.

Almost everybody I've spoken with THOUGHT the idea was for a page to have a minimum lifetime. And that's my interpretation of the wordage in Performance page and the docs and the comments within cache.inc

And it's what I would expect out of a cache to improve performance for a site.

It's quite disappointing that the Drupal community doesn't really have a handle on such a basic feature.

Mikeytown2 - is there some

Mikeytown2 - is there some cool video or tutorial that we can check out regarding how to best set up Boost to avoid all these issues?

While Boost can work wonders,

While Boost can work wonders, I've ran into problems with it (on a D5 site) where the caching files were not clearing correctly, so buyer beware.

Memcache is definitely the way to go when possible, along with opcode caching. Beyond that I'm hoping to get a custom caching engine a co-worker built released to d.o to see what the feedback is..

In Drupal 7, the cron run has

In Drupal 7, the cron run has 2 processes to it: One is for "main tasks" (like the old cron hook), and the other is to run any cron entries queued up in the job queue. The latter you probably want to run fairly quickly, the former not so much.

So what if for Drupal 8 we split cron into 2 requests: One that can run on poormanscron or a separate cron run and handles the queued tasks, and has no cache clear. (That would be things like sending out mass emails, aggregator updates, etc.) The second could be set to run every 6-24 hours instead of every hour like most sites do, and handle things like log rotation, update_status updates, and other non-time-sensitive stuff, and then clear the cache.

That would keep high-volume tasks from piling up while keeping the wipe part of cron rare.

dylan's picture

Moshe, Can you clarify a bit

Moshe,

Can you clarify a bit more? As I understand it, the page cache doesn't have an expiration date – it's always set with CACHE_TEMPORARY. Even with a minimum cache lifetime (which I was surprised to see uses a timer rather than {cache_page}.created) the page cache will still be wiped every second cron run (assuming your cache lifetime is shorter than the cron period).

dylan's picture

cron.php in URL path

To be clear, the graph shows cron runs, and does not observe cache clears directly. The input to the graph was just a list of URLs and responses. Each response was classified as a hit, miss, or cron run. A hit or miss is 100% or 0% respectively, and then the data were smoothed by convolution with a Bartlett window. In any case, cron.php is run on a schedule, so the timing can be known in advance.

Update to cron cache warmer

With the newest xmlsitemap the cache warming command was failing. I switched out perl for egrep and I have it working. The command I use is

wget --quiet http://example.com/sitemap.xml --output-document - | egrep -o "http://example.com[^<]+" | wget -q --delete-after -i -

Please note you have to change both instances of example.com to your site name.

Expire module

I'm opting to set my minimum cache lifetime to 1 day... then use the Expire module to allow nodes to be expired when updated. Dealing with the patch/submodule right now to expire core cache. http://drupal.org/node/1308252#comment-5305656

Attempt to time this 1 day cron clear at night after daily backup.

Expire comes with rules integration as well for doing non-node expirations, Cache Actions may also do the trick.

BTW: Noticed this module which may be relevant for this discussion-- http://drupal.org/project/cache_graceful

Expanded solution

I've expanded the cache warmer solution so that it can also be used to handle updates to the site, and not only cron runs. Take a look at my blog post.

Expanded solution

Replying to myself to clarify... With "updates to the site", I mean that the cache is not only cleared during each cron run, but also when any page is changed, or when a comment is added to any page (even if it is just queued for moderation!). So just running a cache warming script after each cron run is not sufficient.

What about not expiring the page cache - a semi-permanent cach

Varnish module lets not expire the page cache on cron runs when you use the Varnish module's cache backend. Combining this with updating caches for specific nodes with Cache actions module's and Rules to clear a node from Drupal's internal and external cache (both handled by Varnish in this case) when a node is updated. Caching for Views and Panels for things like latest news can also be expired on node saves and node updates. This way you have smart caching and smart cache clearing and are not wasting cycles on regenrating the same pages over and over.

Sadly this is not supported on Acquia Cloud hosting as they do not allow communicating with Varnish via Telnet to expire the cache.

I'd like to see the ability to not clear the page cache on cron runs implemented in other caching backends too. May have to write some patches!

Add new comment