How Drupal's cron is killing you in your sleep + a simple cache warmer

A lot of what's written about performance tuning for Drupal is focused on large sites, and benchmarking is often done by requesting the same page over and over in an attempt to maximize the number of requests per second (à la ab). Unfortunately, this differs from the real world in two key ways:

  1. Most of our projects aren't regularly driving traffic at millions of hits per day.
  2. Our users request a lot of different pages.

In this post I'll model a site with 500 nodes, and 1200 hits per day. That's fewer than 1 request per minute, yet for many small businesses this would be a fairly healthy traffic flow. In this case, it might at first seem that high performance doesn't matter. A clever hacker could probably manage to install Linux on the office coffee maker and get acceptable HTTP throughput. However, latency still matters a great deal, even for small sites:

User impatience is measured in units of 1 tenth of a second starting at 200 milliseconds or so.

Drupal's page cache is capable of delivering blazing fast response times, but only when the cache is warm. And the reality for most small to mid-sized sites is the page cache is being cleared out far quicker than it's regenerated.

hit rate

histogram

These graphs show what happens to a site that runs cron.php every six hours. Remember, system_cron() calls cache_clear_all(), and every time the cache is cleared the hit rate crashes to zero. The traffic follows a Pareto distribution: a few pages are very popular with a "long tail" of less common requests. Initially the hit rate jumps back up as the popular pages get cached. But the "long tail" pages never really gain an acceptable hit rate before it starts all over again. To make matters worse, I've seen site operators that run cron far more often – imagine if cron were run every fifteen minutes; there would be almost no page caching at all. So what can be done?

Run cron.php less often

Somewhere between six hours and one day is likely adequate for most sites. If you have tasks than need to run more often (such as notifications), consider breaking up your cron runs with Elysia cron or perhaps a drush script. Still, as the graph shows it can take a long time for the page cache to kick in no matter what the frequency. Furthermore, the cache is also cleared during other operations such as node edits or comment posting.

Run a crawler

Almost every smart Drupal developer I've discussed this problem with has the same answer: run a crawler. I'm not thrilled about this solution because in some ways it seems very inefficient; but some testing shows the impact can be minimal. In my tests 500 nodes took less than a minute to regenerate from a cold cache; and less than 5 seconds when fetching from the cache. This assumes you have the XML Sitemap module installed. While not required, a sitemap certainly make's wget's job easier.

wget --quiet http:∕∕example.com∕sitemap.xml --output-document - |\
perl -n -e 'print if s#</?loc>##g' | wget -q --delete-after -i -

Invent a preemptive cache handler

It seems the ideal cache handler would regenerate cache entries before they expire, sadly as far as I know this doesn't exist yet for Drupal 6. (There's Pressflow Preempt for D5, but it seems it never made it to D6). If you have ideas about this please let me know!

Comments

But cach_clear_all does a

But cach_clear_all does a check. If NULL is passed as the $cid then it does a check on some variables. It makes sure that cache_clear_all is only run on the tables once per "period" defined by the cache_lifetime settings.

Check your sites admin/settings/performance. Is the minimum cache lifetime set to "none"?

But even if it is set to none, it only deletes items which are expired...

Boost for 6.x is the

Boost for 6.x is the answer!
It has a crawler built in so your site will never have an expired page. It can regenerate the cache before it expires. It solves all the issues you brought up. I can have a cache that lasts several weeks using Boost.

Also consider setting your

Also consider setting your 'Minimum cache lifetime' setting in admin/settings/performance. That way if you run cron every hour it doesn't have to clear your caches every single run.

dylan's picture

Minimum cache lifetime isn't

Minimum cache lifetime isn't much help for small sites. What it's really good for is preventing the caches from being cleared too often in cases where you have large numbers of writes. (As in thousands of comments per day). In the site charted above, one would need to set the lifetime to 12 hours or more to make a significant difference, and even then the caches will still expire eventually.

Regarding Boost: I have only tinkered with it, so I can't comment in detail on how it handles cache expiration. The built-in crawler is good idea. So far, the thing that has kept me off boost is that it's not actually cache handler – so it's completely unaware of cache clear events. (Correction: see mikeytown2's comment below.) In a follow-up post, I'll try to compare Boost to cacherouter with the file backend.

Actually Boost is a cache

Actually Boost is a cache handler; it has some of the most tightly integrated cache expiration logic ever seen in any CMS. The default for Boost is to ignore the cache_clear_all cron call, but if wanted this can happen by setting Ignore cache flushing: to Disabled.

Boost will clear or expire the cached page under the following hooks - Nodeapi; comments; voting api.
In addition to clearing that node it can also clear the views containing the node; cck node reference fields; taxonomy term pages containing that node; and the menu item above it (parent), next to it (siblings), and sub items below it (children).

You can also set custom expiration times by content container (view, node, taxonomy, panels, ect...); content type (page, story, ect...) or by ID (nid, tid, view display, ect...). Boost in short allow you to spend a lot of time tinkering with the cache to get it exactly how you want it to work. It is very powerful and there is a reason it is used by a lot of sites out there.

dylan's picture

mikeytown2, Thanks for the

mikeytown2,

Thanks for the correction! I was thinking that because of Moshe's comment in this issue, and the fact that it does not include a cache.inc handler, (a.k.a. $conf['cache_inc'] = boost.inc) that it didn't respond to things like node_save().

I will definitely be giving Boost a closer look. I wonder if the project page could be updated to clarify that it is bypassing cache.inc and cache_clear_all() in order to provide more fine-grained control.

Thanks for this post. I have

Thanks for this post. I have my cron set to run every 5 minutes and it never occurred to me that there would be any harm in it. I'm going to investigate the options in the comments.

Thanks again!

Michelle

Thanks for the post Dylan,

Thanks for the post Dylan, and to all who commented! This will definitely help, as a good portion of the sites we build are in the low-medium traffic range, and performance is always a concern.

This is incorrect. This call

This is incorrect. This call is pretty innocous; it just prunes expired and unused items from the cache tables.

Whats called during system_cron() is cache_clear_all(NULL, $table). It is easy to assume that this wipes whole table. But not so - $cid is NULL in cache_clear_all() and thus empty($cid) is TRUE and thus only expired items get deleted. Follow the (twisted) logic of cache_clear_all and you will see it. The comments for this function are pretty confusiHere is the relevant section

function cache_clear_all($cid = NULL, $table = NULL, $wildcard = FALSE) {
  global $user;
 
  if (!isset($cid) && !isset($table)) {
    // Clear the block cache first, so stale data will
    // not end up in the page cache.
    cache_clear_all(NULL, 'cache_block');
    cache_clear_all(NULL, 'cache_page');
    return;
  }
 
  if (empty($cid)) {
    if (variable_get('cache_lifetime', 0)) {
      // We store the time in the current user's $user->cache variable which
      // will be saved into the sessions table by sess_write(). We then
      // simulate that the cache was flushed for this user by not returning
      // cached data that was cached before the timestamp.
      $user->cache = time();
 
      $cache_flush = variable_get('cache_flush_'. $table, 0);
      if ($cache_flush == 0) {
        // This is the first request to clear the cache, start a timer.
        variable_set('cache_flush_'. $table, time());
      }
      else if (time() > ($cache_flush + variable_get('cache_lifetime', 0))) {
        // Clear the cache for everyone, cache_lifetime seconds have
        // passed since the first request to clear the cache.
        db_query("DELETE FROM {". $table ."} WHERE expire != %d AND expire < %d", CACHE_PERMANENT, time());
        variable_set('cache_flush_'. $table, 0);
      }
    }
    else {
      // No minimum cache lifetime, flush all temporary cache entries now.
      db_query("DELETE FROM {". $table ."} WHERE expire != %d AND expire < %d", CACHE_PERMANENT, time());
    }
  }
 
<php>

Mikeytown2 - is there some

Mikeytown2 - is there some cool video or tutorial that we can check out regarding how to best set up Boost to avoid all these issues?

While Boost can work wonders,

While Boost can work wonders, I've ran into problems with it (on a D5 site) where the caching files were not clearing correctly, so buyer beware.

Memcache is definitely the way to go when possible, along with opcode caching. Beyond that I'm hoping to get a custom caching engine a co-worker built released to d.o to see what the feedback is..

In Drupal 7, the cron run has

In Drupal 7, the cron run has 2 processes to it: One is for "main tasks" (like the old cron hook), and the other is to run any cron entries queued up in the job queue. The latter you probably want to run fairly quickly, the former not so much.

So what if for Drupal 8 we split cron into 2 requests: One that can run on poormanscron or a separate cron run and handles the queued tasks, and has no cache clear. (That would be things like sending out mass emails, aggregator updates, etc.) The second could be set to run every 6-24 hours instead of every hour like most sites do, and handle things like log rotation, update_status updates, and other non-time-sensitive stuff, and then clear the cache.

That would keep high-volume tasks from piling up while keeping the wipe part of cron rare.

dylan's picture

Moshe, Can you clarify a bit

Moshe,

Can you clarify a bit more? As I understand it, the page cache doesn't have an expiration date – it's always set with CACHE_TEMPORARY. Even with a minimum cache lifetime (which I was surprised to see uses a timer rather than {cache_page}.created) the page cache will still be wiped every second cron run (assuming your cache lifetime is shorter than the cron period).

Add new comment