Drupal's Big Data Problem

When we in the Drupal community talk about scalability, it's most often in terms of handling high numbers of visitors. An equally important dimension, that to our detriment we often overlook, is scaling with larger datasets.

One of the biggest problems I see is a pattern of loading all of a module's data at once, regardless of size. Two examples:

Drupal core has a built-in assumption that menus and vocabularies are small. For most projects it's a valid assumption, but if you push the limits on these components the results can be disastrous.

This one is particularly bad. The time needed to load /admin/structure/menu/manage/%menu is Θ(N2), because of the way Drupal's tabledrag system works. The practical limit is around 500 menu items.

Similarly, memory consumption on /admin/structure/taxonomy/tags grows linearly with vocabulary size. The page load time is impacted less, because terms are listed with pagination. In my experience performance problems will start to crop up beyond about 20,000 terms.

Are these limits reasonable? For perspective, here are a few published ontologies:

Library of Congress Subject Headings (LCSH) 250,000 items
Species in NCBI taxonomy 357,590 items
Medical Subject Headings (MeSH) 177,000 items


Solutions

Unfortunately, there's not much the typical site builder can do about these problems. The best advice I can offer is to plan your project carefully, and if you anticipate any “big” datasets then do some benchmarking before committing fully.

For module authors, I urge you to think ahead when designing you code. Try to think in extremes. How will your program behave if installed on a site with a million nodes or users? Or 1,000 fields?

  • Use pagination where appropriate.
  • Don't write API functions that load more data than needed.
  • Avoid scenarios where datasets are multiplied together. An example is field formatters; use formatter settings instead of creating individual formatters for all possible option combinations.
  • Use AJAX to lazy-load data
  • Benchmark your code with more “stuff” than you think is reasonable. Document the scalability limits where you find them.

Comments

_block_rehash()

I'd also add blocks to this list. Believe it's been addressed in D8, but on every cron run in D7 (and D6?), _block_rehash() is called, looping through each code-declared block for each active theme and performing a DB insert/update on it.

Pressflow?

This looks like a job for Presflow! :) Thanks for the nice writeup!

Good to know limits

Great post!

I think its really important for any developer to think long term and look at the big picture. Its also helpful to have some of these "limits" in mind. In edge cases that DO require such large menu trees or taxonomies, it seems that it would make sense to come up with a custom solution geared to that particular site.

Questions:
- Would using Mongo help with these limits?
- Does the "functional limit" of menu items refer to ALL menu items or the number of items per menu?

thanks!

Overlooked? Not so fast

The menu_override_parent_selector variable in menu_parent_options is provided exactly because " The menu_links table can be practically any size and we need a way to allow contrib modules to provide more scalable pattern choosers." So if you have so many links then hook_menu_alter the overview page to do something better than the tabledrag and set this variable and perhaps use HS or something to add a better parent selector.

Taxonomy Edge

Also just stumbled across the Taxonomy Edge module:
http://drupal.org/project/taxonomy_edge/

"Taxonomy Edge optimizes tree functions for taxonomies. It provides a data model for easily managing hierarchical terms. It was created to avoid recursive functions when fetching all children, and to avoid the not-quite-so-scalable memory footprint of taxonomy_get_tree()."

Menu

I have a patch for core that speeds up certain aspects of the menu system.
http://drupal.org/node/1710656

In short I found that menu items where being loaded even if that menu link was disabled. Test case is found in issue link.

joaquin's picture

Not just Drupal

While it's always great to acknowledge Drupal's defects and not sweep them under the rug, I think it's also important to note that this same problem exists in most systems. User interfaces are typically designed around some assumptions in scope, and they need to be changed/adapted as the scope changes. Performance aside, the drag-and-drop mechanism for sorting menu's within Drupal would need to be rethought to support hundreds or thousands of items. Drupal is certainly not alone here. WordPress, Croncrete5, Joomla and even closed source proprietary systems will all need to be adapted at some point to support edge cases, if those edge cases become important enough to warrant the investment.

Add new comment