Using the Drupal Batch API

Recently I was working on a site for a library that had a lot of data that needed to be imported into Drupal as nodes. Each book title, e-book, DVD, etc. needed to be a node inside of their Drupal 7 website. Not only that, but the database that held this data would add new records and occasionally update and remove existing ones. This meant about 300,000 - 400,000 nodes that had to be created and kept synchronized with their internal database. In this post I'll outline how I made use of Drush and the batch API to import the dataset into Drupal from a terminal.

Bringing the data in

My first challenge was to import the dataset into Drupal. I had quite a bit of data to work with so I had to utilize the batch API. The Batch API allows you to run one or more method on to a large set of data without worrying about PHP timeouts and can provide feedback on the progress of the operation. I had created a module to handle the updating and importing of the library data. To create the batch queue you must build an array for batch_set();

function mymodule_setup_batch($start=1, $stop=100000) {
  //  ...
  //  Populate $lots_of_data from record $start to record $stop.
  //  ...
 
  //Break up all of our data so each process does not time out.
  $chunks = array_chunk($lots_of_data, 20);
  $operations = array();
  $count_chunks = count($chunks);
 
  //for every chunk, assign some method to run on that chunk of data
  foreach ($chunks as $chunk) {
    $i++;
    $operations[] = array("mymodule_method_to_work_on_a_small_part", array( $chunk ,'details'=> t('(Importing chunk @chunk  of  @count)', array('@chunk '=>$i, '@count'=>$count_chunks))));
    $operations[] = array("mymodule_another_method",array($chunk));
  }
 
  //put all that information into our batch array
  $batch = array(
    'operations' => $operations,
    'title' => t('Import batch'),
    'init_message' => t('Initializing'),
    'error_message' => t('An error occurred'),
    'finished' => 'mymodule_finished_method'
  );
 
  //Get the batch process all ready!
  batch_set($batch);
  $batch =& batch_get();
 
  //Because we are doing this on the back-end, we set progressive to false.
  $batch['progressive'] = FALSE;
 
  //Start processing the batch operations.
  drush_backend_batch_process();
}

You'll also have to write what our operation methods will do. Each of these will be called with the parameters we set up earlier. In this case both methods will work on the same data, one right after the other.

function mymodule_method_to_work_on_a_small_part ($chunk, $operation_details, &$context) {
  //Do something to $chunk, maybe create a node?
  $context['message'] = $operation_details; //Will show what chunk we're on.
}
function mymodule_another_method($chunk, &$context) {
  //Do some more work.
  $context['message'] = t('We have done a second thing to a chunk of data');
}

We also need to code the method that is called when it is finished:

function mymodule_finished_method($success, $results, $operations) {
  //Let the user know we have finished!
  print t('Finished importing!');
}

Drushing data

I have always enjoyed the use of Drush, but I have never created my own Drush commands. It turns out, it is a very easy process. I decided to go ahead and make an import command, so I could start the batch process off and import a section of the entire dataset from the terminal. I placed the above code into a file mymodule.drush.inc and created the following methods:

function mymodule_drush_command() {
  $items  = array();
  $items['myimport'] = array(
    'callback'    => 'mymodule_setup_batch',
    'description' => dt('Import'),
    'arguments'   => array(
      'start'     => "start",
      'stop'      => "stop",
    ),
  );
  return $items;
}
 
function mymodule_drush_help($section) {
  switch ($section) {
    case 'drush:myimport':
      return dt("import items from the Internal Database [start record] [end record].");
  }
}

It was that simple to create a new Drush command! Now I could open up a terminal and type in drush myimport 100 2000 and watch Drupal import a bunch of records. The batch API can come in very handy when dealing with large amounts of data that may take undetermined lengths of time, such as a massive import or upgrade.

Good luck with your batch processes and happy coding!

Filed under 

I think the reason I went this route had to do with having the import logic already existing in some other module, but had to be extended to large dataset. To be honest, I'm not too familiar with Migrate yet, but it would seem like for this project I'd have to extend it to use the Library's source. Thanks for pointing it out to me, it could be very useful in the future :]

Cool writeup. Especially enjoyed the drush documentation.

Also, +1 to the 'migrate' module. Small learning curve, but amazing doesn't begin to describe it.

Nice post. At first I thought that mymodule_setup_batch was a hook, but then I realized it was a function you were setting up to call in Drush. The way you named it sort of falls in line with hook naming conventions.

Once I sorted this out in my head it gave me exactly what I needed for my own run of batches. Thanks!

When dealing with consistent data importing, by this I mean a CSV with a permanent structure.

Batch won over Migrate2 for us as it allowed facilitation of a custom import page where we could nicely wrap everything up for the client.

Great post btw, thank you.

Hi and thanks for a great read on Batch API.

Would there be any way to call a batch function, creating some entities, from a submit-handler where it could return the ids of the newly created entities?
I have almost been giving up on this.

Off hand I'm not entirely certain if the batch api allows for this directly. It might (and sounds like it would and should), but I'd have to do some testing around that to be sure. If anything, you could store the ids (as you process them) in an array in something like Drupal's caching system or even a Drupal variable via variable_set.

Thanks for the quick respons, Chris. If you find an answer please share, as it would without doubt be better than my current solution, which is creating a custom 'mapper-entity' in the submit handler, were its id will be stored in the node as a enity-reference, and where the id also will be sent along to the batch-function and finally loaded, updated with the newly created enitites, and then saved . - Im so a shame :)

Hi, great article, thank you.

Like posted by Rohit earlier on, I get the following error:

Call to undefined function drush _backend_batch_process()

I get this on a working environment with drush installed.

How is drupal supposed to know where to find the drush _backend_batch_process() function when invoked via cron for instance?

Thanks!

Thanks for writing this.
It helped me.
The drupal doc on this one is not as good as it could be.

Thanks for your post Chris, I'd translate this blog into Chinese, and post it on my site, sharing it to Drupaler in China, Please let me know if you don't like it, I will remove it.

Thanks again.
mail: dustise at gmail dot com

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <cpp>, <java>, <php>. The supported tag styles are: <foo>, [foo].
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.

About the Author