Saturday 13 April 2013

Feed Sources and Self Nodes and Bears (oh my)

Seems it's been nearly a year since I last posted something. I can only say that I have been tied up with Drupal 6 projects and uninspiring Drupal 7 ones. However something interesting came up with a project of my own over the last day or so. So here we are:

The project in question involves creating a meta-search site on a specific subject with data gathered from various sites. Some of those sites have RSS feeds, some do not.

The Feeds module is fairly awesome if you need to get structured data from somewhere and turn it into nodes, or other entities. If you're reading, say from an RSS feed, it works out of the box and I must admit I was very impressed with what it can do. There's even a feeds crawler module that lets you scrape data.

I appreciate that may make some people feel awkward but this site does nothing except fetch information, allow someone to search and then direct the viewer to the original site. No more than Google or any other search engine.

However for this project there was an issue: even the RSS does not contain all the required information to fill the nodes created. To do that properly we have to go to the original page and scrape the relevant sections - for example, to get the full description and the tags that have been used.

There is another module Feeds Self Node Processor which allows a node to become its own feed, which means that you create an importer for the initial information and create the nodes. Then each of these new nodes can fetch its own specific information from the target URL.

I'm leaving out a lot of detail here but I hope this is enough to be understood.

Fetches can be scheduled to be performed during cron jobs or switched off completely so that's all fine. Except for one little thing:

There is no option to do a once-only fetch. No imagine you've imported 10,000 data items, the feeds_source table is now filled with 10,000 self-node rows. If you set the frequency of fetch to the maximum (4 weeks) the first self-node fetch won't happen for 4 weeks. But worse: if you set it to "as often as possible" the cron starts to cycle through the existing 10,000 nodes but (as far as I can tell) doesn't do it right and keeps repeating the same content. And new content gets ignored.

Oh dear. The initial joy of getting the feeds process functioning for three different sites was slowly eroded by the realisation that this was just not going to work.

I spent two days playing with various options. The Feeds module provides a wide array of hooks and lots of potential for customisation however none of them helped.

Of course I would not be writing this unless I had found a solution.

It became clear what was needed was to delete the relevant row from the feeds_source table. There's a nice class wrapper and method to achieve this which ensures all relevant information is also deleted (like the entry in the job_scheduler table).

And there's a "post import" hook. Theoretically just performing $source->delete() should do the job, unfortunately doing that in the "post import" hook doesn't work, the entries are simply recreated because the data is saved after the hook is run. And you can't touch the "last imported" time stamp to set it into the distant future. Simply deleting the relevant job_scheduler row is equally futile.

What was needed was a method of deleting the feeds_source row at some point after everything else had been done. The next page load was an attractive choice initially - I even toyed with the thought using $_SESSION but only for a few seconds.

The answer is that under-used hardly-mentioned-anywhere feature of Drupal 7: Queues. And here it is:


/**
 * Implements hook_feeds_after_import().
 *
 * @param $source
 *  FeedsSource object that describes the source that has been imported.
 */
function YOURMODULE_feeds_after_import(FeedsSource $source) {
  if (/* identify as a feeds source to be killed */) {
    // Get the queue (create if not existent)
    $queue = DrupalQueue::get('killJobQueue');

    // Build the required job data
    $job = array(
      'type' => $source->id,
      'id' => $source->feed_nid,
    );

    // And put it in the queue
    $queue->createItem($job);
  }
}

/**
* Implements hook_cron_queue_info().
*/
function YOURMODULE_cron_queue_info() {
  return array(
    'killJobQueue' => array(
      'worker callback' => '_YOURMODULE_kill_source',
      'time' => 5,
    ),
  );
}

function _YOURMODULE_kill_source($job) {
  feeds_source($job['type'], $job['id'])->delete();
}

We intercept the hook after the import and determine whether this is a feeds source we want to kill. I did this by using a naming convention, all my self-node importers are named "reprocess_[something]". We build the job data and add it to the queue.

It's assumed that many queue-using applications will want to process the queue during hook_cron() so the Queue API provides the functionality for you - apart from the bit that does the actual work.

Now it doesn't matter whether my cron function runs before or after the feeds cron, because somewhere between crons the feeds source rows will be correctly deleted. And it works. Apart from anything else it prevents the feeds_source and job_scheduler tables from getting clogged up with useless data.