Showing posts with label hooks. Show all posts
Showing posts with label hooks. Show all posts

Saturday, 13 April 2013

Feed Sources and Self Nodes and Bears (oh my)

Seems it's been nearly a year since I last posted something. I can only say that I have been tied up with Drupal 6 projects and uninspiring Drupal 7 ones. However something interesting came up with a project of my own over the last day or so. So here we are:

The project in question involves creating a meta-search site on a specific subject with data gathered from various sites. Some of those sites have RSS feeds, some do not.

The Feeds module is fairly awesome if you need to get structured data from somewhere and turn it into nodes, or other entities. If you're reading, say from an RSS feed, it works out of the box and I must admit I was very impressed with what it can do. There's even a feeds crawler module that lets you scrape data.

I appreciate that may make some people feel awkward but this site does nothing except fetch information, allow someone to search and then direct the viewer to the original site. No more than Google or any other search engine.

However for this project there was an issue: even the RSS does not contain all the required information to fill the nodes created. To do that properly we have to go to the original page and scrape the relevant sections - for example, to get the full description and the tags that have been used.

There is another module Feeds Self Node Processor which allows a node to become its own feed, which means that you create an importer for the initial information and create the nodes. Then each of these new nodes can fetch its own specific information from the target URL.

I'm leaving out a lot of detail here but I hope this is enough to be understood.

Fetches can be scheduled to be performed during cron jobs or switched off completely so that's all fine. Except for one little thing:

There is no option to do a once-only fetch. No imagine you've imported 10,000 data items, the feeds_source table is now filled with 10,000 self-node rows. If you set the frequency of fetch to the maximum (4 weeks) the first self-node fetch won't happen for 4 weeks. But worse: if you set it to "as often as possible" the cron starts to cycle through the existing 10,000 nodes but (as far as I can tell) doesn't do it right and keeps repeating the same content. And new content gets ignored.

Oh dear. The initial joy of getting the feeds process functioning for three different sites was slowly eroded by the realisation that this was just not going to work.

I spent two days playing with various options. The Feeds module provides a wide array of hooks and lots of potential for customisation however none of them helped.

Of course I would not be writing this unless I had found a solution.

It became clear what was needed was to delete the relevant row from the feeds_source table. There's a nice class wrapper and method to achieve this which ensures all relevant information is also deleted (like the entry in the job_scheduler table).

And there's a "post import" hook. Theoretically just performing $source->delete() should do the job, unfortunately doing that in the "post import" hook doesn't work, the entries are simply recreated because the data is saved after the hook is run. And you can't touch the "last imported" time stamp to set it into the distant future. Simply deleting the relevant job_scheduler row is equally futile.

What was needed was a method of deleting the feeds_source row at some point after everything else had been done. The next page load was an attractive choice initially - I even toyed with the thought using $_SESSION but only for a few seconds.

The answer is that under-used hardly-mentioned-anywhere feature of Drupal 7: Queues. And here it is:


/**
 * Implements hook_feeds_after_import().
 *
 * @param $source
 *  FeedsSource object that describes the source that has been imported.
 */
function YOURMODULE_feeds_after_import(FeedsSource $source) {
  if (/* identify as a feeds source to be killed */) {
    // Get the queue (create if not existent)
    $queue = DrupalQueue::get('killJobQueue');

    // Build the required job data
    $job = array(
      'type' => $source->id,
      'id' => $source->feed_nid,
    );

    // And put it in the queue
    $queue->createItem($job);
  }
}

/**
* Implements hook_cron_queue_info().
*/
function YOURMODULE_cron_queue_info() {
  return array(
    'killJobQueue' => array(
      'worker callback' => '_YOURMODULE_kill_source',
      'time' => 5,
    ),
  );
}

function _YOURMODULE_kill_source($job) {
  feeds_source($job['type'], $job['id'])->delete();
}

We intercept the hook after the import and determine whether this is a feeds source we want to kill. I did this by using a naming convention, all my self-node importers are named "reprocess_[something]". We build the job data and add it to the queue.

It's assumed that many queue-using applications will want to process the queue during hook_cron() so the Queue API provides the functionality for you - apart from the bit that does the actual work.

Now it doesn't matter whether my cron function runs before or after the feeds cron, because somewhere between crons the feeds source rows will be correctly deleted. And it works. Apart from anything else it prevents the feeds_source and job_scheduler tables from getting clogged up with useless data.



Wednesday, 7 September 2011

Reducing the footprint

Sorry I haven't blogged in a while, I have been working very hard on commercial D7 sites in my day job but haven't really come across anything spectacularly D7 I wanted to talk about.

So here's something unspectacular but quite important, and applies almost as much to D6 as D7.

Every module eats memory. Every x.module file (for enabled modules) gets loaded for every access (well, let's ignore caching). So anything that can be done to reduce the files loaded (size in this case) can only be a good thing. (Back in the day I wrote code for machines with 128K of memory of which 20K might be used for the screen, and another 32K for the OS - in those days you really had to think about keeping code as compact as possible.)

Drupal uses hooks to gather information from modules, and then it stores this information in the database caches (or other tables), after which it no longer calls these hooks - unless the caches get cleared and it has to build the information again.

Hooks like hook_menu, hook_theme, hook_permission and so on. But these hooks need to be available in the x.module file - and yet they are only called infrequently.

Solution? Put a minimal hook in the x.module file which loads another file (say x.registry.inc) and then calls the actual hook code which is in there. Like this:

In x.module:

function x_menu() {
  module_load_include('registry.inc', 'x');
  return _x_menu();
}

And in x.registry.inc:

function _x_menu() {
  $items = array();


  $items['my/path'] = array(
    // my/path data  );


  return $items;
}

If a hook is called on pretty much every page there is no advantage in doing this. But if it's a hook that's called infrequently it's definitely worth it.

This is something I'd played with back in my D6 days but then stopped using. But then I saw another contributed module that was using it and thought "yep, I should be doing that". So I am and here it is.

hook_hook_info()

There is another solution using hook_hook_info() which allows you to define a file in which hooks can be found. This does not work for the majority of system modules because they use customised hook-calling methods. Only one system module - Tokens - has it defined so token code can be located in x.tokens.inc.

However the Block module only uses standard hook-calling functions and you could specify a file for block management using hook_hook_info_alter(); I wrote something about this here.