Saturday, 13 April 2013

Feed Sources and Self Nodes and Bears (oh my)

Seems it's been nearly a year since I last posted something. I can only say that I have been tied up with Drupal 6 projects and uninspiring Drupal 7 ones. However something interesting came up with a project of my own over the last day or so. So here we are:

The project in question involves creating a meta-search site on a specific subject with data gathered from various sites. Some of those sites have RSS feeds, some do not.

The Feeds module is fairly awesome if you need to get structured data from somewhere and turn it into nodes, or other entities. If you're reading, say from an RSS feed, it works out of the box and I must admit I was very impressed with what it can do. There's even a feeds crawler module that lets you scrape data.

I appreciate that may make some people feel awkward but this site does nothing except fetch information, allow someone to search and then direct the viewer to the original site. No more than Google or any other search engine.

However for this project there was an issue: even the RSS does not contain all the required information to fill the nodes created. To do that properly we have to go to the original page and scrape the relevant sections - for example, to get the full description and the tags that have been used.

There is another module Feeds Self Node Processor which allows a node to become its own feed, which means that you create an importer for the initial information and create the nodes. Then each of these new nodes can fetch its own specific information from the target URL.

I'm leaving out a lot of detail here but I hope this is enough to be understood.

Fetches can be scheduled to be performed during cron jobs or switched off completely so that's all fine. Except for one little thing:

There is no option to do a once-only fetch. No imagine you've imported 10,000 data items, the feeds_source table is now filled with 10,000 self-node rows. If you set the frequency of fetch to the maximum (4 weeks) the first self-node fetch won't happen for 4 weeks. But worse: if you set it to "as often as possible" the cron starts to cycle through the existing 10,000 nodes but (as far as I can tell) doesn't do it right and keeps repeating the same content. And new content gets ignored.

Oh dear. The initial joy of getting the feeds process functioning for three different sites was slowly eroded by the realisation that this was just not going to work.

I spent two days playing with various options. The Feeds module provides a wide array of hooks and lots of potential for customisation however none of them helped.

Of course I would not be writing this unless I had found a solution.

It became clear what was needed was to delete the relevant row from the feeds_source table. There's a nice class wrapper and method to achieve this which ensures all relevant information is also deleted (like the entry in the job_scheduler table).

And there's a "post import" hook. Theoretically just performing $source->delete() should do the job, unfortunately doing that in the "post import" hook doesn't work, the entries are simply recreated because the data is saved after the hook is run. And you can't touch the "last imported" time stamp to set it into the distant future. Simply deleting the relevant job_scheduler row is equally futile.

What was needed was a method of deleting the feeds_source row at some point after everything else had been done. The next page load was an attractive choice initially - I even toyed with the thought using $_SESSION but only for a few seconds.

The answer is that under-used hardly-mentioned-anywhere feature of Drupal 7: Queues. And here it is:


/**
 * Implements hook_feeds_after_import().
 *
 * @param $source
 *  FeedsSource object that describes the source that has been imported.
 */
function YOURMODULE_feeds_after_import(FeedsSource $source) {
  if (/* identify as a feeds source to be killed */) {
    // Get the queue (create if not existent)
    $queue = DrupalQueue::get('killJobQueue');

    // Build the required job data
    $job = array(
      'type' => $source->id,
      'id' => $source->feed_nid,
    );

    // And put it in the queue
    $queue->createItem($job);
  }
}

/**
* Implements hook_cron_queue_info().
*/
function YOURMODULE_cron_queue_info() {
  return array(
    'killJobQueue' => array(
      'worker callback' => '_YOURMODULE_kill_source',
      'time' => 5,
    ),
  );
}

function _YOURMODULE_kill_source($job) {
  feeds_source($job['type'], $job['id'])->delete();
}

We intercept the hook after the import and determine whether this is a feeds source we want to kill. I did this by using a naming convention, all my self-node importers are named "reprocess_[something]". We build the job data and add it to the queue.

It's assumed that many queue-using applications will want to process the queue during hook_cron() so the Queue API provides the functionality for you - apart from the bit that does the actual work.

Now it doesn't matter whether my cron function runs before or after the feeds cron, because somewhere between crons the feeds source rows will be correctly deleted. And it works. Apart from anything else it prevents the feeds_source and job_scheduler tables from getting clogged up with useless data.



Wednesday, 25 July 2012

Form API and AJAX callbacks


Here's a piece of information that's well hidden.

If you're using Drupal's AJAX functionality with forms - but you don't actually want to change the form but instead do other stuff on the page, you may run into trouble because there's something that's not clearly explained about the AJAX callback function in your PHP code.

You know you can add this:

$form['tickbox'] = array(
  '#type' => 'checkbox',
  '#title' => t('Change something on the page'),
  '#ajax' => array(
    'wrapper' => 'id-of-some-div-on-the-page',
    'callback' => 'mymodule_form_ajax_callback',
  ),
);

And when you click the box your function in the PHP gets called:

function mymodule_form_ajax_callback($form, $form_state) {
  return '<div id="">Hello!</div>';
}

Okay that just does a straight replace. But what if you want to append it instead? The documentation says that this function can return AJAX commands instead. So I can do this, right?

function mymodule_form_ajax_callback($form, $form_state) {
  return ajax_command_append(NULL, 'Hello!');
}

Nope. The documentation says I can return an array of commands. So I can do this, right?

function mymodule_form_ajax_callback($form, $form_state) {
  return array(
    ajax_command_prepend(NULL, 'Hello!'),
    ajax_command_append(NULL, 'Goodbye!'),
  );
}

Nope.

The answer is hidden around line 219 of includes/ajax.inc, this will work:

function mymodule_form_ajax_callback($form, $form_state) {
  return array(
    '#type' => 'ajax',
    '#commands => array(
      ajax_command_prepend(NULL, 'Hello!'),
      ajax_command_append(NULL, 'Goodbye!'),
    ),
  );
}

It needs to be a renderable array so this is what works.

Sunday, 22 July 2012

jQuery publish/subscribe custom events

There seems to be a stuck idea with respect to jQuery which demands binding custom events and their functions to a specific DOM object (like 'document' or 'body') and triggering the event on that object which then tells whatever objects might want to know about the event using another event.


There's an example of this here: http://stackoverflow.com/questions/399867/custom-events-in-jquery and another here http://jamiethompson.co.uk/web/2008/06/17/publish-subscribe-with-jquery/ (okay, that's four years ago but it's top of the Google results on this subject).


I may be being stupid (it's been known) but that seems a completely unnecessary step - possibly inherited from OOP coding in a non-HTML environment where it may be necessary to have two stages.


So, here I am building a new carousel system for Drupal 7 - not because I'm a glutton for punishment but because none of the existing options does what I need for my current contract - and come up against this issue and something is nagging me. I've been here before.


If you imagine, a modern carousel has its little indicator buttons to show which slide we're currently on and allow the user to select a slide to view with a click. It also has forward and back arrows (which may be hidden or displayed if there's a slide to go forward or back to). And there may be an auto-change option if the user isn't selecting slides manually.


Now each of those indicators, arrows and auto-scroll items is an object which has behaviours attached. Let's call them carousel "tools". And they need to know what's going on.


Let's say the last indicator is clicked by the user, the system must scroll to the last item and then a check must be made to see if the "next" arrow should be hidden, the old indicator unhighlighted, the new indicator highlighted and the auto-scroll switched off (maybe with another timer started so the auto-scroll restarts after a period of time of user inactivity).


Or, if it's the auto-scroll in action, similar actions must be taken when a new slide is displayed.


Now you could hard-code all this into the slide function but we all know that's naughty tight coupling and will be difficult to write without lots of bugs and to maintain for anyone else. Since each object has its own behaviours this problem is ripe for proper OOP implementation.


So we could encapsulate the tools and then keep a list of those tools and hard-code a function to call in each one of them when the slide changes. Okay, that's better functionality, looser coupling but it can be done better.


Instead what we intuitively want to do is send a custom event saying "this carousel has changed its slide" to every object that needs to know (and don't forget we might have more than one carousel on a page so we also need to distinguish between the tools belonging to each carousel).


Okay, so we could bind a function for the "slideChange" event to the root carousel DOM element and then have that element triggered by the slideChange function (with data including the old and new slide IDs, plus whether this is the first slide or the last slide in the list - so that the previous/next arrows know whether to hide themselves).


But why do we have to use the root carousel element at all?


We don't. What about this:


$('.carousel-tool').trigger('slideChange.' + myCarouselID, {...slide change data...});


And in the set-up for each tool we can have this:


$(this).bind('slideChange. + myCarouselID, function(e, data) {
  var me = $(this);
  ... process the slideChange event
});


And in the HTML every tool has a "carousel-tool" class. If we do this we are completely encapsulating the actions of the tools. The slideChange function can reference every tool, without actually having to know who they are.


This uses the custom event namespacing feature available in jQuery to ensure that only the tools that belong to a specific carousel have the event triggered when that carousel slide changes. You could namespace the HTML class but that's less efficient in some respects.


Or you could modify the trigger line:


$('#' + myCarouselID).find('.carousel-tool').trigger('slideChange', {...slide change data...});


Actually this is probably the most efficient option even if it's not the most elegant, and note you wouldn't use the namespacing in the binding either if you do it this way.


In case you think that's an odd way to do the selection process, it's quicker to have the single $('#id') search on its own and then do a find() from there, than it is to combine to two. (See http://24ways.org/2011/your-jquery-now-with-less-suck.)


So there you are: how to do publish/subscribe properly with jQuery. (In my opinion.)


UPDATE: One caveat, the events propagate up through the DOM tree, which means that if you have a handler for a custom event higher up the tree it will get called as many times as there are handlers lower down the tree. You can avoid this using the e.stopPropagation() method.



Monday, 9 July 2012

Search API, Facet API and Display Suite

Just spent most of the day tracking down a nasty little issue involving these three modules.

Let's face it Search API with Facet API (and all the Search API support modules) are brilliant, the core Search is very difficult to customise and you usually have to end up with nasty core hacks if you want to do anything clever.

Search API on the other hand is lovely, and Facet API is just great with almost everything extendible. And, of course, you can display search results as a view, which adds that level of delightfulness.

Display Suite is also awesome (I may have already mentioned this).

But if you have all three together - displaying search results and facet blocks on a Display Suite configured page, you may run into the problem of the facet blocks refusing to appear.

The reason is really simple: there are no facets to display.

But you say (well, I screamed in my head) I'm displaying search results through a view, I'm looking at them right now and I know they have facets.

I finally, eventually, lit upon this issue in Facet API http://drupal.org/node/1392288 which explains the problem but doesn't come up with any solid solution. The issue is that if the page displays the blocks before the search query has been run they will be empty because search results are needed before the facets can be calculated.

Simple? Yes. Solvable? Not easily. There's no way of telling Display Suite what order you want the blocks and content displayed (it would be a nice touch but not worth for just this problem, or maybe a way to defer content rendering of blocks to the end). There is talk of having the facet run the query if it's not been run, but that's in the future if it ever happens.

The solution is legal but ugly.

The way to ensure the query is run first is to do it in hook_init() in my case by rendering the view and saving it. Then adding the view to the output when required. Yucky but it works.

UPDATE: Typical really, what I hadn't done is fully explored the DS Extras module which allows Views pages to be configured. This is very handy but the issue above still remains.

Wednesday, 4 July 2012

Changing view_mode mid-stream

In my current contract I have to display a hierarchy of complex taxonomy terms - each one as a page with relevant data hanging off it. There are three levels to the taxonomy: Level 1 displays one set of data and links, Level 2 is never displayed on its own (which is something I'll have to take care of) and Level 3 has a different page structure again.

Different page layouts means Display Suite (http://drupal.org/project/ds) and that's great. It works fine when you want to configure a different layout for a node - as long as it's the same for every node type. Or entity bundle.

DS works fine for taxonomy terms in just the same way as for nodes. It is one of my favourite modules.

But I needed to be able to change the view_mode on the fly: to check what level the taxonomy term is and change the view_mode based on that. Which is when I ran into trouble - and it's not DS's fault. Though I never had to do it apparently this was not a problem in D6 but in D7 there is a bug which makes it tricky to change view_mode. You can read all about it here: http://drupal.org/node/1154382

This issue gives some hints as to the solution (hook_entity_prepare_view() is not it), but there is a link to this blog here. The solution described  is for changing build mode for nodes based on the current theme (so you can change things if you're using a mobile theme).

My solution is the same except slightly more generic, instead of intercepting hook_node_...() hooks, I intercept hook_entity_...() hooks.

I'm just going to throw my code at you because I know you can work out what to do in your own situation.


/**
 * Implements hook_entity_prepare_view().
 *
 * We have to play silly games to change the view_mode, first we intercept
 * hook_entity_prepare_view() and establish what view_mode we want, and
 * save it. But there's a bug which means this has no effect on the output
 * so...
 */
function page_g2g_entity_prepare_view($entities, $entity_type, $langcode) {
  if ($entity_type!='taxonomy_term') {
    return;
  }


  foreach ($entities as $id => $entity) {
    if ($entity->vocabulary_machine_name!='gtg_tags' || $entity->view_mode!='full') {
      // wrong vocabulary or not a "full page"
      continue;
    }


    // Change the display dependent on the number of parents
    switch (count(taxonomy_get_parents_all($entity->tid))) {

      case 1:

        $entity->view_mode = 'g2g_level_1';
        break;

      case 2:

        // Hm. Need to do something else here, maybe
        // do a redirect to the parent term.
        break;

      case 3:

        $entity->view_mode = 'g2g_level_3';
        break;
    }
  }
}


/**
 * Implements hook_entity_view_alter().
 *
 * ...we intercept before the full build is enacted. You can test and see that
 * even though we changed the view_mode in the term itself, it hasn't transferred
 * to the build theme. So having verified we want to do it with this entity
 * we transfer it. And now it gets changed.
 */
function page_g2g_entity_view_alter(&$build) {
  if (isset($build['#entity_type']) && $build['#entity_type']=='taxonomy_term') {
    $build['#view_mode'] = $build['#term']->view_mode;
  }
}


Arguably you don't need the first hook, you could do it all in the second call. But it's a matter of elegance and splitting actions into their appropriate locations.

Tuesday, 29 May 2012

Then three come all at once

I have been doing some work on the field_extract module, a few minor fix-ups most of which wouldn't be noticed and added support for the entityreference field. You can find this module here: http://drupal.org/project/field_extract

Someone I worked with recently has put out my "deeplink" module which allows otherwise hidden content to be made available on a specific trackable URL. My version was Drupal 6 (that's what I was working on at the time) he's doing the D7 upgrade, you can find it http://drupal.org/project/deeplink

But deeplink needs my Controls module, and that's a baby that needs explanation. And that explanation is available on the project page http://drupal.org/project/controls there's both a D6 and D7 version both written by lil ole me.

Briefly: Controls is an API module which provides a similar function to CTools plugins (they have a lot in common), but requires virtually no setting up and is much easier to use. Now I always say that to people but I did seriously wonder whether it was true, so for the last commercial project I worked on I didn't use Controls, I went back to using CTools plugins instead. I desperately wished I hadn't.

So I'll stick by my statement - I think Controls is easier to use and in some ways more versatile than CTools plugins.


However they are also more easily abused. It's something for developers working on an end-client's website. Anyway I'll let you be the judge.


Thursday, 12 April 2012

Entities vs Nodes

For the last six months I've been working on a Drupal 6 site professionally, but developing a couple of personal Drupal 7 sites. However I haven't been going into D7 in any great detail - except for developing entities.

Now I've moved contracts and I'm developing a new Drupal 7 site to replace an older non-Drupal site with lots of additional facilities. And the Drupal implementation is entirely up to me.

Being an OOP person at heart I love the concept of entities but when working with a commercial website you have to make some serious decisions. My personal preferences have to give way to the reality of building a website that delivers the spec and can be maintained and extended by other developers in the future.

So how do you decide what should be a node and what should be an entity?

There's a line in this blog post which says:
You can now actually create data structures specific to your application domain with their own database table (or any other storage mechanism really) plus a standardised way to add fields to them. No need to turn nodes into something they are not.
Which is technically accurate but does not always provide clear guidance, so here's my step-by-step analysis method to decide whether a specific data structure should be a node or not:

1. Is it content? Is the item definitely stuff that gets turned into HTML and displayed for the user? If you were building a review site, a review would definitely be content, so that's a node.

2. Does it need to have revisions? The node module provides revisions and the associated modules make it easy and powerful. Building revisions into custom entities is hard. So if it has to have revisions then it has to be a node.

3. Is there additional "property" (as opposed to "field") information? I had a situation where I wanted to define a "proxy", and a proxy needs a web address, a port number, maybe a username and password. These are fundamental properties of a proxy. You could add these as fields for a node, of course, but it makes more sense for them to be properties of a proxy entity. So this should be considered as an entity. (Another way of looking at this is: is there a need for a new table containing information specific to this data structure? If so, think entity.)

4. Is the structure normally invisible to the user? That should be an entity.

5. Would using an entity instead of a node obscure the function? Perhaps this is tricky to answer, after all dividing functionality out to a new object should never make things more complex. But it's worth asking the question.

Any other ways of to help make the decision?

Remember: there is virtually no overhead in creating a new entity. And there are huge advantages in additional functionality that the core, Entity API, Views, Features and other entity-related modules can give you.