« Delving further into the Data Module | Main | Patching the CCK Required by Role module »

November 05, 2010

Incorporating the NBII Thesaurus into Drupal

Here's an email from a colleague (i.e., I deserve no credit for this but I want to post it due to relevancy.):

Hi folks,

"Integrate the NBII Biocomplexity Thesaurus" (integrate in our Drupal work, that is)

The task was stated clearly in both our NSF supplement, and also in our LTER / NBII cooperative agreement scope of work.

We have been exploring this an related ideas every now and then, but upon stumbling in blocks, we moved on to more pressing matters.

Yesterday I expend some time revisiting the taxonomy_xml module -- to the risk of experiencing again that insidious deja-vu feeling, kyle knows what i mean, i explain this a bit later.

Using the taxonomy_xml dev version, and being open minded, I found a winning strategy to at least transform and integrate NBII's thesaurus into the powerful Drupal taxonomy structure.

The workflow is actually simple - install the module, and work with a particular CSV format that the module is successful doing imports (See ISO 2788.. well, not really, see better the snippet below). NBIIs thesaurus is served also as an XML dump, following the Lib of Congress schema for thesaurus (I believe...) I wrote a small Perl script to transform it into the importable CSV format - and voila.

50,381 triplets. Here is a snippet:
Zoledronic acid, Related Terms, Risedronic acid
Zoledronic acid, Related Terms, Tiludronic acid
Zona pellucida, Broader Terms, Ova
Zonal distribution, Broader Terms, Geographical distribution
Zonation, Used for, Ecological zonation
Zonation, Related Terms, Benthos
Zonation, Related Terms, Biogeography

This promising taxonomy_xml module (here is why deja-vu) fails to work for most of the stated functionality. During your careful module hunt and exploration, you feel frustration when most of the "stuff" does not work as expected. Part of it is that web services change --the risk of third party dependencies is felt again-.

After finding the right formula, all I had to do is flatten the NBII thesaurus XML file that Lisa Zolly provides (Giri too) and then ingest it. In doing so, I delved some more in the little used Drupal taxonomies -- I had no clear idea that you can actually express a thesauri in a Drupal taxonomy losing little of its relational structure. I forgot about the "Related term". The "Synonym" feature took us closer, but it is the ability to encode tighter relations between terms that adds to the well know Drupal taxonomical (hierarchical) structure.

Finally, I had to break the triplets file in 5000 line chunks to avoid PHP script timeouts. The verbosity of the script is surprisingly enlightening, it questions terms, etc. fun.

I installed the NBII thesaurus in a Drupal vanilla instance ( http://inigo.lternet.edu/vanilla ) Before moving it into a closer to production, Im asking myself how to effectively use this monster vocabulary? is this is too large - 50,000 tangled net of terms? should we only ingest the (preferred terms?) to provide that functionality (through views/autocompletes, for example?)

Should we do a NBII-lite ? how do we pack this vocabulary so it actually provides functionality and not outlandish suggestions?

If any of you have some time to wrap your head around any of these questions, please, Im all ears. I will proceed to do some functional tweaks too, but the road for NBII knb Thes. is now open - we just need to figure out how to navigate it better.


I couldn't help but reply:

Great stuff, Inigo. After playing with a different type of hierarchical vocabulary (species taxa info, in this case) a few things became clear to me.

1) We're selling ourselves short if we deploy vocabularies that aren't hierarchically structured because a lot of functionality within Drupal is immediately lost to us.

2) There are 250+ modules that extend Drupal's taxonomy system meaning the potential is nearly limitless but an up-front investment is needed to locate the best options (and identify what it is we want to accomplish.) The good news is that there are several use-cases for Drupal's Taxonomy system and a well-thought out solution should work for most or all of them.

Another question is how to deal with the need for multiple keyword vocabularies (i.e., NBII, LTER Keyword Dictionary and site-specific terms)? It may be possible to keep them within one Vocabulary where the ultimate parent term is the keyword source (e.g. NBII). I think terms can have multiple parents, meaning a term could exist within the LTER Dictionary and the NBII simultaneously. Of course, if that is true, it compounds the problem of having an absolutely huge Vocabulary...

Either way. I noticed that taxonomy_xml throws an error on the Status Report page if the RDF Module isn't installed. Can't say it will help the imports but it probably won't hurt.



Posted by kkwaiser at November 5, 2010 03:32 PM


Login to leave a comment. Create a new account.