Posted by & filed under Books, IT & the Internet, July 5 2007.

I’ve mentioned metadata in a few of my previous postings, and something else that is related to metadata is the concept of a ‘controlled vocabulary‘. This is a term that suffers from many misunderstandings so I’m going to try and define what it is and where it sits in the grand scheme of things, using my own experiences.

Previously when I was writing about metadata, strictly speaking I was writing about metadata schemas. These define the structure and arrangement of metadata elements (such as title, description and address), but often say nothing about the formal syntax and definition of the content that actually goes into these elements. For example, a metadata schema may contain an element, the purpose of which is to uniquely identify the data in question (an ‘identifier’), but there may be many things that actually provide the content of this particular metadata element; an ISBN number for a book, or a serial number for a product in a sales inventory. Likewise, many metadata schemas have elements called things like ‘keyword’, ‘subject’, ‘discipline’, ‘theme’ or ‘topic’ – what sort of conceptual system would provide the content for these?

The concept of metadata comes largely from the bibliographic world (where metadata is more often referred to as catalogue information), and these sorts of elements have in that world traditionally been defined by bibliographic classification systems like the Dewey Decimal Classification system or the Library of Congress Classification system. These classification systems are actually sophisticated controlled vocabularies and are more properly defined as knowledge domain ontologies – they are trying to categorise or classify areas of knowledge with labels (or ‘terms’) agreed upon by all interested parties within that domain.

Ontologies are often hierarchical in nature, and are then also referred to as taxonomies. Another example of this sort of ontology from a narrower domain than bibliography is the Linnaean taxonomical system of botanical and zoological names.

Discussion of controlled vocabularies and classification systems often focuses exclusively on thematically-based library-style keywords, but controlled vocabularies can be used to define almost any descriptive concept that a metadata element exists to provide for. For instance, material that can be used for pedagogical purposes (in the form of ‘learning objects‘) may have an intended target audience. This target audience could be defined by a controlled vocabulary consisting of terms like ‘pre-school’, ‘primary’ ‘higher’, ‘further’ and so on. Also, my own work recently has involved data that is geographical in nature so to describe this sort of data adequately requires the use of placename metadata elements – and the set of placenames used comprises a controlled vocabulary, or in this particular context, a gazetteer.

This process in some ways causes all sorts of practical as well as philosophical problems – it assumes that reality can be split up into granular chunks, and that a label that is applied to that chunk (a word like ‘forest’) describes that concept, that chunk of reality, and that anyone who understands the language used by the system will agree upon what should be classified using that label.

The actual situation is not like this of course. Librarians and knowledge domain specialists argue endlessly over classification definitions and semantics (why use ‘forest’, when ‘wood’ or even ‘ecosystem’ might be more appropriate?), and librarians constantly have to make fudges and ad-hoc compromises so that they can offer a functional service.

Many people often see these sorts of things as ‘metadata’ and use that term when talking about classification systems, but this misses the subtle (but important!) distinction between metadata schemas and the controlled vocabularies used to populate certain elements within a metadata schema.

And why is all this important to a software engineer? Because increasingly these ideas are being used to bring order to the chaos and gigantic scale of data on the Internet (see my previous posting ‘The curse of metadata‘). The distinction between a librarian and a software engineer working on Internet-based applications is becoming increasingly blurred. We are all data specialists now.

Controlled vocabularies are an essential component in the concept of ‘browsing’ for data by selecting a ‘keyword’ from a controlled vocabulary and then using that to match against data that has been ‘tagged’ with that particular keyword.

This concept is central to the way that traditional computer-based library cataloguing systems work – but crucially, it is not the way that web-based search engines like Google work. Google harvests and then indexes the textual content of the data itself (in the form of web pages) and then matches this with a user-supplied string pattern that can be absolutely anything – a user is not restricted to a controlled vocabulary and they can use the entire scope of the language they happen to be fluent in. And what is more, Google does not require that the data it searches be described by metadata to do this.

Other web-based search engines like Yahoo! and AltaVista (now incorporated into Yahoo!) used to also offer taxonomical cataloguing and keyword-based searches, but the crucial thing is that no-one wants to use this sort of search now, and this method of searching the web seems to have quietly vanished from sight (Yahoo! still has the ‘Yahoo! Directory‘ search but it’s not prominent) . It is a fact of life that searching the web via a free text field using Google is what people want to do, and more often than not they are satisfied with the results. There are difficulties with searching non-textual data (such as images), but no-one seems to really care about this when Google seems to work so well (and it does, much to the despair of traditional information specialists such as librarians and archivists).

My own personal opinion is that controlled vocabularies are too unwieldy and inflexible for the requirements of the Internet. Crucially, the concept does not scale to the demands of the Internet. The only way that order can be imposed in a future of ever-expanding data volume on the Internet is to embrace the concept of user-generated descriptive labels (or ‘folksonomies‘) – something that often brings cries of outrage from librarians and knowledge specialists such as scientists.

Folksonomies have many disadvantages – anonymous Internet users are often not specialists or experts in a particular field, and mass ‘tagging‘ of data results in anomalies like the terms ‘boundary’ and ‘boundaries’ being understood by dumb software as labels for distinctly granular concepts, rather than what intelligent humans mean by the terms (numerical variations of the same concept). However, the advantages of a folksonomy (or ‘social bookmarking’) approach outweigh these problems. A web-based service like Wikipedia contains so much information that a strict and unchanging controlled vocabulary (Wikipedia allows users to supply their own entry classifications in a taxonomy of themes) would severely limit its utility. eBay uses a taxonomy that cannot be changed by users, but this is constantly changing to reflect the content on eBay and conforms to no standard.

The future of the Internet (and perhaps knowledge itself), is heading towards this: the task of dividing up the universe and labelling the things within it will not lie in the hands of scientists, librarians, philosophers or theologians, but will be consensus-based and will reflect what the entire human community (or at least that part of it that uses the Internet) agrees upon. This is some way from the biblical tale of Adam alone giving names to the animals within eden and perhaps it’s a little scary to contemplate, but advancement towards this future is unstoppable whilst the Internet exists.

Unfortunately, the academic world is still wedded to controlled vocabularies. Projects like HILT exist to enhance web-based services by overcoming the multiplicity of controlled vocabularies in use in disparate data domains but they will ultimately fail due the scalability issues I’ve touched upon here.

4 Responses to “Controlled vocabularies and why you should be interested in them”

  1. Eddie

    Hardly any of this blog posting is original thought on my part – I guess I’m just reinforcing what Clay Shirky has already said about ontologies here.

  2. Heimo Hänninen

    Very good read – thank you!

    Ontologies are often hierarchical in nature, and are then also referred to as taxonomies.

    As data models I’d rather say Ontologies are network structures, similar to entity-relationship (ER) models: any entity (a.k.a topic, object, subject, instance, node) can relate to any other entity within certain defined constraints. Typically entities and relationships are typed and equipped with necessary attributes.
    Taxonomy is a hierarchical categorization structure i.e. tree as a data structure. Should you add some controlled associations to your taxonomy, such as a narrower or a broader term – we often call them thesaurus.

    The most powerful model is ontology, since taxonomies, thesauruses and flat list of controlled terms can be automatically derived from richer network model.

    I agree that controlled vocabularies alone cannot fulfill somewhat contradictory business control and usability requirements. You need to seamlessly combine “folksonomy style” free terms and controlled terms so that, recommended terms are easy to tag (consistency, business relevancy), but new concepts can be easily proposed to the “dynamic library of terms”.

    Last part of the puzzle is to bring external information sources close to your company terms. Say, your term tagging device shows also common related terms from Wikipedia. My final word is about definitions. Term without definition that clarifies the meaning in a given context is mandatory too. User might want to see the definition as mouse over tooltip on terms all over the desktop and tools.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>