The curse of metadata |

Posted by Eddie & filed under IT & the Internet, March 7 2007.

For the last seven years of my professional life, one issue has dominated above all others, and that is metadata. Metadata is a simple notion really, that of describing things in a summarised fashion so that they can discovered by searching catalogues and then used in a practical way. A library book index is an example of metadata, as is a telephone directory.

Given that the Internet and the Web are all about the distribution of information, and that the structure and arrangement of that information has been up until now incredibly chaotic, it was inevitable that metadata would be seen as the answer and the way to impose order. This is the thinking behind the Semantic Web, and underlies many of the projects I work on.

For metadata to work on the Internet, where computers communicate directly with each other via ‘m2m’ (machine to machine, as opposed to a human being interpreting the information) interfaces, then metadata standardisation is essential for the interoperability of separate components. And this is where the fun really starts.

There is really no issue about the syntax and encoding formats of this metadata as it is carried over the HTTP protocol; XML and UTF-8 are relatively straightforward, have become widely used and do their job well. The problem comes with the semantic part of all this; what words are used to describe data, what order are they in, and what do they mean?

RDF is one attempt to standardise a ‘model’ of metadata arrangement but there are others, from simple name-value lists (or ‘hash tables‘) to massively hierarchical and recursive structures. As for the actual words (or terms, or elements, or keywords) used, then the number of standards and specifications is almost endless.

In my experience also, for a metadata standard to be useful it must have three components:

An information model, i.e. the arrangement and detailed definitions of the metadata elements that make up the standard
A ‘binding’, i.e. how the information model is to be encoded for transport across the Internet – this is almost always via an XML schema
Developer-friendly implementation guidelines

Probably the most widely used metadata standard across the Internet is Dublin Core. This was originally used for bibliographic applications, but has become widely used for other things. Its a lot simpler than other standards but doesn’t fulfill the requirements of describing more complex data.

Two fields that I have worked in that have quite complex data structures are Learning Technology and Geographic Information. There are many standards of all sorts that have been developed from within these two fields, but two stand out as examples of what has made working with metadata so frustrating: the IMS ‘Learning Object’ metadata specification (learning technology) and ISO 19115 and the related OGC Catalogue Service specification (geographic information).

Both of these specifications (they are not ‘standards’ in the strictest sense) are incredibly hard to implement. The whole point of standardising something is so that seperately developed applications and tools can interoperate with each other via agreed interfaces. But there are many barriers to members of an interested community (or ‘domain’) doing this with these two standards, the most important being their overwhelming complexity (due to huge numbers of elements and a very complex structure), but also the fact that to gain the full documentation describing these standards requires a financial payment.

These standards were intended to promote global data interoperabilty across the Internet, but the complexity of the standards has caused the development of ‘application profiles’. Essentially these are secondary standards that a non-global community with a more localised agenda has adapted from the primary standards, and may use a different national language, or a different set of vocabularies (e.g. discipline-specific thematic keywords) or a simpler subset of the metadata structure.

Sometimes these application profiles are so different from the primary metadata standard that they in effect become new, seperate standards. They use a different information model which means a different XML schema, and have different semantics. They become uninteroperable, and cause the creation of a data ‘island’ on the Internet, the exact opposite of the original intention. This, despite the new standard being proudly proclaimed as being fully interoperable with the rest of the world whilst at the same time fulfilling the needs of a community with a narrower focus. An unfortunate example of this is the Gemini standard.

What seems to have been lost in the drive to develop more varied and more complex metadata standards is that they have to be used by software systems, not people. And for two software systems to exchange data (or talk) over the Internet of the early 21st century, then they must communicate using XML schemas. If they use different XML schemas, then they are at best talking in different dialects, and all too often in entirely different languages.

The reason for this mess is due to political and financial considerations gaining too much attention over the technical issues that need to be addressed. It’s worth stating here what I consider to be the true test of whether a metadata standard, or indeed an application profile of that standard is truly interoperable and therefore something worth investing time and effort in (perhaps I should call it Eddie’s First Law of Metadata ):

A metadata record (i.e. an XML file containing the metadata elements describing a piece of data) that has been created by a software system that claims to support metadata standard ‘x’, can be imported into an entirely seperately-developed second software system that claims to support an application profile of metadata standard ‘x’ (or the primary standard itself), and viewed or parsed by that second system in a fashion that allows it to fulfil its own requirements, with no loss of metadata and with no requirement to change the system code to map the imported metadata elements to other elements.

Unfortunately as a software engineer, it has all too often been me who has had to build software applications that implement these metadata standards and make the claims of interoperability a reality. I’ve done this with varying degrees of success over the years but always with serious gnashing of teeth and complaints. Almost nothing I’ve worked with passes the test above.

The only way things will work in the future is if software applications are allowed to work with metadata that has been developed with two guiding principles in mind: simplicity and clarity. Two things I constantly strive for in myself and search for in others but all too often never get.

So, faced with the issue of the chaotic data environment that exists on the Internet, what are the potential ways an organisation or a software engineer can approach this? The way I see it there are four:

Implement globally-supported and easily-implemented metadata standards, whilst compromising on full descriptive capabilities.
Muddle through, remaining non-commital, whilst attempting to appease as many of the powerful financial and political players in your corner of the Internet as you can, even though they may have conflicting and incompatible aims. Pay lip-service to the idea of working with the ‘big’ metadata standards so that you can get invites to any parties that are happening.
Use no standards at all: support the radical new model of allowing any user (not just discipline-specific data experts) the ability to supply their own metadata in the format they desire, e.g. deli.icio.us and Connotea. This approach has yet to aquire a widely used buzzword but is known as ‘social bookmarking’ and ‘tagging’.
Do nothing and go to the pub.

And if you want more preaching about metadata from someone with even more bile than me have a look here.

The curse of metadata

Leave a Reply Cancel reply

Posts

Categories

Admin