Monday, August 22, 2005

OK... so what's a document?

What is a Document?
If we’re going to discuss document (or content) management systems of various sorts, we have to frame up the discussion with what may seem like a trivial question: What is a document? Most of us have moved beyond whether or not a document is only paper, lambskin, or papyrus, but when you expand the definition to include electronic content, the answer to what constitutes a document gets fuzzy. So here’s how I define documents and document content.

First, a document is any piece of meaningful content –often with some of the look-and-feel of physical renditions like paper books--, whether public or private. Books are merely physical instances of electronic documents. “Meaningful content” extends beyond mere words, and includes video and audio. If you had to take a chance that every IPod download would be random static, you probably wouldn’t bother buying the device or repeating the download. And these days, multimedia content usually comes with its own metadata – information about the content, such as its author, when it was created, and so on.

But wait, you ask: Catalogs are books, and catalog content these days is usually found in a database. So are databases documents too? No, although I assert that the reverse is true in a way: “documents are data too,” a nice little bumper sticker thought.

I see document content as existing in a spectrum, ranging from databases (the most highly structured form of content) to highly structured content (increasingly expressed in XML) to forms to documents that are more-or-less subtly structured. Note I don’t use the term “unstructured documents” as many others do. To me, “unstructured documents” are an oxymoron: Either they have structure of some sort (and therefore have use and meaning) or they do not. “Unstructured” means “random” and randomness is not useful to purveyors of content. Even a well laid out advertisement has structure. Although the advertisement’s electronic format may not make it easy to discern its structure and pitch, human eyeballs readily do (or the sponsor won’t re-run the ad).

Why the brouhaha about “subtle” versus “un”-structured? Because if you don’t understand the difference you give up any chance of using whatever mechanisms to structure a document –such as styles—that are available to you. Or you give up and say “hey, it’s unstructured, no wonder I can’t repurpose or transform the document to be or do something else.” And that would be a pity. We invest a lot in our documents and deserve to get the most out of them that we can.

So to summarize, documents are book-like containers of information, regardless of their format. Databases aren’t books (but can be used to produce books). The organization of documents ranges from subtly (or loosely) structured (like restaurant menus or a child’s book) to highly structured (like a form). Moreover, the binary format of documents, most commonly that of a word processor or page layout system, does not yield clues easily to its structure but structure it has or it wouldn’t be useful. Word processors and page layout programs’ structure is pretty darned loose, however, and that is one reason why the publishing world (that would be all of us) is ever-so-slowly moving to more explicit structure, XML.

Interestingly, Microsoft may actually be putting its cash horde where its mouth (support for XML) is. Press teasers coming out of Redmond suggest that Office 2006 will not only replace its binary format “RTF” with “XML,” but may in fact begin supporting the content (not just the look-and-feel) with arbitrary XML.

So curmudgeon point number 1: No document is unstructured. All documents (and document formats) have some structure, ranging from subtle or loose to very rigid. But please stop using the “unstructured” adjective, OK?