Monday, July 09, 2007

Office Suites and XML - Vendor feedback

In my latest Info Insider column, I mentioned contacting two vendors to get their take on the impact of the two major office suites, OpenOffice/Star Office 8 (ODF) and Microsoft Office 2007 (OOXML), using XML internally. The vendors I contacted were Altova and MarkLogic. Here are the questions I asked them, followed by their responses.

Now that OpenOffice and Office 2007 both use XML natively, what new opportunities are there for analyzing or transforming Office documents?

Do you have any examples of customers using your products (or those of your technology partners) to analyze or transform OpenOffice/StarOffice or MS Office 2007 documents, leveraging their use of XML?

In essence, both vendors seem poised to provide ways for customers to extract extra value from
their document repositories, although the current state is a “ chicken and egg” problem. For now, there are no office document repositories, so there is no rush to buy new products to extract this value. However, sooner or later the enterprise chickens will be forced to lay the XML eggs (see below).


Following are the responses from MarkLogic, specifically John Kreisa, Director of Product Marketing for MarkLogic. Regarding opportunities for analyzing or transforming Office documents (whether ODF or OOXML), John says:

"Microsoft’s choice of XML as a core form for Office 2007 means that everybody using Office will be authoring directly in XML – Office becomes a direct means for creating XML content. We believe there is a significant opportunity for customers to leverage the ever-increasing amount of XML content by combining Office 2007 with an XML content server, like MarkLogic. Doing so will allow users to exploit the XML within the content in two ways. First they can combine all their content into one common repository, which is the first step to getting more value from the content. Then second, they can build content applications to repurpose the content, dynamically publish the content in new ways, and perform analytic functions they haven’t been able to do before.

Loading all of their content into a content server lets organizations analyze their entire content in new ways including understanding the term frequency, word counts, page counts etc, and understand the relationships within the content like citation analysis between articles and many other areas of analysis. What we typically see is that once organizations take a platform approach to their content they immediately find new ways to exploit it and generate new business opportunities."

Of course this begs the question “When will there be enough XML content to put into a repository, since adoption rates are currently low even though as users upgrade to OOXML or switch to ODF, they will generate documents for this repository. And in the case of OOXML, if users decide to stick with Microsoft they’ll have no choice but to upgrade, since sooner or later Microsoft will stop releasing free security patches to its earlier office products.

Kreisa confirmed the problem of the current adoption rate in his response to my second request for examples of customers using MarkLogic products (or those of your technology partners) to analyze or transform OpenOffice/StarOffice or MS Office 2007 documents, leveraging their use of XML:

"While Mark Logic does not currently have any customers using MarkLogic Server with MS Office 2007, we do anticipate that as adoption of Office 2007 increases, our customers will leverage the XML content they create with Office 2007 by combining it with MarkLogic to create new content, repurpose existing content into multiple formats, and republish this content, and to mine the content to find previously undiscovered information.

Our senior VP of products demonstrated our Office 2007 related capabilities in a general session at our User Conference in May, and the audience were very impressed – lots of nodding and clapping. When people see what we can do it generates interest in upgrading to Office 2007.

We have not heard much from our customer base regarding OpenOffice. However Mark Logic’s fundamental value proposition remains the same. We can load, query, manipulate and render the XML from StarOffice in the same manner we do for Microsoft Office 2007.

In response to your question about how presentational XML facilitates text analytics in Microsoft Office, it really depends on the goal of the user. Highly marked up XML can complicate or confuse tools that are not capable of handling this kind of deep XML. MarkLogic Server, on the other hand, can easily handle this kind of content and separate the markup from the text. For example, if a user wants to know how many places a certain word is in bold or how many words are tagged as <title1> style, we can help with that kind of analysis. We see this as potentially relevant for technical documentations organizations, for example, who want to make sure that they have consistency across their different documents."


Altova is the vendor who created the famous XML Spy product line, providing lots of ways to create, analyze, and manipulate XML on desktop PCs. Here are responses to the same questions from Alexander Falk, President, CEO and Co-Founder of Altova.

"Organizations save vast amounts of information in Microsoft Word documents and Microsoft Excel spreadsheets, but until now, that content could not be re-used in an extensible, programmatic way. With the Open XML document formats, that data is now standards-based; and the new capabilities in Altova XMLSpy allow developers to extract, edit, query, and transform XML data from within documents that use Office Open XML Formats - the new file type used by the 2007 Microsoft Office release - to make the data highly interoperable and easy to process. This provides huge advantages to business people and application developers.

Because XML Spy's support for Office Open XML was released only a few weeks ago, its too early to provide feedback."

I followed up to ask about the issue of XML quality in the two office suites, and whether or not one offers greater potential for leveraging the new XML internals. Office 2007 is almost exclusively presentational, while OpenOffice goes beyond that with support for additional standards, Scaleable Vector Graphics, MathML and XML Forms.

"Yes, that is an old argument. In an ideal world, the content authors would be motivated to create content with semantically meaningful tagging, e.g. Docbook or DITA. But the reality is that in today’s world most content is created in Office documents, so it is better to be able to extract and process that content with Office Open XML, than to continue to wait until all content creators use semantically meaningful tags. Furthermore, the Office 2007 Open Office XML formats are not just for Word documents. Extracting data from the millions of Excel spreadsheets that get created and processing it further in XML opens the door to a huge opportunity for information reuse and repurposing."

So there you have it. OOXML will likely have the largest installed base. In fact, the Massachusetts Information Technology Division (ITD), (the agency that essentially stuck its finger in Microsoft’s eye) has released a new draft of its Enterprise Technical Reference Model This draft now includes OOXML as an acceptable open format. The discussion period will end on 20 July 2007, but I’m betting the draft will become approved. 20. For an expert insight into the issues with the Massachusetts ITD, go to:

And there are still persuasive arguments that OOXML is fundamentally inferior to ODF, and how that plays out over the next several years will be abstractly fascinating to watch -- if only the future of our office document content weren’t so important. I’v e got my opinions on the XML quality issue, expressed in my Information Insider columns at EContent Magazine for some time. Here is O’Reilly’s take on the issue. .

It is right for both the above vendors to profess no preference over one format or the other, since both suites use XML and their products can and will work with each. Still, quality and openness matter. We’ll see how this plays out.