Monday, July 24, 2006

Secret to Content Longevity?

The Wrapper, not the Gum
One goal of FDsys is to preserve content for future access and repurposing. So naturally the questions are: 1) Are you using XML? and 2) If so, what DTD or schema? I was expecting to hear "WordML" (Microsoft Word's XML standard, which is really more an XML expression of its RTF or Rich Text Format; I was also hoping to here OpenDocument, the rich XML office standard on which OpenOffice is constructed. The answers surprised me, but in retrospect should not have. FDsys's plan is to take the content in whatever format it arrives --preferably in a reasonably small number of common formats-- and to concentrate on the metadata wrapper itself, for accessibility. Here's what Mike Wash said.

"We have developed requirements for the information packages that will
exist in FDsys. FDsys architecture is based on the Open Archival
Information System (OAIS) model which develops the concept of
submission, archival and dissemination packages. The excerpts from the
Requirements Document will help you understand our approach to
structuring submission packages and dissemination packages."

And now the details, obviously too much for my 800-word Information Insider column. By "RD" Wash means the FDsys "Requirements Document.

"Page 31 in the RD 2.0 Document Submission Information Packages (SIP)
This section specifies the packaging details for the Submission
Information Package (SIP), and describes how digital content and its
associated metadata are logically packaged for submission to FDsys.
A SIP contains the target digital object(s) and associated descriptive and
administrative metadata. It will be the vehicle whereby content packages
are submitted to FDsys by Content Originators. The concept of the SIP in
the OAIS (Open Archival Information System) model provides a starting
point for the specification of content and associated metadata, but it does
not specify how it is packaged. It is necessary that a SIP follow prespecified
rules so that FDsys can validate and accept the content for

Associated with the SIP are three types of information:
* Content Information (digital object(s) and Representation Information),
* Packaging Information, and
* Descriptive Information.
Packaging Information is the information that binds or encapsulates the
Content Information. To accomplish this, a SIP will include a binding
metadata file (sip.xml) that relates the digital objects and metadata
together to form a system-compliant SIP. The Metadata Encoding and
Transmission Standard (METS) schema shall be adopted as the encoding
standard for the sip.xml file, and GPO will specify profiles for METS to
drive its implementation for FDsys.

Descriptive Information is the metadata that allows users to discover the
Content Information in the system.

All file components of the SIP will be populated within a structured file
system directory hierarchy and are then aggregated into a single file or
entity for transmission and ingest into the system."

Wash elaborates further:
"Page 42 in the RD 2.0 Document Dissemination Information Package (DIP)
Dissemination Information Packages (DIPs) are transient copies of digital
objects, associated content metadata, and business process information
that are delivered from the system to fulfill End User requests and Content
Originator orders. As necessary, DIPs should follow the concept of a DIP
as outlined in the OAIS (Open Archival Information System) model.

The DIP is created as part of delivery processing and digital objects may
be adjusted based on orders and requests to support the delivery of hard
copy output, electronic presentation, and digital media.

The DIP should include all digital objects and/or metadata necessary to
fulfill requests and orders. The DIP may also include a binding metadata
file that relates the digital objects and metadata together to form a
package. The Metadata Encoding and Transmission Standard (METS)
schema has been adopted for the SIP and AIP and may be used as the
encoding standard for the binding metadata file, if a binding metadata file
is created."

Standardized, format neutral, and concentrating on the information about the content rather than the content itself. That is the long view, because when you are dealing with a very large (and unpredictable) number for format types, you have to concentrate on the access and delivery of these things.

More Q&A to follow soon.

Thursday, July 20, 2006

Future Digital Systems - FDSys... Complete Q&A

Cutting Room Floor -- Mike Wash Q&A - GPO, FDSys
My next InfoInsider column describes an initiative at the US Government Printing Office that surprised me by its breadth, vision, and implementation pace. That initiative is called Future Digital System (or FDsys). FDsys began with strategic planning in July 2004 and developed a strategic vision for the 21st Century. This vision provides a plan to provide printing and electronic delivery services to the three branches of federal government, 1250 Federal Depository libraries (providing protection from disastrous losses), and to the general public. FDsys is packaged into six phases, is currently mid-way through phase 4 (implementation planning), and expects a full system implementation in October of 2007.

For the past several months I've posed questions and received responses from Mike Wash, GPO's Chief Technical Officer. Due to the size constraints of my column in eContent Magazine, I could only summarize my questions and Mike's answers. If you've read this far, I assume you'd like more details. Here, in this and succeeding posts, are the details of my interactions with Mike.

Question: Ever the IT guy, I asked "What are your broad systems acquisition strategies:
a. Best of Breed versus integrated systems?

b. Proprietary versus Open Source."
Since we're talking about essentially loosely structured content, by "Proprietary" you can easily infer "Microsoft." By "Open Source" you can equally infer "OpenOffice" or "StarOffice 8."
Answer: Here were Mike's answers.
"FDsys will be focused on meeting customer needs;
therefore, GPO is taking a best of breed approach to acquiring
and integrating the technology components that will comprise
FDsys." and "FDsys is a standards based system."

My comment: Since the federal government is "by the people," --all of us-- I think he did a pretty job of stating a preference for standards while not specifying exactly which standards he was referring to. OpenOffice became an ISO standard in May.

Monday, June 26, 2006

More on the Enterprise Search Summit

ESS - more thoughts

Expect more vendor consolidation, and there are many instances of it already:

* Autonomy bought Verity (an oil-and-water team, IMHO).
* Oracle bought TripleHop (great product that combined Autonomy’s statistical search with Verity’s keyword approach).

Not only that, but due to the "camel's nose in the tent" phenomenon, if you've already picked a major vendor that you trust for a major collection of services (Microsoft for ASP/email/Visual Source Safe... or Oracle for databases....) you may be tempted to go with that vendor's search solution --for better or for worse. Doing that will lock you in for a long time. Vendors may see "search" as a way to lock you into their other more profitable solutions.

Sunday, June 25, 2006

Enterprise Search Symposium

Well, it's been about a month since I attended this conference in NYC. I wanted to let my first impressions sink in before relaying my conclusions about enterprise search and this conference. After a month, I have to say I am as ambivilant in many of my conclusions as I was in May. Here are some of those conclusions.

First, the conference was surprisingly well attended. I'd estimate there were from 800-900 attendees, way above the total last year (I'm told). So maybe this finally is the year of Enterprise search. On the other hand, several conference presentors reminded us that IBM developed enterprise search software in the mid 1960s --that's right, about 40 years ago-- and the fundamental capabilities haven't changed a whole lot. Moreover, the market for enterprise search is less than a billion dollars, a relatively small size. So is this the year Enterprise search finally takes off? Or is this a little like Lucy's football?

More thoughts on the way this week.

Sunday, April 02, 2006

MS Office Irritants

Microsoft Office has been with us for a long time. In particular, Microsoft Word's DOS version was available at least in the early 90s. I will give the latest version of Office 2007 (aka "Office 12") a fair and impartial review when there is a stable version to review. In the meantime though, Microsoft, listen up. Although I'm writing this blog, believe me there are many other folks in the "silent majority" of users who feel as I do but just suffer in silence.

I like MS Access and Excel, but there are lots of things I hate about Word and PowerPoint. I hope you'll endeavor to fix these in Office 12 and not simply tart up the old products with a new interface. First, here are some particularly annoying things about Word. Let's start with stability. This product has been around for 15 or so years, and it still is buggy -- as in "Word has experienced a problem" and then the whole thing hangs. Maybe you'll be able to save what you've done, maybe not. But shouldn't this product be bullet proof?

Then there's the infamous "Do you want to merge changes" message when you open a document attached to an email. Everyone's first (second, third) reaction is "huh?" Yes, I understand how to get around that, but I shouldn't have to.

Here's another bogus feature: Whenever you print a Word document --nothing else but print, mind you-- and exit the document, you get the message "Do you want to save the changes?" Again, a big "huh?" First, second, third reactions are "I don't think I made any changes, but I guess I'd better save it anyway." The message comes because you've inadvertantly associated a printer to the document. If this warning is such a great idea, then why don't you apply it consistently with PowerPoint and Excel too?

Here's another: Unhelpful HELP. If you want to know how to prit to fit a paper width or a certain number of pages, Word's help says "If your work doesn't fit exactly on the number of printed pages you want, you can adjust, or scale, your printed work to fit on more or fewer pages than it would at normal size. You can also specify that you want to print your work on a certain number of pages." Great. And how do we do that? Your HELP system, like Word itself, has grown lazy and bloated.

And don't even get me started with security and "leaking" of personal information inadvertently.

These are just a few examples of simple things Microsoft could have done long ago to improve this product. I've got hundreds more but I'm getting carpal tunnel just from resaving my Word documents after printing them...

Getting more value from Office Documents -- XQuery

Since documents by definition are created by and for the world-wide masses, there is a wide variation in the value of these documents. By value I mean both the quality of what they contain (are they true? accurate? interesting?) but also in their value as assets that can be transformed or reused. Microsoft Office 2007 will be XML-based; OpenOffice and StarOffice 8 are XML-based. You can argue whose XML is richer and more useful (so far I give that award to OO and SO8 for a variety of reasons that I'll explain later). But it is still hard to do much with that XML unless you use the right tools. I've used Altova's "Spy" for some time as a suite for some XML management like schema development and analysis and been generally happy with that suite. Increasingly, however, styling and transforming the content is becoming the best way to derive value from investments in XML content. That means using XSLT and XQuery, and I'm increasingly believing that for serious use of those standards, you need a different tool: Stylus Studio. To learn more about this alternative suite, check out Larry Kim's Stylus Studio blog.

Watch this space... StarOffice 8 Review in earnest

Well I've gotten the green light to review StarOffice 8 in eContent magazine, with a very short timeframe. So watch this space for details that will be unavailable in the review, or even too "edgy" for a printed publication. In this blog I can be as candid and opinionated as I wish... after all, I am a curmudgeon.

Tuesday, January 17, 2006

Star Office 8 - Writer - More Findings

More StarOffice 8 Findings

I’m using StarOffice 8 (SO8) as my default word processor now, and two general impressions about StarOffice Writer continue to amaze me:

1) How easy it is to learn how to use, its familiar feel for MS Word users.
2) How it has improved many of the annoying shortcomings of MS Word.

Although I’ve read the reviewer’s guide, I find myself generally not needing the HELP system. It is as though I’m interacting with a twin –not an identical twin, and not the evil twin, but with the better twin.

Little annoying things with MS Word that SO8 has fixed:

1) First and foremost, the table model. You really can join (merge) cells vertically, and it isn’t just a “fake join” (one where the cell boundary has simply been hidden, as in MS Word); you really can vertically center text within a vertically merged call.
2) Second, when you use the format painter to paint table cell attributes, you paint not only the text font attributes but also the cell attributes (like borders and cell shading). This should be an easy fix for MS Word. We’ll see.

So though I’m still in the honeymoon phase, do I notice any blemishes on SO8’s Writer? Well a couple. Here’s what I’ve found so far:

1) Unexpected hang while initializing the templates for first-time use. I sent a message to SO8 support about this. This has happened only once.
2) And speaking of the format painter, sometimes it doesn’t seem to paint formats. Several times, especially with text copied from a web page into a Writer document, you can try as you will to format paint but it doesn’t seem to work. I’m not sure why. I’ll bet I could edit the XML text contents to fix the problem (as I might have done with WordPerfect reveal codes a decade ago), but haven’t gone to try that yet.
3) There is a lovely little anticipated text completion feature as you type into Writer, and sometimes it cleverly guesses the right word, but I can’t figure out how to accept its suggested completion. I’m sure I’ll figure that out; the feature is too nice not to be documented there.
4) And speaking of XML, I examined the “content.xml” piece of a saved document, and then I opened it in Altova XML Spy to see if the document was valid. It wasn’t (although it was well-formed). I’m not sure why; there are schema references; perhaps they’ve changed. Anyway, SO8 comes so close and I wish there were a way to get that validity check to work.
5) Last and minor to some folks but not to me, I dearly wish that the text analysis functions were a bit beefier. I find the thesaurus to be somewhat better than MS Words, but there is no way to run a reading level analysis the way there is in Word. I’m sure that will come; maybe there is even an add-on somewhere that I don’t know about. But I do miss that reading level analyzer.

As I said, working with Writer has an uncannily familiar feel, so trying to figure out what to test next is in some ways difficult. It is almost like the experience of saving all your laptop Windows settings before getting the thing re-imaged, then using the updated laptop with its settings. You tend to find little differences and things missing as you work.

What do I plan to check next in Writer? Much as I dislike MS Word’s “fields” capabilities for a variety of reasons, I’ve grown comfortable with the way they work. One feature I especially like (and need) is the ability to display a “last date and time saved” string in a document footer. I use this even in document management systems, since you never know which version the printed copy you’re looking at is.

Tuesday, January 10, 2006

StarOffice 8 - Installation and first impressions begin

Installation (and registration) went smooth as silk. Unlike typical MS applications, StarOffice8 offered (but didn't default) to be my choice for opening MS Word, PowerPoint, and Excel.

Jumped right into the Writer, created a table, and tried right away to do a couple of things that always bugged me in MS Word:
1) Using the style painter to copy cell background from one cell to another... worked like a charm.
2) Joined cells horizontally and vertically... worked fine.

#2 is something that WordPerfect for DOS in the early 90s could do, but MS Word hasn't mastered it yet (fakes the vertical join, but doesn't really do it).

Bonus surprise: I could save the table in native Writer or many other formats... including DocBook. Fantastic!

How to Eat the Review Elephant?

Well the StarOffice8 CD arrived yesterday, along with lots of reviewer hints and overviews. How to go about assessing this office suite? What criteria that I can also apply to Office 12? Here are SOME of the review dimensions that I will be considering the following criteria; can you suggest anything to add?:

Overall Package Considerations


Licensing options, including List price/ street price, support options, ease of installation, disk space consumed, memory required. Overall value.

Ease of Use
Intuitiveness of screens, HELP, click efficiency, importing/exporting others formats?

Accuracy, robustness (e.g., styles included?), complexity to perform. Interoperability with MS Office. Evaluate:

Automates the analysis of documents to identify potential migration risks?
Calculates the cost of migration
Compare migration options in different Office Package editions (Migration Partner, Enterprise Edition in StarOffice 8)
How well does it migrate Macros?


Ease of use, ability to uninstall.


How long does it take to perform simple and complex procedures (such as updating a TOC or index, inserting a graphic).

Data Management

OLE? Live data from databases? DB queries as source (which ones)? XML Sources?

XML Support
Schema and DTD?
Format versus meaning?

Tags for formatting only?
Support for external XML models?

Forms Support?
Which forms schema? XForms.

Extensibility? Allow use of "alien attributes"?


How robust is the office package? What does it contain? Consider these modules:

Word Processing (emphasized here)
Drawing packages (vector/raster)
Others – Chart, Math, integrated tools

Word Processing Capabilities

Supported in presentation tool? Robustness in WP?

Robustness of table model

Technical document Support

Desktop Publishing (layout intensive) Support


Tables Of Contents, Indexes, Hyperlinks, Cross References

Writer Support

Spell Check, Thesaurus, Word Count, Reading Level

PDF Support
Which Acrobat version compatibility?