Monday, January 18, 2010

Justifying eDiscovery Systems

As I said in my Information Insider October 2009 column, "The landmark 2006 Federal Rules of Civil Procedures Rule 26 and its updates make all electronic stored information (ESI) subject to legal discovery, and ESI continues its unbridled growth." Given the nation's increasing litigiousness, and the exploding amount of electronic information everywhere that could be subject to subject to 2006 FRCP rule 26, I am surprised how little we've heard about such litigation. Is it simply that our attention is elsewhere (whether the US Health Care debate, 2 wars, Global Warming –or is it Global Cooling?, the earthquake in Haiti…)? Or is eDiscovery yet another ticking time bomb that will burst onto the news when we least expect it? Well the vendors supplying eDiscovery solutions have plenty to say about that.

And what is special about eDiscovery? Why not just buy the very best search system available, and use it to do all the "e-lectronic" discovery that you want? After all, isn't it all about "search"? I spoke with Ursula Talley, VP Marketing of Stored IQ, to gather expert opinions on this subject. Here are excerpts from her comments about this, which I find pretty illuminating.

First, "Enterprise Search and eDiscovery Search technology do share a set of core capabilities, specifically crawling, indexing and searching data across a multitude of various applications and storage systems. Enterprise Search is designed to assist knowledge workers with information access and retrieval. The end result is that a user can find some files with information that can help that user complete a task." So what's the difference? Ursula went on to say "eDiscovery Search is designed to support a workflow that can be legally defended in court. The end result is a set of data files that is preserved (saved to a new, target location without any changes to the metadata and recording every system and location for each data file was originally located)." This kind of quarantining of content goes over and above what you can do with any enterprise search system. Moreover, she says that search performed by eDiscovery systems must also be very robust. Such eDiscovery searching can require queries with between 25 – 300 search terms. Moreover (for those of you who have ever posed a complex query on an enterprise search system, then went to have a cup of coffee while you waited for the result to return) eDiscovery search must be able to copy large volumes of content that has been found, "if necessary hundreds to thousands of gigabytes, without disrupting user productivity."

While it's at it, robust eDiscovery systems such as those from StoredIQ can provide de-duplication of email and user files (saving space and attorney time pouring over the same redundant files), while keeping a record of every location where those items originally resided – in case the judge asks. Lastly, searching just email systems can be a real pain, since they are so big and are threaded. Even the best search often is like sorting through low-grade ore, tons of it. eDiscovery systems also can extract both metadata and content from email and export this into a database format that can be queried and re-used into legal document review applications.

So how do you go about justifying the purchase of an eDiscovery system? Not by claiming you can add features to an existing or new Enterprise Search system. Instead, focus on the other features that you'll need if a lawsuit comes a calling. Unfortunately, getting your eDiscovery house in order may be like getting your electronic records management house in order – really hard to justify until after the lawsuit. Still, at least you can avoid the trap of thinking that Enterprise Search can do all you need to find and quarantine your information for a credible eDiscovery defense.

Tuesday, October 06, 2009

And Now For Something Completely Different

... actually 8 things. What is different is that I normally use this blog for details that I couldn’t squeeze into my eContent Magazine column, Info Insider. The eight things I’m referring to are in AIIM’s recent (free) e-book describing the eight reasons you need a strategy for managing information.



John Mancini has a knack for writing simply, and this e-book (free for the downloading here) is well done. Although it is 95 pages long, don’t be put off by that; the pages are small ;-). Not only that, but the content, distilled from various “8 things” blogs, provides truly useful perspectives on Information Management. Here’s one gem from the section “Tidal Wave of Information.”

“A study by IDC a few years back concluded that there are currently 281 billion exabytes of information in the Digital Universe. So how much is this? Well…an exabyte is a million million megabytes. Thanks a lot. To put it in a bit of perspective, a small novel contains about a megabyte of information. So in other words, the Digital Universe is equal to 12 stacks of novels (fewer if the chosen novel is a big fat one like Harry Potter 6 or one of those Ken Follett Pillars of the Earth deals) stretching from the earth to the sun. So it's a big number, whatever it is.”
Go ahead, download a copy and enjoy the read.

Tuesday, May 05, 2009

It's TAXonomy Time


TAXonomy Time – Why the interest in Taxonomies?

I'm hearing the word “taxonomy” more and more often in ECM projects, often uttered by business people in the same sentence as “metadata.” Can it be that business people are becoming comfortable with these terms? If you know you've got a serious information overload problem, where do you start with taxonomies to tame and organize your content? Everybody starts with Excel for metadata and Visio or similar graphical tools to sketch out taxonomies. Those tools are available, sometimes free, and well understood. But they are fundamentally static. Do you need more? What are some best practice and alternatives?

As part of my latest column “It's TAXonomy Time” in EContent Magazine, I spoke with Carol Hert, PhD., Chief Taxonomist and Consultant for Schemalogic Inc. to get her take on trends in taxonomy projects. Here are my questions and Hert's responses.

1) What is the state of client awareness of the value and urgency of developing taxonomies? What is the trend – use the Gartner “hype cycle” stages if you’d like. Do you see increasing interest in taxonomies, and –if so—why? Is the “information explosion” itself motivating this interest?

We typically work with large corporations that have already developed and deployed multiple taxonomies across their organizations. These companies are well aware of the cost and limitations of trying to manage these taxonomies in a dynamic environment that includes many consuming systems. Some of the organizations we work with are focused on taxonomy harmonization-integrating single-use taxonomies into one or several related taxonomies that can be utilized enterprise-wide.

We continue to see increased interest in taxonomies with the further proliferation of SharePoint and other collaboration systems, the need to increase the efficiency of the information worker, and the continued interest in enterprise information findability. Also the need to meet compliance requirements for large amounts of unstructured information continues to increase the need to govern and manage information more effectively.

2) What are typical approaches to taxonomy development:

a.
Use an existing taxonomy only

b.
Build on existing taxonomies

c.
Enterprise versus single-application (tactical) approach

d.
Use tools not available from current application vendors (e.g., EMC Documentum) for possible use with multiple vendors, or vendor-specific tools?

Our customers usually have multiple taxonomies deployed across their organizations. They have issues with managing and coordinating multiple taxonomies, especially in a dynamic environment. The first thing we do is to collect these multiple taxonomies and model them in our metadata management platform. We can then work with the customer to connect and optimize these taxonomies and then extend them as well. Some of our customers approach this from an enterprise wide perspective, while others choose to focus on a single department, function or business process and then expand.

Because complexity increases as number of business stakeholders expands, most organizations are working to achieve a balance between the optimal goal of enterprise-wide taxonomies and single-application taxonomies. All our customers use SchemaLogic’s metadata management platforms to build and manage their taxonomies. Our systems are designed to allow customers to model enterprise-wide taxonomies and publish those taxonomies to multiple applications such as SharePoint and Documentum and well as to search engines such as FAST or auto-classification systems such as Teragram.

3) What trends do you see in the evolution of taxonomy development? In supporting technologies (such as SOA or SaaS)

There continues to be a need to manage taxonomies in a more dynamic way. The need to collaborate across the enterprise, locate and share information, and improve information governance at the same time is putting pressure on organizations to develop a more flexible approach to managing information. The distributed nature of SOA and SaaS architectures puts further pressure on companies to establish a enterprise with taxonomy that can be accessed by multiple applications.

4) What are best practices for developing taxonomies? What are some approaches to avoid?

Books could (and have been written on this topic), but a short list of Best Practices might include:

  • Understand the ultimate uses to which the taxonomies will be put (there is no one perfect taxonomy).
  • Incorporate business and technical stakeholders in the development process to assure that the final product will met requirements.
  • Conduct a “taxonomy”audit prior to developing any new taxonomies to understand what already exists and might be leveraged.
  • Consider taxonomy maintenance and governance during development processes to assure that the taxonomy is able to be maintained and there are clear lines of responsibility.
  • Look for externally available taxonomies but be cautious as they have not been designed for the particular goals of the organization in question. Participate in industry-wide organizations where taxonomy development efforts might be occurring.

5) Are there any emerging or existing standards other than ISO 2788 for developing or expressing taxonomies? Is ISO 2788 relevant (I gather it is oriented towards human indexers) and who tends to use it?

ISO2788 is relevant in terms of providing extensive guidance into term forms, and other such matters. Since most organizations work in networked environments and want to transfer taxonomic information electronically, most will need to explore approaches to structuring taxonomic data for electronic transmission. Some of the standards to be aware of are RDF, OWL, Topic Maps, and SKOS. Additionally, since taxonomies might reside in metadata repositories, standards such as ISO 11179 may be relevant.

6) What are some common exports from taxonomy tools (e.g., Excel)? Are there any common formats for importing existing taxonomies or developing them in taxonomy tools? For example, are there XML DTDs or Schemas?

CSV is a good common base line as some organizations still manage a number of their taxonomies in Excel. Some taxonomy management vendors have XML formats (such as we do) but these may be proprietary and need some translation into an XML format another application could use. Standards such as RDF, OWL, and Topic Maps might be used in this context as well.

7) Can you provide client case studies?

Yes. We have published several customer case studies and would be happy to work with you on additional case studies in the future.

Now About Tools

1) What are typical costs for acquiring and implementing taxonomy products?

The costs of taxonomy products varies greatly based on the particular application. Simple taxonomy modeling tools can cost less than $1000. While enterprise wide taxonomy management and governance systems can cost over $500,000. These larger systems provide highly scalable modeling capability, complete change management and governance, integration to full suites of enterprise applications and metadata compliance monitoring. We have deployed systems that range in price from less than $50,000 to over $1M.

2) What are three key features in taxonomy tools; what are three unique features in yours?

Three key features:

1. Support for a variety of relationships between terms (should at least be able to support the term relationship types specified by ISO2788).

2. Allow unlimited hierarchical structures.

3. Provide import and export features.

Three unique features in ours:

1. Extensive change management component that enables changes in taxonomies to be automatically subjected to governance.

2. Set of productized connectors that automatically can provide updated taxonomy information to consuming applications. In addition, the ability to create custom connectors.

3. Ability for end-user administrators of the interface to create custom properties on terms and taxonomies.

3) How would you assess the current state of the art for automatic classification features?

Auto-classification systems continue to improve, but still lack the precision and accuracy provided by a managed taxonomy. Taxonomies have been found to be useful frameworks upon which an auto-classification system can be developed rather than have the auto-classifcation tool start from scratch. A combination of taxonomy management to provide structure and manage term relationships combined with auto-classification methods has proven to be the most effective solution.

4) Do you provide “connectors” to work with enterprise content management systems such as EMC Documentum and Microsoft SharePoint?

We provide connectors that allow our customers to publish taxonomies out to subscribing systems such as Documentum and SharePoint. We also publish taxonomic metadata to search engines, auto-classification systems, portals, and other enterprise applications

---
So there you have it from an expert. And if you happen to use -- or be interested in using Documentum or SharePoint (or both), here's a way to move beyond graphical tools and spreadsheets to manage and leverage your taxonomies.

Sunday, March 08, 2009

CMIS - EMC's role and vision for the future

First off, what on earth does CMIS stand for and why should any content management person care? Here's the easy part, what it stands for: "Content Management Interoperability Services." What is promises is a way for customers (vendors, and others) to begin allowing useful sharing of content between different vendor repositories. That is a huge thing, since right now most companies have several, maybe hundreds (and maybe they don't even know how many) different document repositories they have under their enterprise roof.

To write my column on this subject ("Building Content Bridges") I interviewed EMC and Day software. The former one of the original writers of the specification; the latter a vendor that is keenly supportive of content management standards. The following notes are taken from my EMC interview.

On the 23 rd of October, 2008, I spoke with two representatives from EMC about the emerging standard CMIS: Patricia Anderson, Sr. Marketing Manager, Documentum Platform Marketing, Content Management & Archiving and Dr. David Choy, Sr. Consultant. "CC" below refers to my comment on statements in the interview -- "Content Curmudgeon."

I was curious about the timeline for CMIS to be implemented (assuming it succeeds), and why CMIS is important either to EMC or to the content management space in general. Following are my notes from that interview.

Dr. Choy: Nobody knows how long the process will take, but about a year or more for a full-fledged standard. There were eight companies participating with validating the current version of the CMIS spec for interoperability (IBM, EMC, Microsoft and five others). The eight proved that the spec could be used to assure interoperability. After that the team sent the proposed standard to OASIS. The formal process for discussing the standard takes time, but in the meantime for EMC we intend to make the prototype available for the public to play with.


Security has administrative issues (mechanisms proprietary to each vendor) and also in the runtime space; security policies reign. CMIS security and access control is out of scope at this point. Each vendor has its own security model. In the near term, that is outside the scope of CMIS. Security policy is now reduced to the lowest common denominator (CRUD), but every vendor supports those.

---

CC: By CRUD, Dr. Choy means the basic four operations, Create, Read, Update or Delete. Every content management system provides at minimum those same operations. How they determine who can do those things is a separate issue, and CMIS assumes each system manages its own security in its own way. If the administrator of a CMIS-compliant system gives you one of these rights, then from your own CMIS-compliant system you can access and perform operations on content in that system.

---


Patricia: One of the questions is “ what caused the need for this standard in the first place?” But organizations would set up more than one repository platform, perhaps departments or as the result of M&As. We realized that it was difficult getting to this other information. This also hampered development that was cross-divisional or cross-platform. Then with Web 2.0 mashups, it became even more difficult to leverage use of information. ECM folks realized that it was a hindrance that affected all vendors. We looked at different standards but wanted a standard that was platform-agnostic and services-based, to unlock information in different repositories. Serious discussions began in October 2006. Other committees like IECM tried to develop such standards, but they needed to start fresh.”


David: iECM is an AIIM consortium that tried to create something similar to CMIS. That group wasn’t set up for highly technical interoperability standards. Very little concrete results occurred. iECM is still looking at best practices and standards, not technical areas.

---

CC: Clearly you need both and without either there is no bridge between the repositories.

---

Patricia: For users, CMIS can expand the available applications and open the market for developers to write cross-repository applications. It is an open protocol and supports all repositories that support the standard. This provides customers lots of investment protection.


David: Enterprise Content Integrated Services is an example of an application that can facilitate cross repository work. Federated search, mashups, business process workflows across repositories.


Patricia: This is the first and only web services standard. An insurance company could have separate subsidiaries across the world, and writing to a standard would enable access and update to the repository information. A distributed environment such as a franchise would also facilitate sharing of information outside each organization.


Patricia: The 3 originals were the first tier; then we included others such as Alfresco (participated), Oracle, SAP, OpenText; now Day Software. This standard is comparable to what SQL did for databases years ago.


David: The importance is how widely a standard is adopted. The spec is publicly available. Interested parties (after technical committee is formed) can send comments to the technical committee. They’d need to join the technical committee. Enterprise customers (the first group) can benefit from CMIS and need to tap into different repositories. The second group is between repositories and vendors, allowing them to access each others content. The third group interested in CMIS is Independent Software Vendors.


Patricia: Another way customers benefit is from having a broad suite of applications for their vertical markets, since a developer could develop for all.


David: Road maps for CMIS are difficult because CMIS is not a full-fledged standard yet. My rough guesstimate would be about a year, after the standard is released. We do intend to make prototypes available for the public before then, and those would be built on Doc Foundation Services. So those interfaces are close.


Patricia: This proposed specification is already 2 years in development and vendors have done interoperability testing. We didn't just send paper to OASIS, working prototypes. “What should I do today?” When you are evaluating the specification, when you go to your next purchase or RFI, ask if vendors support that standard.




Saturday, January 10, 2009

Enterprise Search Summit Program

Do any of you feel like you can't keep up with the latest trends in search, or you just feel like you could wring more value out of your investment but aren't sure how? Or maybe you don't get the connection between Web 2.0 and Search? Whether you are responsible for your Intranet, your commercial site, or the various repositories inside your firewall, I heartily recommend the annual Enterprise Search Summit to be held this May in NYC.

I've attended this in the past, as a paid attendee (my "day job" employer considered it that worthwhile!), not gratis as a columnist for eContent magazine which is part of the Information Today Inc. portfolio. Michelle is the editor for eContent and designs/runs the Search Summit. I like this conference a lot. To learn more, click here.

Sunday, September 07, 2008

XML 10th Anniversary

In an upcoming Information Insider column, I invite XML to an intimate party where we can celebrate its 10th anniversary. I also invited Alexander Falk, CEO of Altova, and an XML aficionado if ever there were one (here's his blog http://www.xmlaficionado.com/ Here are some of the questions I asked Alexander as background for the column. I hope you'll find this interview interesting. After all, celebrating a "double digits" anniversary doesn't happen often. Alexander's responses to my questions are shown in blue text.

Question: The XML Recommendation is now 10 years old. XML led to hundreds of additional specifications, yet its adoption rate in publishing and word processing software (and XHTML in web pages) seems slow. What is your assessment of XML adoption, and what do you see for the next 10 years?

Ten years is a mighty long time to make forecasts for – my crystal ball is only rated for 2-3 years max…
What we’ve seen with XML over the last 10 years is a huge adoption in all areas that are data-centric, rather than content-centric. XML has become the lingua franca of data exchange and interchange and has made a whole class of enterprise applications possible, because you can now move data fairly freely between disparate systems.

The benefits of XML in a pure content-creation scenario – be it publishing, word processing, Web design – are only realizable if you have a large amount of content and use it with some content management system. That is not something that most small- or medium-size businesses would do, and that has, I believe, let to a somewhat slower rate of adoption in those areas.

Question: OOXML is essentially “ rich text format” expressed as XML rather than leveraging existing XML standards such as MathML. MS Office is expensive; OpenOffice (based on ODF that leverages other XML standards) is free. MS Office maintains office share. What gives?

This is an interesting conundrum. From a purely academic perspective I would agree with your statement that leveraging existing XML standards is desirable. But the reality is that 95% of the world’s office documents are MS Office documents today, and people want to continue working with those documents – and want to reuse the content that exists in those documents in other applications, and by opening the file format up and having them be XML-based rather than binary format, such reuse is now possible. I can tell you from our experience that we have received countless requests from our customers that they want to be able to work with OOXML documents, and not a single request for ODF. Also, when I look at e-mail that I receive from others, I have yet to encounter a single e-mail that came with an ODF attachment. I don’t necessarily like Microsoft’s near-monopoly on the office market, but to deny its existence and standardize on a file format like ODF that nobody actually uses in the real world doesn’t make much sense either.

Here we disagree a bit; my question to Alexander followed by his response.

Question: OOXML (which today looks like it will become an ISO Standard) is still essentially just an XML expression of Microsoft’s internal word processing format, “Rich text format.” What value does such a use of XML provide to potential applications?

Actually, I need to disagree on that one. OOXML is not just RTF in disguise. OOXML includes separate and distinct markup languages for expressing word processing documents, spreadsheets, and presentations. The wordprocessingML is somewhat related to RTF because it is based on a similar concept (runs of characters with styles applied to them), but that is where the similarity ends. We found that it is very easy to use XSLT (or XQuery) to extract content from either wordprocessingML or spreadsheetML documents in OOXML that were created in Office 2007 (or other OOXML compatible apps), and likewise it is very easy for us to generate OOXML content in both of those formats from our applications. For example, our data mapping tool MapForce makes it very easy for people to map data from a variety of data sources (including EDI, databases, Web services, XML, etc.) into spreadsheetML documents that they can then open with Excel 2007. Likewise, our stylesheet design tool StyleVision, makes it very easy for people to produce stylesheets that render reports from XML or database data not just in HTML or PDF, but now also in wordprocessingML for use in Word 2007.

Still, what is new in OOXML that didn't exist in earlier editions as Rich Text Format? And if 2007 simply uses XML as a replacement for RTF, I don't see the added value. Sure, you can search for table captions (if you want), but the richness of ODF is not there and won't be (can't be, due to compatibility with earlier versions).

Question: HTML 5 seems like a step backward from XML and XHTML. Is this a sign of eroding support for XML? One reason for HTML 5 (to quote the W3C) is “new elements are introduced based on research into prevailing authoring practices.” Wasn’t XHTML sufficient, or maybe too difficult for “ prevailing authoring practices”?

I’m afraid that the reality is that a lot of HTML is still created by hand: people creating some HTML in Web-tools like Dreamweaver or other HTML editors and then going into the HTML and messing around in it in text editing mode. Since those tools have been very slow to enforce XHTML compliance, people continue to generate sloppy HTML pages, and so there is unfortunately a real need out there to at least standardize on what authoring practices exist in the real world.

The much better approach is, of course, to generate XHTML by means of an XSLT stylesheet from XML source pages, which is what we do, e.g., for the http://www.altova.com/ Web site.

Question: XQuery is a standard co-developed by the developers of SQL. What’s your prediction for widespread adoption and use of XQuery?

I initially thought that XQuery had a lot of promise, too, which is why Altova was very quick to provide an implementation of XQuery in our products, including an XQuery editor, debugger, and the ability in our mapping tool to produce XQuery code. However, we’ve found that the adoption of XQuery in the real world is happening much slower than we and many others had anticipated. I think that one of the issues is that there isn’t yet a clear and consistent XQuery implementation level and API across all database systems that people can rely on. The beautiful thing about SQL is that – for the most part – you can throw the same SQL query against an Oracle, IBM DB2, SQL Server, or even MySQL database, and you will get back the same result. The same is not true for XQuery yet, and until we reach that level of wide-spread adoption in the database servers, it has no chance to be as widely adopted by database users and application developers.

The reality is that we see a lot more interest in XSLT 2.0 from our customers than XQuery.

Sad but true Alexander. I had high hopes for XQuery but I don't hear much about it these days.

Question: Will XBRL be one of the “next big things” leading to a major use of XML by investors via a new set of prosumer applications? Enterprise processes and financial systems? What role will XQuery provide in these contexts?

I do indeed see XBRL as being the next big thing. The fact that both the Europeans and the SEC are mandating XBRL for financial reports from publicly listed companies will be a huge driver of XBRL adoption on a global scale. I am convinced that XBRL will be essential in financial systems and will find its way into enterprise applications fairly swiftly. When it comes to the use of XBRL by investors as prosumer applications, I’m a little bit more skeptical. It is certainly clear that investment professionals will use XBRL to better compare data between different companies in a certain market and to derive some key financial figures much easier than before, because the financial reports don’t have to be re-keyed into their systems. But I don’t think that this effect will transcend the investment professionals and become easily available for consumers anytime soon. As to what role XQuery will play: it might play some role, but I’m thinking of XBRL more as a standardized data transport mechanism and am expecting investment firms to map the XBRL into their internal decision-making and analysis applications and do the querying there.

On this we agree. This might be XML's first great opportunity to transform significant amounts of content -- and the processes to generate that content -- outside the tech doc arena.

Question: I know some subscribers to online financial services are wondering if they will be able to supplement (or even skip) certain of these services by analyzing sets of XBRL files themselves. What are the practical limitations to such analysis? Is there an inherent limitation to max numbers of XBRL files that can be XQueried at once?

There aren’t really any limitations that I’m aware of. The problem is more one of: how will you use the data? An investor who is very accounting-savvy can probably easily use XBRL to extract some key financial indicators for a company and compare several possible investment candidates in an industry group. But most investors I know rather want the key financial indicators automatically calculated by somebody else rather than directly work with the raw XBRL data. So I am skeptical that individual investors will be able to skip their subscriptions. Augmenting them is, however, a possibility and I indeed see the ability for some people to get a more in-depth look at some numbers than what they can currently get from Bloomberg or similar services.


Saturday, February 16, 2008

Update on Office 2007 Compatibility etc.

Julie ("funnybroad") has updated her slide show about her Office 2007 compatibility findings. Here is an excerpt from what she said:

I've replaced my original Office 2007 Compatibility Mode Confusion paper on slideshare.net with an updated version. I had to delete and re-create the existing one, so the link to it from your blog is now broken (click here for Julie's updated info)....everything has been re-tested with Service Pack 1, and sadly, compatibility still sucks. So go to the new link, not the older one.

-----

While I'm on the subject of Office 2007, when I tested and reviewed the product I was happy to see a weird longstanding behavior removed: You print a document, then exit and are asked if you want to save changes. Most people simply "yes," fearing they forgot whatever change they'd made and don't want to lose it. Others say "no" thinking they made a change inadvertantly and don't want it to stick. Well, I was happy to see that dumb "feature" removed, but recently --several automated patch upgrades later, I guess-- I see the "feature" is removed. So we've got compatibility with pre-2007 suites, but this is one compatibility feature they could have dropped and it would have made the product better.