The Outlook
VOLUME 4 ISSUE 1  
HOME
The Wave Front
Unstructured Data: Tapping the Knowledge of the Enterprise

Unstructured data and the ability to access it has long presented a challenge to enterprises. Outlook Ventures has been particularly focused on this area and is actively evaluating startups that purport to have a solution to this problem. In this article Outlook Ventures' Managing Director Carl Nichols and analyst Josh Hohman focus on areas of challenge and opportunity within the unstructured data space and take a look at the startups and incumbents focusing on finding solutions.

The bulk of enterprise knowledge is stored away in documents "somewhere on the network", and outside the reach of current business analytical tools, Google searches, or database queries. Tapping into this wealth of enterprise knowledge has proven to be one of the biggest remaining challenges facing the enterprise today, but as of yet no one has developed the killer-app to take advantage of this largely untapped resource. As a result of this emerging opportunity, Outlook Ventures is actively looking to invest in start-ups who can overcome the numerous storage, search, and analytical challenges inherent in unstructured data to provide breakthrough solutions to a nagging enterprise challenge.

DEFINITION OF UNSTRUCTURED DATA

Unstructured data means different things to different people, so we begin with a definition. Unstructured data is defined loosely as any data that does not fit neatly into rows and columns within a database or spreadsheet. This includes most of the documents and files a typical office worker encounters throughout the course of a normal day, such as email and Word documents, as well as numerous other sources including PDFs, web pages, X-Ray images and other scanned documents, archived customer statements, web logs, and other documents that are loosely structured, as well as video files, picture files, PowerPoint presentations, and other sources of data not stored within databases. Even within the rigid structure of a database, much data is often stored in fields called "BLOBs" where free text is entered. This is type of information is especially common within call-center applications where notes are taken for each call. This type of data is often defined as semi-structured data, since the data is being stored in a database, but lacks an easy method of analysis. For the purposes of this article, and as is common in technical literature on the subject, the term unstructured data will be used to describe both unstructured data as well as semi-structured data.

SCOPE OF THE UNSTRUCTURED DATA PROBLEM

According to Merrill Lynch, more than 85% of all information within an enterprise exists as some form of unstructured data - meaning that the data is document-related, rather than being neatly sorted in a database or spreadsheet. The amount of semi-structured data generated annually is staggering. A study conducted by the School of Information Management at the University of California at Berkeley revealed that instant messaging generates around 5 billion messages a day, or 274 Terabytes of data a year, while email adds another 400,000 Terabytes annually. The result of all this unstructured data is that the bulk of knowledge resources within an enterprise remain relatively hidden and underutilized. The key to tapping underutilized knowledge resources within the enterprise lies with better software solutions that address the challenges of storing, searching, and analyzing this vast nebula of data.

SPECIFIC PROBLEMS WITH UNSTRUCTURED DATA

The problem of unstructured data within the enterprise is particularly difficult to solve for a number of reasons - all of which are critical pieces of an aggregate solution to enterprise unstructured data. As a result, any comprehensive enterprise solution for unstructured data must, at a minimum, overcome challenges in the areas of storage, search, and analytics.

Storage

A fundamental problem of unstructured data is scalability and storage. Video, audio, picture, and other unstructured data files are typically much larger than text stored with a database, so the problem of making efficient use of this information can be a daunting task for large organizations. Large amounts of unstructured data can degrade the performance of mission-critical applications such as ERP and CRM systems. Many companies are avoiding the issues of data storage by moving unstructured data to less expensive (and less accessible) storage formats. The problem is that once moved to storage not easily accessed by users, the benefit of having the data is greatly diminished. Any solution in the storage space will need to keep unstructured data easily accessible to both employees while minimizing performance hits to mission-critical enterprise applications.

Enterprise Search

Most unstructured data is not easily interpreted by conventional search algorithms. Typically, enterprise search solutions for unstructured data fall into two categories: applying meta-data (i.e. "tags") to unstructured data, or developing a way to automatically interpret unstructured data without the use of tags. Each approach has its own challenges.

Applying meta-data tags to each unstructured document is the current method of choice given the relative simplicity of applying tags to documents compared to actually analyzing the contents and meaning of a document. However, given the volume of unstructured data, an automated process of tagging is needed to make the process feasible. Since automation requires software to determine the meaning of the document before an accurate tag can be attached, the problem is somewhat of a chicken-or-the-egg dilemma. A hybrid solution is automated taxonomy creation and management that sorts documents into categories without a comprehensive understanding of the document contents. The current, though somewhat impractical, approach to this problem involves software automation with manual intervention to complete the tagging or classification process.

Developing a robust enterprise search engine requires an algorithm that can accurately and quickly determine the contents and meaning of virtually hundreds of different file and media formats, ranging from emails to pictures, to video and audio files. Given the sheer number of different file formats in a typical organization, this is no easy task to say the least.

Analytics

Existing Business Intelligence software allows managers to analyze data within structured formats, but this technology does not yet apply to the vast majority of data within unstructured sources. Most problems of unstructured data analytics stem from the difficulty in deriving the meaning of documents via software. The problem is compounded by the sheer number of differing file formats found in a typical enterprise. In the coming years, start-ups that can develop new and better ways to analyze the resources of the enterprise will find a home within virtually all of the Fortune 500 companies. Analytics may prove to the be the killer app within unstructured data.

TECHNOLOGICAL APPROACHES TO THESE PROBLEMS

To date, most technological approaches to the unstructured data problem have focused on a single aspect of the problem, or have sought to overcome a specific business problem within the areas of storage, enterprise search, and business analytics. The following sections outline some of the more interesting approaches within each of the three areas, with a particular focus on analytics - which has the potential to make the biggest impact on the enterprise during the next few years.

Storage

The storage problem within unstructured data is currently under attack from several angles by a number of different start-ups. The solutions range from converting unstructured data into more manageable structured data formats, to the creation of hardware to compliment existing storage devices, and the creation of entirely new storage hardware. Many of the solutions are still early in the development phase, so the opportunity for improved storage solutions should make the space interesting over the next few years.

IBM is taking a unique approach to the problem of truly unstructured data by automating the conversion of video and audio files to text. Though still relatively early in development, IBM's ViaVoice software converts call-center voice data into text on a real-time basis. The data, once converted to text, is parsed into a more structured format within a CRM's relational database structure.

Another approach to the storage challenge is to create hardware to compliment the existing storage infrastructure within the enterprise. The new hardware aggregates the unstructured data through out the enterprise and enables greater transparency of data sources across all storage devices. Acopia is one such early-stage company. Acopia's Adaptive Resource data switches sit on top of the existing storage infrastructure, which reduces costs, but allows users to see an aggregated view of unstructured data in the enterprise. Acopia claims to eliminate "islands of storage" through its switching hardware.

Another approach is to move unstructured data to storage devices built specifically for unstructured data. This approach has the positive side-effect of offering the possibility of increased customer service levels for many B2C companies. For example, many telecom companies have moved archived phone bills to specialized storage devices which are accessible to both employees and customers. Customers can access the files via a web interface which increase customer service without sacrificing performance of internal enterprise applications.

Enterprise Search

One solution to alleviate the problem of semi-structured data within CRM applications is a better enterprise search mechanism. By having the ability to quickly search-and obtain information from-email, CRM BLOB data, and other semi-structured sources; managers can solve one of the most pressing needs within CRM unstructured data.

Documentum (purchased by EMC) has a solution that creates virtual repository for unstructured data. Once in the repository, the software automatically tags the file so that it can be searched and tracked throughout its lifecycle. The software basically aims to add structure to unstructured data through the use of virtual repositories and meta-data tags. Documentum's solution is a key step towards enabling enterprise search of unstructured data.

Similar to Documentum's approach, another potential solution to this problem is the automated management of taxonomies within the enterprise. Solutions in this space are aimed at reducing the time and cost of organizing the knowledge assets of an enterprise. Convera is one company focusing on automated taxonomy management, and has develop industry-specific solutions in genetics, financial services, high-tech, and government.

A better enterprise search engine for unstructured data can be the basis for web-based customer self-service - which can reduce costly customer service calls. One vendor with a successful solution in this space is Attensity, whose application allows customers to use a natural language search interface to find answers to their post-sale support needs. Attensity has found that close to 50% of visitors to the application find the answer they were looking for-avoiding the need for a call to technical support. According to Gartner, the cost of a web visit is a few dollars, compared to upwards of $60 for a customer service call. Please refer to the July 2004 wavefront article for a more comprehensive discussion of current start-ups focused on enterprise search.

Analytics

Emerging software solutions to the challenge of unstructured data analytics fits nicely with Outlook Venture's focus on enterprise software. As a result, the remainder of this article will focus on current approaches to unstructured data analytics, as well as specific application to a number of different industry verticals.

Email and XML Document Analytics:

Several companies are attempting to derive analytical data from the most common source of unstructured data within the enterprise: email and XML-formatted documents. Given the volume of email and XML documents within a typical organization, solutions in this space should deliver the most bang-for-the-buck to the unstructured problem in the short-run. One such company, Skytide, is addressing this problem with an analytical application platform called XOLAP, which attempts to deliver similar analytical tools currently available to structured data to the unstructured world. Email and XML documents are semi-structured and therefore are the low-hanging fruit of the unstructured world. However, given the volume of email and XML, adept analytical tools for email and XML would go a long way towards alleviating the analytical needs of today's enterprise.

CRM and Call-Center Applications:

Analytical tools for unstructured data will give managers a better picture of customer interactions during call-center visits. Typically, a manager has a good idea of the volume of call coming into a call-center, as well as the actions taken as a result of the call. What is often missing is the reasoning for the action taken, which is more than likely being stored with notes fields with the CRM application. Through better analysis of this valuable data, managers can get a better picture on the customer experience and identify trends among groups of customers. This data could be used analyze anything from marketing efforts, to early-detection of product defects - all based of notes historically hidden within CRM databases.

Relationship Discovery in the Medical and Legal Industries:

Some of the most interesting recent developments in the unstructured analytics space have been in the area of relationship discovery. Relationship discovery is the use of analytical software to analyze vast amounts of commonly available public information to uncover previously unknown relationships. The recent application of these techniques to medical literature is particularly fascinating. Early work in this area includes software called Arrowsmith, which was able to find a previously unknown relationship between migraines and magnesium-deficiencies by finding a common set of shared words between articles on both subjects.

More recent work by Christian Blaschke and Alfonso Valencia utilized a program called Suiseki to read through abstracts to uncover previously undiscovered protein interactions. Most of what Suiseki uncovered was already known, but a few interactions were brand new. Similar work by Professor Don Swanson used software to analyze Medline abstracts based on a topic of interest. The software produces a set of heuristics to guide the user to uncover relationships to the topic of interest that were not easily discerned from either set of data alone.

Medical researchers searching online scientific documents are overwhelmed with the amount of data. One database, Medline, contains over 10 million abstracts and grows by another 7500 per week. As of yet, no company has capitalized on a commercial version of the software, but the possibilities appear to be endless for anyone able to produce a commercially viable version of this technology in this emerging space of "Bibliomics" . SPSS is working on a scaled-down toolkit solution called "LexiQuest Mine" to aid pharmaceutical companies in reducing time to market for pharmaceutical research. In addition, ClearForest, Convera, Recommind, and Ingenuity Systems are all working on the bibliomics problem in one form or another. Bibliomics, and in particular relationship discovery, could be a potentially lucrative space in the coming years.

Application of this concept to the legal industry can provide similarly dramatic results in helping to build evidence from a mountain of publicly available case documentation. Law firms often sort through thousands of documents in preparing for litigation, intellectual property, or M&A activities for clients. This has typically been a very manual (not to mention expensive) process for both the clients and the law firms. Several technology companies are working on solutions to address this specific industry vertical . Stratify helps litigation teams accelerate case assessment time through its "Legal Discovery Service", which extracts useful information from unstructured data sources. While Stratify is the current leader in the legal industry, the space remains largely untapped. The coming years should see vast improvements in both legal and medical relationship discovery.

Analytics of Web logs to Derive Marketing Intelligence:

As Web logs become more and more popular, a growing set of useful data remains hidden. Web logs are a particularly interesting source of data for a number of different parties, including marketers, and corporate strategists. As a marketing manager, imagine being able to scan and analyze thousands of specific web logs to see how your product (or products of competitors) are being perceived and discussed. Marketers could analyze "buzz" through analyzing web logs for consumer products to hone in on specific demographics, or analyze the effectiveness of particular marketing campaigns on brand awareness. Aurora WDC and Moreover Technologies are working on just a solution to provide competitive intelligence on trends in the market. IBM is also a player in this space with its WebFountain application, which analyzes web log data to identify emerging trends in product buzz, brand awareness and brand reputation.

SUMMARY

Despite the vast amount of unstructured data within the enterprise, as of yet, no one has been able to find the Holy Grail application to overcome the numerous challenges of unstructured data and tap this under-utilized resource. Current approaches to unstructured data are starting to chip away at the problem and the coming years should see a significant breakthrough on the magnitude of to Google within search. We at Outlook Ventures believe the next several years will see the emergence of unstructured data as a lucrative investment opportunity and are actively seeking investments in this space.


This article was researched and written by Carl Nichols and Josh Hohman with valuable input from several sources, including analyst organizations such as Gartner Research, and technology websites such as DMReview.com and Unstruct.org
Published by Outlook™ Ventures
Copyright © 2005 Outlook Ventures. All rights reserved.