During its eight-year history Outlook Ventures has been an active participant in the area of search. Outlook Ventures (previously called Iminds Ventures) was an early participant in the creation of goto.com (renamed Overture in 1999), and General Partner Randy Haykin was a member of the start-up team at Yahoo! Since the days of Yahoo! and Overture, the search industry has evolved considerably. In this article Outlook Associate Shazia Makhdumi focuses on areas of opportunity within both Web and enterprise search, and takes a look at the startups currently challenging the incumbents in the sector.
Search is arguably the killer application on the Internet. It has spawned cottage industries of search engine marketing and search engine optimization, and caused the verb 'to Google' to enter everyday language. Search technology has been a catalyst for new advertising models. It is transforming how companies market their brands, products and services and changing how enterprises create, store and share information on PCs, mobile devices and corporate networks. Search has grown into a multi-billion dollar industry in which established companies are competing alongside scrappy startups.
Search technology has made in-roads into both the consumer and corporate worlds. However, there are many differences between Web search and enterprise search. For example, in a Web search a user is typically looking for the most relevant document within the returned set, whereas in enterprise search a user is typically looking for something specific that he or she has seen before, or knows of and wants to find quickly. Also, the criteria of relevance within Web search and enterprise search is very different, given the nature of the content repositories on the Web and within the enterprise. Consequently the challenges and opportunities within Web search and enterprise search are extremely different. This article identifies technological advances made by incumbents and startups that are expected to vastly improve the search experience on the Web and within the enterprise.
WEB SEARCH
Most Web search today is offered either through search-only sites such as Google that use algorithmic search (complex formulas to determine the "best" results for any given keywords) or through content sites or portals, including Microsoft's MSN and Yahoo!, many of which also use algorithmic techniques and/or indexing of sites by category. However, there is still a lot of work to be done in shortening the distance between searchers and what they search for. Next generation search technologies are being developed in areas such as personalization, localization, mining the deep Web etc. that will help users access more personally relevant information faster.
RELEVANCE
One of the main complaints of Web searchers is that they are served with a multitude of results, most of which have nothing to do with what they are interested in. As the number of Web pages increases algorithms and technologies are being developed to help users navigate through Web content and get at what they need faster.
However, a problem with achieving contextually relevant results fast is that the two requirements pull in opposite directions. One of the main issues is around context of search terms. Users are accustomed to entering queries as keywords, and the results do not always correspond to the context they had in mind, which is extremely frustrating. They prefer to have the search engine display an array of content types that might be relevant to their query instead of guessing what their intent was. iBoogie helps users get more relevant results through categorization of the results around different contexts. However, several firms are moving away from text based result display and are finding ways to use visuals to help increase relevance and provide context, for example by using special icons to help users quickly identify more relevant sub-searches. Kartoo visually represents relationships between search results while tools from Groxis present results categorized visually with the ability to drill down deeper into each category until the user gets the information they are looking for.
Advances in data structure and searching open up the possibility of querying with more than just keywords. New technologies that will improve relevance include those with the ability to divine context within the content of a Web page. There are a few companies that are developing automated linguistic analysis techniques to help rank Web pages through a contextual understanding of the content. Quigo Technologies has developed artificial-intelligence software that "reverse matches" searches to better understand how people actually find what they're looking for, helping companies develop better advertising strategies. RDF (resource description framework) makes it possible to add descriptive metadata to Web content, which allows search tools to understand not only the words in content but also context, meaning and relationships among the content. However, adoption of this technology has been slow.
Other companies, such as Teoma (acquired by AskJeeves), have taken a different approach by ranking results according to their standing among recognized authorities on a topic. Before the Teoma engine presents the results for a given set of keywords it identifies the associated communities and looks for the Ðauthorities¼ within them (pages that community members” websites point to most often) and ranks search results according to how often each page is cited by authority pages. Unlike companies that have tried this and not been able to do searches in an acceptable timeframe, Teoma has developed proprietary technology that allows sub-second searches.
PERSONALIZATION
Personalization is considered to be one of the main areas of improvement for search technology. Decisions on what was considered relevant were first made by computers, then by paid editors and are now made by webmasters. The next logical step is that users decide what is relevant based on their knowledge and experiences. One of the next big opportunities and challenges in search engines is the ability for them to learn and adapt results based on user behavior, giving personalized results by better anticipating the intent of the user. However, users do not always know what they like or want and have a hard time articulating it, if they do. Link analysis and other "off-the-page" ranking criteria have played an increasing role in relevancy algorithms over the past years. Monitoring navigation behavior at a user-level could conceivably be the basis for developing an understanding of users' individual interests over time: in essence personalizing the equivalent of Google's PageRank scores. Technologies need to be developed that can indirectly find out what people like and consider relevant at any given moment in time. These can involve technologies like analyzing user clicks, other information on their desktop, etc.
Startups such as Eurekster have taken an interesting approach to personalization, by integrating ëword of mouth” and recommendations from like minded individuals to filter information on the Internet. The company has married search technology with social networking technology to create a result set based on recommendations from friends and associates. Eurekster makes use of its own SearchMemory¦ technology which remembers the sites a user finds useful and presents them higher in the results the next time they search, as well as presenting sites their friends and contacts found useful higher. By monitoring what users and their friends click on, it can increasingly understand more about a user”s needs and can tailor searches to that. However, this approach may suffer from the "reverse network effect", whereby the more the network grows, the less useful the recommendations are by those in the network.
As well as customizing results for individuals, search companies are focusing on customizing results for groups of individuals who exhibit similar characteristics, for e.g. those who frequently visit certain sites. IP-sniffing technology might take search engines a step closer to personalizing search results without requiring users to compromise on very personal information. IP-analytic software associates Internet-connected devices to geographic areas, domains (.com, .edu, and .gov), ISPs, connection speed and browser types with some level of confidence. Analyzing click popularity at an aggregate level along IP-associated parameters could be leveraged to extrapolate personalized ranking for clusters of users exhibiting similar behaviors.
Some startups are using a combination of psychology, software and neural networks to create a ranking algorithm that better intuits what users are looking for. Mooter analyzes the potential meanings and permutations of the starting keywords and ranks the relevance of the resulting Web pages within broad categories called clusters. To develop a more precise understanding of what the user is probably looking for, the Mooter engine notes which clusters and links get clicked and uses that information to improve future responses. A refined set of results appears on every page; the engine continues to adjust the rankings based on the user”s behavior.
LOCALIZATION
Local search - delivering results based on location, somewhat like merging yellow pages with search - is an emerging opportunity with significant commercial potential. A great number of yellow page advertisers, including small businesses, do not use online advertising today due to a variety of reasons including economics and complexity. Instead, they rely on yellow-page listings, word of mouth and local advertising (a market estimated at $18B per year). Search features like localization and personalization provide impetus to these businesses to move advertising online.
Geo-targeting Web search content, both organic and paid, requires search engines to infer local intent by extracting geo-signals and leveraging implicit and explicit user profiles. Although established search engines are already serving country-specific results, technologies are being developed to do geo-targeting at a more granular level to provide more value to local advertisers as well as serving more relevant content to users. Companies such as Quova allow websites to know approximately where a user is, though not give a specific address (for privacy concerns).
The effects of localized search are also expected to be meaningful to the mobile search engines of the future. These will take into account a user”s precise location when serving results, as users are more likely to be looking for directions, local news, sports etc. It seems like an almost natural progression to send local content such as yellow page listings, directions, maps and business ratings to mobile devices and be able to search on it. Personalization features, geo-based services, faster networks, better handset resolution and color displays should significantly improve the user experience over time. The navigation schema, whether search or browse modes, will be critical to make cellular phones a viable platform for both end-users and advertisers. About 90% of mobile phones will be Web-enabled by 2006, making it a more attractive platform for content providers, developers, and information architects to invest time on.
DEEP WEB
Currently search engines crawl over the billions of pages posted on websites and open to browsing, which is considered the visible Web. There is much more information in the deep Web, which is estimated to be 92 petabtyes worldwide or around 500 times the volume of the visible Web. However, crawlers have difficulty indexing content protected behind sign-up forms or stored in databases, such as product catalogues and legal/medical archives, that is assembled into Web pages at the moment users request it. A user can currently get to information in a database by filling out a form that goes to the database and gets the information that is put together onto a Web page on the fly. However, search crawlers cannot fill that form, and therefore cannot get at the database. Search companies are spending time and money to figure out how to index information that comes out in response to queries and is not in HTML form. Some companies have developed tools that allow websites to supply information from their database in a format that crawlers can understand. However, that is a lot of work for the website owner and it is relatively easy for the website owner to fool the engine and doctor results. There needs to be a way to get that information from the database, while keeping results honest. Dipsie is building a more nimble crawler that indexes website databases in advance which allows it to send users in a single click to a Web page that would otherwise take many clicks to get to. The company claims it has an index of 10 billion documents, triple the current size of the Google index. Companies such as Endeca are making inroads into the deep Web by creating solutions that can search structured information such as UPC codes in product catalogs.
Access to proprietary databases requires other enabling technologies and gives rise to new business models. There has been some talk of creating a network of proprietary databases, which can be searched by crawlers. These could involve getting users to pay for data access, once they deem it relevant. However, technologies will need to be developed for the extraction and formatting of data that lives in these heterogeneous sources, as well as to ensure security and performance.
Other content not currently accessible to search engines is content that is available only by subscription. Technologies are being developed to somehow find that content; Google is working on a way to display a brief description of it so that the user knows it exists but still has to go to the site and pay to get it. Another category of content that is not always accessible through conventional search queries is information contained in blogs. Blog search engines such as Technorati are not always able to search the rest of the web. Blogs have become an important niche on the Web to the extent that even Google is considering the development/purchase of blog search capability. Another part of the deep Web currently unable to be accessed by search queries is content available in books. Amazon is addressing this issue through their ÐSearch inside the book¼ feature which lets visitors use a search engine to peer inside the pages of more than 100,000 books. The company also announced A9, a separately run subsidiary that will develop E-commerce search tools for use by both Amazon and licensees.
DATA AGGREGATION AND FILTERING
Although crawlers are going through the Web and indexing amazing amounts of information every day, there is still a challenge to keep information fresh. To that end, there needs to be a way for websites to alert and update search engines on information that has changed. Currently the only way to refresh that information is for the site to be crawled again. Most large enterprises are willing to pay to get crawled more frequently. However, it would be much more efficient and lead to reduced bandwidth consumption if all websites were able to automatically post their changes, maybe through a different port.
SPAMMING
One of the key challenges that search engines face is to understand when they are being spammed and being able to deal with it. Spammers try to understand the ranking mechanism of a search engine and then figure out ways to enhance placement of their site. This has evolved from peppering the site with ëwhite on white” keywords for algorithms that rank results based on the number of keywords found; to creating fake links to a site to increase it”s PageRank in Google. There are companies working on solutions to make search engines more resilient to spamming. These include things like adding parameters to relevance, such as monitoring how long people spend on the page and what they do after clicking a link on that page.
An open source search engine called Nutch, funded partly by Overture, has been developed to ensure that search engines for their part remain honest. It uses ranking algorithms similar to Google”s, but with a twist: each search result is accompanied by a link labeled ÐExplain¼ that produces a detailed accounting of the various scores and weights that gave the result its rank. It allows people to compare results from commercial search engines to Nutch to determine whether any of the search engines are biasing results towards their advertisers.
ENTERPRISE SEARCH
Search technology is not only changing the Web and online advertising, it's also changing the way companies store, organize, and share information. Enterprise search encompasses both employees accessing the Intranet as well as external constituents such as customers and partners accessing corporate databases for self service etc. Although employees, customers and partners seek Web-like experiences in the enterprise, the Internet and enterprise domains differ fundamentally in the nature of the content, user behavior, and economic motivations. Traditional Web search technologies do not apply well within enterprise search and companies are exploiting opportunities to optimize enterprise search around desktop PCs and corporate databases with improvements in relevance, unstructured document search, federation etc.
RELEVANCE
The main issues around enterprise search are heterogeneous content repositories and poor intra-document linking. Intranet content is created for disseminating information, rather than attracting and holding the attention of any specific group of users. The link structure on an intranet is very different from the one on the Internet. Content from heterogeneous repositoriesÖfor example, e-mail systems and content management systemsÖtypically do not cross-reference each other via hyperlinks. The popular PageRank and HITS algorithms are thus not as effective on an intranet as on the Internet. Therefore, other techniques have to be employed to improve search relevance on an intranet. Since intranets are essentially spam-free (because of the lack of incentives for spamming), anchor-text and title words are reliable sources of information for ranking documents.
Most Intranet searches are done to retrieve information that the user knows exists somewhere, and remembers some attribute pertaining to that information. Search engines must provide a way of specifying attributes in conjunction with the query. Sorting on an attribute also helps to locate information quickly. The perceived relevance of a result can be dramatically changed by providing better titles (using techniques to create titles automatically if none exists), dynamic summaries, category information, and so forth. Within the enterprise usability is tightly coupled to relevance.
Organizations have a large amount of content and search has been focused on leveraging that organizational content. The next area of focus within enterprises is to deploy technologies that focus on not just finding documents, but processing the statements in the content and extracting the useful meaning embedded in the sentences of the documents of content collection. This has to be done over single documents as well as to develop patterns over entire collections.
Although intranet search does not normally return millions of documents as in Internet search, a result set may contain a large number of documents. Sifting through a long list is very tedious. In this case, on-the-fly result list clustering is desirable to help users navigate the results. Tools from companies like Vivisimo enable real time result list clustering which organizes search results by query-dependent topics that are dynamically generated from search results.
While search relevance is an important yardstick, there are other key characteristics that make for effective search, such as navigation, classification, entity extraction, recommendation, summarization, query language, and semantics. Systems that incorporate user behavior will become the norm, yielding higher relevance, better personalization, and higher utilization of human assets and tacit information. Enterprise systems typically cannot compile the large-scale statistics that Internet search engines use to weed out noisy data, and other techniques will be employed to address the special needs of the enterprise.
PERSONALIZATION
Information usage patterns can be analyzed to discover the patent and latent relationships between the people in an organization and the documents they create, modify, access, search, and organize. Given the high degree of confidence around interactions within an enterprise, startups are focusing on tools that analyze communications patterns between people and the dynamic usage of information in the enterprise. The goal is to deliver a richer personalized experience, based on a combination of content and context, to individuals and groups. The social network can be exploited to augment content analysis with the historical behavior of users, changing result ranking. This adaptive ranking could be simplistic (boost a document's rank if a previous user elected to view/rate it after issuing the same search) or more sophisticated (boost rank if selected from the second results page for a similar query). Incorporating dynamic feedback also allows for the infusion of new terms to document representations, allowing relevant information to be returned for query terms that do not even exist in the content (concept-based retrieval).
UNSTRUCTURED INFORMATION
The overwhelming majority of information in an enterprise is unstructured i.e. it is not resident in relational databases but exists in the form of HTML pages, documents in proprietary formats, and forms (e.g., paper and media objects). Together with information in relational and proprietary databases, these documents constitute the enterprise information ecosystem. Since structured information is most valuable to enterprises they are seeking to enhance the value of their unstructured information by adding structure to it. Creating, aggregating, capturing, managing, retrieving, and delivering this information are core elements in an enterprise content infrastructure. A typical query is a combination of a text query as well as parametric query on structured fields, which is usually handled by relational databases. However, to do this within an acceptable timeframe requires classification and extraction tools that augment information with attributes for improved search and navigation. It is essential to provide high-performance parametric search that allows the user to navigate information through a flexible combination of structured and unstructured data.
Enterprise search users want not just a matching document but an answer. Application specific search engines, such as Kanisa, help companies” customers find useful information within corporate databases within the enterprise. The ability to dynamically construct virtual documents that can consist of relevant portions of many documents will be critical. The increased adoption of XML allows the ability to search and retrieve specific portions of documents but there is a lot of legacy information that will have to be tagged, classified and accessed in such a fashion. An individual's role in an enterprise dictates what documents can be accessed. Sophisticated enterprises demand a more stringent notion of security in which search result lists are filtered to display only the documents accessible to the user. Doing this in conjunction with the native security of the repositories is a particularly difficult challenge.
DATA AGGREGATION AND FILTERING
As in Web search, enterprise data crawlers need to accumulate and index information before it can be searched. This requires knowledge of where the critical information is, access to those repositories (which can be secure) as well as a way to pull information at a rate commensurate with the rate at which it is changing. Adaptive refresh of indexes is required, which involves more sophisticated change-detection mechanisms. Most crawlers use a pull model, which is hard on network resources and on the target repositories. Future crawlers will take greater advantage of triggers and targeted crawling to be able to refresh only that information which has been changed.
Enterprises that are indexing external content will also need to be able to clean up the data before ranking algorithms can act upon it. Techniques such as link-density analysis can be used to detect the differences between content and link-rich menus on Web pages. Entity-extraction techniques can be used to add relevant information before indexing occurs. Stripping out advertisements, menus, and so forth will be required to improve the quality of the data.
FEDERATION
Enterprise search involves accessing information that is not always indexed (it is on the Web, it is not indexed due to security and legal considerations, it resides in an esoteric application/data repository). In such cases, federated search is used since it provides a single point of access to data from multiple sources, including enterprise repositories and applications, Web search, as well as external subscription sources and real time feeds. The key challenge here is to merge sets of results from all sources for unified presentation, which is harder when sets have no documents in common and employ different scoring and ranking schemes. Verity is a market leader in this space, searching an enterprise network as Google does the Web, utilizing information in multiple applications and databases, some of which may not even have been published on any site, to come up with an answer in a single query of multiple data sources. There is an opportunity for systems to add further value via ranking, filtering, duplicate detection, dynamic classification, and real time clustering of results from disparate sources that may not be under the jurisdiction of the enterprise. Database vendors provide federation across disparate relational databases, but federated search for unstructured data provides different challenges such as ranking across independent systems, classification and clustering.
PC SEARCH
Advances in search are not just being made for the Web but also for content on other devices such as PCs. There are many occasions where information is stored somewhere on the hard drive, in notes or email or somewhere else that is hard to find. Companies like X1 and Filehand have developed tools that search for information in multiple applications and formats within a user”s computer. Google, which has released an email service incorporating advanced search technology, is currently working on a tool to search both the Web and the computer”s insides, while Microsoft and Apple are racing to include that functionality within the operating system. Apple is expected to release Spotlight, a built in search engine based on the iTunes categorization and identification technology, within its O/S next year. Spotlight will allow users to comb files, music and photos, while also looking at the content and metadata within files. Meanwhile Microsoft has stepped up plans to release an improved search tool for files and emails, which was initially expected to be incorporated within Longhorn, Microsoft”s next operating system, by next year. Although Microsoft and Apple will enjoy the benefits of providing an integrated desktop search solution they will need to ensure that indexes are refreshed at an acceptable rate without draining system resources.
SUMMARY
The holy grail of search is the ability to talk directly to the Web and corporate databases through a computer or any other device, ask a question in English or any other language and get a correct and relevant answer, regardless of how it is phrased. A search product of the future will have access to everything ever written or recorded, know everything the user ever worked on and saved to his or her personal hard drive, and understand his or her tastes, friends and predilections. Eventually, search will be like a direct connection between a person”s brain and the entire world's information. It will grasp so much about a user and their immediate circumstance that it will often know exactly what he or she needs, perhaps even before they do.
Although technologies are being developed to help achieve that experience, there is still a long way to go. However, the search experience is expected to change dramatically over the next 3-5 years (and some actually think that Google is the last word in search!) Users will be able to access search databases from sources other than the keyboard (maybe with voice recognition technology), and on different platforms (such as the GPS in cars). Eventually, all these elements will meld together Ö the visible Web, the deep Web, localization, one”s stored content and info about one”s life. The final piece will be software that can understand what you're typing or reading and constantly look for related content. A search engine of 2010 will know who you are, where you are and what you're doing; and will look across every form of information to automatically find what will help you. Achieving that, however, will require major advances in search technology.
This article researched and written by Shazia Makhdumi, Associate with Outlook Ventures. Special thanks to Waqar Hasan (dbWizards), Micki Seibel and James Speer (AskJeeves), Gene Feroglia (Otopy) and Ashok Chandra (ex-Verity).