Mail


Shortcuts For Experts

Intro
RDBMS/XML
FOL
Frames
Description Logics
A.I.
RDF
UMLS
Google
Conclusion


The Secret World of Advanced Search Technology


The third futuristic knowledge-representation technology is centered around the needs of The Writer. This consists of the (mostly probably still secret) knowledge representation research done at major internet search providers, especially Google. This technology has singlemindedly concerned with organizing information in sophisticated ways, but remains highly focused on the need to place little or no demands on the author to fully encode the information he/she is generating.

This singleminded focus is obvious if one looks at the history of search engines in the internet era: Early providers, such as Excite, AltaVista, Northern Lights, etc., all had a grand vision of building indexing systems that understood what they was being indexed. I suspect that the main reason the search results in these early systems were so disappointing was that the reasearch in these companies was driven by academics who were too future-oriented and were not focused enough on generating high quality search results from more simple-minded designs.




Once Google came into use, they obliterated the these lofty researchers with a brilliant Guy-In-The-Garage search philosophy: "We'll just show you pages that have exactly the words you asked for- No fancy guessing. Plus, we'll sort the results based on site popularity." So far, Google has managed to maintain a perfect balance between pragmatism and science in their technology- By treating web pages mostly as just strings of words, they were able to remain pragmatic and outperform other research engines by being comprehensive and more consistent than their competitors. But now they have the leisure to concentrate on more advanced knowledge systems- That's why it is fairly likely that they (and other survivors of the search wars, like Yahoo and MSN) are returning to more exotic KR systems.

At the most basic level, the main problems of a standard search engine revolve around synonyms and homonyms. For instance, if I perform an internet search for the term "Apple", there is no way that the computer system to know whether I intend to find information about fruit apples, or if I intend to find information on the company Apple Computers, as these are homonyms- Two different meanings for the same word. Even if I search in a way that avoids this ambiguity (by searching for "Apple Computers" using quote marks, for instance) it is likely that many of the web pages that refer to the computer company will use the more vague word 'Apple' and will not be captured by my query. Synonyms (two different words that mean the same thing) pose similar problems.

As I already indicated, Google has always been a simple brute-force text searching engine at its heart (augmented with some clever ranking technology) that can be easily foiled by synonyms and homonyms... but do we know this for sure? Since the guys at Google are reportedly pretty smart guys/gals, can we find any evidence yet that there is already some more advanced technology under the hood of the Google search engine?

I decided to run a simple test to see if I could "peek under the hood" of Google with the hope of finding some evidence of more intelligent textual analysis: Is the Google engine, perchance, able to determine the context of a page in an automated fashion to disambiguate synonyms, for example?

To test this idea, I placed two small hyperlinks at the bottom of my website www.lisperati.com that pointed to two carefully designed pages, containing two small fragments of text:

Page #1:

roxfumb take asparagus and grind it in a lemon peeler. Preheat oven to a million degrees, basting all turkeys as you dice an egg. Apple slices, plums, oranges and other fruit make great spices. Spread butter on a can of soda and slice it into generous figs. After you have finished grilling, eat your pie with a warm glass of beer! You can find great recipies on your computer.

Page #2:

roxfumbthe USB standard programming protocol requires file partitions in the operating system. Log in to the screen saver. Apple, IBM, and Microsoft search infrastructure network together seamlessly. The word processor has Font kerning issues. The spreadsheet program has macro support and pie charts for circuit analysis. Use a Pentium processor to power your computer.

As you can see, these text fragments are basically gibberish without any discernible meaning. However, any human can immediately discern that the first fragment primarily consists of instructions for cooking food, whereas the second fragment is a discussion on computer technology. Also, both pages contain a nonsense word, roxfumb, which is unique throughout the entire internet (at least prior to publication of this website :) as well as three identical vocabulary words: "apple", "computer", and "pie", in roughly identical positions in the text.

So here is my question: Once Google has incorporated these pages in its index, how would it behave on the following two queries:

Query #1:

apple computer roxfumb
Query #2:

apple pie roxfumb
Since both pages contain all these words, both pages wold be retrieved by these queries. However, if Google was "smart", it should return the technology-themed page as the top-ranked result on Query #1 (as this is most likely refers to Apple Computers) and the food-themed page as the top-ranked result on Query #2 (since this most likely refers to the food apple). When I say "smart", what I mean is that a human researcher, when presented with the same information would clearly rank results based on which page makes the most sense, given the context.

My hypothesis was that Google indeed would obey these rules when ranking its results. The null hypothesis would be that it would treat both queries as identical and return the same results for each, as one would expect in a brute text search.

Below are the results of the test (after the pages had been published for 6 months)





My hypothesis proved incorrect: In this (admittedly, rather feeble) attempt to discover more sophisticated analysis within the Google engine, none was found. Since you cannot, of course, prove a negative, this example doesn't mean that such technology doesn't exist withing Google in some instances- Possibly, someone who makes more sophisticated attempts might be able to find out more.

So, if it is the case that Google can perform syntactic analysis of unstructured information, it is not yet easily discernible in their public products. I find it very likely, however, that inside in the Google/Yahoo/MSN skunkworks such technology is indeed being developed- Quite possibly, they are developing systems that are similar to the symbolic A.I. systems, which are able to extract description logics or other highly structured knowledge representations out of raw unstructured text- At some point, it is in the interest of these companies to be able to extract structure out of raw text... and the sheer volume of data on the internet makes it impossible to use simpler, more pragmatic methods that require some human involvement by themselves to accomplish this.



The Soft And Chewy Center! >>