How To Tell Stuff To A Computer

Mail

Shortcuts For Experts

Intro
RDBMS/XML
FOL
Frames
Description Logics
A.I.
RDF
UMLS
Google
Conclusion

Any discussion about KR and what role science will play in the future of computerized knowledge first needs to make a convincing argument that new advances are still possible and that the current tools can't handle the job... In other words, can the Guys in the Garage and the Writers already solve all the interesting problems in developing information applications on their own? Do we really need science to improve our software? I believe that the methods used successfully by the Guys in the Garage in the past can't continue to work in the future. To see why, let's look at two popular ways most software developers use to represent information: RDBMSes (Relational Database Managment Systems) and XML, and see what they teach us about the current state of computerized knowledge.

Relational databases

The first time most of the public became aware of the effect the computer revolution was having on their lives was when they first encountered computers being used at their local bank or at their local travel agent as a tool to search their account and as an aid to plan a trip or other service to them in an efficient manner. This revolution became possible with the advent of the relational database schema.

A relational database system allows data in tables, typically with rows of a fixed length, to be searched and cross-referenced in an effective manner. Having tables containing different forms of data that can be queried against each other may sound like simple task, but it turns out to be a very difficult task to accomplish when dealing with very large tables. The breakthrough came when computer scientists began to represent data in a way that could be queried using optimized mathematical systems such as tuple calculus. This system, involving mathematical matrices, allows a complex query that involves multiple tables to be simplified and optimized in a manner that allows it to be performed at a peak efficiency. Whereas the ENIAC could only analyze census records, for instance, one at a time, this new relational database design made it possible to analyze data in an arbitrary order and allowed easy cross-referencing.

Here is an example of how knowledge is typically stored in a relational database:


   TABLE CUSTOMERS

          Customer#   Name       Phone
   ROW           12   Bob Smith  772-4500
   ROW           13   MegaCorp   971-0504

   TABLE WIDGET_ORDERS

         Order#   Widget Type    Customer# Supplier#
   ROW      100   Yellow         12        1033
   ROW      101   Red            12        2022
   ROW      102   Blue           13        1033

   TABLE SUPPLIERS

         Supplier# Name                 Phone
   ROW   1033      Acme Widget Company  232-3450
   ROW   2022      Widgets'R'Us         122-3992

Here you see two tables and a "link" between the two tables using the column Order#.

There are several common idioms that are adhered to when designing a relational database system. First of all, every table will typically be given a unique number (in this case Customer# and Order#). This number is typically a kind of "throw away" number that is never, ever shown to the user. If the user of the program needs to be able to look at an ORDER number, this number is usually created separately with a special name (like ExternalOrder#, maybe), to distinguish it from the unique internal Order#. Why is this? Simply because relational database systems are so dependent on these numbers to keep track of linkages between tables that it may conflict with actions that the user may want to do to with an Order#. For instance, a user may want to change the Order# or synchronize order#s between databases- something that cannot easily be allowed if the Order# functions as a link to other tables.

Another common idiom is to never have the same piece of data duplicated in several places in the database. Data duplicated in several places is called "de-normalized" data. By using a simple method called "database normalization", such duplicates are usually removed by creating new tables specifically to hold the duplicated data.

Relational database systems turned out to be perfect for business applications, for several reasons:

Companies operate under the principles of a business model, which always dictates that the company only sell a fixed number of different services or products: Whether its a McDonalds or Pfizer, a business limits itself in the products it offers in order to maintain consistency and benefit from efficiencies of scale. This "fixedness" of business data allows it to be comfortably mapped onto the fixed table rows that a relational database requires.
The relational database model allows data to be stored relatively efficiently, because the data that is stored is free of "metadata" which is a term for data that describes other data. The fixed length of the fields allows the program to "know" what each piece of data means without needing to put extra descriptive information to say "what" each piece of data is. Since large companies tend to have incredibly large amounts of data in their databases, and since computer storage in the 60s, 70s, and 80s was very expensive, this made relational models ideally suited for business.

XML and other Markup languages

What is the least amount of "structure" data has to have in order to describe an arbitrarily complicated piece of data in an "understandable" fashion?

break the data up in a tree-like manner

each branch

a pair of parentheses

first item

the rest

actual data

(customer "Bob Smith" "772-4500"
          (widget "yellow" (supplier# 1033))
          (widget "red" (supplier# 2022)))
(customer "MegaCorp" "971-0504"
          (widget "blue" (supplier# 1033))))
(supplier "Acme Widget Company" "232-3450" (supplier# 1033))
(supplier "Widgets'R'Us" "122-3992" (suppler# 2022))

In this case, the "types" for the branches are "supplier", "customer", and "widget". As you can see, the data is represented as a tree instead of a set of tables. Since the related pieces of data are stored together in the same place, we need fewer weird "unique ids" to link data. Additionally, the layout of the data is natural for a human to understand, since items that are related to each other are not in completely different places, as they would be in a relational database system. This type of format is basically identical to HTML, which was a central part of the internet revolution. The purpose of HTML (which is the language of Web pages) was to allow the representation of arbitrarily complicated book/magazine/newspaper-like documents, containing different text styles, pictures, tables and other layouts that human authors prefer for organizing information for other humans to read. In HTML, however, the type information is denoted by putting it in brackets, so a piece of text that is meant to be in bold text has <B> in front of it and </B> at the end of it.

A more general sibling of HTML is XML, which takes the ideas of HTML and applies them to any type of data, not just web-like documents. For instance, the same data described above can be represented in XML as follows:

<customer>Bob Smith <phone>772-4500</phone>
                                <WIDGET>yellow <supplier#>1003</supplier#> </WIDGET>
                                <WIDGET>red <supplier#>2022</supplier#></WIDGET>
</customer>
<customer>MegaCorp <phone>971-0504</phone>
                                <WIDGET>blue<supplier#>1033</supplier#></WIDGET>
</customer>
<supplier>Acme Widget Company<phone>232-3450</phone><supplier#>1033</supplier#></supplier>
<supplier>Widgets'R'Us<phone>122-3992</phone><supplier#>2022</supplier#></supplier>

As you can see, this format is analogous to that of the "syntax-expressions" format, invented back in the dawn of the of the A.I. era. (XML, however, does add innumerable extra flourishes such as header info linking to extra descriptive files called DTDs, foreign character support through UNICODE, support for namespaces, internal linkages, etc.) These newer data formats, in general called "markup languages", compromise the second revolution in knowledge representation described in this primer.

Now that we've looked briefly at XML, let's look at some of the things that these knowledge representation systems can't handle very well.

What RDBMses, XML, and Other Commonly Used KR Systems Can't Do Well

Since I work in the medical software field, I am exposed everyday to the limitations imposed on medical informatics by the constraints of these common representation methods. It may be surprising to an outsider that these limitations are so hard to overcome- After all, isn't medicine just "another business"? Wouldn't the same representational systems that proved so effective to business software work equally well in medicine? I believe the answer is no.

The fact is that medicine is qualitatively different from any regular business- And I don't just mean this in a mushy "because it involves human lives" kind of way (although that's a good reason too)... It really is something different. Remember what we said about typical businesses: Because they have a well-defined, finite business model, it is relatively straight-forward to translate the rules of the business into a computer information system.

However, in medicine there can really be no business models: A patient may walk into a clinic with diabetes, osteoporosis, hypertension, or any other medical condition and the clinic needs to be able to address it. Every human disease has quirks that require many unique data representations that are hard to fit neatly into tables- Information involving patients is therefore very difficult to store into fixed-length database rows. Although this limitation can always be overcome by adding additional tables for new types of data, this strategy eventually becomes unwieldy: After all, medicine is filled with an innumerable number of exceptions and idiosyncracies.

Because of the empirical nature of medicine, it is very difficult for software developers to encode all the structured parts of the medical business inside a software application- This is how business software is typically designed: The operator of the program must interact with the system at any point that he/she wants to create structured data (such as pressing a button to add a new widget to the database, for instance)- An end user would never, for instance, create a new rdbms database column or a new XML tag, because these are to unwieldy for a domain expert without computer expertise to interact with directly from the standpoint of an end user.

But since medicine is so unpredictable and not driven by a pre-determined structure, this is exactly what would be needed for a truly powerful medical system: The doctor would need to be able to enter fully structured information directly into the system in a manner that cannot be predicted ahead of time by any software developers.

In a way, an ideal medical software entry system would need to allow a clinician to create new rdbms tables/new XML tags on the fly- Something that is not practical with the current tools available to the Guys in the Garage. However, some of the concepts developed by The Scientist, which we will be discussing in the next few sections, are able to offer some possible solutions to this dilemma. This is why I think science needs to become more critical in solving the many remaining IT software problems than it has been in the past.

How Scientists Think About Knowledge >>