|Up: The Semantic Web||[Related] «^» «T»|
Monday, November 10, 2003
By Paul Ford
Clay Shirky, a well-regarded thinker on the social and economic effects of Internet technologies, has published an essay called “The Semantic Web, Syllogism, and Worldview,” a critical appraisal of the semantic web which claims, in essence, that the Semantic Web is a technological pipe dream: an over-specified solution in search of a problem.
As someone who has spent long hours attempting to fathom the standards which define the Semantic Web (see August 2009: How Google beat Amazon and Ebay to the Semantic Web and Web Pidgin), I can empathize with Shirky's frustration, particularly his frustration with the more lofty of the Semantic Web evangelist's claims.
That said, I believe that there is much of value in the Semantic Web framework which can be applied to real-world problems, and I find many of Shirky's arguments to be misguided attacks against Semantic Web straw men.
After summarizing several claims made concerning the Semantic Web by various parties, Shirky defines the Semantic Web as “a machine for creating syllogisms.” This is an over-simplification. The Semantic Web cannot “create”, any more than the current Web can create. Humans create data, and computer programs may process that data in order to create new data, but to assign agency to the Semantic Web is a mistake. Neither is the Semantic Web associated with any pre-defined process, so it is false to call it a “machine”
Most notably, the means whereby nearly all automated reasoning is accomplished on the Semantic Web is not syllogistic reasoning, which has hardly been used since Descartes. In the case of languages based on first order predicate logic processing, the method used is usually resolution reasoning (as in Prolog), and in the case of description logics, like OWL, the means is tableau reasoning.
In his opening paragraph, Shirky links to, and subsequently dismisses, the World Wide Web consortium's definition of the Semantic Web. That definition, in full, is:
The Semantic Web is the representation of data on the World Wide Web. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming.
I think the phrase “representation of data on the World Wide Web” is confusing, and the focus on XML puts the emphasis on specific syntax, which is misleading. A simpler way to say it might be:
The Semantic Web is a framework that rigidly defines a means for creating statements of the form “Subject, Predicate, Object” or “triples,” in a machine-readable format, where each of Subject, Predicate, Object is a URI.
The means of “rigid definition” is a series of standards published by the World Wide Web consortium, namely RDF, RDF Schema, and OWL.
Shirky writes, “Despite their appealing simplicity, syllogisms don't work well in the real world, because most of the data we use is not amenable to such effortless recombination. As a result, the Semantic Web will not be very useful either.”
It is true that very few of us, before we kiss our lovers, or mothers, say things like “I only like to kiss living women; this woman is alive; therefore I shall kiss her.” Much of life is lived in a place that is not easily captured by first-order predicate logic. But logical reasoning does work well in the real world—it's just not identified as such, because it often appears in mundane places, like library card catalogs and book indices, and because we've been trained to automatically deduce certain assumptions from signifiers which do not much represent the (S,P,O) form.
Let's say you're reading the book Defenders of the Truth, which is sociologist Ullica Segerstråle's intellectual history of the debate over sociobiology. Interested in finding what the book has to say about Steven Jay Gould, you turn to the index, and find:
Gould, S.J. and adaptationism 117-18 and Darwinian Fundamentalists 328 and Dawkins 129-31 Ever Since Darwin (1978) 118 and IQ testing 229-31 Marxism 195, 226 unit of selection dispute 129
(p 482, example much abridged)
If you are an experienced reader with some knowledge of the field of sociobiology, you can make a variety of deductions using the index. Take the third item, “and Dawkins 129-31.” Looking at this statement, and drawing on your memory, you could deduce:
|Dawkins||Is a synonym for||Richard Dawkins|
|Steven Jay Gould||Interacted in some way with||Richard Dawkins|
|Information on Steven Jay Gould's interactions with Richard Dawkins||Can be found on||Pages 129, 130, and 131|
|Pages 129, 130, and 131||Are found in||The book Defenders of the Truth|
And so forth. Internally, you wouldn't go to so much effort; if you had to think using predicate logic, it'd be hard to get out of the house in the mornings. As Shirky writes:
When we have to make a decision based on this information, we guess, extrapolate, intuit, we do what we did last time, we do what we think our friends would do or what Jesus or Joan Jett would have done, we do all of those things and more, but we almost never use actual deductive logic.
And this is true, if you say, as Shirky seems to, that “deductive logic” is the conscious explication of logical facts which lead, via syllogistic reasoning, to a logically valid conclusion. But come back to our book index: good indexes are the product of quite a bit of craft and expertise, and the result of quite a bit of logical thinking. Professional indexers take a long block of narrative text—a book—identify the subjects the text describes (Steven Jay Gould, Richard Dawkins), formalize the names for those topics (Gould, S.J., Dawkins, R.), and then specify narrower subjects which relate to that subject, cross-indexing items where they feel it will be valuable to the inquisitive reader.
As such a reader, the process whereby I decide which page to which I will turn is a very logical one. When I look up Steven Jay Gould, the index returns a list of sub-topics related to that subject. I then choose none, one, or several of those sub-topics, seeking those which are relevant to my interests, and turn to the page which corresponds to the given sub-topic. The book presents data using a formal semantics—in this case the semantics of index structure and typography, much of which is reliant on alphabetization. I establish a goal, seek within the index's data set, and refine my goal based on my preliminary search results, then obtain my result in the form of a page number. I perform these acts in a given sequence, a sequence to which I've become accustomed over time.
Indexes only work because human beings are comfortable with the logical conventions of a book. We are taught how to use books in elementary school, learning about alphabetization, indices, and how to find words in a dictionary. As we advance as readers, we come to understand concepts like sections, headers, page numbers, and topics. None of these things have any meaning unto themselves; rather, we learn to interpret their semantics. The large, bold type that appears after a blank page means we have come into a new chapter. The little number after a line corresponds to a footnote. The text bounded by squiggles is a quotation.
Indexes are a kind of taxonomy, a classification of the ideas in the book into topics and subtopics, and indexes work because readers can be counted on to understand their layout and function, to interpret the symbols contained in an index to mean “if I am interested in Steven Jay Gould's ideas regarding adaptationism, I should turn to page 117.” In order to function, they are very dependent on a human being's ability to perform acts of reasoning. They lend themselves to deduction: from reading the index above, I can deduce that Steven Jay Gould had an opinion on adaptationism, IQ testing, the unit of selection dispute, and Marxism, had some kind of relationship with Dawkins and/or Dawkin's ideas regarding sociobiology, and is somehow involved with the book Ever Since Darwin, which was published in 1978.
Imagine that 30 or 40 books about sociobiology are available on the web. Since these books cover similar topic matter, it would be ideal if they could all be indexed together: that is, rather than have multiple indices spread across dozens of books, why not have a single index for all of them? For a serious researcher, this would be extremely useful. And in fact, indices of periodical literature, a staple of the library orientation required of most college freshman, perform this exact function: they take a large number of the journals and magazines published over time and create a master index, so that, were you looking for information on Steven Jay Gould, you could look up his name and find all of the articles that discuss him or his work that were published during the span of time covered by the index.
Those multi-volume meta-indices have real precedents: prior to the age of the CD-ROM and the Internet, the best way to distribute our periodical reviews of literature was to issue annual volumes, and the serious scholar would go year by year, looking up her topic. Occasionally expensive volumes appear with annotated bibliographies covering a single topic—I once helped a librarian at my alma mater format his bibliography on Iceland, which described hundreds of books on that country, organized by into chapters that covered history, geography, political systems, and so on. But now, given the ubiquity of computers and the ease by which large databases can be created, such works are increasingly being created and distributed digitally. No longer is the index of periodical literature divided, by the constraints of book technology, by year; rather, a search returns results for perhaps 100 years, organized by date.
To put this data into a computer, developers had to translate the conventions of typography into a set of semantic boundaries that corresponded to the meaning of the datum in question. It's not enough to simply italicize a book title and assume that the computer can make sense of it. A computer is an unsubtle beast, and needs help knowing how to sort data so that when it is asked a question, it can make sense of it and answer in a meaningful way. So you create a BOOK database with fields like TITLE, AUTHOR, PUBLICATION_DATE, SUBJECT, and so forth, and put your book data in there. Now you can state a question, specified in a database query language such as SQL, that asks, in effect: ”show me all the books that have the subject of 'Iceland',” and the computer will produce a list of matches as a result.
What do you put in the AUTHOR field, though? If you have an author named “John Smith,” you'll run into a problem: there is likely to be more than one author named John Smith. So instead, you make a totally different table called "AUTHOR" and you put John Smith in there, and the system creates a unique identifier for him--usually a number like “103.” Now, in the AUTHOR field of the BOOK database, you insert that number instead of his name. Then, when you have another John Smith to add, you create a new record for him, and receive another number, and use that in the BOOK database. Now, when you're looking at the record for the book about Iceland written by John Smith, and you say “show me all the books by the author who wrote this book,” the system doesn't simply go out and get any book written by any John Smith—it only gets those written by the John Smith who wrote the book on Iceland. While you're at it, you should create a table called SUBJECT, and give Iceland a special ID. Because if suddenly there's a pop group called Iceland, and books are written about that pop group, you'll have the same problem you had with John Smith.
This is, more or less, how relational databases work, and if you work out from what I've described, adding many layers of complexity, you'll arrive at web sites like Amazon.com or Ebay.com, which store all of their information this way.
These database-backed sites operate on the principle that human beings are capable of logical reasoning, and they use basic web links to express logical relationships between different resources (namely, web pages). When you go to Amazon, you are presented with a search box. Enter the word Ulysses into the search box, and you will see 8590 results. Too many. However, you know that Ulysses is a book, so you narrow your search to “Books,” then search for Ulysses, and see a much more manageable list of the three items.
If you click on the first result, you're taken to a web page that tells you that Ulysses is by James Joyce. Clicking on the link to Joyce gives you another list, of the books written by him.
In this sequence of events, you've made quite a few assumptions which are logical in nature:
Searching for “Ulysses” returns 8590 results, which is too many, and a limited search will return fewer results:
- Amazon allows me to limit my searches to Book.
- Ulysses is a book.
- Therefore, I should limit my search to books.
I would like to know more about James Joyce:
- The Ulysses page contains the text “by James Joyce.”
- Links to authors on Amazon provide a list of the works written by that author.
- Therefore, if I click on the link I will see a list of the works written by James Joyce.
I had to learn these processes when I first started using Amazon, just as I had to learn how to use book indices. I trusted that Amazon was arranged according to some logical principles, and learned the semantics of its different links. If I clicked on James Joyce was taken to a page on NASCAR racing, it would be surprising, and illogical.
Taking this a little further, I think that many links, search boxes, and other interface elements on the Web have semantics—that is, the text of the link and the context in which it appears indicate the sort of resource to which it links. The semantics of these links are quite arbitrary and vary from site to site. One link to James Joyce on a site might show me a list of the books he wrote, but the same link on a different site might show me his biography. On EBay the same link will take me to James Joyce-related items up for auctions.
As an Amazon user, I have come to understand that a link to an author shows me a listing of that Author's works. That's fine if I want to find a list of works by James Joyce. Amazon sells books, and I understand that. I would be surprised if it offered me blow up James Joyce dolls when I clicked on a James Joyce author link. But even in the domain of books, this is narrow: authors can always be subjects, so if you wanted to find the biography of James Joyce, you'd have to back up and search in a different way. In the system defined by Amazon, subjects exist in another database table from authors, and the twain does not meet. But there's no reason why this should be. It's simply not part of Amazon's design, but it's totally feasible if you stop thinking in terms of relational databases. Instead of just a listing of books, the James Joyce page could present a list of:
- Books by James Joyce
- Books about James Joyce
- Books about the books of James Joyce
If it went deeper, I might see:
- Books that were influenced by James Joyce
- Writers who worked with or knew James Joyce
- Books that cover the same subject matter as James Joyce
It doesn't do this because the James Joyce ID is defined purely in terms of the Author—in the model of the world that Amazon used to build its database:
|James Joyce||Is an||Author|
|James Joyce||Is a||Subject|
But both of those James Joyces are completely different things, as far as the computer is concerned. If you've built relational databases, you'll understand how this happens: in general, unique identifiers are limited by the column in your table. Every subject is unique, and every author is unique.
Shirky, addressing this problem space, says that the Semantic Web does not offer an answer.
Is your "Person Name = John Smith" the same person as my "Name = John Q. Smith"? Who knows? Not the Semantic Web. The processor could "think" about this til the silicon smokes without arriving at an answer.
But the Semantic Web is designed to address this exact issue. So how would you make a complete, interlinked, data-rich James Joyce page like the one I described above? The answer is in creating a unique, independent identifier for James Joyce. You can't use numerical unique IDs, because my #205 might be a lampshade, and yours might be Genghis Khan. So you use URIs, the addresses that allow us to point to different resources on the Web. URIs give us namespaces—they give us a way to be very specific so that our chickens and our Genghis Khans don't get mixed up. For Joyce, you can create a URI like so:
And if there was another James Joyce, you might call him:
That URL doesn't mean anything unto itself. Like our numeric IDs, it's just a convenient way to say “this is a unique thing, even if it is described by the same words as another thing.” James Joyce is no longer a single datum inside of a database of authors, and another datum in a database of subjects; rather, he is a free-floating, unique identifier that exists outside of any specific database, called “http://amazon.com/authors#JamesJoyce” (let's call that #JamesJoyce for short). Now take that a step further, and let's say that Ulysses has the unique identifier:
and Richard Ellman's biography of Joyce, James Joyce, is
Now our database, instead of a table, is a set of logical statements like so:
And so on, for quite a while. Now, when we want to build our James Joyce page, instead of saying “show me all the books written by James Joyce,” our query is something like:
Make a list of all the triples where James Joyce is an object or a subject, and sort them by predicate. Then, taking each predicate in turn, perform an operation that displays something useful to the user. If you need to, go back to the database and get information on the subjects or objects, as the case may be, in the triple.
That's a lot, but it's a lot easier to ask that question of the computer in practice, using a standard RDF query language. And in practice, it can lead to some pleasant results. Take the The Chinese Room Thought Experiment page on this web site, for example. Scroll down a bit, to where it says “Links Related To The Chinese Room Thought Experiment” and take a look at everything after that. None of that content is actually part of the piece. It is culled from the small database of facts that is automatically derived from Ftrain. The links come from different parts of the site, and are automatically pulled in and sorted by date. The text at the end of the piece is created by traversing another set of triples. And after that, the list of semantic relationships is created from the same set of triples: the source, author, related subjects, place in hierarchy, and so forth, are all pulled out of a Subject-Predicate-Object database.
If, on that page, under the link “The Turing Test Doesn't Work,” you click on the link to Nova Spivack, you'll be taken to the Nova Spivack page. As you can see, the text under “The Turing Test Doesn't Work” is now highlighted, because that was the link you clicked to get to this page. If you scroll down a bit more, you'll see that Nova Spivack is a a human being. Clicking on that link brings you to a list of all the human beings on the site, with organized by subcategories.
I've become very fond of that sort of inter-linking. I'm still figuring out what to do with it, but I think it's worth pursuing. So when Shirky quotes a particularly perplexing syllogism and says:
[This syllogism] illustrates the kind of world we would have to live in for this form of reasoning to work, a world where language is merely math done with words.
I disagree entirely. I am a writer by avocation and trade, and I am finding real pleasure in using Semantic Web technologies to mark up my ideas, creating pages that link together. What I do is not math done with words. It's links done with semantics, and it forces me to think in new ways about the things I'm writing.
Let me give you another example, so that you might get a better sense of the power of the system. Let's say that every week, I published a summary of the the week's news. But instead of just writing the news up, sentence after sentence, I marked up the news as a series of events, with the times they occurred, and linked the different events to the topics they discussed. Here's a faked-up paragraph, from November 14, 2000 (or so).
The results of the election between George W. Bush and Al Gore remained uncertain. It appeared that the presence of Ralph Nader in the electoral race cost Gore a clear majority. Bush began to refer to Laura Bush as “First Lady Bush.”
Now, you can't see this, but each one of those three sentences is marked up as an “Event.” If you click on the link to Al Gore, it will show you a timeline-sorted list of the two events that relate to Gore, and if you click on the link to George Bush, it'll do the same. Working out from here, it's easy to imagine how you could take all of your weekly reports, build one master database and publish it like “The Timelines of History.” You could issue queries like “show me all the events that involve 'George Bush' and 'Iraq.'” Doing this the old-fashioned way, with a relational database, is a true pain. Rolling the database yourself, like I did, is very difficult and no fun at all, and writing it in a language like XSLT, which is also something I did, is about as dumb as it gets. Nope, having tried everything, I'm increasingly of the opinion that the right way to do it, as far as I can see, is to turn your events into Semantic Web-friendly RDF statements, store them in an RDF database, and query them there. When you've got a big pile of semantically tagged interlinked data as a nail, the Semantic Web framework is the best hammer around.
Just having a bunch of linked events isn't the answer. Those links need some place to point to. So we create a sort of meta-index, which doesn't just contain subjects and sub-topics, but also defines relationships between them (which you can do using triples). As opposed to being events with links, this meta-index, or ontology, is far more formally specified. It has relationships like:
|George W. Bush||Is president of||the United States (from 2001-?)|
|The Middle East||Contains||Iraq, Iran, Israel, etc.|
You might also add facts about Bill Clinton, George Bush, Sr, and other presidents, and you also need to tell the system that the “Contains” preposition means if something relates to the thing that is contained, then it also relates to the container—for instance, if there is war in Israel, then there is war in the Middle East.
Given a database of such facts, you might want to ask a question like: “show me all the events that involve a president of the United States and the Middle East.” Because you have an ontology, your system can reason something like what follows:
- The set of things contained by the Middle East includes Iraq, Iran, Israel, and so forth.
- The set of presidents includes George W. Bush, Bill Clinton, and so forth.
- Therefore every event which refers to at least one of the set of presidents, and at least one country contained by the Middle East can be said to answer our question.
The system then goes ahead and finds all of those matches, via the dark arts of resolution reasoning, applied graph theory, and set theory, and returns a big list of events as a result. What you do with these events is up to you: you might organize them in a timeline, for instance. As above, doing this with relational databases is a pain. A storage layer that understands logic is far preferable.
Some of this sort of linking and programming is easy, but a good bit of it is mind-bending. I find it hard to code up my own ontologies. A geography ontology is a good example: I don't want to encode every region, country, state, prefecture, and so forth. I'd much rather take someone else's ontology of all the countries, states, and so forth (like the one my pal Jack Rusher is using on his web site). So what I'll do instead, when I want that view of the world, is get it from Jack, or from where Jack got it, in RDF format, and drop it into my triple store, and address my links to the unique URIs specified within that small geographic ontology.
What's good about that is that now, if we want to, Jack and I can take all of our web pages, spit them out in RDF, and merge them together. All the attributions stay the same, but if we've both written something about Italy, the Italy page will contain links to each of our pieces. It can do this because we shared the ontology of nations. So when Shirky says:
No one who has ever dealt with merging databases would use the word 'simply'. If making a thesaurus of field names were all there was to it, there would be no need for the Semantic Web; this process would work today.
It's hard to see where he's coming from. That is the point for the Semantic Web, and merging RDF databases is not as easy as, say, drinking chilled white wine in the summertime, but it's definitely not as hard as unifying multiple relational databases. The OWL language, which allows you to define ontologies, has all manner of trickery for saying which URIs are synonyms of each other, and how they relate. So what you can do, if you choose, is merge your databases, and then write up a series of OWL statements explaining how the different databases relate. Then an OWL-aware system (of which there are admittedly remarkably few, but more on the way) glues the databases together for you. What the Semantic Web framework does is admit that it is really hard to unify databases, and gives you a language for unifying them that doesn't require you to muck around too much in the details. You can focus on the semantics, not the actual formatting of the data, and approach the problem quite strategically.
This situation is much more friendly than the one Shirky describes:
But [the meta-data we generate] is being designed a bit at a time, out of self-interest and without regard for global ontology. It is also being adopted piecemeal, and it will bring with it with all the incompatibilities and complexities that implies.
This assumes that people would prefer to endlessly re-create ontologies that describe well-known subjects, like geography, or authors, or books. Personally, I like my way better: if there's a nice-looking ontology about geography, and I can get my hands on it, I'll just plug that thing into my site and start using it.
There are many other points in Shirky's essay that I disagree with, and I originally set out to refute them point by point, but essentially, I disagree with every one of his major conclusions, and find them to be based on incomplete understanding of what the Semantic Web is and how its researchers work. If you search Citeseer for papers on RDF, the Semantic Web, and related technologies, you'll find a wide variety of prior art that addresses many of the issues he discusses, and you'll also find that the Semantic Web community is nowhere near as ignorant of the problems he describes as he suggests. Quite a bit of work has been done on trust metrics, semantic disambiguation, ontology exchange, triple storage, and query semantics. Some of it is doubtlessly going down the wrong path, but some is equally likely to prove worthwhile. 50 years of AI research has not given us a computer that thinks, but it hasn't been wasted time, either. Neural nets, Bayesian algorithms, and the other fancy stuff that is trapping spam, girding up search engines, and performing other useful tasks are a direct result of the long years of research into AI. The Semantic Web is a classic AI project, but on a much larger, less predictable scale than ever before. By sneering at a few researchers, Shirky maligns the patient, methodical work of hundreds of others.
For every quote he presents that shows the Semantic Web community as glassy-eyed, out-of-touch individuals suffering from “cluelessness,” I could give a list of many other individuals doing work that is relevant to real-world issues, who have pinned their successful careers on the concepts of the Semantic Web, sometimes because they feel it is going to be the next big thing, but also because of sheer intellectual excitement. The work being done at the UMD Mindswap lab, which employs two friends of mine, Bijan Parsia (well, Bijan is more of a likeable nemesis, but anyway) and Kendall Clark, whom I can personally vouch for as down-to-earth individuals keenly aware of the limits of computing, is definitely worth noting. Companies like Radar Networks are working on a truly usable Semantic Web platform. Sensible individuals see the Semantic Web as an enabling technology for all manner of applications. Individuals like Edd Dumbill of XML.com (which publishes me from time to time), Dave Beckett of Semantic Web Action Development Europe, who has put together the promising Workshop on Semantic Web Storage and Retrieval, happening this week in Amsterdam, and many, many others are pragmatic technologists who share information freely and believe strongly that building a Semantic Web is a worthwhile pursuit. My money's on them. They know what they're talking about, and aren't afraid to admit what they don't know.
Postscript: on December 1, on this site, I'll describe a site I've built for a major national magazine of literature, politics, and culture. The site is built entirely on a primitive, but useful, Semantic Web framework, and I'll explain why using this framework was in the best interests of both the magazine and the readers, and how its code base allows it to re-use content in hundreds of interesting ways.