中文XML论坛--The Road to the Semantic Web

The Road to the Semantic Web
2007-07-28 11:00
Written by Alex Iskold / November 14, 2006 / 14 comments
Written by Alex Iskold and edited by Richard MacManus.
John Markoff's recent article in NY Times has generated an interesting discussion about Web 3.0 being the long-promised Semantic Web. For instance, a short post on Fred Wilson's blog had a lot of lengthy comments attempting to define Web 1.0, Web 2.0 and Web 3.0. Some people think that the Semantic Web is about AI, some claim that it is more about semantics, while others say that it is about data annotation. All agree however, that we will all be wonderfully more productive and simply happier when it arrives. Lets take a look at the ingredients, definitions and approaches to the Semantic Web so that we can recognize it when it is finally here.
What is the Semantic Web?
The Wikipedia defines the Semantic Web as a project that intends to create a universal medium for information exchange by putting documents with computer-processable meaning (semantics) on the World Wide Web. The core idea is to create the meta data describing the data, which will enable computers to process the meaning of things. Once computers are equipped with semantics, they will be capable of solving complex semantical optimization problems. For example, as John Markoff describes in his article, a computer will be able to instantly return relevant search results if you tell it to find a vacation on a 3K budget.
In order for computers to be able to solve problems like this one, the information on the web needs to be annotated with descriptions and relationships. Basic examples of semantics consist of categorizing an object and its attributes. For example, books fall into a Books category where each object has attributes such as the author, the number of pages and the publication date. The basic example of a relationship comes from various social networks that we are part of. In one network the relationship might be a friend of, in another a family member and in another works with.
RDF, OWL and the mathematical approach to annotation
There are billions of fairly unstructured HTML pages which contain no annotations and meta data. The fundamental engineering question is how can we go from today's unstructured web to one rich with semantical information? W3C consortium authored specs for RDF (Resource Description Framework) and OWL (Web Ontology Languages) attempt to enable the collective capture and description of information, along with the ontology and the relationships with other pieces of information, in a rigorous, mathematical way.
RDF is an XML-based language which enables description of relationships via predicates. The Wikipedia explains: The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion "The sky has the color blue" in RDF is as a triple of specially formatted strings: a subject denoting "the sky", a predicate denoting "has the color", and an object denoting "blue".
OWL is another XML-based language used for describing and reasoning ontologies. In a nutshell, OWL facilitates semantic descriptions such as Dog is an animal or Dog has four legs. There are three flavors of OWL: OWL Lite, OWL DL and OWL Full - each flavor capturing a different side of a trade off between expressiveness and computability. This RDF/OWL framework is comprehensive, but is difficult for people without a background in mathematics and computer science to understand. Given that this is a bottom up approach, it is clear that if it is to succeed, there needs to exist an automated mechanism that takes existing HTML content and turns it into RDF and OWL meta data. This, however, is a chicken-egg problem because if we could already do this, the problem would not be there to begin with. Still we can envision tooling which does 80% of the work automatically and then interacts with the person to complete the other 20% of the work.
Microformats
Recognizing the complexity of RDF and OWL, a group of people are trying a different approach called Microformats. The goal of microformats is to embed the basic semantics right into HTML pages. It is not as expressive right now as RDF and OWL, but it is very compact and uses available XHTML facilities to add semantics to the pages. For example, there is a microformat for describing contact information called hCard. Using hCard it is possible to annotate the HTML so that a microformat-aware browser or a search engine can deduce the information about a person such as first and last name, a company or a phone number. Another mature microformat called hCalendar enables page authors to describe events. Many popular event sites, such as Facebook and Yahoo! Local use this format to annotate events in their HTML pages.
Leaving the aesthetics of the representation aside, the microformats approach is clearly simpler than RDF and OWL. And even though it is less powerful, it is becoming very popular. Many site authors are starting to embed microformats into their HTML pages. We are also seeing some early examples of search engines based on microformats, like this one from Technorati. The simple gain in using microformats and doing search is removing ambiguity. In a way, it is similar to the vertical search engine - which knows which vertical you are searching. With microformats inside the pages, the data is also no longer ambiguous, so the search results are more precise.
Still, there are some issues with microformats. The first one is the same as with the previous bottom up approach - people have to do the work to annotate the pages. The good news is that since the format is simpler, more can be done via reverse engineering and automation. The second issue is that the current set of microformats does not cover many things that we encounter online. For example, we are not aware of a format that would help represent a book or a movie. Many more formats need to be created before they can really "cover" the web.
Semantic Web is Personalized Web
The problem of annotating data is very complex and is far from being solved completely. But let’s leave it aside for a moment and think of what we can be doing once all the data becomes annotated. The promise is that we will be doing less of what we are doing now - namely sifting through piles of irrelevant information. Given that the amount of information is growing exponentially and our tolerance is shrinking, this is a very intriguing proposition. If the computer can return relevant results instantly, we can potentially save a ton of time.
But having semantics and knowing all relationships between the data is not enough to do that. Take the simple example of a travel agency. When you show up there for the first time, the agent does not know what to offer you, even though she knows the semantics of travel, the relationships between things and the prices of everything. In order to be effective, she needs to know where you've been already and what kind of destinations you like. That’s why she asks you questions. All services that we receive work this way and the results are better and more refined over time, because service people have time to learn what you like.
So the second important ingredient of the Semantic Web, the one that will facilitate productivity, is a set of persistent personal preferences. Once the computer knows your preferences and has a semantical representation of it online, it can then run an algorithm to deliver you precise, personalized results. To put it differently, your personal preferences is the filter that needs to be applied to the results that the computer returns in response to: Find a vacation for under 3K. And when this happens, then we can claim that the Semantic Web has arrived.
Conclusion
So will the 'Web 3.0' be the Semantic Web? Probably. But are we there yet? Not quite. It will take some time to annotate the world's information and then to capture personal information in the right way, to enable the kinds of applications that we have discussed. We are certainly getting close and it will be interesting to see how things unfold over the next few years.
Incidentally, if you would like us to write more about the Semantic Web please let us know and we will do follow up posts.


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	6,152.344ms