This is the final installment of my interview with Michael Bergman. The interview started here. Today Michael talks about federated search, the semantic Web, and his work with Zitgist.
17. What do you see as the pros and cons of real-time federated search vs. crawling, indexing and harvesting?
I mostly see these two approaches filling very different needs with different approaches.
A number of years back I wrote a white paper on the inadequacy of standard search alone for business purposes. It basically contrasted and compared standard search, which I called discovery, from purposeful capture or harvest.
Search engines work best in the discovery phase, when searching is a fast, give-and-take, contact sport. Real-time performance is important and interaction and testing are the user mode. I frankly feel deep Web search is not terribly useful or helpful in this phase. Identifying candidate searchable databases can be very important in this phase, but that can be accomplished from a search engine for databases such as CompletePlanet or the DQM rather than going to the site directly (reserving deep Web search for the purposeful harvest mode.)
Once the researcher has got a good bead on their capture requirements, harvesting and the deep Web come to the fore. But, this can be scheduled, and need not meet a real-time criterion.
18. What do you see as major challenges with real-time federated search?
I guess I really do not see the need for real-time federated search involving the deep Web; federated search, yes, but real time, no. Again, I would put that into the purposeful harvesting category, and not subject to the discovery threshold of being real-time.
19. What do you think the landscape of federated search will look like in ten years? This blog is sponsoring a writing contest to predict the future of federated search. Perhaps you’d like to enter?
No, thank you (smile), I think I’ll pass on any writing contests.
As for ten years out, I think it is important to distinguish the quick search, look up or discovery from purposeful information gathering (which implies thoroughness and the harvesting of good candidates.) Of course, these same distinctions occur today.
I would guess that a source like Google (perhaps an emerging competitor) will continue to do an excellent job of “indexing everything” for quick search and discovery. This will be supplemented by much metadata and support for faceted search, or a basic “slicing-and-dicing” for better organization of results. High-value content formally in the deep Web will become exposed as well, either through direct content owner relationships or the harvesting techniques at the scale recently announced by Alon Halevy and Google.
For purposeful information gathering I obviously believe the trend will be inexorably to structured representation of text and the addition of semantics. This is what my current company, Zitgist, calls linked data. It responds to the area of purposeful information gathering and analysis that is at the core of an information economy.
20. Tim Berners-Lee, credited with inventing the World Wide Web, has been talking about the importance and value of the Semantic Web for years yet common folks don’t see much evidence of the Semantic Web gaining traction. Is there substance to the Semantic Web? What’s happening with it now and what does its future look like?
Wow, in 10,000 words or less?
No, actually, this is a very good question. As things go, I am a relative newbie to the semantic Web, only having studied and followed it closely since about 2005. I’m sure my perspective in coming later to the party may not be shared by those at the beginning, which dates to the mid-1990s as Berners-Lee’s vision naturally progressed from a Web of documents, as most of us currently know the Web, to a Web of data.
I think there is indeed incredibly important substance to the semantic Web. But, as I have written elsewhere, the semantic Web is more of a vision than a discernable point in time or a milestone.
The basic idea of the semantic Web is to shift the focus from documents to data. Give data a unique Web address. Characterize that data with rich metadata. Describe how things are related to one another so that relationships and connections can be traced. Provide defined structures for what these things and relationships “mean”; this is what provides the semantics, with the structures and their defined vocabularies known as “ontologies” (which in one analog can be seen as akin to a relational database schema).
As these structures and definitions get put in place, the Web itself then becomes the infrastructure for relating information from everywhere and anywhere on any given topic or subject. While this vision may sound grandiose, just think back to what the Web itself has done for us and documents over the past decade or so. This same architecture and infrastructure can and should be extended to the actual information in those documents, the data. And, oh, by the way, conventional databases can now join this party as well. The vision is very powerful and very cool.
Progress has indeed been slow. Many advocates fairly point to how long it takes to get standards in place and for a while people spoke of the “chicken-and-egg” problem of getting over the threshold of having enough structured data to consume to make it worthwhile to create the tools and applications and showcases that consume that data.
From my perspective, the early visions of the semantic Web were too abstract, a bit off perhaps. First, there was the whole idea of artificial intelligence and machines using the data as opposed to better ways for humans to draw use from the data at hand. The fundamental and exciting engine underneath the semantic Web — the RDF (Resource Description Framework) data model — was not initially treated on its own. It got admixed with XML that made understanding difficult and distinctions vague. There is and remains too much academia and not enough pragmatics driving the bus.
But that is changing and fast.
There is now an immediate and practical “flavor” of the semantic Web called linked data. It has three simple bases:
(1) RDF as the simple but adaptable data model that can represent any information — structured or unstructured — as the basic “triple” statement of subject-predicate-object. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or the ball is round. It sounds like a kindergartner reader, but that is how data can be easily represented and built up into more complex structures and stories
(2) Give all objects a unique Web identifier. Unique identifiers are common to any database; in linked data, we just make sure those identifiers conform to the same URIs we see constantly in the address bar of our Web browsers, and:
(3) Post and expose this stuff as accessible on the Web (namely, HTTP).
My company adds some essential “spice” to these flavors with respect to reference structures and concepts to give the information context, but these simple bases remain the foundation.
These are really not complex steps. They are really no different than the early phases of posting documents on the Web. Only now, we are exposing data.
More importantly, we can forget the chicken-and-egg problem. Each new data link we make brings value, in the similar way that adding a node to a network brings value according to Metcalfe’s Law. Only with linked data, we already have the nodes — the data — we are just establishing the link connections (the verbs, predicates or relations) to flesh out the network graph. Same principle, only our focus is now to connect what is there rather than to add more nodes. (Of course, adding more linked nodes helps as well!)
The absolutely amazing thing about our current circumstance as Web users is that we truly now have simple and readily deployable mechanisms available to finally overcome the decades of enterprise stovepipes. The whole answer is so simple it can be mistaken as snake oil when first presented and not inspected a bit.
As an industry accustomed to hype and cynical about so much of this, I only ask that your readers check out these assertions for themselves and suspend their normal and expected disbelief. For me, in a career of more than 30 years focusing on information and access, I feel like we finally now have the tools, data model and architecture at hand to actually achieve data interoperability.
21. Can you speak to the intersection of federated search and the Semantic Web? Is anyone aggregating semantically tagged content using a federated search approach?
I appreciate the focus on federated search in these questions. And I don’t mean to be arrogant or dismissive. But from my personal perspective, getting basic semantic Web principles and frameworks holds trump here. There may be (and likely are) federated search efforts also occurring here, but I don’t know much of them personally.
There have been a couple of deep Web research workshops now initiated with some appropriate academic meetings. They are probably the better places to look on this question. My current focus is linked data and not the deep Web or federated search.
22. Can you explain what you’re doing at Zitgist? What problem is Zitgist addressing? What are its products and services? Who are its customers?
My main job at Zitgist, of course, is to build the business and secure revenues and customers.
Another aspect is the fact that what we are doing with linked data is somewhat new and needs explanation to our enterprise markets. If you were to ask a traditional VC, they would say we do not want Zitgist involved in “educating the market”. But, from my perspective, that is true for us and also part of my current job description. Zitgist still is early in the market; there is much that needs to be communicated. And for that, I thank Federated Search Blog for these questions and this opportunity.
But, apart from administrative stuff, the real problem and thus opportunity that Zitgist is addressing is:
How to apply Web-responsive ways to provide data interoperability across the enterprise and with the global environment?
In other words, Zitgist is addressing the problem of breaking down data barriers and bridging across stovepipes.
The combination of RDF and linked data offers, finally in my opinion, the first viable approach for overcoming the data silos that have been the abiding legacy of enterprises of all stripes over the past 30 years. My professional career has been embedded in the cost and frustration arising from dysfunctional data and systems. It is a travesty, that now has an escape route to success.
Zitgist has some consumer things coming out soon with (I think) an innovative revenue model. The uncertainty with that, however, is that there really are few consumer revenue models on the Web aside from advertising. At minimum, we think our consumer offerings will increase visibility and be a demo framework, but to be brutally honest there are too many unknowns and serendipity in a consumer play to bet Zitgist on it. So, we are pursuing a consumer angle because it is innovative and could be a real home run, but we just don’t really know.
Thus, the main focus of Zitgist we are relying on in our plan is enterprise services to transition to this linked data future. There is real pain and real need here — and this has been the case in enterprises for decades. Enterprise services are a more reliable basis for our business model to build Zitgist. We have some interesting licensing and pricing models around that — as well as synergies with open source offerings — but that is a story for another day.
23. Is there a next big project or venture?
Absolutely. Stay tuned for our commercial announcements.
However, related to Zitgist’s core mission of delivering linked data in context, we have developed and recently released the free, open source UMBEL project. UMBEL (Upper Mapping and Binding Exchange Layer) is firstly a set of 20,000 reference subject concepts to help describe or tag what Web content “is about”. It is also a semantic Web “backbone” ontology for aiding the tie-in and interoperability of other information structures on the Web. And, lastly, UMBEL is a lightweight vocabulary that is an extension of SKOS (Simple Knowledge Organization System) from the World Wide Web Consortium (W3C) standards setting body. As a vocabulary, UMBEL is very useful to construct more specific, domain-oriented vocabularies (such as for a single enterprise or industry) designed from the outset to interoperate with the Web.
We have posted a large set of APIs and Web services for various aspects of UMBEL (see http://umbel.zitgist.com). Depending on publication date for this interview, we are also releasing the free scones tagging Web service. scones (subject concept or named entities) output can also be obtained in RDFa form for those that want to embed a linked data description of what their content “is about” on their Web pages.
We have put a full year of effort behind UMBEL because it is a central piece of infrastructure for Zitgist’s own services. We have made it fully open source because our self-interest as a company is served by having a network effect of reference concepts with which we and our customers can interoperate.
Finally, I should mention Zitgist is working with a small group of sister companies and leading experts in global climate change to bring semantic Web and linked data principles to that community. Nothing has been formally announced because we want to have real tangible stuff when we do. But both communities — climate change and semantic Web — see real synergies in working with the other. It is exciting to think we might help promote data interoperability amongst the current 100,000 or more datasets related to global warming. More info on this should be announced in early 2009.
I hope you’ve enjoyed this interview. Stay tuned for interviews with more luminaries.