Viewing entries tagged with 'conference'
I gave a presentation at the K-CAP 2007 conference. It, along with all my other papers, can be found in the publications section of this website.
You can view the research paper, the slides of my presentation, and an animated movie of my presentation (slides + audio). To view the movie click the image below to download (quicktime h.264):
The research gives techniques and a methodology for multi-user editing of ontologies. It suggests a technique for locking segments of description logics ontologies for multi-user editing. This technique fits into a methodology for ontology editing in which multiple ontology engineers concurrently lock, extract modify, error-check and re-merge individual segments of a large ontology. The technique aims to provide a pragmatic compromise between a very restrictive approach that might offer complete error protection, but make useful multi-user interactions impossible and a wide-open, anything-goes editing paradigm, which offers little to no protection.
The research gives an overview of ten different ontology engineering projects??(TM) infrastructures, architectures and workflows. It especially focuses on issues regarding collaborative ontology modeling. The survey leads on to a discussion of the relative advantages and disadvantages of asynchronous and synchronous modalities of multi-user editing. This discussion highlights issues, trends and problems in the field of multi-user ontology development.
Interactive Knowledge Externalization and Combination for SECI model
Yoshinori Hijikata from Osaka University in Japan talked about capturing both tacit and explicit knowledge from two people while they are engage in a conversation. Stages of the conversation are: socialization, externalization, internalization and combination. The GRAPE interface was used to allow the users to collaboratively build a classification tree as they speak. Four general discussion models were observed: (1) both users understand and agree with each other based upon their individual experiences, (2) one user doesn't have knowledge of the topic being discussed, but understands the other user's explanation, (3) one user doesn't understands the other user's explanation completely, but nevertheless modifies his own understand, because he trusts the other user's expertise, (4) both users disagree with each other, but one user reluctantly gives into the other user.
A Luis von Ahn talked about the CAPTCHA test he developed. The test is designed to protect a website from being misused by automated computer programs. A computer has trouble passing the test, but a human can pass it with ease. This has led to a whole new industry of "captcha sweat shops" where spam companies employ people in developing countries to solve captcha tests all day long, so they can sign up for free email accounts and use these to send out spam. In total about 200 million captchas are solves every day. Solving a captcha takes an average human 10 seconds. So, this amounts to a great deal of wasted distributed human processing power.
This led to the development of reCAPTCHA, a game that has all the advantages of a regular captcha, but also helps the OCR process of digitizing all the world's books. A scanned word from a book that a computer could not recognize accurately is offered up as a captcha for the human to interpret.
Luis von Ahn also developed the ESP game, where people have to assign keywords to an image with a partner with whom they can't communicate. If both people guess the same keyword, they win and the keyword gets assigned to that image. The keywording helps services like Google's image search to return more accurate search results.
The scary thing is how much information can be found out about a person just by monitoring them playing the game. After just 15 minutes of game-play, the researchers could predict a person's 5-year age bucket with 85% accuracy and gender with 95% accuracy (only a male would, for example, attempt to label a picture of Britney Spears as "hot"). This is just from a short time anonymously playing an online game, so, you imagine just how much information Google knows about you based upon what search for?
Some other new games being developed in Luis von Ahn's lab are: Squigl, a game where two players trace out an image; Verbosity, a game where people are asked to describe a secret word via a template of questions; Tag-A-Tune, a game to label sounds. All these games and more will soon be coming to the Games With A Purpose (GWAP) website.
Maintaining Constraint-based Applications
Tomas Eric Nordlander talked about a brilliant constraints programming system for hospital inventory control. He defined Knowledge Acquisition as: "the transfer and transformation of problem-solving expertise from some knowledge source into a program. The process includes knowledge representation, refining, verification and testing". He goes on to define Knowledge Maintenance as: "including adding, updating, verifying, and validating the content; discarding outdated and redundant knowledge and detecting knowledge gaps that need to be filled. The process involves simulating the effect that any maintenance action might have". Knowledge Maintenance is extremely important, but frequently under-appreciated.
The author designed a system named proCAM for Cork University Hospital. This system replaced the hospital's previous manual logistics system. It had to answer three basic questions: what products to store? When should the inventory be replenished? How much should be ordered? To answer these questions, proCAM considered: historic demand, service level (risk of being out of stock), space constrains, time constraints, holding cost, ordering cost, current stock level, periodic review time, and more. These can be generalized into physical constraints, policy constraints, guidelines and suggestions (nice to order and store two products together that get used together).
proCAM used a combination of operational research algorithms and constraint programming (CSP) to do its magic. It is very easy to use. The users of proCAM only see two values on the display: the order level (the stock level at which a new order should be placed) and the order number (the amount of the product that should be ordered). Behind the scenes, the system takes all constraints and past history into account to calculate the ideal order amounts. It can even detect seasonal variations in stock usage patterns and adjust order amounts accordingly. If someone tries to order a product that violates one of the system's constraints, this violation is highlighted the user is given the option of: overriding the constraint and placing the order anyway, adjusting the constraint, or canceling the attempted order. Constraints can be maintained on-the-fly by hospital staff with this easy-to-use interface. proCAM also supports different sets of constraints between e.g. the day-shift and the night-shift staff of the hospital.
One could imagine the same system being adapted to almost any inventory control scenario.
Strategies for Lifelong Knowledge Extraction from the Web
Michele Banko (a student of Oren Etzioni's) taked about "Alice" system. Similar to TextRunner, Alice goes from a text corpus to extract facts, but also attempts to create logical theories (e.g citrus fruit = orange, kiwi). Alice adds generalized statements and embellishes class hierarchies. It allows lifelong learning. It does bottom-up, incremental acquisition of information. So, it will extract facts, discover the new concepts, generalize these facts and concepts and repeat this process indefinitely. The output is an updated domain theory.
Alice, when answering a query, does not use exhaustive search, because its data is never assumed to be perfect. Instead, it uses best-first search and search-by-analogy (association search) to navigate its knowledge tree.
Evaluation consisted of assessing the returned knowledge as: true, off-topic (true, but not interesting), vacuous, incomplete, erroneous. The system was 78% accurate. Problems occurred when the best-first search got distracted by going deep down a specific search branch.
She talked about the need to index ontologies for easier and faster search retrieval. Ontologies are different from text documents, so traditional text indexing can't be blindly applied. Ontologies are suppose to be conceptualizations of a domain, so the emphasis of this work was to take advantage of this aspect when indexing ontologies. Existing ontology indexing approaches use flat keyword indexes, human authored manual indexes or page-rank-like indexing techniques.
The author's semantic enhanced keyword approach works by unfolding all axioms in an ontology until all primitive concepts are extracted. These concept names are then weighed according to whether they are e.g. negated or not. Finally, because ontologies are conceptualizations of a domain, then it should be possible to take advantage of other people's conceptualizations of the same knowledge. So, the approach harvests wikipedia articles (and other articles link to from these articles) relevant to the ontology, and then uses latent semantic analysis to further tune the ontology keyword indexes.
Papers and presentation that I found interesting from day 1 of the K-CAP 2007 conference:
Oren Etzioni talked about his TextRunner knowledge extracting search engine. TextRunner gathers large amounts of knowledge from the web. It does this by focusing on semantically tractable sentences, finding the "illuminating phrases" and learning these in a self-supervised manner. It leverages the redundancy of the web, so, if something is said multiple times, it is more likely to be true.
This is all loaded into an SQL server and can be queried by anyone. If you type a query into the search engine it will return all the structured knowledge it knows about that query. For example: "Oren Etzioni is a professor" and "Oren Etzioni has an arm".
Capturing Knowledge About Philosophy
Michele Pasin talks about his PhiloSURFical project to build an ontology of the history of philosophy for the purpose of improving the browsing and searching experience for philosophy students and teachers. His view is that ontology should not be about true or beauty, but instead should focus on enabling reuse and sharing. Requirements for this tool were that it should support: uncertainty (e.g. of dates), information objects, interpretation of events, contradictory information and different viewpoints. The ontology itself is based upon CIDOC CRM. It captures events such when one philosopher interprets another's work, teaches a student, and/or debate with another scholar. The knowledge base contains 440 classes, 15000 instances, 7000 individual people, 7000 events and 500 axioms related to the philosopher Wittgenstein.
Searching Ontologies based on Content: Experiements in the Biomedical Domain
Harith Alani talked about the need for a good system to find existing ontologies on the web. Users need to find ontologies that they can reuse and/or bootstrap their own efforts. Existing content-based searching tools don't work, because, for example the Foundation Model of Anatomy (FMA) doesn't have an actual class called "anatomy" anywhere in it. So, a search for "anatomy" would not result in this ontology being returned.
The research involved interviewing a number of experts to established a gold standard. The experts were asked to list the good ontologies on certain topics (anatomy, physiological processes, pathology, histology). However, even the experts only agreed on 24% of answers.
The researchers new ontology search tools uses the wikipedia to expand the queried concepts (future work involves also using UMLS and WordNet to expand the query). The result was that Swoogle achieved an accuracy f-measure of 27% and the expanded term search's f-measure was 58%. The conclusion is that more ontology meta-data is necessary.
Capuring and Answering Questions Posed to a Knowledge-based Systems
Peter Clark from Vulcan, Inc. talks about their Halo project. The project aims to build a knowledge system (using the AURA knowledge authoring toolset) that can pass high-school level exams in physics, biology and chemistry. The system should be able to answer a free-text question such as: "a boulder is dropped off a cliff on a planet with 7 G gravity and takes 23 seconds until it hits the bottom. Disregarding air resistance, how high was the cliff?"
The system enforces a restricted simplified version of English that humans express the questions in (based upon Boeings Simplified English for aircraft manuals, modified for the specific domains). The language is both human usable and machine understandable.
Common sense assumptions need to be made explicit for the system. So, for example, in the above example it must be specific that the drop is straight downwards and not arced. So, the humans who were asking question to the system had to go through the following cycle: read original question text, re-write in controlled English, check with the system and take note of any re-writing tips, allow the system to turn the text into logic, check the paraphrase of the system's understanding, press the answer button and evaluate the system's attempted answer to the question, retry as necessary.
38% of biology questions were answered correctly with 1.5 re-tries per question (1-5 range).
37.5% of chemistry questions were answered correctly with 6.6 re-tries per question (1-19 range).
19% of physics question were answered correctly with 6.3 re-tries per question (1-18 range).
The researchers considered this to be a huge achievement! The system uses the sweet spot between logic and language to do something no other system before it could come close to. There was no single weak point that caused the system to give the wrong answer. Bad interpretation, bad query formation and missing knowledge all equally caused incorrect answers.
I just had a paper accepted for publication at this year's Knowledge Capture conference (K-CAP 2007). My paper is "A Methodology for Asynchronous Multi-User Editing of Semantic Web Ontologies". It will serve as the basis of my upcoming PhD thesis.
So, see you in Whistler, Canada in October.
I got a paper accept at the ER2006 conference! That's the 25th International Conference on Conceptual Modeling to be held in Tucson, Arizona, USA from November 6 - 9, 2006 (ER stands for Entity Relation - the age-old method of conceptual modeling in databases).
My paper on "Representing Transitive Propagation in OWL" was accepted in peer-review process. A total of 158 papers were submitted and only 37 were accepted (23.4% acceptance ratio).
I got high marks for Originality and Presentation, but low marks for Significance (when I say "low", that means a "neutral = 4" rating, rather than a "accept = 6" rating; ratings were out of 7). That is fair enough. This research isn't the main, innovative, ground-breaking trust of my PhD. It is just something interesting that came up as a side-idea.
There was a talk on "The Web Structure of E-Government - Developing a Methodology for Quantitative Evaluation".
The researchers from University College London (UCL) used several statistical measures for evaluating government websites: worse case strongly connected components, incoming vs. outgoing link, path length between pages, etc. They compared their statistical measure with results from user evaluations. That is, they got a bunch of users together and measured how long it took them to find stuff on various website (both with and without using Google).
They tested the UK, Australia and USA immigration websites. The results:
- UK is best, both navigating the link structure and searching
- AU is terrible to navigate, but good to search
- USA is bad any way you look at it, but at least search will eventually find you what you are looking for.
Automated statistics don't tell you much.
More info at: www.governmentontheweb.org
This was followed by a talk by Ian Pascal Volz from the Johann Wolfgang Goethe University in Germany. He talked about "the Impact of Online Music Services on the Demand for Stars in the Music Industry".
His main (and interesting!) finding is that people tend to buy music they already know and like from online music stores like the iTunes Music Store. Peer-to-peer file sharing networks, on the other hand, tend to get people to try and discover new music. Virtual communities are somewhere in between the two.
People who buy music will not spend any money on something they don't already know and value. Even $1 per song is too high a price for a casual purchase. If you want people to discover your music and you are unknown it must be available for free.
On a related topic: when recording lectures on spiritual subject matter, please, please, please don't try to charge for them. No one will pay. Make them available for free. That way to the whole world will benefit.
And so ends the WWW2006 conference. Next stop Banff, Canada for WWW2007.
A presentation by some researchers from Karlsruhe, Germany was very interesting (well presented, too). They talked about their "semantic wikipedia", an extension to the popular MediaWiki that allows authors to express some semantics, i.e. to get at the hidden data within the articles.
The normal wikipedia only has plain links between articles. Nevertheless, it is the 16th most successful website of all time (according to alexa.com). However, in the semantic version every link has a type. Object properties map concepts to concepts and datatype properties map concepts to data values.
Why do it this way? Answers: adding these annotations is cheap and easy (no new UI), they can be added incrementally and there is no need to create a whole new RDF layer on top of the existing content, the annotations are right there in the wiki text.
This simple addition is enough to allow for powerful queries. You can create pages that automatically pull in all articles of a specific category, with a specific title and between a specific date range, for example. Checking for completeness because easier too: you can construct a query that tests if every Country has a Capital. If some countries come up that don't, those can be easily fixed.
The whole thing self-regulates. Each property has its own page in the wiki, so that people can suggest property types and eventually come to a consensus about which properties are the right ones to use.
The wiki can be imported into OWL and vica versa. The template system can also be leveraged to quickly create semantic annotations.
The whole thing is a win-win-quick-quick scenario (bit of an in-joke there).
Over lunch I bumped into John Darlington, the former CEO of Active Navigation, a small company (spin-off from Southampton University) that I worked for a while ago. John is now working for Southampton University as a Business Manager and was involved in organizing the WWW conference.
Active Navigation was a very nice place to work. It had the atmosphere of a small start-up without the killer, passionate, burn-out, no-holds-barred pace.
The company creates a server technology that automatically injects hyperlinks into web pages pointing to relevant, related pages on the same website. Website navigation can be improved by using these injected links. If someone, for example, creates a web page containing the word "ontology" and someone else has written a web page that also contains the "ontology", then the server transforms those words into links to each other's web pages. Someone browsing the website could find the two related pages by clicking on the automatically created link.
John called me over: "Julian! Wow, great to see you!"
Turning to Nigel Shadbolt next to him: "Julian here worked for me for a while, then disappeared into the either, as you do, and now: I'm chairing a session (the one on education), look down and who do I see? Julian, asking a question!"
He suggested I might look into digital media production in New Zealand as a possible career path. Ever since Lord of the Rings that has apparently taken off in a big way down-under.
Harith Alani presented his position paper on building ontologies from other online ontologies. He explained how building ontologies is difficult, so it is best to reuse existing knowledge bases, or, even better, completely automate ontology construction. The state of the moment is that there are quite a few ontology editing tools, but little support for reuse. Furthermore, these tools are build for highly trained computer scientists, not the average web-developer.
His idea is to combine three existing research areas:
Ontology libraries (e.g. DAML library, Ontolingua) and ontology search engines (e.g. Swoogle) can be used to located ontologies on the Internet.
Ontology segmentation techniques (like mine) can be used to cut smaller pieces out of these ontologies.
Ontology mapping techniques can be used to reassemble the pieces into new ontologies.
Result: instant custom ontology. However, to get this working in practice takes quite a bit of doing. He himself admitted that is was quite an ambitious undertaking. Good idea though.
Mustafa Jarrar (from Beligum) and Paolo Bouquet (from Trento, Italy) presented the next two papers. They talked about a very similar topic. Both were advocating linking ontology terms to dictionary / glossary definitions.
It was interesting two observe these two researcher's presentation styles. Paolo was very fast and frantic, very much unlike Mustafa who was very slow and relaxed, even when trying to hurry (Vata vs. Kapha, for those knowledge in Ayurveda).
Mustafa told of how he built a complex ontology for some lawyers, but, after he had gone through the trouble of carefully constructing this knowledge base, the lawyers found it to be too complicated to understand and threw everything expect the glossary part away. However, the did really like and appreciate having a sensible glossary of all kinds of law-related knowledge.
He defined this "gloss" as:
auxiliary informal (but controlled) account for the common sense perception of humans of the intended meaning of a linguistic term
The glosses should be written as propositions, consistent with the formal definition, focused on the distinguishing characteristics of what is being described, sufficient, clear, use supportive exampled and be easy to understand.
Advantages are that these glosses are highly reusable (very important for his lawyer clients) and that they are very easy to agree upon.
So everyone: link your ontology to WordNet (or something better)!
Paolo picked up the issue and talked about his WordNet Description Logic (WDL). An extension to DL that adds lexical senses to the vocabulary of logic. It allows for compound meanings. So, UniversityOfMillan is automatically inferred as University that hasLocation some Millan.
Using this type of dictionary-link makes it possible to check for errors by comparing the glossary definition to the logical semantics. If they don't match, a potential error can be flagged.
His system also allows for bridging and mapping between ontologies. If two ontology concept refer to the same dictionary definition, then that is a very good indication that they are describing the same sort of thing.
UK National Health Service (NHS): web-enabled primary care is finally coming, but is still super-clunky. And forget technology use in secondary care, it's non-existent. If only there was a central registry of patient's records. That would be really useful both for patients and statistical medical research. It would also be very cost effective.
The NHS is spending ?£6 billion on modernizing its information technology. Unfortunately, despite being only about one year into the project, they are already ?£1 billion pounds over budget.
I know from first hand ontology building experience that the Systematized Nomenclature of Medical Clinical Terms (SNOMED-CT), which is supposed to underly this whole revamp, is an extremely poorly architected ontology. A disaster just waiting to happen.
USA Health IT: IT in health could prevent some of the 90,000 avoidable annual deaths due to medical errors. Test often have to be re-done, because it's cheaper to re-test someone than to find the previous lab results. We need to get rid of the medical clipboard!
Knowledge diffusion is super-slow. It takes 17 years (!) for observed medical evidence to be integrated into actual practice. Empower the consumer (while also providing privacy and data protection). Also, empower homeland security to protect us from the evildoers.
Most practices don't have Electronic Health Records (EHR). Those would enable some degree of data exchange between practices, which would benefit a practice's competitors. The patient would be less tided to one doctor. Less tie-in means less profit. So, in the fierce competitive market of for-profit health care, there is little reason to go electronic.
However, SNOMED will help (... or so they say).
Now the chance for up and coming semantic web developers to demo their killer applications. The apps that will revolutionize the Internet, on display.
Tim Bernes-Lee (who uses a Mac, by the way) showed his Tabulator RDF browser. He gave a brief talk and demo of the app. It gives an "outline" style view of RDF and asynchronously and recursively loads connecting RDF using AJAX technology. It follows the 303 redirects, follows # sub-page links, uses the GRDDL protocol on xHTML and smushes on owl:sameAs and inverse functional properties (the killer feature, apparently).
Some commented to me afterwards that they thought that no one should ever have to see the RDF of a semantic web application, let alone browse it. Oh well.
Then came DBin. Not just a browser, no, a semantic web rich client! It uses so called brainlets (HTMLS) and a new semantic transport layer (not HTTP) to dynamically query and retrieve RDF using peer-to-peer transfer.
Again, I'm skeptical. It is just (yet another useless) RDF browser that saves bandwidth by sending the data through a peer-to-peer network. But RDF file sizes aren't exactly huge and compression will do far more than peer-to-peer to help with bandwidth. This browser is solving a problem that doesn't exist.
Next up: Rhizome. A python-based app that allows one to build RDF applications in a wiki style. Is uses a Raccoon application server to transform incoming HTTP requests into RDF, evolve them using rules and uses schematron validation. In short, it is to RDF what Apache Cocoon is to XML. Or, in more understandable terms: you declaratively build your web-site using RDF for everything from the layout to the database.
Pity, of course, that no one uses Cocoon and this Rhizome system looks really complicated, despite being pitched at "non-techical folks".
At this point I left the semantic web demo session. My thinking: these guys are nuts.
It uses a shallow (read: simple) ontology to label areas of a web page according to their functional roles. It also creates a hierarchy of elements inside of each area or module. The third component of the system is a Finite State Automata for moving between functional states of the website.
Putting these three things together allows one to identify common trails of FSA transitions. That is, processes which users tend to perform regularly. Having identified these trails, one can cut out all the modules that do not contribute to the task. All useless clutter is eliminated from each web-page.
Result: mobile web surfing speed could be accomplished twice as fast as before and blind web surfing (using a screen reader) could be performed 4 times faster than before.
Future work: mining for workflows, using web services and analyzing the semantics of web content. Problems: coming up with standard way to describe the process and concept models. A system for semantic annotation of web content is needed.
I was impressed. It sounds like a really good idea. It takes three relatively simple ideas and combines them into something innovate and powerful. Nice.
I attended a session on computing and education.
Tim Pearson said:
Schools in the UK spend just 1% of their budget on training and information technology. Business, in comparison spend 3%.
Schools like the web. It means less Microsoft, less expensive, in-school equipment, easy home access and is known to be modern/cool. Web 2.0 is great for lots of applications, but will never completely replace a rich-client for: hardware access, serious graphical work, immersive virtual reality and complex-process based assessment.
Learning is transforming into something more self-driven, interactive, open-ended and creative. Teachers will spend less time lecturing and more time mediating.
In terms of administration: school need to seriously look into getting some decent web-based admin, record keeping and curriculum planning applications. Crazy that this kind of stuff is still often done by hand, or in an Excel spreadsheet.
Gordon Thomson from Cisco said:
The IQ is dead as a measure of a good student. Better: passion + curiosity!
Cisco is working on innovative teaching solutions such as Telepresence. Imagine having a 3D image of Bill Clinton projected into the classroom to give a speech on global warming. It's like Star Trek.
The laptop is overrated. The $100 laptop, for example, is seen as the panacea to bridge the digital divide. However, in a few years technology will become so omnipresent that it doesn't matter anymore. What really matter is, first and foremost, that parents are interested in their children's education.
Addressing the challenge of web-based e-assessement, Neil T. Heffernan talk about an online exam system he and his student's built. It doesn't just assess students, but also offers hints and advice as students get questions wrong. It can also detect differences in performance over time as students learn. Teachers can use it to monitor their students, see which areas they are struggling with and then invest more time in explaining those in the classroom. Indeed, evaluation showed that student knowledge could be predicted very well.
I thought it was a very interesting and well-designed system. Looked good. It actually made answering math questions on a website kind-of fun.
Finally Elizabeth Brown presented her research on "Reappraising Cognitive Styles in Adaptive Web Applications".
We process information either visually or verbally, globally or sequentially, reflective or impulsive, convergent or divergent, tactile or kinesthetic, field dependent or independent, etc.
Focusing on the visual/verbal issue she used the WHURLE adaptive hypermedia system to present students with a customized revision plan best suited to their individual learning style. However, after extensive analysis, she had to conclude, that the adaptive learning environment made no difference whatsoever to students' performance. It might actually result in less learning, since if a student is only subjected to content that matches his or her individual learning style, then he or she will never learn to adapt to compensate for imperfect information. Students did say they liked the system, however.
The day started with Mary Ann Davidson, the chief security officer at Oracle Corporation and former Navy officer, giving a keynote talk on the critical issue of security.
She quoted the head of the department of homeland security in the USA as saying:
"a few lines of code can be more destructive than a nuclear bomb".
Poor security costs between $22 and $60 million per year (National Institute of Standards and Technology). People would never accept if we built bridges as poorly as we build software. Software developers need to be accredited and licensed professional like engineers are.
She ended with a quote from Thomas Jefferson (in a letter to George Hammond, 1792):
"A nation, as a society, forms a moral person, and every member of it is personally responsible for his society."
Then Tony Hey (former head of the Computer Science department at Southampton University and now working for Microsoft) talked about e-science, grids and high-performance computing and how Microsoft was getting into it. They would build simple grid services, based upon simple web services protocols. This will result in e-science mash-ups: people combining different services to perform a really useful new task.
His presentation was technically good, but used far too many words on far too many slides. With so much visual clutter, I stopped listening to him half-way through.
In the evening there was a food and drinks reception at the Edinburgh Castle.
The castle was impressive. Very large and imposing. I could literally feel the history of the place. Many, many wars were fought on its mighty walls. The entire city of Edinburgh has a unique ancient feeling to it. Of course, not everything was awe-inspiring. The dog cemetery, for instance, was laughable (sad, sad, sad).
The reception (price of admission = ?£50) involved pretty waitresses walking around with trays of expensive wine and hors d'oeuvre for everyone's enjoyment and nourishment. However, there was far too much wine and far too little food. Every time a food tray appeared, the poor waitress was jumped upon by a crowd of hungry researchers and raided for all she (or, more accurately, her food tray) was worth.
The food was completely abominable, too. Various varieties of dead animals. The only vegetarian options I saw were plates of deep-fried mushroom balls. Yum. Needless to say, I didn't eat or drink anything, nor did I have much opportunity to.
As the night wore on the who's who of the World Wide Web became more and more drunk. Give famous and powerful innovators, researchers and academics lots of free alcohol and they turn into "high-class" swaying, stammering simpletons. The British are especially renowned for their joy in and expertise at getting themselves utterly and completely drunk. It is, after all, the supreme form of enjoyment.
It was however a good opportunity to meet and rub shoulders with like-minded people from all over the world. I met lots of folks from my alma mater, Southampton University. However, with 1200 delegates attending, it was a bit too overwhelming. With so many people it is difficult to get to know anyone.
Feel free to browse the pictures of this event, as well as the rest of the conference here.
Right after the opening keynote came the ontology research track. This track included my presentation on ontology segmentation.
Peter-Patel Schneider gave the first talk in the session. It was a position paper presenting a new idea. He explained how the open-world semantics of classical logics was better suited to the wide-open world wide web, but the well known datalog (database) paradigm also had some useful attributes. The idea therefore was to merge the paradigms of datalog and classical description logic into a ideal hybrid web ontology language.
Sounded like a good idea to me. Unfortunately, he didn't see himself actually doing any work to realize this idea anytime in the foreseeable future.
Then came my presentation. I wasn't half as nervous as I thought I would be, despite there being over 100 people in the audience.
Unfortunately, I forgot to record the talk. Fortunately, I re-did the talk and you can view and listen to a quicktime movie of it.
I got lots of questions, though mostly of the "please explain this semantic web thing to me, I don't understand anything" variety. Many people were also taking pictures of my presentation slides with their digital cameras throughout my talk. I was one of the few people to use the Apple Keynote software to give my presentation. This software allowed me to produce slides that were vastly better-looking than the usual death-by-powerpoint variety.
I got lots of feedback afterwards. Here a few things people said to me:
- I liked and understood, you explained it well.
- I liked little story in the middle of your talk.
- Well done, the story didn't go overboard. It lightened the mood, but wasn't overblown and made a good point.
- Well done answering questions about what OWL is. I would have just told him to read the paper.
- I really liked your presentation.
- That was really interesting; this problem you are working is part of a larger issue of academics who just work on toy examples and never consider large-scale problems.
- I actually understood you talk, unlike the other two talks in the session, thanks.
- I think your algorithm is flawed because of the irregularity in the data on slide number 21! (I proceeded to explain how a depth-first search strategy in my implementation would potentially result in such irregularities and that a breath-first search would have resulted in a more regular, linear graph) Oh, now I understand; it was a really good talk.
Wendy Hall started off the day talking about the trials and tribulations of organizing the conference. She had to put up a ?£0.5 million deposit to secure the conference center three years in advance. She could have kissed her career goodbye, if this conference had not been a success.
Next Charles Hughes the president of the British Computer Society (BCS) spoke. He gave an utterly boring scripted speech about how computing needs to become a respected professional profession.
Carole Goble then spoke about the paper review process. The conference was super-competitive. 700 papers were submitted, over 2000 reviews issued, and only 84 papers accepted (11% acceptance ratio).
Thereafter came a panel discussion on the next wave of the web. Important people from research and industry talked about the semantic web. Business wants TCO figures, risk measures, abundance of skilled ontology engineers and stuff like that. Academia underestimated the amount of work necessary (and wants more grant money).
Ontologies can be used today: they are especially useful for unstructured information and to organize already structured information in database tables.
Tim Berners-Lee brushed off Web 2.0 as just hype. That's just AJAX and tagging. Folksonomy is not going to fly in the business world. The real, hard-core Semantic Web is where it's at. What's more: we're already there. We've reached critical mass, but just haven't realized it yet. All we need is for the right search engining to "connect the dots" and boom! Instant semantic web via network-effect (or something like that).
The right user interface is going to be the most difficult part. Browsers will need an "Oh yeah? Why?" button query the RDF and give a justification for any entailment.
"Don't think of the killer app for the semantic web, think of the semantic web as the killer app for the normal web"
The value of the semantic web will be universal interoperability and findability. We have more information than ever before and are spending longer trying to find stuff. The semantic web will help automate some of the "find stuff". The search engines of today aren't sufficient went searching for information on Exxon Mobile, for example. That will return millions of hits.
Tim: "search engines make their money making order out of chaos, if you give them order, they don't have a business. That's why they are not interested in the semantic web"
Take home message from the panel:
- "you ain't seen nothing yet"
- "a lot of education still has to go on. It needs to get simpler for the average business person and there needs to be a lot more investment"
- "we can already apply the first results in a business context"
- "it's a great simplifying technology"
My take: they are quite right, we have indeed not seen anything yet ... if nothing else they certainly succeeded in securing the next 5 years of grant money ...
The World Organization of Webmasters tutorial session offered a chance to take an exam to become a certified professional webmaster. I though, "what the heck": the exam normally costs $195 to take and here at the conference they are offering it for free, so I might as well give it a go.
The exam wasn't easy. One needed to answer 70% of the questions correctly to qualify as a professional webmaster. There were some tough questions. A typical question would be something like:
Which of the following is valid XHTML 1.0 / HTML 4.0 (mark all that apply):
a. <img src="image.gif" alt="the image" height=25 width=25 />
b. <strong><a href="link.html">click here</strong></a>
c. <DIV CLASS="style.css">text</DIV>
d. <img src="picture.jpg" alt="my picture" />
e. <hr><a href="page.html">next page</a><hr>
f. (none of the above)
Bill Cullifer was impressed with the exam results. Most people did extremely well. He commended that the individuals present were obviously the top people in the world in the Internet field.
I passed, of course. I'm now a WOW Certified Professional Webmaster.
Along the same lines as Web 2.0 comes eXtreme Programming. This new philosophy of how to program has 12 basic principles:
- Pair programming: two people to one screen. This is easier than it sounds. Software engineering is a very social activity, so pairing up is only natural. Pairings change naturally over time, sometimes several times a day. This practice helps introduce new people to the team, creates shared knowledge of the codebase and (most important) greatly improves the quality of the resultant code, while only minimally reducing productivity.
- On-site customer: the customer is present throughout the development process. No huge requirements documents that no one reads. This means that the customer must always be reachable to ask about a design decision. A programmer with a question should not have to wait longer than an hour for an answer.
Test-first development: write the tests first and then create the program until all the tests pass.
Frequent small releases: most important principle. Release a working product at some small fixed period. A beta every two weeks, for example. The customer always has something tangible to use and give feedback on. No big-bang integration.
Simple design through user stories: simple 3x5 cards to capture requirements. These serve as a contract to further discuss the feature with the customer and find out exactly what they want.
Common code ownership: anyone on the team can change anything in the codebase (relies and builds upon test-first development and pair programming)
- Refactoring: if you need to change something, do it!
- Sustainable pace: work no longer than 40 hours a week. No burn out.
- Coding standards
- Continuos integration
- Planning game
- System metaphor
David Leip from IBM and David Shrimpton from the University of Kent talked about Web 2.0. The Web 2.0 phenomenon is exemplified in the difference between mapquest and google maps, ofoto and flickr, britannica online and wikipedia, personal websites and myspace, stickness and syndication, etc. The value of a website can no longer be measured by how many people visit it. Instead people can subscribe to feeds off the website and get all the benefits without ever actually visiting the site.
Websphere is IBM's Java Enterprise Application Server. It's biggest competition no longer comes from products like BEA WebLogic, but instead from Amazon. Amazon offers people a virtual e-marketplace that handles all the accounting, advertizing, searching, buying, selling and refunds. All you have to do it set up the user account and use their APIs. Very easy and very cheap; very Web 2.0.
Another Web 2.0 phenomenon is the perpetual "beta". A product is never finished, but rather is continuously re-evaluated and refined. Updates can be pushed to all users, since the entire application lives on the web.
New application create buzz by being genuinely fun to use. Google Maps delights its visitors. The wow-factor makes people stay loyal. However, as soon as things start to go wrong, people will very quickly switch to using another service that works. Word of mouth is the way! Google never advertise; they don't have to.
Web 1.0 was all about commerce, Web 2.0 is all about people (what Web 3.0 will be is still written in the stars). The myriad number of WS* standards may be useful and necessary for the enterprise, but any normal person will be totally bewildered by WS*-standards vertigo. Web 2.0 is about the people taking back the Internet.
In the Web 2.0 world accessibility matters. Don't use red and green together on a web page, some people are color blind. Use xHTML and CSS, some people use screen readers.
Exclusive, hierarchical, fixed taxonomies are out. Flexible, flat, multi-tag, emergent folksonomies are in.
Microformats decree: Humans first, machines second. They are the lower-case semantic web. They use simple semantics, adding to the stuff that's already there, instead of inventing this hugely complicated description logic stuff (that I'm working on). Microformat are cheap, easy and, as long as people agree on them, they can be just as powerful and interoperable as if you had created a full XML-Schema monster. More at microformats.org and programmableweb.com.
The second day of the WWW2006 conference started with Les Carr saying how super-excited he was about everything in the upcoming conference. Les was one of my former teachers back in Southampton University. He is the one who encouraged me to submit a paper for WWW2006.
Then the first minister of Scotland got on stage and gave a talk, singing the glories of mother Scotland. He talked about how the great country of Scotland, with its devolved parliament and independence from oppressive England was making great strides in the world. No nation is more illustrious!
Sir David Brown, the chairman of Motorola Ltd. gave a speech. He recalled how he estimated ten years ago that there might be 900,000 mobile phones sold every year. Now there are 900,000 mobile phones being sold every 19 hours. He was 46,000% wrong! But at least he was 46,000% wrong in the right direction.
Mobiles are the 4th screen, he said. The computer desktop, the living room, the car and the mobile make up the places were we consume media. The future is personalized content anywhere and anytime. The device formally known as the mobile phone will be central to this ubiquitous media revolution.
Globalization is good. It's a chance for a positive-sum gain for everyone. Smart countries will use communication technology to combat outsourcing of manufacturing by "insourcing" logistics control. For example, there is no reason that a manufacturing plant in China can't be managed and control remotely from the UK.
On to socioeconomics: there will be an estimated 930 million new mobile phones in developing countries by 2008. The proliferation of low-cost mobile devices everywhere will lead to drastically increased economic output from developing nations. Technology innovation will be followed by business innovation, which will be followed by renewed technology innovation, and so on in a spiral of economic growth. More money for everyone! This will create better health, better education, better lifestyle and a better world.
What Sir David does not realize is that with increased economic development there also comes greatly increased suffering, stress, mental illness, pollution and war. As my spiritual master has said: "vaisyas (businessmen) can not be the leaders of any working society, material or spiritual"
Tagging is also being used in the enterprise. IBM has added tagging to its internal contact management system: Fringe Contacts. IBMers are connected by location, projects, position in the organizational hierarchy and now also by the tags they give each other. For example, everyone attending the chi2006 conference might tag themselves, or get tagged with that tag by a co-worker. By collecting all the reverse links one can easily build a list of all attendees, something what would have been otherwise very difficult in such a large organization. No single person has to maintain the list. It is updated organically.
The researchers noticed that the most interesting tags were those that were used by lots of people on a small number of people. These kinds of tags describe special expertise that there few people have. They can be used to identify special skills in the company.
Avaya labs has a similar system. They used to use a system of broad categories (e.g tech, development, marketing, etc) and skills. Every employee was tasked with keeping their own user profile up to date. However, inevitably, people got lazy, forgot the update their profiles and the system became useless.
Tagging collects dynamic user categories by the social relationships that already exist in the company. Changes in people's interests and people learning new skills are reflected in the collective tag cloud.
The talk by the lady from Avaya was somewhat difficult to understand. Loads of text on each slide and a virus scanner constantly coming up during the presentation, blocking the view, all made it very difficult to follow what she was saying. The slides might as well have not been there. Lesson for her to learn: less is more.
Mitre corporation created a system called Onomi. This enables social bookmarking, networks of expertise and information sharing. It integates with del.isio.us, LDAP, email, RSS, Soap, intranet URIs. They now use it as a replacement for email when telling people about something interesting. 18% of the workforce are using it. Most were attracted by a banner ad on the Intranet, as well as by selective announcements to specific user groups.
Yahoo has developed an AJAX tagging engine that suggests tags. This reduces the overlap between tags. If you tag something with one tag, all related tags will be pushed way own on the selection list. It also helps eliminate tag spam. If you use good tags (those used by many other people) those tags get a higher value (in a mutual reenforcement HITS algorithm style). It also awards original tags. People that introduce tags that later become popular are awarded a higher "importance" score. A further advantage is that users don't need to come up with their own tags.
Another presentation by Yahoo research was on combining ontology and flickr tags. Tags are like dynamic/shifting namespace, very different from a static controlled vocabulary. The lack of structure makes it difficult to hunt and search for content, but leans itself well to random browsing and accidental discovery.
Introducing simple subsumption between tags helps highlight that London is in the UK, for example. People will put in hypernyms in the middle strip of an ontology, e.g. golden retriever and dog. But the high-level hypernyms are too obvious, so people forget to add them (e.g. London and UK). Luckily, these kinds of high-level relations are well defined in ontologies. A combination of upper-level ontologies with low-level tags seems to be a promising area of research.
Some people from the steve.museum tagging project gave a talk on how the professional museum curators were very good at describing some things about the museum exhibits and terrible at others. The difference between the professional and layman taggers was staggering.
I attended the World Wide Web 2006 conference in Edinburgh, Scotland last week. It was really interesting. Lots of knowledge on the future of the Internet. Here is what I learnt:
The first day I went to a workshop on tagging organized by Yahoo and RawSugar.
Tagging is the act of annotating something with a keyword. On the Internet anyone can tag. It puts the user in control. Tagging becomes useful when it happens on a large scale. Tags can be aggregated, organized into sets (like in flickr, youTube and technorati). A good tag set will cover as many facets as possible, e.g. music, artists, song, band, etc. People don't think "definition" when they tag. A tag can express an emotion, a insight, a gut reaction, anything. People are willingly telling us how they feel about something. That's part of the power. It's metadata for the masses.
Tagging works because it does not involve high brain functions of conscious sorting. It does not force people to make a choice (does skiing belong in the "recreation" or "sport" category?), things can have any number of tags. This kind of free, loose association is cognitively easy and makes less time. However, categories are arguably more memorable than tags, because you have had to make more of mental effort to add the category.
Tags can also count as opinion votes. Multiple instances of a tag are collected in bags of tags and determine how interesting a webpage, piece of music, photo, or any other tagable resource is (like in lastFM, My Web and delicious).
Tagging gives a sense of community. Like when playing a massively multiplayer online role-playing game like World of Warcraft, it gives a sense of "alone together". As described in the book, the wisdom of crowds, this leads to more cognitive diversity, less group-think, reduces conformity, reduces the correlated effects of individual mistakes, encourages new viewpoints, leads to less herd behavior and encourages participation.
Benefits of tags are:
- better search
- less spam / ability to identify genuine content
- ability to identify trends and trend setters
- a metric of trust
- ability to measure how much attention a resource is getting
- helps filter by interest (really works!)
Tagging is however limited in that people very rarely tag other people's stuff. Most tags are added by the content author. Tags are also often not very prominent, nor identified and collated in one's account.
Tags also lack structure and semantics. They exist in a large cloud, not an ordered hierarchy. Synonyms and polysemy can lead to a vocabulary explosion.
Search is a pull mechanism. Search engines need to go out an crawl the web to index all the content. This can take days. Tagging is push. Blogs notify the search engines when there is something new to to be had. Readers can be notified of new content the very second it appears.
It is difficult to add tagging to an existing system. Amazon tried this and failed. There has to be a clear role for tags. They have to provide some tangible benefit. The best tagging systems highlight unique contributions, give users control, allow for smaller tag-related sub-groups and allow for personalization.
Tagging can be described as going for a hike in the woods, or picking berries, while categorization is more like driving a car, or riding a rollercoaster.
Most people would consider this temporary relief from suffering true happiness. Suckers!
Take a look at some of the pictures of the Mother Nature's beautiful artistry in Banff, Canada.
Today, over breakfast, I was at a table with various high-powered researchers. One of them has been up all night writing an "emergency paper" for the boss of a friend. The topic of schmoozing came up.
They enlightened me that it is very important to complement even the most senior speaker on their keynote presentation. The may seem like they are all-powerful and supremely intelligent, but, in reality, they are just as insecure as everyone else about whether they did a good job and people liked their talk. The trick is to boost their ego, become their friend and get them to help you out.
Research is mostly funded by various government agencies (EPSRC and JISC in the UK and DARPA and NASA in the US). At big conferences there are invite-only "brainstorming" sessions where the agency??(TM)s officers discuss with the researchers what the next big research grant should focus on. This is a chance for the University professors to argue that their line of research is best and should be funded (even if it isn??(TM)t ??¦ in fact: especially if it isn??(TM)t).
The key in these brainstorming sessions is to injecting one's ideas into as many other peoples??(TM) mind as possible before these meetings. It??(TM)s a horrible thing to do and one may have to have a shower afterwards to wash off the slime, but the more people argue one??(TM)s case, the better the chance of getting the money.
However, in the end, all this is somewhat of a pretense. The actual decision is made in the pub after the session. The grant officers will give the contract to their friends. Their friends are their drinking buddies. The really successful researchers are those that manipulate the social scene to make everyone their friend. For example, people like Wendy Hall and Nigel Shadbolt are primarily famous not because they are brilliant researchers (though, of course, that must also be there), but because they knows everyone and everyone knows them.
What, if you don't drink? Well, better start soon.
It works the same in most industries. Film producers for example spend most of their time in the five year production cycle of a film going to cocktail parties meeting the potential funders, potential actors and potential directors. They negotiate the production crew over a few drinks. Sometimes a key member will pull out of the agreement and they need to go to more parties to recruit new staff.
Ministers in the Greek government spend most of their time at the ministry drinking coffee with one another. The do this because they need to know that they can pick up the phone, talk to a friend, ask for a report and get it delivered to them next morning.
In the UK and USA beer replaces coffee. Each country has its own style.
When one then finally has the grant money one often can't spend it fast enough. If one doesn't spend all of the money one has been granted, then one obviously didn't need it in the first place, so one will get less next time. Some projects therefore need to get very creative in how they can burn money. They will, for example, finance trips overseas for the entire research group. Even then, sometimes one simply cannot spend enough of the government grant money. In such cases one needs to extend the grant due to "staffing issues". In other words, in order to fudge the records one, once again, needs to be in cahoots with the right people.
Today Udo Hahn gave an interesting presentation on a new methods of extracting technical terms from a large text corpus. Traditional methods work by statistical analysis of how often a phrase occurs. His new method used limited paradigmatic modifiability to test the frequency of each single word of a given phrase and thereby compute how likely it is that a phrase is part of a term and not just a chance combination of frequently used words. The new p-mod method beat the t-test and c-value methods in testing on the UMLS meta-thesaurus. Supplementary tools used were the GENIA POS tagger, YAMCHA (support vector machine) chunker and a stop-words filter.
Some US Army and IBM researchers were experimenting with ways to detect if a particular speech contained a story. Their vision is to attach small recording devices to every soldier and automatically record the war stories they tell. Stories are the best way to entice people to take up military life, entertain them, keep up their moral and record the "human" side of military service. They used the WEKA toolkit to rapidly try out different machine learning algorithms and ultimately settled on support vector machines with polynomial kernels. The neural net would be used in real time on textual speech data transcribed by IBM ViaVoice 10. Certain kinds of figures of speech indicate a story is being told. The SVM was therefore trained to recognize the structure and grammar of story-speech. Ultimately, they failed in their experiment. The speech recognition was only about 70% accurate, which wasn't high enough to accurately distinguish stories from regular conversation.
Carol Goble from Manchester (the co-leader of the IMG research group) gave the closing keynote presentation. She talked about the Montagues and Capulets, the two families from William Shakespeare's Romeo and Juliet. The Montagues are equivalent to the logicians and knowledge engineers in the realm of research. Ian Horrocks, for example, falls squarely into this camp. They are interested in the cool technology, advanced tools, logical rigor, writing researcher papers, solving the interesting (though often not practical) problems. The Capulets, in contrast, are the biomedical researchers such as the people that created the Gene Ontology (GO). They don't care about the theory, but do care about solving practical problems. They also tend to be better at the social engineering necessary to get people to actually use the tools they provide. A third camp is the philosophers (like Barry Smith), who say that everyone else is doing everything completely wrong, but don't offer any practical advice or help in how to do it better. Her conclusion: let's not all kill each other and instead try to work together and have a happy ending.
Need: a seemless ontology authoring and annotation tool that lets people annotate data and extend the ontology at the same time. At the moment we not only need to switch between tools to accomplish this talk, we also need to switch between people. Currently only the biologists can do the annotation and only the logicians can build the ontologies.
Jim Hendler's principle: "A little bit of semantics goes a long way". Just using OWL as a common knowledge interchange format is of great benefit to the e-science community.
This evening was the official conference banquet at a restaurant called "the Keg Steakhouse" (groan). The conference organizers had informed them of one vegan guest within the dinner party. One of the waiters asked me if it was me and joked that he wouldn't tell anybody. He considered it quite a ridicules idea. Nevertheless, they had prepared a special meal for me: tofu in soy sauce appetizer, green salad with tomato and raw peppers, brown rice with little bits of chopped vegetables mixed throughout, no dessert (the idea of a vegan cake/dessert was completely beyond them). These people really need to learn to cook! I guess they specialize in killing innocent animals and distilling poisonous liquids.
More interestingly, I got a chance to talk with a professor from Jena Universit??t in Germany. He is at the forefront of automated text mining and natural language processing (NLP) research. The next day he gave a very interesting presentation on automatically extracting the important technical terms from a large corpus of text.
The professor was talking about his lifestyle. He loved the isolation of the New Zealand South Island, which he has visited three times. Untouched nature. Not a human in sight for miles.
This is very much in contrast to Tokyo, Japan. In Tokyo everything is grey. You cannot tell where you are. Grey concrete everywhere. He was staying on the eighth floor of a hotel and the motor-highway was just three meters away from his window. How so? In Tokyo, due to lack of space, they stack their highways vertically. Outside his window was the fourth level of a super-highway. A true vertical city. Even at 3am there was continuous traffic on a seven lane highway going into the city. After all, the 36 million people in the world's largest city need to somehow be feed every day. Metropolitan life in the very extreme. I wonder what it does to the people?
Still, he was attached to life in Europe. He would never want to live anywhere but there. The cities have so much more history than anywhere else. Each place has a distinct history and personality.
Life as a professor isn't rosy. He travels around the world presenting his research in so many exotic places, but doesn't have any time to enjoy them. Here he is in Canada, but doesn't have time to enjoy any of the sites, because he is too busy preparing his next presentation. Giving a keynote address at a conference is a great honor, but giving five of them per year very quickly turns into a burden. Then there is reviewing other people's papers. Well known researchers need to review their peer's work. For example, he needs to write an elaborate explanation for each research paper from Asian researchers which doesn't meet the western standard of innovative research. Japanese researchers tend to take a too mechanistic approach to research, which doesn't teach anyone anything new. Then there are the many academic funding committees. He needs to help determine if a particular project gets government research grant money. On top of that comes his own research. He needs to write and publish papers of his own to stay in business. Then, of course, comes the job of teaching his students. PhD and Masters students need to be supervised. Undergraduates need to be lectured to and their exams marked. Sometime between all of that there is (maybe) a little thing called family life.
Still, such a life certainly isn't boring. Discovering truly new things and significantly enhancing the knowledge of humanity has its appeal.
Today Nuno, a researcher from Porto, Portugal, asked me about my distinct hairstyle (sikha) and why I seem so peaceful and relaxed. While asking he was constantly apologizing, thinking that I might be offended. I told him a little bit about Krishna consciousness.
One presentations was about image analysis on 3D cell slices. Matlab's image toolkit is very good for this purpose. The researchers from Amsterdam used RuleML to capture shape classification rules from medical image interpretation experts. However, they suggested using SWRL instead, since RuleML is quite a clunky rules engine. Post-presentation questions raised the issue of rules vs. machine learning. Many people preferred the neural net approach, though a few people defined rules as they allow for better provenance, logging and examination.
Pat Hayes presented the COE ontology editor. This was originally a concept map creation tool, but has been expanded into a fully featured graphical OWL ontology editor. The major advantage COE has is that it is very intuitive to use. Like HTML, people can "view source" on ontologies and "steal" other people's designs/modeling tricks. COE doesn't work with ontologies larger than about 2000 classes. This is another area where my segmentation work might come in handy.
There was a panel discussion about machine learning vs. manual knowledge capture. The conclusion was to do both:
Improve the volume of manual K-CAP by mass-collaboration
Automatically capture knowledge and manually clean up any mistakes (in this case it is very important to use codes that indicate where a particular piece of knowledge data came from)
Use manual methods to guide (but not haul) large knowledge acquisition methods
Revolutionary concept: make knowledge capture fun by making the task into a game. Carol Goble in particular was very impressed by this idea from Tim Chklovski from USC. She intends to build this into her bio-annotation tools.
An interesting presentation was about estimating the health of pigs by the consistency of their feces. The researchers worked with veterinarians to build a Bayesian network of external circumstances and pig disease. The interesting part was their use of a combination of statistical data and expert rules of thumb. They used isotonic regression to bias the statistical data to match their expert's intuitions. Ultimately, the graphical structure of the Bayesian network matters much more than the exact probabilities on the nodes.
Pat Hayes, a famous AI research started off today's conference day. His keynote, while somewhat entertaining and somewhat insightful was extremely scattered and altogether gave the impression that he had prepared it the night before (which indeed he had). He talked about his "9 deadly sins of AI". These are as follows (and yes, I know there are only four):
Not wanting to accept that the ship has sunk: some researchers still hang on to trying to make techniques and ideas work that where bad ideas when they were first invented and have caused no end of trouble since.
Worshipping philosophy (or, for that matter, worshipping anything): philosophy is useful, but it is a different field to knowledge representation. Just because something is important in philosophy doesn't mean that we have to pay any attention to it in KR.
Taking paradoxes too seriously: A logical paradox is just a humorous distraction for a Sunday night. Just because Kurt G??del's incompleteness theorem shook the very foundations of logic and mathematics, doesn't mean that a paradox is something we have to worry about in practical system. Yeah, so OWL-full allows for paradoxes. Just don't create them and stop complaining about it.
Worshipping logic: (first-order) Logic is attractively simple. Everything in the world can be expressed using AND, NOT and FOR-ALL. However, this is too much of an abstraction from real useful things. It requires too large a framework of axioms on top of it to make it do something useful. We should push more expressivity into the logic layer, thereby bringing it closer to the ontology layer.
Other topics of today:
Nokia and Airbus are working together to shorten their product development feedback cycle. They want to create more mature (useable, useful and acceptable) products more quickly. They aim to achieve this using a system of active documentation. Documentation not just for the sake of it, but in order to involve all project stakeholders in the design, prototype, evaluation and requirements capture processes.
Harith Alani uses four measures for ranking ontologies returned from an ontology search engine:
- Class match: the degree to which the searched for terms are present in the ontology
- Centrality: how close the search terms are to the middle of the is-A hierarchy
- Density: how much information context there is on the search terms (restrictions, etc)
- Semantic similarity: how many links need to be followed from one search term in order to reach another
Harith also mentioned that there is a graph query API called JUNG. I'll have to check this out for my work.