How will we populate the semantic web on a vast scale? - Tom Mitchell keynote
Three answers: humans will enter structured info; database owners will publish; computers will read unstructured web data.
Read the Web project. Inputs: initial ontology, handful of training examples, the web (!), occasional access to a human trainer. Goals: (1) system running 24x7, each day extract more facts from the web to populate the initial ontology, (2) each day learn to perform #1 better than the day before.
Natural language understanding is hard. How to make it more plausible for machines to read? ways:
- leverage redundancy on the web (many facts are repeated often, in different forms)
- target reading to populate a given ontology, restrict focus of attention
- Use new semi-supervised learning algorithms
- Seed learning from Freebase, DbPedia, etc...
State of project today: ontology of 10^2 classes, 10-20 seed examples of each, 100 million web pages. Running on yahoo m45 cluster. Examples include both relations and categories.
All code is open-source, available on web site. Currently XML, working on RDF.
Impressive demo of determining academic fields: 20 input examples, looked like hundreds of learned examples, good quality results. Output includes the learned patterns and alternate interpretations considered. approx 20K entities, approx 40K extracted beliefs
Semi-supervised learning starts to diverge after a few iterations. Under-constrained. Making the task apparently more complex by learning many classes and relations simultaneously. Adds constraints. Unlabeled examples become constraints. Nested, coupled constraints. "Kryzewski coaches for the Devils" have to simulatenously classify coach name and team name.
"Luke is mayor of Pittsburgh" - learn functions for classifying Pittsburgh as a city based on (a) "Pittsburgh" and separately (but coupled) (b) "Luke is mayor of"
Information from the ontology provides constraints to couple classifiers together: e.g disjointness between concepts. Also provides for consistency of arguments in noun phrases (domain and range constraints).
Coupled bootstrap learner. Given ontology O and corpus C. Assign positive and negative examples to classifiers (e.g. cities are negative examples of teams). Extract candidate (conservative), filter, train instance and pattern classifiers, assess, promote high confidence candidats, share examples back to coupled classifiers using ontology (including using the subsumption hierarchy)
Rather than focussing on single sentences in single docs, system looks across many sentences to look for co-occurrence statistics. Macro-read many documents, rather than micro-read single document.
Example of IBM learned facts. Rejected candidates might be good input to a human doing manual ontology design.
If some coupling is good, how to get even more? One answer: look at html structure, not just plain text. If some cars are li elements in a list, then likely the other li's are cars as well. PhD student Richard Wang at CMU - SEAL system. Combine SEAL and CBL. Combined system generally gets good results, though performance is poor in some categories (e.g. sports equipment). To address performance issues, extend ontologies to include nearby but distinct categories.
System runs for about a week, before needing restart. Some categories saturate fairly quickly.
Want a system that learns other categories of knowledge. Tries to learn rules by mining the extracted KB. Need positive examples - get from KB. Where to get negative examples? Not stored in KB. Get help from ontology. For restricted cardinality properties (e.g. functional), can infer negative examples.
Examples of learned rules - conditional horn clauses with weights. Showed some of the failed rules as well, e.g. skewed resuls due to partial availability of data. Good rules can be very useful, but bad rules are very bad - need human inspection to filter out bad rules.
Future work: add modules to inner loop of extractor, e.g. use morphology of noun phrases. Making independent errors is good! Also: tap in to Freebase and DbPedia to provide many more examples during bootstrap.
Q: can system correct previous mistakes? A: current system marches forward without making any retractions. Internally, evidence of error builds up. Should be possible to correct previously made assertions.
Q: how to deal with ambiguity? e.g. Cleveland is a city and a baseball team. A: current system is a KB of noun phrases non entities in the world. Knows that things can't be multiple categories. Leads to inconsistency. Need to change program to have distinct terms for the word and its separate senses.
Q: what about probabilistic KB? A: currently store probs, but hard part is how to integrate 10^5 probabilistic assertions. How to do prob reasoning at scale? Not known.
Q: can you learn rules with rare exceptions? A: can have exceptions, but not different types. Could understand the counter-examples to an otherwise good rule. Could generate new knowledge (example of 'continuing education students').
Q: how to deal with dynamic changes to the world? A: yes, it's a problem. Second most common cause of errors. Would need to add some temporal reasoning to the KB.
Q: what can we do from semweb to encourage ML researchers to contribute? A: it will happen, if you [sw community] can build big databases. Very excited about DbPedia. Suggest pushing on natural language processing conferences. They are not aware of these [semweb] kinds of resources. Btw, there are other linguistic resources you can draw on as well as wordnet, e.g verbnet, propnet (?).