Kalyanpur et al - extracting enterprise vocabularies
IBM and Gartner. Enterprises need semantic vocabularies. Can they be generated bottom-up from source documents? Tried using NLP tools and off-the-shelf named entity recognizers, but poor recall (50% of possible terms identified by domain expert).
Summary of solution: algorithm to discover domain-specific terms and types; techniques to improve quality and coverage of LOD; statistical domain-specific NER's using LOD.
discovering domain-specific terms. use part-of-speech tagger to identify all nouns as possible terms, then filter using tfidf , then infer types using LOD, then use types to further filter the terms. Result in 896 terms, estimated probable terms would be 3000 in full dataset.
Improving recall: improved type mappings between dbpedia and freebase using conditional probs. New mappings included in dbpedia downlaod since Aug'09. Improved LOD: add instance types. Get entity disambiguation for free using term URI's. Generate candidate patterns using super-types from ontology, let machine learning system score each candidate.
Final result: start with precision - recall of 80-23, raised it to 78-46 with all improvements. Conclusion: lots of benefits of using LOD as input for vocabulary extraction.