Презентация на тему: " An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems Anna V. Zhdanova and Pavel V. Mankevich Novosibirsk." — Транскрипт:
An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems Anna V. Zhdanova and Pavel V. Mankevich Novosibirsk State University and A.P. Ershov Institute of Informatics Systems, Novosibirsk
Contents Goals Research Motivation What Matters when You Deal with Texts? Proposed Algorithm Subject Domain Construction: Example Results
Our Goals To provide a solution for a knowledge management problem… which includes arranging natural language texts in a structure… which is hierarchical with the order of «easy for understanding» vs. «difficult for understanding» texts... and serves its best (and is the best) in natural question answering... Knowledge Management Structure Hierarchy Question Answering
Automatic Construction of Subject Domain for Natural Language Queries Motivation Our assumptions and preconditions Most kinds of information presented in a natural language can not be effectively stored and accessed by widely used data bases Currently, most of the popular search engines create their own web resource hierarchies and use them in information retrieval. However, even for the largest engines the hierarchies are created manually Manual construction of a hierarchy is not effective, because –it takes a lot of human effort and time –people make mistakes,but machines dont –WWW grows too rapidly to process it manually
Automatic Construction of Subject Domain for Natural Language Queries Motivation Automatic construction of a subject domain (i.e., presenting knowledge of a chosen field in the way we do) -- why? Obtaining the structures of text data specifically for satisfying natural language queries This problem has not had acceptable solutions until today. Generating additional metadata for ontologies (establishing general - particularand simple - complicated relationships between hierarchy units)
Weight function –if the text you read is quite long and contains many rare and complicated words you will probably stop reading it. This means this text is difficult for understanding and its weight function should be high (note here Zipfs law) Similarity measure –if two texts tell about similar things employing nearly the same words, these texts are similar and their similarity measure should be high They Matter when You Deal with Text Documents and They Matter in the Hierarchy as Well!
Weight Functions Examples Let X be a text document x i be the word frequencies in the document x i * be the word frequencies in the whole subject domain
Similarity measures Examples Let X, Y be text documents x i and y i be the word frequencies in documents X and Y Jaccard Association Taxonomic Distance Cosine Measure
Hierarchy Construction Algorithm Input: natural language texts divided into independent text-units Output: corresponding hierarchy (i)Rank the units by their weights using the chosen weight function. (ii)Choose the unit with the lowest weight and place it as the root (i.e., the top node) of the hierarchy. (iii)Choose the unit with the lowest weight among the remaining ones and calculate the chosen similarity measure between this unit and each of the already chosen units. Put the newly chosen unit just below the one with the maximum similarity measure. (iv)Repeat step (iii) until the set of the remaining units is empty.
Construction of Hierarchical Subject Domain Example Step 1: elimination of stop-words Filtration D1 public interface Document The Document is a container for text that serves as the model for swing text components. The goal for this interface is to scale from very simple needs (plain text textfield) to complex needs (HTML or XML documents for example). D3 Content At the simplest level, text can be modeled as a linear sequence of characters. To support internationalization, the Swing text model uses unicode characters. The sequence of characters displayed in a text component is generally referred … D2 Structure Text is rarely represented simply as featureless content. Rather, text typically has some sort of structure associated with it. Exactly what structure is modeled is up to a particular Document implementation. It might be …
Construction of Hierarchical Subject Domain Example Step 2: calculating weight of each document, ordering the document array according to their weight: D1, D3, D2. Hence, document D1 is the root. D1 Document Interface Textfield Container Text Swing Component Model D3 Component Content Data Word Character Swing Text Model Document Sequence D2 Interface Text Docnument Element Field Content Structure Attribute Unit Model Weight Function Calculation D1 W = 5.14 D3 W = 5.24 D2 W = 5.33
Construction of Hierarchical Subject Domain Example Step 3: adding to the hierarchy D3, indexing D3 D1 Document D3 Content Data Word Character Sequence Document Text Interface Swing Textfield Component Container Model
Construction of Hierarchical Subject Domain Example Step 4: similarity measure calculation between D2 and D1, D2 and D3 D2 D1 D2D3 = = 0.111
Construction of Hierarchical Subject Domain Example Step 5: getting the hierarchy, indexing D2 Document Text Interface Swing Textfield Component Container Model D2 Structure Element Field Content Structure Attribute Unit D1 Document D3 Content Data Word Character Sequence
Results Experiments A system performing automatic construction of a subject domain information retrieval natural language interaction with the user is built on the basis of Java and XML technologies The main test base lies within insurance subject domain English language 83 text units, i.e., hierarchy nodes 27 typical natural language questions 0 : Decreasing TLI 1 : Lowering auto insurance rates 1 : WLI and ULI 2 : VLI, ULI and participating WLI 1 : Having an accident 1 : Buying auto insurance 1 : Amount of life insurance 1 : Mortgage LI and other TLI 1 : Variable universal WLI 2 : Variable WLI 1 : Selling LI 1 : Participating WLI 1 : Physical exam 1 : Rental car 2 : New car 1 : Credit TLI 1 : Adjustable WLI 2 : Universal WLI 2 : Permanent LI 2 : Current assumption WLI 2 : Repairing vehicle 2 : 1035 exchange 3 : Tax issues 1 : Moving to another state 2 : SR-22 form 1 : Combination Policy 2 : Special Auto Policy 1 : Family income insurance 2 : Family Automobile Policy 3 : Automobile Insurance 1 : Liability lingo 2 : Liability Insurance 2 : Packaged policy 2 : Liability limits 1 : Term LI 1 : Underinsured Motorists 1 : Deposit TLI
Results Recall-Precision Curves in terms of question answering Non-hierarchical subject domain ( ) Cosine similarity measure and weight function equal to the amount of words in a document ( ) Cosine similarity measure and weight function inversely proportional to the product of word frequencies ( )
Results Conclusion Introduction of automatically constructed subject domains substantially improves performance of question answering systems. Performance of a question answering system depends on the chosen combination of weight function and similarity measure. However, the combination that would be best for all cases is not found.
Thank you for attention! Contact us by This presentation was created for the Andrei Ershov Fifth International Conference, Novosibirsk, Akademgorodok, Russia, July, 2003.