An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems Anna V. Zhdanova and Pavel V. Mankevich Novosibirsk.

Презентация:



Advertisements
Похожие презентации
LANGUAGE, SPEECH, SPEECH ACTIVITY Suggests to allocate the following functions: communicative; thinking tools; mastering the socio-historical; experience;
Advertisements

Computers are a necessary part of modern life. Computers play an important role in the lives of most of us today, whether we realize it or not. Some people,
Lecture # Computer Architecture Computer Architecture = ISA + MO ISA stands for instruction set architecture is a logical view of computer system.
Family Relationships (Семейные Отношения). Family How could you describe the word family? First of all family means a close unit of parents and their.
Учимся писать Эссе. Opinion essays § 1- introduce the subject and state your opinion § 2-4 – or more paragraphs - first viewpoint supported by reasons/
Take one minute to prepare a talk on the following subject. Take notes if you like and remember to include reasons and examples. You should then speak.
Institute for Information Problems of the Russian academy of Sciences and its linguistic research Olga Kozhunova CML-2008, Becici, 6-13 September.
Some ideas of semantic analysis for anaphora resolution Dmitry P. Vetrov Dorodnicyn Computing Centre of RAS.
How can we measure distances in open space. Distances in open space.
What to expect? How to prepare? What to do? How to win and find a good job? BUSINESS ENGLISH COURSE NOVA KAKHOVKA GUMNASUIM 2012.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v Scaling Service Provider Networks Designing Networks with Route Reflectors.
LEADERSHIP SKILLS. Many years of experience in Exploring have shown that good leadership is a result of the careful application of 11 skills that any.
Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chap 1-1 Chapter 1 Why Study Statistics? Statistics for Business and Economics.
1 Another useful model is autoregressive model. Frequently, we find that the values of a series of financial data at particular points in time are highly.
Classification of Queries by Topic: Approach Based on Hierarchically Structured Subject Domain Anna V. Zhdanova and Denis V. Shishkin A.P. Ershov.
The waterfall model is a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to.
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.
Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Designing the Network Management Architecture ARCH v
The Web The Internet. Level A2 Waystage Level A2 Waystage Listening (p.17) I can understand simple messages delivered at a relatively high speed (on every.
Транксрипт:

An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems Anna V. Zhdanova and Pavel V. Mankevich Novosibirsk State University and A.P. Ershov Institute of Informatics Systems, Novosibirsk

Contents Goals Research Motivation What Matters when You Deal with Texts? Proposed Algorithm Subject Domain Construction: Example Results

Our Goals To provide a solution for a knowledge management problem… which includes arranging natural language texts in a structure… which is hierarchical with the order of «easy for understanding» vs. «difficult for understanding» texts... and serves its best (and is the best) in natural question answering... Knowledge Management Structure Hierarchy Question Answering

Automatic Construction of Subject Domain for Natural Language Queries Motivation Our assumptions and preconditions Most kinds of information presented in a natural language can not be effectively stored and accessed by widely used data bases Currently, most of the popular search engines create their own web resource hierarchies and use them in information retrieval. However, even for the largest engines the hierarchies are created manually Manual construction of a hierarchy is not effective, because –it takes a lot of human effort and time –people make mistakes,but machines dont –WWW grows too rapidly to process it manually

Automatic Construction of Subject Domain for Natural Language Queries Motivation Automatic construction of a subject domain (i.e., presenting knowledge of a chosen field in the way we do) -- why? Obtaining the structures of text data specifically for satisfying natural language queries This problem has not had acceptable solutions until today. Generating additional metadata for ontologies (establishing general - particularand simple - complicated relationships between hierarchy units)

Weight function –if the text you read is quite long and contains many rare and complicated words you will probably stop reading it. This means this text is difficult for understanding and its weight function should be high (note here Zipfs law) Similarity measure –if two texts tell about similar things employing nearly the same words, these texts are similar and their similarity measure should be high They Matter when You Deal with Text Documents and They Matter in the Hierarchy as Well!

Weight Functions Examples Let X be a text document x i be the word frequencies in the document x i * be the word frequencies in the whole subject domain

Similarity measures Examples Let X, Y be text documents x i and y i be the word frequencies in documents X and Y Jaccard Association Taxonomic Distance Cosine Measure

Hierarchy Construction Algorithm Input: natural language texts divided into independent text-units Output: corresponding hierarchy (i)Rank the units by their weights using the chosen weight function. (ii)Choose the unit with the lowest weight and place it as the root (i.e., the top node) of the hierarchy. (iii)Choose the unit with the lowest weight among the remaining ones and calculate the chosen similarity measure between this unit and each of the already chosen units. Put the newly chosen unit just below the one with the maximum similarity measure. (iv)Repeat step (iii) until the set of the remaining units is empty.

Construction of Hierarchical Subject Domain Example Step 1: elimination of stop-words Filtration D1 public interface Document The Document is a container for text that serves as the model for swing text components. The goal for this interface is to scale from very simple needs (plain text textfield) to complex needs (HTML or XML documents for example). D3 Content At the simplest level, text can be modeled as a linear sequence of characters. To support internationalization, the Swing text model uses unicode characters. The sequence of characters displayed in a text component is generally referred … D2 Structure Text is rarely represented simply as featureless content. Rather, text typically has some sort of structure associated with it. Exactly what structure is modeled is up to a particular Document implementation. It might be …

Construction of Hierarchical Subject Domain Example Step 2: calculating weight of each document, ordering the document array according to their weight: D1, D3, D2. Hence, document D1 is the root. D1 Document Interface Textfield Container Text Swing Component Model D3 Component Content Data Word Character Swing Text Model Document Sequence D2 Interface Text Docnument Element Field Content Structure Attribute Unit Model Weight Function Calculation D1 W = 5.14 D3 W = 5.24 D2 W = 5.33

Construction of Hierarchical Subject Domain Example Step 3: adding to the hierarchy D3, indexing D3 D1 Document D3 Content Data Word Character Sequence Document Text Interface Swing Textfield Component Container Model

Construction of Hierarchical Subject Domain Example Step 4: similarity measure calculation between D2 and D1, D2 and D3 D2 D1 D2D3 = = 0.111

Construction of Hierarchical Subject Domain Example Step 5: getting the hierarchy, indexing D2 Document Text Interface Swing Textfield Component Container Model D2 Structure Element Field Content Structure Attribute Unit D1 Document D3 Content Data Word Character Sequence

Results Experiments A system performing automatic construction of a subject domain information retrieval natural language interaction with the user is built on the basis of Java and XML technologies The main test base lies within insurance subject domain English language 83 text units, i.e., hierarchy nodes 27 typical natural language questions 0 : Decreasing TLI 1 : Lowering auto insurance rates 1 : WLI and ULI 2 : VLI, ULI and participating WLI 1 : Having an accident 1 : Buying auto insurance 1 : Amount of life insurance 1 : Mortgage LI and other TLI 1 : Variable universal WLI 2 : Variable WLI 1 : Selling LI 1 : Participating WLI 1 : Physical exam 1 : Rental car 2 : New car 1 : Credit TLI 1 : Adjustable WLI 2 : Universal WLI 2 : Permanent LI 2 : Current assumption WLI 2 : Repairing vehicle 2 : 1035 exchange 3 : Tax issues 1 : Moving to another state 2 : SR-22 form 1 : Combination Policy 2 : Special Auto Policy 1 : Family income insurance 2 : Family Automobile Policy 3 : Automobile Insurance 1 : Liability lingo 2 : Liability Insurance 2 : Packaged policy 2 : Liability limits 1 : Term LI 1 : Underinsured Motorists 1 : Deposit TLI

Results Recall-Precision Curves in terms of question answering Non-hierarchical subject domain ( ) Cosine similarity measure and weight function equal to the amount of words in a document ( ) Cosine similarity measure and weight function inversely proportional to the product of word frequencies ( )

Results Conclusion Introduction of automatically constructed subject domains substantially improves performance of question answering systems. Performance of a question answering system depends on the chosen combination of weight function and similarity measure. However, the combination that would be best for all cases is not found.

Thank you for attention! Contact us by This presentation was created for the Andrei Ershov Fifth International Conference, Novosibirsk, Akademgorodok, Russia, July, 2003.