Classification of E-mail Queries by Topic: Approach Based on Hierarchically Structured Subject Domain Anna V. Zhdanova and Denis V. Shishkin A.P. Ershov.

Презентация:



Advertisements
Похожие презентации
Institute for Information Problems of the Russian academy of Sciences and its linguistic research Olga Kozhunova CML-2008, Becici, 6-13 September.
Advertisements

Federal State Educational Institution of Higher Professional Education "Moscow State Construction University" MASTER'S DISSERTATION THEME: Improving methodological.
Computers are a necessary part of modern life. Computers play an important role in the lives of most of us today, whether we realize it or not. Some people,
THE MEDIA The mass media play an important part in our lives. Nowadays information is the most necessary thing. That is why there are so many sources.
Family Relationships (Семейные Отношения). Family How could you describe the word family? First of all family means a close unit of parents and their.
NEW Business NEW Business. What is business? A business can be defined as an organization that provides goods and services to others who want or need.
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.
What to expect? How to prepare? What to do? How to win and find a good job? BUSINESS ENGLISH COURSE NOVA KAKHOVKA GUMNASUIM 2012.
Choosing a career and getting a job are two things any person passes through in his life. There are many professions and it is not an easy task to make.
Some ideas of semantic analysis for anaphora resolution Dmitry P. Vetrov Dorodnicyn Computing Centre of RAS.
Loader Design Options Linkage Editors Dynamic Linking Bootstrap Loaders.
1/27 Chapter 9: Template Functions And Template Classes.
The Web The Internet. Level A2 Waystage Level A2 Waystage Listening (p.17) I can understand simple messages delivered at a relatively high speed (on every.
The Law of Demand The work was done by Daria Beloglazova.
Ecology and fashion. Project was done by Borodina Ludmila from 10 B.
People can/should control nature, their own environment and destiny. The future is not left to fate. Result: An energetic, goal-oriented society.
© 2002 IBM Corporation Confidential | Date | Other Information, if necessary © Wind River Systems, released under EPL 1.0. All logos are TM of their respective.
Transition to IFRS in the Banking Sector IFRS application practice This Project is funded by EU September 2007.
Учимся писать Эссе. Opinion essays § 1- introduce the subject and state your opinion § 2-4 – or more paragraphs - first viewpoint supported by reasons/
An Approach to Automatic Construction of Hierarchical Subject Domain for Question Answering Systems Anna V. Zhdanova and Pavel V. Mankevich Novosibirsk.
Транксрипт:

Classification of Queries by Topic: Approach Based on Hierarchically Structured Subject Domain Anna V. Zhdanova and Denis V. Shishkin A.P. Ershov Institute of Informatics Systems, Russian Academy of Science, Novosibirsk , Russia and Novosibirsk State University, Novosibirsk , Russia {anna,

Contents Basic Terminology, Motivation, Our Goals, and Application Areas What is in Our Classifier and How is It Presented There? Classification Principles and Examples Technical Details of Our Classifier Summary and Future Work

Basic Terminology query is a message written in natural language with a sense of purpose. Classification or categorization is the task of assigning objects from a universe to classes or categories. Category or class is a group of objects sharing common distinctive properties. Hierarchy of categories is a partially ordered set of categories, where a subcategory has an "is a" or "part of" relationship with its superior category. Classification of Queries is an indispensable step in understanding the purpose of an query.

Motivation The need for implementation of automatic classification of electronic documents is acute nowadays, because: the flow of data in the World Wide Web is increasing electronic databases are enlarging classifier is a substantial part of automatic answering system, thus it saves the resources spent on manual processing of queries

The Goals We Attained We present a CLASSIFIER of queries, which executes TEXT CATEGORISATION BY TOPIC and is PRECISE COMPREHENSIVE and EFFICIENT, because: By using the hierarchically structured subject domain and classification rules, the Classifier's engine assigns an query to the most relevant category or categories. The Classifier improves its productivity by taking into consideration the meaning of words, the word order and special semantic constructions such as negation words (e.g., the texts consisting of expressions "to be" and "not to be" are assigned to the different category, as they should). The Classifier provides an efficient information extraction and effective tools for managing hierarchical data.

Application Areas Automatic answering systems, call- centers Search engines Document classification, e.g., spam filtering Semantic information retrieval, including web – hierarchic organization of subject domain is quite common for web data

Categories in Classifier A category is represented by a list of regular expressions, its label, a set of its subcategories and a textual message in natural language. Each category represents a topic in classification process. The messages ascribed to the same topic (i.e., category) have common semantic properties.

Hierarchy of Categories The hierarchical structure of subject domain is introduced in order to take into account the semantics of an incoming query. Each category of a hierarchy is a generalization of its subcategories (part of, is a relationships). Here, the trees of categories is constructed manually by the experts who are familiar with the banking and insurance domain. English and Russian languages are used. By employing the hierarchically structured subject domain and classification rules, the Classifiers engine assigns a message to the most relevant category or categories.

Morphology Retrieved Items and Regular Expressions Example: A set of regular expressions attributed to a category Amount of life insurance This set represents the words: amount, amounts, cost, costs, price, prices, how much, страховые суммы, страховыми суммами, страховой суммой, страховых сумм, страховая сумма, страховой суммы, сумма, суммы, суммой, суммами, сумм, сумме, суммах, etc. A regular expression is an algebraic notation for characterizing a set of strings.

Classifiers Vocabulary The Classifier's vocabulary is a list of regular expressions, which correspond to the chosen domain. Each regular expression has one or several pointers at the related category or categories from the tree. The functional expressions, such as the expressions for negation (e.g., "no", "isn't", "besides"), are marked by special labels.

Rule-based Classification Algorithm 1) The incoming query is lexically analyzed. 2) Information on the categories is extracted from the Classifiers vocabulary. 3) A list of pointers at categories and negations is generated in the order of word appearance. 4) The classification rules are applied to the list. 5) The decision system produces the output, i.e., defines a category where the message is classified.

Classification Rules Intersection Query: What does my auto insurance policy cover when I rent a car? Retrieved items: auto, insurance, rent, car Answer: In the not-too-distant past, most auto insurance policies would extend coverage to rental cars whenever you rented one. This is not quite true anymore. In most cases, your personal auto insurance policy will cover only vacation car rentals. Many insurance companies no longer extend personal auto insurance coverage for business travel. Find out what rental car coverage you have under your policy is by calling your insurance agent/company.

Classification Rules Negation Query: I would like to insure my life while being on vacations only, not permanently. Retrieved items: life, not, permanently Answer: A specialized individual policy that commonly combines whole life insurance with decreasing term insurance. The whole life insurance portion of the policy is usually paid as a lump sum when the insured dies. The decreasing term insurance portion of the policy provides an income for a predetermined period to help support the insured's family...

Classification Rules Union Query: I've recently bought a new car. What would you recommended about buying insurance? Retrieved item: new car, buying, insurance Answer: Things you should consider when purchasing automobile insurance include: Decide how much liability coverage you want to carry. This is highly subjective. The liability levels you have on your other policies can serve as a guideline. Consult a financial professional if you need more advice. Determine which optional coverage you will need to feel protected. For example…

Classification Process Query: I have an older car whose current market value is very low – do I really need auto insurance? Retrieved items: older car, auto, insurance Corresponding regular expressions: / old.._car../, /auto./, /insurance/ Names and positions of proposed categories: Old car, Automobile Insurance, Personal Financial Services are retrieved from the Classifiers vocabulary The most relevant category Old car is left after the application of rules. Personal Financial Services Automobile Insurance Old car

Used Tools TreeBuilder The TreeBuilders GUI allows us: to create easily the tree nodes and their sub-nodes and associate them with runnable Java objects or hard disk data (e.g., files) to edit the source code of rules and dictionaries to create tests to store the hierarchy and its data in the XML format to copy&paste to view the debug information The main goal of the program TreeBuilder is helping a user to create a tree (a hierarchy of categories in Classifier).

Used Tools RegExParser RegExParser is the tool for morphological analysis and information extraction.

Used Tools LinguaEngine LinguaEngine is a rule-based engine, used by the Classifier. Input: a list of Java objects (a list of pointers at categories in our case) and rules In the process of execution, a rule replaces a sub-list of objects with another list of objects. The engine makes it possible to unite the rules into groups and to introduce a group-order. For example, sometimes it is necessary to execute more important rules before the rules of less importance. When none of the rules from the current group can be executed, the engine switches to the rules of the next group according to the group-order. Technically, a rule is presented as a Java method. A group of rules is implemented by the means of a Java class, where every boolean method represents a rule. Output: the objects left in the input list after all of the rules have been executed

Classification Scheme

Summary Rule-based algorithm for text classification - takes the syntax and semantics of an incoming query into consideration - is based on hierarchically structured subject domain Effective tools: 1) LinguaEngine for rule execution 2) RegexParser for morphological analysis and information retrieval 3) TreeBuilder for working with hierarchical data Our Main Novel Results:

Our Team and Our Partners The Classifier has been implemented with support and collaboration of the companies: Exigen ( Transnet ( Our team comes from A.P. Ershov Institute of Informatics Systems (Siberian Division of Russian Academy of Science), and, besides the authors of the presented paper, the following specialists worked and been helpful for the Classifier project: Farida G. Dinenberg Irina S. Kononenko David Ya. Levin Dmitry V. Petunin Alexander L. Semenov

Possible Future Work Automatic construction of category trees Creating an automatic answering system based on information extraction, hierarchical structure of subject domain and rule-based engine from the Classifier Identifying the genre of an incoming query Trying new subject domains and new languages

Thank you for attention! This presentation was created for the Third International Conference on Intelligent Data Engineering and Automated Learning, Manchester, UK, August 12-14, 2002