Скачать презентацию
Идет загрузка презентации. Пожалуйста, подождите
Презентация была опубликована 8 лет назад пользователемСветлана Полозова
2 Classification of Queries by Topic: Approach Based on Hierarchically Structured Subject Domain Anna V. Zhdanova and Denis V. Shishkin A.P. Ershov Institute of Informatics Systems, Russian Academy of Science, Novosibirsk , Russia and Novosibirsk State University, Novosibirsk , Russia {anna,
3 Contents Basic Terminology, Motivation, Our Goals, and Application Areas What is in Our Classifier and How is It Presented There? Classification Principles and Examples Technical Details of Our Classifier Summary and Future Work
4 Basic Terminology query is a message written in natural language with a sense of purpose. Classification or categorization is the task of assigning objects from a universe to classes or categories. Category or class is a group of objects sharing common distinctive properties. Hierarchy of categories is a partially ordered set of categories, where a subcategory has an "is a" or "part of" relationship with its superior category. Classification of Queries is an indispensable step in understanding the purpose of an query.
5 Motivation The need for implementation of automatic classification of electronic documents is acute nowadays, because: the flow of data in the World Wide Web is increasing electronic databases are enlarging classifier is a substantial part of automatic answering system, thus it saves the resources spent on manual processing of queries
6 The Goals We Attained We present a CLASSIFIER of queries, which executes TEXT CATEGORISATION BY TOPIC and is PRECISE COMPREHENSIVE and EFFICIENT, because: By using the hierarchically structured subject domain and classification rules, the Classifier's engine assigns an query to the most relevant category or categories. The Classifier improves its productivity by taking into consideration the meaning of words, the word order and special semantic constructions such as negation words (e.g., the texts consisting of expressions "to be" and "not to be" are assigned to the different category, as they should). The Classifier provides an efficient information extraction and effective tools for managing hierarchical data.
7 Application Areas Automatic answering systems, call- centers Search engines Document classification, e.g., spam filtering Semantic information retrieval, including web – hierarchic organization of subject domain is quite common for web data
8 Categories in Classifier A category is represented by a list of regular expressions, its label, a set of its subcategories and a textual message in natural language. Each category represents a topic in classification process. The messages ascribed to the same topic (i.e., category) have common semantic properties.
9 Hierarchy of Categories The hierarchical structure of subject domain is introduced in order to take into account the semantics of an incoming query. Each category of a hierarchy is a generalization of its subcategories (part of, is a relationships). Here, the trees of categories is constructed manually by the experts who are familiar with the banking and insurance domain. English and Russian languages are used. By employing the hierarchically structured subject domain and classification rules, the Classifiers engine assigns a message to the most relevant category or categories.
10 Morphology Retrieved Items and Regular Expressions Example: A set of regular expressions attributed to a category Amount of life insurance This set represents the words: amount, amounts, cost, costs, price, prices, how much, страховые суммы, страховыми суммами, страховой суммой, страховых сумм, страховая сумма, страховой суммы, сумма, суммы, суммой, суммами, сумм, сумме, суммах, etc. A regular expression is an algebraic notation for characterizing a set of strings.
11 Classifiers Vocabulary The Classifier's vocabulary is a list of regular expressions, which correspond to the chosen domain. Each regular expression has one or several pointers at the related category or categories from the tree. The functional expressions, such as the expressions for negation (e.g., "no", "isn't", "besides"), are marked by special labels.
12 Rule-based Classification Algorithm 1) The incoming query is lexically analyzed. 2) Information on the categories is extracted from the Classifiers vocabulary. 3) A list of pointers at categories and negations is generated in the order of word appearance. 4) The classification rules are applied to the list. 5) The decision system produces the output, i.e., defines a category where the message is classified.
13 Classification Rules Intersection Query: What does my auto insurance policy cover when I rent a car? Retrieved items: auto, insurance, rent, car Answer: In the not-too-distant past, most auto insurance policies would extend coverage to rental cars whenever you rented one. This is not quite true anymore. In most cases, your personal auto insurance policy will cover only vacation car rentals. Many insurance companies no longer extend personal auto insurance coverage for business travel. Find out what rental car coverage you have under your policy is by calling your insurance agent/company.
14 Classification Rules Negation Query: I would like to insure my life while being on vacations only, not permanently. Retrieved items: life, not, permanently Answer: A specialized individual policy that commonly combines whole life insurance with decreasing term insurance. The whole life insurance portion of the policy is usually paid as a lump sum when the insured dies. The decreasing term insurance portion of the policy provides an income for a predetermined period to help support the insured's family...
15 Classification Rules Union Query: I've recently bought a new car. What would you recommended about buying insurance? Retrieved item: new car, buying, insurance Answer: Things you should consider when purchasing automobile insurance include: Decide how much liability coverage you want to carry. This is highly subjective. The liability levels you have on your other policies can serve as a guideline. Consult a financial professional if you need more advice. Determine which optional coverage you will need to feel protected. For example…
16 Classification Process Query: I have an older car whose current market value is very low – do I really need auto insurance? Retrieved items: older car, auto, insurance Corresponding regular expressions: / old.._car../, /auto./, /insurance/ Names and positions of proposed categories: Old car, Automobile Insurance, Personal Financial Services are retrieved from the Classifiers vocabulary The most relevant category Old car is left after the application of rules. Personal Financial Services Automobile Insurance Old car
17 Used Tools TreeBuilder The TreeBuilders GUI allows us: to create easily the tree nodes and their sub-nodes and associate them with runnable Java objects or hard disk data (e.g., files) to edit the source code of rules and dictionaries to create tests to store the hierarchy and its data in the XML format to copy&paste to view the debug information The main goal of the program TreeBuilder is helping a user to create a tree (a hierarchy of categories in Classifier).
18 Used Tools RegExParser RegExParser is the tool for morphological analysis and information extraction.
19 Used Tools LinguaEngine LinguaEngine is a rule-based engine, used by the Classifier. Input: a list of Java objects (a list of pointers at categories in our case) and rules In the process of execution, a rule replaces a sub-list of objects with another list of objects. The engine makes it possible to unite the rules into groups and to introduce a group-order. For example, sometimes it is necessary to execute more important rules before the rules of less importance. When none of the rules from the current group can be executed, the engine switches to the rules of the next group according to the group-order. Technically, a rule is presented as a Java method. A group of rules is implemented by the means of a Java class, where every boolean method represents a rule. Output: the objects left in the input list after all of the rules have been executed
20 Classification Scheme
21 Summary Rule-based algorithm for text classification - takes the syntax and semantics of an incoming query into consideration - is based on hierarchically structured subject domain Effective tools: 1) LinguaEngine for rule execution 2) RegexParser for morphological analysis and information retrieval 3) TreeBuilder for working with hierarchical data Our Main Novel Results:
22 Our Team and Our Partners The Classifier has been implemented with support and collaboration of the companies: Exigen ( Transnet ( Our team comes from A.P. Ershov Institute of Informatics Systems (Siberian Division of Russian Academy of Science), and, besides the authors of the presented paper, the following specialists worked and been helpful for the Classifier project: Farida G. Dinenberg Irina S. Kononenko David Ya. Levin Dmitry V. Petunin Alexander L. Semenov
23 Possible Future Work Automatic construction of category trees Creating an automatic answering system based on information extraction, hierarchical structure of subject domain and rule-based engine from the Classifier Identifying the genre of an incoming query Trying new subject domains and new languages
24 Thank you for attention! This presentation was created for the Third International Conference on Intelligent Data Engineering and Automated Learning, Manchester, UK, August 12-14, 2002
Еще похожие презентации в нашем архиве:
© 2024 MyShared Inc.
All rights reserved.