Uncovering Languages from written documents Nikitas N. Karanikolas TEI of Athens, nnk@teiath.grnnk@teiath.gr and Panagiotis Ouranos TEI of Athens, pouran24@gmail.com.

Презентация:

Advertisements

Похожие презентации

© 2006 Cisco Systems, Inc. All rights reserved. MPLS v Complex MPLS VPNs Introducing Central Services VPNs.

Advertisements

In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.

© 2009 Avaya Inc. All rights reserved.1 Chapter Two, Voic Pro Components Module Two – Actions, Variables & Conditions.

Making PowerPoint Slides Avoiding the Pitfalls of Bad Slides.

Using Dreamweaver MX Slide 1 Window menu Manage Sites… Window menu Manage Sites… 2 2 Open Dreamweaver 1 1 Set up a website folder (1). Click New…

Lesson 3 - HTML Formatting. Text Formatting Tags TagDescription Defines bold text Defines big text Defines emphasized text Defines italic text Defines.

Linear Block Codes Mahdi Barhoush Mohammad Hanaysheh.

Operator Overloading Customised behaviour of operators Chapter: 08 Lecture: 26 & 27 Date:

ЕГЭ 2014 Письмо (an opinion essay)С 2 МБОУ СОШ 1 г. Александров Владимирская область учитель английского языка Г.А.Семенова 2013 г.

© 2006 Cisco Systems, Inc. All rights reserved. CVOICE v Configuring Voice Networks Configuring Dial Peers.

© 2005 Cisco Systems, Inc. All rights reserved. BGP v Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to Multiple Service.

© 2006 Cisco Systems, Inc. All rights reserved. MPLS v MPLS VPN Technology Introducing MPLS VPN Architecture.

SMS GENERATION. SMS IN THE MODERN WORLD Nowadays SMS (SMS - short message service) and mobile phones play very important role in our lives. SMS has its.

Ideal Family Were prepared by Iryna Molokova and Ilona Synytsia.

HPC Pipelining Parallelism is achieved by starting to execute one instruction before the previous one is finished. The simplest kind overlaps the execution.

Here are multiplication tables written in a code. The tables are not in the correct order. Find the digit, represented by each letter.

Sequences Sequences are patterns. Each pattern or number in a sequence is called a term. The number at the start is called the first term. The term-to-term.

LETS THINK IT OVER AND TALK ABOUT THE MAIN PROBLEMS OF YOUNG PEOPLE. ARE YOUNG PEOPLE OF DIFFERENT COUNTRIES WORRIED ABOUT THE RELATIONSHIPS BETWEEN MEMBERS.

Inner Classes. 2 Simple Uses of Inner Classes Inner classes are classes defined within other classes The class that includes the inner class is called.

LANGUAGE, SPEECH, SPEECH ACTIVITY Suggests to allocate the following functions: communicative; thinking tools; mastering the socio-historical; experience;

Транксрипт:

Uncovering Languages from written documents Nikitas N. Karanikolas TEI of Athens, and Panagiotis Ouranos TEI of Athens, PCI 2014, Athens – Greece, October 2 – 4, 2014

Motivation Understanding what is the language used in a processable electronic document. More problematic are cases where the input text is composed of several languages. This is a common situation on Web documents. It is a prerequisite for NLP tasks, like full text indexing, summarization, classification, computer assisted assessment. PCI 2014, 3-Oct-2014

Background: coding systems, codepages Automatic language identification of text can be further decomposed to coding system identification and next to language identification. Coding systems are: ASCII (7 bit), EBCDIC (8 bit), extended ASCII (8 bit) and Unicode. Codepages are used in 8 Bit coding systems. They define the location (the code) of the basic graphemes (English letters, punctuation symbols and numbers) and the location of other international or region-specific graphemes (characters). Codepage based files can contain only couples of languages (e.g. English and Greek). PCI 2014, 3-Oct-2014

Background: some codepages PCI 2014, 3-Oct-2014 standardInformal nameMicrosofts similar ISO Latin 1Windows 1252 ISO Latin 2Windows 1250 ISO Latin/CyrillicWindows 1251 ISO Latin/ArabicWindows 1256 ISO Latin/GreekWindows 1253 ISO Latin/HebrewWindows 1255 ISO Latin-5 or TurkishWindows 1254 ISO Latin-7 or Baltic RimWindows 1257

Background: Unicode a newer coding system designed to represent text- based data written in any language. occupy 32 bits for each single character. can be implemented by different character encodings. UTF-32 is a 32-bit fixed-width encoding, able to encode every Unicode character. UTF-16 is a variable-width encoding, uses either 16-bit or 32-bit, it is able to encode every Unicode character. UTF-8 uses one byte for any ASCII character (same code in both UTF-8 and ASCII encoding) and up to four bytes for other characters. PCI 2014, 3-Oct-2014

Background: the extend of problem The identification of codepage, in case of single– byte encoding, is a very helpful achievement. For example, the identification of ISO is enough to know that the text contains only Greek and (possibly) English words. The identification of ISO restricts the languages to a few (Bulgarian, Byelorussian, Russian, Serbian, Ukrainian and English). In case of Unicode encoding, the alphabet of each character is directly identified but the distinction between languages sharing the same alphabet is the same problem as in previous example (languages using the Cyrillic alphabet) PCI 2014, 3-Oct-2014

Our approach: Assumptions PCI 2014, 3-Oct-2014 Language Identification of Multi-lingual Web Documents. We do not make the assumption that web documents are written in a single language that we are trying to identify. We assume that the same documents can contain many languages in different segments. Regarding the granularity of segmentations, we assume that each paragraph has a single language.

Our approach: Input and Output Input: a URL. Output: a file with.mnt file extension. It is a plain text encoded with UTF-8 and having a language identifier tag before every paragraph. Tag has the form \langxxxx xxxx is a decimal number, the same one used in RTF files PCI 2014, 3-Oct-2014

Command C:\> MultiNationalText html c:\log\test5 1 File test5. mnt \lang1058 (Ουκρανικά) Для організації саме такої роботи Микола Азаров доручив внести пропозиції щодо бюджетного фінансування сервісного обслуговування дорогої апаратури, забезпечення витратними \lang1049 (Ρώσικα / Λευκορώσικα) Об этом 17 марта на пресс-конференции в Минске заявил лидер кампании "Говори правду" Владимир Некляев. "Народный референдум" не является коалицией – это политическая кампания. "Мы будем делать все, чтобы в 2015 году оппозиция выдвинула \lang1049 (Ρώσικα) Сборная Финляндии смогла пробиться в раунд плей-офф турнира благодаря сборной Швейцарии, которая не пустила туда сборную Латвии \lang1049 (Ρώσικα / Βουλγάρικα) Фирмата провежда специална политика за предоставяне на услугите на преференциални цени за образованието, науката, медицината, армията и полицията \lang3098 (Σέρβικα) Крунска улица је крајем године указом о категоризацији одређена као улица за подизање вила, али тек после Првог светског рата добиће резиденцијални карактер – када су изграђена \lang1032 (Ελληνικά) Κατηγορείτε τους άλλους: Οι άνθρωποι κάνουν λάθη. Οι συνάδελφοι δεν κάνουν τη δουλειά τους καλά. Οι courier δεν φέρνουν τα πακέτα σας στην ώρα τους \lang1029 (Τσέχικα) Nespisovná mluva obyvatelstva je rozlišena územně. V Čechách převládá interdialekt (nadnářeční útvar), zvaný obecná čeština, který se vyvinul na podkladě hlavních rysů nářeční \lang1045 (Πολωνικά) Poprzedzający Wielkanoc tydzień, w czasie którego Kościół i wierni wspominają najważniejsze dla wiary chrześcijańskiej wydarzenia, nazywany jest Wielkim Tygodniem

Our approach: Algorithm PCI 2014, 3-Oct Request site data (allocate and fill RawBuffer) 2. Create output folder 3. Crawl and save external dependencies and main file, using sockets 4. Parse of html head tags and update Global variables (content_type, content_language, isUTF, CodepageID) 5. Remove any html tag and save the result (in TextData) 6. Validate that values of variables content_type and content_language are acceptable 7. Create array (pParagraphs) with borders of paragraphs (Attributes of pParagraphs elements references into TextData). 8. FOR each item in array pParagraphs 9. IF it is recognized as a codepage based buffer 10. Convert paragraph to UTF End IF 12. Replace Escapes with UTF-8 encodings 13. Identify the language of item (paragraph) 14. Save the language identifier 15. Save (append) paragraph into.mnt file 16. End FOR

Our approach: Algorithm – Step 4 PCI 2014, 3-Oct-2014 given the following html head tags: Step 4 updates Global variables, as follows: content_type:=utf-8; content_language:=el; isUTF:=True; CodepageID:=65001;

Our approach: Algorithm - Step 13 It checks the existence of character patterns that are unique (or almost unique) for one of the supported languages. The existence of each such pattern increases the weight (the possibility) for the corresponding language. These patterns can be short words, postfixes (word endings). [positives] It can be also single letters, in case that these letters are used only by one language among the languages that share the same alphabet. [positives] the existence of letters that have been removed for a given language from the (shared between many languages) alphabet, automatically nullify the possibility for the given language. [negatives] PCI 2014, 3-Oct-2014

Our approach: Algorithm – Step 13 – positives PatternLanguageType ЉSerbianLetter ЊSerbianLetter ониRussianShort word етRussianPostfix ўBelarusianLetter гэтаBelarusianShort word даBulgarianShort word щоUkrainianShort word ΘGreekLetter žeCzechShort word sięPolishShort word

Our approach: Algorithm – Step 13 – negatives It checks also the existence of letters that have been removed for a given language from the (shared between many languages) alphabet. The existence of such a letter, automatically nullify the possibility for the given language. For example, letters Ѥ, Я, Ю, Ѱ and Ъ (that exist in Cyrillic alphabet) are of no use in the Serbian language. PCI 2014, 3-Oct-2014

Utilization So far, our approach supports eight languages: Belarusian, Bulgarian, Czech, Greek, Polish, Russian, Serbian and Ukrainian. It can be used as a preparatory step before full text indexing, summarization, classification or any other NLP task. We also aim to be used for archiving and offline reproduction of web pages. More than the.mnt, it creates an index.htm file with resolved extrernal dependencies. PCI 2014, 3-Oct-2014

Evaluation We have created a rather small set of seven multi- language documents. Each document contains, usually, seven to nine paragraphs. Each paragraph is written in one of the eight supported languages. Each document intermixes, usually, four to five of the eight supported languages. In this way, we have a set of fifty-eight paragraphs. Our system/approach correctly identifies forty of them. Thus the success rate is about sixty-nine per cent (69%). PCI 2014, 3-Oct-2014

Conclusions With a small set of rules and a compact design, we have achieved an acceptable success rate. It seems to be an easy target to increase this rate with the introduction of some more patterns. The success rate will be possible to approach the hundred per cent with the introduction of more elaborated rules (e.g. with collocations). The interesting and naïve point is that our approach identifies the language per paragraph and not a language for the whole document. The other interesting point is the introduction of the.mnt file type. This file type can be used for interchange of multi-language documents PCI 2014, 3-Oct-2014

Uncovering Languages from written documents Thank you for your attention We will try to answer your Questions Nikitas N. Karanikolas PCI 2014, 3-Oct-2014