REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.

Презентация:



Advertisements
Похожие презентации
Учимся писать Эссе. Opinion essays § 1- introduce the subject and state your opinion § 2-4 – or more paragraphs - first viewpoint supported by reasons/
Advertisements

The most important technological inventions Think of as many words as possible related to the topic Think of as many words as possible related to the.
INVOLUTES An involute is a curve that is traced by a point on a taut cord unwinding from a circle or regular polygon, which is called a base or (plane.
Benford Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading.
The waterfall model is a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to.
Making PowerPoint Slides Avoiding the Pitfalls of Bad Slides.
Ideal Family Were prepared by Iryna Molokova and Ilona Synytsia.
It was done by the students of Sosnovskaya Secondary school 1: Aslanova Ina Zhukova Inna 11 th form B The teacher of English: Nadezhda Vladimirovna Kuprina.
The waterfall model is a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to.
REFERENCE ELEMENTS 64. If your REFERENCE ELEMENTS toolbar is not in view and not hidden, you can retrieve it from the toolbars menu seen here. 65.
Tool: Pareto Charts. The Pareto Principle This is also known as the "80/20 Rule". The rule states that about 80% of the problems are created by 20% of.
A Bill is a proposal for a new law, or a proposal to change an existing law that is presented for debate before Parliament. Bills are introduced in either.
Improving You Memory. Introduction Has anyone ever had trouble remembering someones name or forgot where you put something? I will be sharing with you:
How can we measure distances in open space. Distances in open space.
A S ANY LANGUAGE IN THE WORLD A SIGN LANGUAGE HAS MANY ADVANTAGES. F IRST OF ALL, IT IS QUITE RICH TO SHOW THE MOST IMPORTANT MEANINGS THAT EXIST IN ALL.
THE MEDIA The mass media play an important part in our lives. Nowadays information is the most necessary thing. That is why there are so many sources.
Blood type Blood type or blood group is a medical term. It describes the type of blood a person has. This blood type is based on whether or not there are.
When you leave school you understand that the time of your independence life and the beginning of a far more serious examination of your abilities and.
Operators and Arithmetic Operations. Operators An operator is a symbol that instructs the code to perform some operations or actions on one or more operands.
Knot theory. In topology, knot theory is the study of mathematical knots. While inspired by knots which appear in daily life in shoelaces and rope, a.
Транксрипт:

REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry of Industry and Trade, Vietnam

CONTENT Introduction to the reduced n-gram approach Reduced n-gram algorithm Reduced n-grams and Zipfs law Methods of testing Perplexity for reduced n-grams Discussion Conclusions

Introduction to the reduced n-gram approach A statistical method to improve language models based on the removal of overlapping phrases. A distortion in the use of phrase frequencies had been observed in the Vodis Corpus Bigrams RAIL ENQUIRIES and its super- phrase BRITISH RAIL ENQUIRIES occur 73 times ENQUIRIES follows RAIL with a very high probability when it is preceded by BRITISH

Introduction to... (cont.) When RAIL is preceded by words other than BRITISH, ENQUIRIES does not occur, but words like TICKET or JOURNEY may The bigram RAIL ENQUIRIES gives a misleading probability that RAIL is followed by ENQUIRIES irrespective of what precedes it The frequencies of RAIL ENQUIRIES were reduced by subtracting the frequency of the larger trigram, which gave a probability of zero for ENQUIRIES following RAIL if it was not preceded by BRITISH

Introduction to... (cont.) The phrase with a new reduced frequency is called a reduced phrase A phrase can occur in a corpus as a reduced n-gram in some places and as part of a larger reduced n-gram in other places In a reduced model, the occurrence of an n-gram is not counted when it is a part of a larger reduced n-gram One algorithm to detect/identify/extract reduced n-grams from a corpus is the so- called reduced n-gram algorithm

Reduced n-gram algorithm The main goal is to produce three main files The PHR file: contains all the complete n-grams appearing at least m times (m 2) The SUB file: contains all the n-grams appearing as sub-phrases, following the removal of the first word from any other complete n-gram in the PHR file The LOS file: contains any overlapping n-grams that occur at least m times in the SUB file The final list of reduced phrases is the FIN file FIN := PHR + LOS - SUB

Algorithm (cont.) Implementation: there are 2 additional files The SOR file: contains all the complete n-grams regardless of m. From SOR to create PHR, words are removed from the right-hand side of each phrase until the resultant phrase appears at least m times The POS file: for any SUB phrase, if one word can be added back on the right-hand side, one POS phrase will exist as the added phrase Thus, if any POS phrase appears at least m times, its original SUB phrase will be an overlapping n-gram in the LOS file

Algorithm (cont.) The scope of this algorithm is limited to small, medium and large corpora To work well for very large/huge corpora, it has been implemented by FILE DISTRIBUTION AND SORT PROCESSES Our previous English/Chinese results: 2005: Chinese TREC syllables of 19 million characters and its compound word version 2006: English Wall Street Journal (WSJ) of 40 million tokens; and North American News Text (NANT) sizing 500 million tokens 2006: Chinese compound word version of the Mandarin News corpus of 250 million syllables

Reduced n-grams and Zipfs law Wall Street Journal corpus (English)

Reduced n-grams & Zipfs law (cont.) North American News Text corpus (English)

Reduced n-grams & Zipfs law (cont.) Mandarin News compound words (Chinese)

Reduced n-grams & Zipfs law (cont.) Highly-inflected Indo-European Celtic Both the beginning and end of words are regularly inflected The Irish corpus is taken from a corpus of 17th and 18th century Irish with sizes 7,122,537 tokens with 449,968 types Royal Irish Academy (RIA) corpus (Irish)

Reduced n-grams & Zipfs law (cont.) Royal Irish Academy (RIA) corpus (Irish)

Reduced n-grams & Zipfs law (cont.) Royal Irish Academy (RIA) corpus (Irish)

Methods of testing Weighted average model The probabilities of all of the sentences An average perplexity of each sentence

Perplexity for reduced n-grams Wall Street Journal corpus (English)

Perplexity … (cont.) North American News Text corpus (English)

Perplexity … (cont.) Mandarin News words (Chinese)

Perplexity … (cont.) RIA corpus (Irish)

Discussion Various perplexity improvements over the traditional 3-gram model (41.63% for Irish, 16.95% for Chinese and 4.18% for English) A significant reduction in model size, from a factor of 11.2 to almost 15.1 in Irish, Chinese and English model sizes The Irish reduced model produces a perplexity improvement much better than in English and Chinese languages The reason is that the Irish language has numerous word inflections and inflected words meanings are much related together

Conclusions The conventional n-gram language model is limited in terms of its ability to represent extended phrase histories To overcome this limitation, we created reduced n-gram models for English, Chinese and Irish languages Our reduced models semantically contain more complete n-grams than traditional n-grams The good improvements in perplexity and model size reduction An encouraging step forward, although still very far from the final step in language modelling.

Thank you very much for your attention