Basic strategies in machine translation: Rule-based vs. Non-rule-based MT systems.

Презентация:



Advertisements
Похожие презентации
Some ideas of semantic analysis for anaphora resolution Dmitry P. Vetrov Dorodnicyn Computing Centre of RAS.
Advertisements

The category of mood. The category of mood is an explicit verbal category expressing the relation of the action denoted by the predicate to reality as.
A S ANY LANGUAGE IN THE WORLD A SIGN LANGUAGE HAS MANY ADVANTAGES. F IRST OF ALL, IT IS QUITE RICH TO SHOW THE MOST IMPORTANT MEANINGS THAT EXIST IN ALL.
PERT/CPM PROJECT SCHEDULING Allocation of resources. Includes assigning the starting and completion dates to each part (or activity) in such a manner that.
Brief introduction to the general genetic law of development Nikolai Veresov 1.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Plan: Key English Test (KET) Preliminary English Test (PET) First Certificate in English (FCE) Certificate in Advanced English (CAE) Certificate in Proficiency.
SPLAY TREE The basic idea of the splay tree is that every time a node is accessed, it is pushed to the root by a series of tree rotations. This series.
1 Another useful model is autoregressive model. Frequently, we find that the values of a series of financial data at particular points in time are highly.
Comparative Analysis of Phylogenic Algorithms V. Bayrasheva, R. Faskhutdinov, V. Solovyev Kazan University, Russia.
How can we measure distances in open space. Distances in open space.
Correlation. In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to.
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.
Chap 11-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 11 Hypothesis Testing II Statistics for Business and Economics.
Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Designing the Network Management Architecture ARCH v
The most important technological inventions Think of as many words as possible related to the topic Think of as many words as possible related to the.
REFERENCE ELEMENTS 64. If your REFERENCE ELEMENTS toolbar is not in view and not hidden, you can retrieve it from the toolbars menu seen here. 65.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to Multiple Service.
Lecture # Computer Architecture Computer Architecture = ISA + MO ISA stands for instruction set architecture is a logical view of computer system.
© 2002 IBM Corporation Confidential | Date | Other Information, if necessary © Wind River Systems, released under EPL 1.0. All logos are TM of their respective.
Транксрипт:

Basic strategies in machine translation: Rule-based vs. Non-rule-based MT systems

The first decision to be considered in designing an MT system: multilingual or bilingual? method: direct, transfer or interlingua? – very important because it affects the whole strategy what computational environment as a whole? a batch system or an interactive system? how is lexical data to be organized?

Bilingual systems Bilingual systems may be: 1. unidirectional (Language1 Language2) or bidirectional (Language1 Language2) 2. reversible or non-reversible: in reversible system the process of language generation is the opposite to language analysis. For example, the English analysis module in an English German system will mirror the English generation module in a German English system. But: nearly all bilingual systems are in effect two uni- directional systems running on the same computer and methods of analysis and generation for either of the languages are designed independently. Such a bilingual system is best represented as Language1 Language2 + Language1 Language2 instead of Language1 Language2

Multilingual systems: Involve more than two languages Many languages in one system are rare but e.g. ECs Eurotra project May not cover all the pairs and directions A 'truly' multilingual system is one in which analysis and generation components for a particular language remain constant (and separate) whatever other languages are involved.

Three basic types of MT systems Direct systems Transfer systems Interlinguas Direct systems = first generation Transfer and interlingua = second generation MT Systems Direct systems Indirect systems Transfer systems Interlingua systems

Direct systems

Called direct because they lacks any kinds of intermediate stages in translation. Traces of the direct approach are found even in contemporary indirect systems. A direct MT system is designed in all details specifically for one particular pair of languages in one direction Shallow analysis of the source text

Summary of the direct approach: Morphological analysis: the system identifies word endings and reduces inflected forms to their canonical forms. Dictionary look-up: then it input the results into a large bilingual dictionary look-up program. Local reordering rules give more acceptable target language output, perhaps moving some adjectives or verb particles. Finally, the target language text would be produced.

Limitations of the direct systems: no analysis of syntactic structure or of semantic relationships! basically 'word-for-word' translation frequent mistranslations at the lexical level largely inappropriate syntax structures

Some examples of the output of an English- Ukrainian direct MT system: (1) The board of directors discussed some financial proposals. *Дошка директорів обговорила деякі фінансові пропозиції. На раді директорів обговорили декілька фінансових пропозицій.'

Table 1 (2) Table 1 shows us the growing indices. Стіл 1 показує нам ростучі індекси. На таблиці 1 наведено показники росту (зростання).'

How useful is the direct approach today? It continues to some extent in many uni- directional bilingual systems. It takes advantage of similarities of structure and vocabulary between SL and TL. The designers are then able to concentrate most effort on areas of grammar and syntax where the languages differ greatest.

Indirect methods

Failures of direct systems led to the development of intermediate representations = representations of meaning. Based on them the system would generate the target text. This is the essence of the indirect method. It has two principal variants: Interlingua systems Transfer systems

Interlingua systems

Interlingua systems/method SL Text – Intermediate representation proposition PREDICATE (NODE)+ ARGUMENTS (Agent, object) – TL Text

The main feature: a representation in the middle – called interlingua The source text is analyzed into interlingua. The target text is generated from interlingua. Interlingua is an abstract representation neutral between 2 or more languages. Each analysis and generation module is independent and remains the same no matter what the SL or TL is in translation. Most attractive for multilingual systems.

Advantage: to add a new language to the system one needs to create just two new modules: an analysis grammar and a generation grammar. Disadvantages: Difficult to create an interlingua, even for closely related languages e.g. the Slavic languages: Ukrainian, Byelorussian, Polish, Russian. A truly 'universal' and language-independent interlingua hasnt been created so far.

Transfer systems

Transfer systems/method

bilingual modules between intermediate representations of each of the two languages language- dependent these representations are language- dependent and are, typically, phrase- structure trees. SL text is analyzed into SL trees SL trees are converted to TL trees TL text is generated from these trees These representations/trees are called interface representations

Procedures: (1) French analysis (ambiguities are resolved) (2) French-English transfer (performed by a French- English bilingual module) (3) English generation (English text generated)

Disadvantages: A lot of work to add a new language Advantages: Transfer modules are easier to devise than interlingua Analysis and generation are only between two languages in each case – easier Possible to use similarities between the two languages in each pair

The MT pyramid

Compare the relative sizes of the three components: analysis, transfer and generation. The apex of the pyramid represents the theoretical interlingual representation. The more the text is analysed, the simpler transfer will be. Early direct systems (at the bottom) - minimal monolingual analysis, and nearly all the work is done in transfer.

Non-rule based MT Early 1960s: the investigation at the IBM Research Laboratories Result: first STATISTICAL MT SYSTEM Statistical MT (SMT) along with Example- Based MT (EBMT) represent Empirical approaches in MT.

Statistical MT systems Statistical MT systems rely on probabilistic and statistical models of the translation process trained on large amounts of bilingual corpora. These models include little or no explicit linguistic knowledge, relying instead on the distributional properties of words and phrases in order to establish their most likely translation. General idea of SMT General idea of SMT: we look for features of a bilingual corpus and see how these features can be used to predict translations

In purely statistical method of translation modeling it is presumed that with certain probability each word of the TT may be a translation of each word of the ST. The essence of the method: The alignment of sentences in the two languages and The calculation of the probabilities that any one word in a sentence of one language corresponds to two, one or zero words in the translated sentence in the other language.

Example-based MT EBMT: 1.the alignment of texts, 2.the matching of input sentences against phrases (examples) in the corpus, 3.the selection and 4.extraction of equivalent TL phrases, 5.the adaptation and combining of TL phrases as acceptable output sentences

In SMT, the core process involves a translation model which takes as input SL words or word sequences (phrases) and produces as output TL words or word sequences. In EBMT, the core process is the selection and extraction of TL fragments corresponding to SL fragments. It is preceded by an analysis stage for the decomposition of input sentences into appropriate fragments (or templates with variables) and their matching against SL fragments (in a database).