Презентация на тему: " Basic strategies in machine translation: Rule-based vs. Non-rule-based MT systems." — Транскрипт:
Basic strategies in machine translation: Rule-based vs. Non-rule-based MT systems
The first decision to be considered in designing an MT system: multilingual or bilingual? method: direct, transfer or interlingua? – very important because it affects the whole strategy what computational environment as a whole? a batch system or an interactive system? how is lexical data to be organized?
Bilingual systems Bilingual systems may be: 1. unidirectional (Language1 Language2) or bidirectional (Language1 Language2) 2. reversible or non-reversible: in reversible system the process of language generation is the opposite to language analysis. For example, the English analysis module in an English German system will mirror the English generation module in a German English system. But: nearly all bilingual systems are in effect two uni- directional systems running on the same computer and methods of analysis and generation for either of the languages are designed independently. Such a bilingual system is best represented as Language1 Language2 + Language1 Language2 instead of Language1 Language2
Multilingual systems: Involve more than two languages Many languages in one system are rare but e.g. ECs Eurotra project May not cover all the pairs and directions A 'truly' multilingual system is one in which analysis and generation components for a particular language remain constant (and separate) whatever other languages are involved.
Three basic types of MT systems Direct systems Transfer systems Interlinguas Direct systems = first generation Transfer and interlingua = second generation MT Systems Direct systems Indirect systems Transfer systems Interlingua systems
Called direct because they lacks any kinds of intermediate stages in translation. Traces of the direct approach are found even in contemporary indirect systems. A direct MT system is designed in all details specifically for one particular pair of languages in one direction Shallow analysis of the source text
Summary of the direct approach: Morphological analysis: the system identifies word endings and reduces inflected forms to their canonical forms. Dictionary look-up: then it input the results into a large bilingual dictionary look-up program. Local reordering rules give more acceptable target language output, perhaps moving some adjectives or verb particles. Finally, the target language text would be produced.
Limitations of the direct systems: no analysis of syntactic structure or of semantic relationships! basically 'word-for-word' translation frequent mistranslations at the lexical level largely inappropriate syntax structures
Some examples of the output of an English- Ukrainian direct MT system: (1) The board of directors discussed some financial proposals. *Дошка директорів обговорила деякі фінансові пропозиції. На раді директорів обговорили декілька фінансових пропозицій.'
Table 1 (2) Table 1 shows us the growing indices. Стіл 1 показує нам ростучі індекси. На таблиці 1 наведено показники росту (зростання).'
How useful is the direct approach today? It continues to some extent in many uni- directional bilingual systems. It takes advantage of similarities of structure and vocabulary between SL and TL. The designers are then able to concentrate most effort on areas of grammar and syntax where the languages differ greatest.
Failures of direct systems led to the development of intermediate representations = representations of meaning. Based on them the system would generate the target text. This is the essence of the indirect method. It has two principal variants: Interlingua systems Transfer systems
Interlingua systems/method SL Text – Intermediate representation proposition PREDICATE (NODE)+ ARGUMENTS (Agent, object) – TL Text
The main feature: a representation in the middle – called interlingua The source text is analyzed into interlingua. The target text is generated from interlingua. Interlingua is an abstract representation neutral between 2 or more languages. Each analysis and generation module is independent and remains the same no matter what the SL or TL is in translation. Most attractive for multilingual systems.
Advantage: to add a new language to the system one needs to create just two new modules: an analysis grammar and a generation grammar. Disadvantages: Difficult to create an interlingua, even for closely related languages e.g. the Slavic languages: Ukrainian, Byelorussian, Polish, Russian. A truly 'universal' and language-independent interlingua hasnt been created so far.
bilingual modules between intermediate representations of each of the two languages language- dependent these representations are language- dependent and are, typically, phrase- structure trees. SL text is analyzed into SL trees SL trees are converted to TL trees TL text is generated from these trees These representations/trees are called interface representations
Procedures: (1) French analysis (ambiguities are resolved) (2) French-English transfer (performed by a French- English bilingual module) (3) English generation (English text generated)
Disadvantages: A lot of work to add a new language Advantages: Transfer modules are easier to devise than interlingua Analysis and generation are only between two languages in each case – easier Possible to use similarities between the two languages in each pair
Compare the relative sizes of the three components: analysis, transfer and generation. The apex of the pyramid represents the theoretical interlingual representation. The more the text is analysed, the simpler transfer will be. Early direct systems (at the bottom) - minimal monolingual analysis, and nearly all the work is done in transfer.
Non-rule based MT Early 1960s: the investigation at the IBM Research Laboratories Result: first STATISTICAL MT SYSTEM Statistical MT (SMT) along with Example- Based MT (EBMT) represent Empirical approaches in MT.
Statistical MT systems Statistical MT systems rely on probabilistic and statistical models of the translation process trained on large amounts of bilingual corpora. These models include little or no explicit linguistic knowledge, relying instead on the distributional properties of words and phrases in order to establish their most likely translation. General idea of SMT General idea of SMT: we look for features of a bilingual corpus and see how these features can be used to predict translations
In purely statistical method of translation modeling it is presumed that with certain probability each word of the TT may be a translation of each word of the ST. The essence of the method: The alignment of sentences in the two languages and The calculation of the probabilities that any one word in a sentence of one language corresponds to two, one or zero words in the translated sentence in the other language.
Example-based MT EBMT: 1.the alignment of texts, 2.the matching of input sentences against phrases (examples) in the corpus, 3.the selection and 4.extraction of equivalent TL phrases, 5.the adaptation and combining of TL phrases as acceptable output sentences
In SMT, the core process involves a translation model which takes as input SL words or word sequences (phrases) and produces as output TL words or word sequences. In EBMT, the core process is the selection and extraction of TL fragments corresponding to SL fragments. It is preceded by an analysis stage for the decomposition of input sentences into appropriate fragments (or templates with variables) and their matching against SL fragments (in a database).