Скачать презентацию
Идет загрузка презентации. Пожалуйста, подождите
Презентация была опубликована 9 лет назад пользователемнаталья александровна спицына
1 Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP
2 Parallel corpora - definition A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other The direction of the translation may not even be known.
3 Parallel corpora - uses Parallel corpora are objects of interest at present because of the opportunity offered to align original and translation and gain insights into the nature of translation. From this work it is hoped that tools to aid translation will be devised. Probabilistic machine translation systems can moreover be trained on such corpora.
4 Comparable corpora - definition A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.
5 Comparable corpora - uses The possibilities of a comparable corpus are to compare different languages or varieties in similar circumstances of communication, but avoiding the inevitable distortion introduced by the translations of a parallel corpus.
6 Quotations from: EAGLES - Expert Advisory Group on Language Engineering Standards Guidelines – 1996 – at: htmlhttp:// html
7 Parallel corpora - alignment & annotation Most common form of alignment = at sentence level E.g. Text aligners: –WORDSMITH – recognizes full stops only –WinAlign – TRADOS – recognizes a certain amount of formatting, paragraphs, numbers, tagging Ongoing research to align at: –term/word level –tag level
8 Parallel corpora - alignment & annotation problems Different linguistic theories = different annotation schemes –E.g. Morphological, syntactic or semantic? Different languages = different annotation schemes –E.g. English / Portuguese / Polish / Finnish /Chinese Different languages = different types of alignment –E.g. English / Hebrew / Chinese
9 Parallel corpora - professional uses Translation memories – aligned collections of repetitive texts in special domains –Provide previous translations for translator to consult / copy –Allow economy in translation process –Provide material for probabilistic machine translation –E.g. EU translation services, Canadian Hansard
10 Translation memories – requirements Garbage in = garbage out! Original > good quality – hence –Emphasis on: good editing and proof reading > controlled language –E.g. EU documentation – training people to edit English documents written by non-native speakers Translation > good quality – but certain parallel relationship to the original Therefore: tendency to homogeneity –(e.g. Eurospeak)
11 Parallel corpora - academic uses For studying the translation process For studying translation solutions E.g. –INTERSECT – French/English (Brighton) –English-Norwegian Parallel Corpus Project (Oslo) –COMPARA/DISPARA – Portuguese/English – online at For terminology extraction
12 Parallel corpora - requirements Theory should allow for any original + translation - warts and all! –Much literary criticism of translation thrives on the warts! –Useful for study of errors, translationese etc Practical applications require quality: –Contrastive linguistics –Pedagogical applications –Terminology extraction
13 Comparable corpora – Perceived needs Texts as: –Examples of natural original text in the source language culture –E.g. Legal texts written according to local conventions Socially conventional texts: e.g. the deaths column and advertisements for houses and jobs. Academic / scientific texts – different cultural conventions
14 Comparable corpora – Advantages Availability –More texts –Greater variety Versatility - applications for research in: –Discourse analysis –Pragmatics –Information retrieval –Knowledge engineering
15 What makes texts /corpora COMPARABLE?
16 EAGLES - quotes A comparable corpus is one which selects similar texts in more than one language or variety. Similar - in more than one language AND/OR Similar - in variety...similar circumstances of communication..
17 Similarity – Form/content? Form –Size, no. of words, sentences, paragraphs –Length of texts –Format -.txt,.doc,.html,.xml Content –General language –Specialised domains
18 Similarity- Structure/Function? Structure –Formal, carefully constructed texts – e.g. Legal texts –Informal, loosely organized discourse – e.g. transcriptions of conversation Function –Social –Cultural
19 Similarity- Register? Register –Field – situation, subject matter etc –Tenor – interpersonal relationships e.g. formal/informal, politeness, etc –Mode Spoken: e.g. speech, formal dialogue, conversation Written: e.g. book, essay, instruction manual Multimedia: e.g. Encarta, films
20 Similarity - Dialect? Dialect –Geographical e.g. urban/rural areas, developed/developing countries –Temporal e.g. historical periods, different age groups –Social e.g. social classes, educational backgrounds
21 Comparability in Very Large Corpora Very Large Corpora comparable if : –similar in size –constructed according to same criteria –e.g. quantity and quality of text types Consider: –British National Corpus –Mannheimer Corpora
22 Comparability in newspaper corpora Newspaper corpora vary according to: –Type: quality/popular, general/specialised content –Time: same day/month/year > concurrent corpora Consider: –CETEMPúblico - Portuguese –Reuters Corpus - English
23 Comparability in literary corpora Period: –Medieval, 18 th Century, Post-war School: –Romanticism, Realism, Post-modernism Genre: –Novel, science fiction, drama, poetry
24 Comparability in technical and scientific corpora - form Pamphlets Manuals Textbooks Articles and papers Dissertations, theses
25 Comparability in technical and scientific corpora - content Everyday information Encyclopedic information Instructions Education Expert-to-expert communication
26 Constructing comparable corpora - general language Where does one start? Very large comparable corpora in 2 or more languages = mega-proposition! Carefully selected annotated general corpora – like ICAME corpora (Brown, LOB etc) = a possibility + limitations
27 Using comparable corpora - general language Advantages: –Comparative and contrastive research at all levels –Particularly useful for lexicographical research and search for syntactic patterns Disadvantages: –Difficult to manage for more delicate analysis –Unnecessary for certain types of research
28 Constructing comparable corpora – Newspaper texts Newspaper corpora –Relatively easy to acquire –A wide variety of fields –Similarity in tenor mode
29 Using comparable corpora – Newspaper texts Concurrent corpora > extraction of similar news items > e.g. –War reports –Politics – election campaigns –Football during the World Cup OR > styles of journalism > comparing individual journalists etc.
30 Constructing comparable corpora – general language + restricted text type General subject texts of similar text type – e.g. Encyclopedia entries, tourism pamphlets Literary texts of similar period, school or genre Technical and scientific texts with similar form or function e.g. textbooks
31 Using comparable corpora – general language + restricted text type Discourse analysis Pragmatics Genre analysis Sociolinguistic analysis
32 Constructing comparable corpora – specialized language Special domains at various levels – e.g. –Geography > population geography > ethnic minorities –Engineering > mechanical engineering > tribology –Medicine > oncology > breast cancer
33 Using comparable corpora – specialized language Genre analysis Terminology extraction Information retrieval Web browsing technology Knowledge engineering
34 All corpora construction Must establish: –Overall general policy in relation to: Form – computational structure Content of sub-corpora Availability to general / restricted public –Specific objectives of sub-corpora
35 All corpora construction Must take into account: –Copyright restrictions –Effect of external factors on the text Idiosyncracies of individual author Characteristics of writing in specific cultural/ social situation Homogenising effect of internationalisation –Eurospeak –Anglicisation of scientific terminology
36 Linguateca - Porto More immediate objectives To construct comparable and parallel corpora in Portuguese and English using: –Texts in special domains already being investigated –Adding corpora from special domains as and when the opportunity arises To construct the necessary computational framework for using the corpora for research To make these corpora as widely available as the respective copyright situation permits
37 Linguateca - Porto Longer-term objectives To extend the notion of comparability to: –genre-specific corpora –restricted general language corpora To construct integrated networks of comparable corpora To extend these objectives to other languages To contribute to similar projects elsewhere
38 Bibliography Bourigault, Didier, Christian Jacquemin, & Marie- Claude LHomme. (Eds.) Recent Advances in Computational Terminology. Amsterdam & Philadelphia: John Benjamins Publishing Co. Charlet, J., M.Zacklad G.Kassel D.Bourigault Ingénierie des connaissances. Paris: Éditions Eyrolles. Veronis, Jean (Ed) Parallel Text Processing – Alignment and Use of Translation Corpora. Dordrecht: Kluwer Academic Publishers.
Еще похожие презентации в нашем архиве:
© 2024 MyShared Inc.
All rights reserved.