Презентация на тему: " Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP." — Транскрипт:
Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP
Parallel corpora - definition A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other The direction of the translation may not even be known.
Parallel corpora - uses Parallel corpora are objects of interest at present because of the opportunity offered to align original and translation and gain insights into the nature of translation. From this work it is hoped that tools to aid translation will be devised. Probabilistic machine translation systems can moreover be trained on such corpora.
Comparable corpora - definition A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.
Comparable corpora - uses The possibilities of a comparable corpus are to compare different languages or varieties in similar circumstances of communication, but avoiding the inevitable distortion introduced by the translations of a parallel corpus.
Quotations from: EAGLES - Expert Advisory Group on Language Engineering Standards Guidelines – 1996 – at: htmlhttp://www.ilc.pi.cnr.it/EAGLES96/browse. html
Parallel corpora - alignment & annotation Most common form of alignment = at sentence level E.g. Text aligners: –WORDSMITH – recognizes full stops only –WinAlign – TRADOS – recognizes a certain amount of formatting, paragraphs, numbers, tagging Ongoing research to align at: –term/word level –tag level
Parallel corpora - alignment & annotation problems Different linguistic theories = different annotation schemes –E.g. Morphological, syntactic or semantic? Different languages = different annotation schemes –E.g. English / Portuguese / Polish / Finnish /Chinese Different languages = different types of alignment –E.g. English / Hebrew / Chinese
Parallel corpora - professional uses Translation memories – aligned collections of repetitive texts in special domains –Provide previous translations for translator to consult / copy –Allow economy in translation process –Provide material for probabilistic machine translation –E.g. EU translation services, Canadian Hansard
Translation memories – requirements Garbage in = garbage out! Original > good quality – hence –Emphasis on: good editing and proof reading > controlled language –E.g. EU documentation – training people to edit English documents written by non-native speakers Translation > good quality – but certain parallel relationship to the original Therefore: tendency to homogeneity –(e.g. Eurospeak)
Parallel corpora - academic uses For studying the translation process For studying translation solutions E.g. –INTERSECT – French/English (Brighton) –English-Norwegian Parallel Corpus Project (Oslo) –COMPARA/DISPARA – Portuguese/English – online at For terminology extraction
Parallel corpora - requirements Theory should allow for any original + translation - warts and all! –Much literary criticism of translation thrives on the warts! –Useful for study of errors, translationese etc Practical applications require quality: –Contrastive linguistics –Pedagogical applications –Terminology extraction
Comparable corpora – Perceived needs Texts as: –Examples of natural original text in the source language culture –E.g. Legal texts written according to local conventions Socially conventional texts: e.g. the deaths column and advertisements for houses and jobs. Academic / scientific texts – different cultural conventions
EAGLES - quotes A comparable corpus is one which selects similar texts in more than one language or variety. Similar - in more than one language AND/OR Similar - in variety...similar circumstances of communication..
Similarity – Form/content? Form –Size, no. of words, sentences, paragraphs –Length of texts –Format -.txt,.doc,.html,.xml Content –General language –Specialised domains
Similarity- Structure/Function? Structure –Formal, carefully constructed texts – e.g. Legal texts –Informal, loosely organized discourse – e.g. transcriptions of conversation Function –Social –Cultural
Similarity- Register? Register –Field – situation, subject matter etc –Tenor – interpersonal relationships e.g. formal/informal, politeness, etc –Mode Spoken: e.g. speech, formal dialogue, conversation Written: e.g. book, essay, instruction manual Multimedia: e.g. Encarta, films
Similarity - Dialect? Dialect –Geographical e.g. urban/rural areas, developed/developing countries –Temporal e.g. historical periods, different age groups –Social e.g. social classes, educational backgrounds
Comparability in Very Large Corpora Very Large Corpora comparable if : –similar in size –constructed according to same criteria –e.g. quantity and quality of text types Consider: –British National Corpus –Mannheimer Corpora
Comparability in newspaper corpora Newspaper corpora vary according to: –Type: quality/popular, general/specialised content –Time: same day/month/year > concurrent corpora Consider: –CETEMPúblico - Portuguese –Reuters Corpus - English
Comparability in technical and scientific corpora - form Pamphlets Manuals Textbooks Articles and papers Dissertations, theses
Comparability in technical and scientific corpora - content Everyday information Encyclopedic information Instructions Education Expert-to-expert communication
Constructing comparable corpora - general language Where does one start? Very large comparable corpora in 2 or more languages = mega-proposition! Carefully selected annotated general corpora – like ICAME corpora (Brown, LOB etc) = a possibility + limitations
Using comparable corpora - general language Advantages: –Comparative and contrastive research at all levels –Particularly useful for lexicographical research and search for syntactic patterns Disadvantages: –Difficult to manage for more delicate analysis –Unnecessary for certain types of research
Constructing comparable corpora – Newspaper texts Newspaper corpora –Relatively easy to acquire –A wide variety of fields –Similarity in tenor mode
Using comparable corpora – Newspaper texts Concurrent corpora > extraction of similar news items > e.g. –War reports –Politics – election campaigns –Football during the World Cup OR > styles of journalism > comparing individual journalists etc.
Constructing comparable corpora – general language + restricted text type General subject texts of similar text type – e.g. Encyclopedia entries, tourism pamphlets Literary texts of similar period, school or genre Technical and scientific texts with similar form or function e.g. textbooks
Using comparable corpora – general language + restricted text type Discourse analysis Pragmatics Genre analysis Sociolinguistic analysis
Constructing comparable corpora – specialized language Special domains at various levels – e.g. –Geography > population geography > ethnic minorities –Engineering > mechanical engineering > tribology –Medicine > oncology > breast cancer
Using comparable corpora – specialized language Genre analysis Terminology extraction Information retrieval Web browsing technology Knowledge engineering
All corpora construction Must establish: –Overall general policy in relation to: Form – computational structure Content of sub-corpora Availability to general / restricted public –Specific objectives of sub-corpora
All corpora construction Must take into account: –Copyright restrictions –Effect of external factors on the text Idiosyncracies of individual author Characteristics of writing in specific cultural/ social situation Homogenising effect of internationalisation –Eurospeak –Anglicisation of scientific terminology
Linguateca - Porto More immediate objectives To construct comparable and parallel corpora in Portuguese and English using: –Texts in special domains already being investigated –Adding corpora from special domains as and when the opportunity arises To construct the necessary computational framework for using the corpora for research To make these corpora as widely available as the respective copyright situation permits
Linguateca - Porto Longer-term objectives To extend the notion of comparability to: –genre-specific corpora –restricted general language corpora To construct integrated networks of comparable corpora To extend these objectives to other languages To contribute to similar projects elsewhere
Bibliography Bourigault, Didier, Christian Jacquemin, & Marie- Claude LHomme. (Eds.) Recent Advances in Computational Terminology. Amsterdam & Philadelphia: John Benjamins Publishing Co. Charlet, J., M.Zacklad G.Kassel D.Bourigault Ingénierie des connaissances. Paris: Éditions Eyrolles. Veronis, Jean (Ed) Parallel Text Processing – Alignment and Use of Translation Corpora. Dordrecht: Kluwer Academic Publishers.