Скачать презентацию
Идет загрузка презентации. Пожалуйста, подождите
Презентация была опубликована 10 лет назад пользователемЕвгений Иринархов
1 Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main
2 Tokens and Types Distribution in TITUS Outline TITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS Resources Metadata for Tokens and Types distribution Корпусная лингвистика
3 Tokens and Types Distribution in TITUS TITUS Resource Data TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) Корпусная лингвистика A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 3 TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens
4 Tokens and Types Distribution in TITUS TITUS Data Корпусная лингвистика Added by J. Gippert, R. Mittmann 4
5 Tokens and Types Distribution in TITUS TITUS Search Engine TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. Корпусная лингвистика
6 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Gothic Biblia Gothica contains additional parallel passages in Latin and Greek. Корпусная лингвистика Biblia Gothica ( 6
7 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Church Slavonic Old Church Slavonic texts are represented in two ways: in the Glagolitic alphabet – original form of the text – and in Cyrillic one. Корпусная лингвистика Codex Marianus ( 7
8 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Polish Old Polish texts contain a simultaneous display of editions that have arisen at different times. Корпусная лингвистика Kazania Świętokrzyskie ( kazania/kazan.htm). 8
9 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Ossetian The Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. Корпусная лингвистика Ossetian: Nart epic ( nart/nart.htm). 9
10 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Russian-Low German Tönnies Fenne's Manual (17th century) contains at least 9 different languages or language variations. Корпусная лингвистика
11 Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Prussian Корпусная лингвистика Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 11
12 Tokens and Types Distribution in TITUS Creation A digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. Корпусная лингвистика $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; # $zeile =~ s/\d*\s+ //g; # 12
13 Tokens and Types Distribution in TITUS Examples: Gothic Корпусная лингвистика Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types TokensTypes Gothic Latin Greek
14 Tokens and Types Distribution in TITUS Examples: Gothic Gothic Bible. New Testament Books. Total: tokens und types TokensTypes Gothic Latin Greek Корпусная лингвистика
15 Tokens and Types Distribution in TITUS Examples: Корпусная лингвистика Tönnies Fenne's Manual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 15
16 Tokens and Types Distribution in TITUS Examples: further application Корпусная лингвистика
17 Tokens and Types Distribution in TITUS Metadata DC – Dublin Core TEI – Text Encoding Initiative CEI – Corpus Encoding Initiative IMDI – ISLE Meta Data Initiative OLAC – Open Language Archives Community CMDI – Component MetaData Infrastructure Корпусная лингвистика
18 Tokens and Types Distribution in TITUS CMDI - Component MetaData Infrastructure Корпусная лингвистика
19 Tokens and Types Distribution in TITUS TITUS Metadata: HTML Format TITUS Texts: Biblia gothica: Frame Корпусная лингвистика
20 Tokens and Types Distribution in TITUS New Metadata Set for TITUS Корпусная лингвистика * Namevorhanden *Authornew *ProjectContactNameexisting *ProjectContactAddressexisting *ProjectContact existing *ProjectContactOranisationexisting *ProjectDescriptionexisting *Resource.Languageneu *Resource.ResourceLinkexisting *Resource.Access.Availabilityexisting *Resource.Access.Dateexisting *Resource.Access.Ownerexisting *Resource.Access.Publisherexisting *Resource.Publication.Time.Original.Manuscriptnew *Resource.Publication.Time.Original.Facsimilenew *Resource.Publication.Time.Original.Publishednew *Resource.Publication.Time.Electronicexisting *Resource.Wordcount.General.Tokens*new (CLARIN) *Resource.Wordcount.General.Typesnew *Resource.Wordcount.Language.Tokensnew *Resource.Wordcount.Language.Typesnew *Resource.Metadata.Encodingnew
21 Tokens and Types Distribution in TITUS Metadata Example for TITUS – XML CMDI Tokens 893 Types Tokens | Types Language 1_General 10 Tokens | 9 Types Language 2_Gothic 420 Tokens | 240 Types Language 4_Latin 572 Tokens | 325 Types Language 5_Greek 627 Tokens | 319 Types Корпусная лингвистика
22 Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика
23 Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика
24 Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика
25 Tokens and Types Distribution in TITUS Thank you for your attention! Корпусная лингвистика Links ARBIL (Metadaten-Editor) CLARIN CMDI Dublin Core IMDI OLAT TEI TITUS 25
26 Tokens and Types Distribution in TITUS Корпусная лингвистика Old Prussian Corpus Tokens General: tokens Types General: 8390 types 26
Еще похожие презентации в нашем архиве:
© 2024 MyShared Inc.
All rights reserved.