Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität.

Презентация:



Advertisements
Похожие презентации
Comparison of Lotus Notes Designer, Domino Workflow Architect and AdHoc Workflow Builder 2003 (c) AdHoc.
Advertisements

Ways to Check for Divisibility Vüsal Abbasov Dividing By 1 All numbers are divisible by 1.
LANGUAGE, SPEECH, SPEECH ACTIVITY Suggests to allocate the following functions: communicative; thinking tools; mastering the socio-historical; experience;
PAT312, Section 21, December 2006 S21-1 Copyright 2007 MSC.Software Corporation SECTION 21 GROUPS.
Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Designing the Network Management Architecture ARCH v
Kurochkin I.I., Prun A.I. Institute for systems analysis of RAS Centre for grid-technologies and distributed computing GRID-2012, Dubna, Russia july.
The market of perfumery and cosmetics is considered one of the fastest growing in the world. The annual growth rate of the market in Russia is 20%, outpacing.
© 2005 Cisco Systems, Inc. All rights reserved.INTRO v Building a Simple Ethernet Network Understanding How an Ethernet LAN Works.
© 2006 Cisco Systems, Inc. All rights reserved. SND v Configuring a Cisco IOS Firewall Configuring a Cisco IOS Firewall with the Cisco SDM Wizard.
A new interface model for the Jazyki Mira typological database Oleg Belyaev The research is supported by RFBR grant ( а.
SMS GENERATION. SMS IN THE MODERN WORLD Nowadays SMS (SMS - short message service) and mobile phones play very important role in our lives. SMS has its.
HPC Pipelining Parallelism is achieved by starting to execute one instruction before the previous one is finished. The simplest kind overlaps the execution.
09/12/20131 The Unified Research Information Space (URIS) of the Russian Academy of Sciences (RAS) construction of a distributed information environment.
INFLUENCE OF AMERICAN ENGLISH ON BRITAIN ENGLISH OR WHAT TYPE OF LANGUAGE WE SHOULD LEARN AT SCHOOL Проект подготовила Филиппова А.А. ученица 10 класса.
Lesson 3 - HTML Formatting. Text Formatting Tags TagDescription Defines bold text Defines big text Defines emphasized text Defines italic text Defines.
© 2005 Cisco Systems, Inc. All rights reserved.INTRO v Managing Your Network Environment Managing Cisco Devices.
© 2005 Cisco Systems, Inc. All rights reserved. IPTX v Configuring Additional Cisco CallManager Express Features Configuring Cisco CallManager Express.
The United Kingdom of Great Britain and Nothern Ireland.
Business Statistics 1-1 Chapter Two Describing Data: Frequency Distributions and Graphic Presentation GOALS When you have completed this chapter, you will.
© 2006 Cisco Systems, Inc. All rights reserved. CVOICE v Configuring Voice Networks Configuring Dial Peers.
Транксрипт:

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main

Tokens and Types Distribution in TITUS Outline TITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS Resources Metadata for Tokens and Types distribution Корпусная лингвистика

Tokens and Types Distribution in TITUS TITUS Resource Data TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) Корпусная лингвистика A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 3 TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens

Tokens and Types Distribution in TITUS TITUS Data Корпусная лингвистика Added by J. Gippert, R. Mittmann 4

Tokens and Types Distribution in TITUS TITUS Search Engine TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. Корпусная лингвистика

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Gothic Biblia Gothica contains additional parallel passages in Latin and Greek. Корпусная лингвистика Biblia Gothica ( 6

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Church Slavonic Old Church Slavonic texts are represented in two ways: in the Glagolitic alphabet – original form of the text – and in Cyrillic one. Корпусная лингвистика Codex Marianus ( 7

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Polish Old Polish texts contain a simultaneous display of editions that have arisen at different times. Корпусная лингвистика Kazania Świętokrzyskie ( kazania/kazan.htm). 8

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Ossetian The Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. Корпусная лингвистика Ossetian: Nart epic ( nart/nart.htm). 9

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Russian-Low German Tönnies Fenne's Manual (17th century) contains at least 9 different languages or language variations. Корпусная лингвистика

Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Prussian Корпусная лингвистика Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 11

Tokens and Types Distribution in TITUS Creation A digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. Корпусная лингвистика $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; # $zeile =~ s/\d*\s+ //g; # 12

Tokens and Types Distribution in TITUS Examples: Gothic Корпусная лингвистика Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types TokensTypes Gothic Latin Greek

Tokens and Types Distribution in TITUS Examples: Gothic Gothic Bible. New Testament Books. Total: tokens und types TokensTypes Gothic Latin Greek Корпусная лингвистика

Tokens and Types Distribution in TITUS Examples: Корпусная лингвистика Tönnies Fenne's Manual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 15

Tokens and Types Distribution in TITUS Examples: further application Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata DC – Dublin Core TEI – Text Encoding Initiative CEI – Corpus Encoding Initiative IMDI – ISLE Meta Data Initiative OLAC – Open Language Archives Community CMDI – Component MetaData Infrastructure Корпусная лингвистика

Tokens and Types Distribution in TITUS CMDI - Component MetaData Infrastructure Корпусная лингвистика

Tokens and Types Distribution in TITUS TITUS Metadata: HTML Format TITUS Texts: Biblia gothica: Frame Корпусная лингвистика

Tokens and Types Distribution in TITUS New Metadata Set for TITUS Корпусная лингвистика * Namevorhanden *Authornew *ProjectContactNameexisting *ProjectContactAddressexisting *ProjectContact existing *ProjectContactOranisationexisting *ProjectDescriptionexisting *Resource.Languageneu *Resource.ResourceLinkexisting *Resource.Access.Availabilityexisting *Resource.Access.Dateexisting *Resource.Access.Ownerexisting *Resource.Access.Publisherexisting *Resource.Publication.Time.Original.Manuscriptnew *Resource.Publication.Time.Original.Facsimilenew *Resource.Publication.Time.Original.Publishednew *Resource.Publication.Time.Electronicexisting *Resource.Wordcount.General.Tokens*new (CLARIN) *Resource.Wordcount.General.Typesnew *Resource.Wordcount.Language.Tokensnew *Resource.Wordcount.Language.Typesnew *Resource.Metadata.Encodingnew

Tokens and Types Distribution in TITUS Metadata Example for TITUS – XML CMDI Tokens 893 Types Tokens | Types Language 1_General 10 Tokens | 9 Types Language 2_Gothic 420 Tokens | 240 Types Language 4_Latin 572 Tokens | 325 Types Language 5_Greek 627 Tokens | 319 Types Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика

Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика

Tokens and Types Distribution in TITUS Thank you for your attention! Корпусная лингвистика Links ARBIL (Metadaten-Editor) CLARIN CMDI Dublin Core IMDI OLAT TEI TITUS 25

Tokens and Types Distribution in TITUS Корпусная лингвистика Old Prussian Corpus Tokens General: tokens Types General: 8390 types 26