Processing Textual Sources for Linguistic and Literary Research: What a 'Solitary Scholar' Can Do Alexei Lavrentiev Alexei.Lavrentev@ens-lsh.fr Ecole Normale.

Презентация:



Advertisements
Похожие презентации
DaCoPAn Software Engineering Project - Проект DaCoPAn.
Advertisements

WS2-1 WORKSHOP 2 NORMAL MODES ANALYSIS OF A 2 DOF STRUCTURE NAS122, Workshop 2, August 2005 Copyright 2005 MSC.Software Corporation.
Brief introduction to the general genetic law of development Nikolai Veresov 1.
Description Research of colour and sound background of poems (according to the book by A.P. Zhuravlev Sound and Meaning vowels in our perception are quite.
Intelligence framework for labour-market and educational services resources management Personalreserve Authors: Antonets A. Galushkin M. c.t.s. Kravets.
WS10a-1 WORKSHOP 10A MODAL ANALYSIS OF A CIRCUIT BOARD NAS122, Workshop 10a, August 2005 Copyright 2005 MSC.Software Corporation.
Some ideas of semantic analysis for anaphora resolution Dmitry P. Vetrov Dorodnicyn Computing Centre of RAS.
WS5-1 WORKSHOP 5 DIRECT FREQUENCY RESPONSE ANALYSIS NAS122, Workshop 5, August 2005 Copyright 2005 MSC.Software Corporation.
WS7-1 PAT328, Workshop 7, September 2004 Copyright 2004 MSC.Software Corporation WORKSHOP 7 CWELD AND CFAST CONNECTORS.
© 2006 Cisco Systems, Inc. All rights reserved. HIPS v Using CSA Analysis Configuring Application Behavior Investigation.
WS14b-1 WORKSHOP 14B MODAL ANALYSIS OF A TOWER WITH SOFT GROUND CONNECTION NAS122, Workshop 14b, August 2005 Copyright 2005 MSC.Software Corporation.
WS6-1 WORKSHOP 6 MODAL FREQUENCY RESPONSE ANALYSIS NAS122, Workshop 6, August 2005 Copyright 2005 MSC.Software Corporation.
WS14a-1 WORKSHOP 14A MODAL ANALYSIS OF A TOWER NAS122, Workshop 14a, August 2005 Copyright 2005 MSC.Software Corporation.
WS15a-1 WORKSHOP 15A MODAL ANALYSIS OF A TUNING FORK USING FINE MESH WITH TET10 ELEMENTS NAS122, Workshop 15a, August 2005 Copyright 2005 MSC.Software.
WS16-1 WORKSHOP 16 MODAL FREQUENCY ANALYSIS OF A CAR CHASSIS NAS122, Workshop 16, August 2005 Copyright 2005 MSC.Software Corporation.
WORKSHOP 13 NORMAL MODES OF A RECTANGULAR PLATE. WS13-2 NAS120, Workshop 13, May 2006 Copyright 2005 MSC.Software Corporation.
Comparison of Lotus Notes Designer, Domino Workflow Architect and AdHoc Workflow Builder 2003 (c) AdHoc.
WS15b-1 WORKSHOP 15B MODAL ANALYSIS OF TUNING FORK USING COARSE MESH WITH TET10 ELEMENTS NAS122, Workshop 15b, August 2005 Copyright 2005 MSC.Software.
OLAP ModelKit is a powerful and high-performance OLAP component specially designed to help you reduce your development time and costs while building effective.
INFLUENCE OF AMERICAN ENGLISH ON BRITAIN ENGLISH OR WHAT TYPE OF LANGUAGE WE SHOULD LEARN AT SCHOOL Проект подготовила Филиппова А.А. ученица 10 класса.
Транксрипт:

Processing Textual Sources for Linguistic and Literary Research: What a 'Solitary Scholar' Can Do Alexei Lavrentiev Ecole Normale Supérieure Lettres et Sciences humaines, Lyon, France University of Kentucky, October

Two projects Scholarly re-edition of a 1861 Anonymous folklore collection Corpus of Medieval French manuscript transcriptions for the study of punctuation

Folklore Project 1/14

Project Team Vera Kuznetsova –Senior Researcher, Institute of Philology SB RAS –Specialist in Russian folklore Olga Laguta –Professor, Novosibirsk State University –Linguist Alexei Lavrentiev Folklore Project 2/14

Objectives Verify the authenticity of folklore texts in the collection Analyze linguistic features of the texts Learn more about the author of the collection Make these texts available to scholarly community Folklore Project 3/14

Challenges Encode data in a sustainable format (TEI XML) using available tools –Microsoft office (Word, Access) –XML processing software (XML Spy) –Perl Configure the tools for the users with virtually no experience in IT Folklore Project 4/14

Workflow Word Documents Perl script Tokenized XML-TEI documents XSL Stylesheets Access Database Printed edition Lemmatized XML-TEI documents Vocabulary with contexts Linguistic analysis Metadata Folklore Project 5/14

Word document Folklore Project 6/14

Metadata file [1. File name] chtochelovekzakhochet ; [номер] 20 ; [2. Заглавие текста (в источнике)] Что человек захочет, то и сделает ; [3. Заглавие текста (рабочее)] Что человек захочет ; [4. Коллектив - редактор электронной версии] Сектор русского языка в Сибири, Институт филологии СО РАН ; [5. Ответственные исполнители] : [функция] Ввод текста и предварительная разметка ; [ФИО] Кузнецова Вера Станиславовна, Алешина Ольга Николаевна ; [функция] Конвертирование в формат XML-TEI, валидация ; [ФИО] Лаврентьев Алексей Михайлович. [6. Информация о проекте] : Корпус текстов русской фольклорной прозы (легенды) ; [7. Информация об источнике] : [Информация о редакторе(ах), составителе(ях) и т.п.] : [функция] подготовка к изданию ; [ФИО] Кузнецова Вера Станиславовна ; [функция] составитель сборника ; [ФИО] аноним ; [функция] автор записи ; [ФИО] не указан. [Место записи] не указано ; [Издательство] типография Ф. Иванова; [Место издания] Санкт-Петербург ; [Год издания] 1861 ; [ISBN] ????. Folklore Project 7/14

Perl script Takes Word document saved in HTML (filtered) format Takes the metadata Produces an XML-TEI document –Tokenizes and gives ID to and –Transforms analytical markup into elements Folklore Project 8/14

XML Document Folklore Project 9/14

XSLT Stylesheets Produce legible text for proofreading Produce tables to be exported to the database Folklore Project 10/14

Access Database Folklore Project 11/14

Access Database Folklore Project 12/14

Access Database Folklore Project 13/14

Results Printed edition –Texts –linguistic analysis supplement –indexes XML-TEI lemmatized text corpus XSLT stylesheets Access database –morphological table, –forms for lemmatization and dictionary Problem: no direct connection between the printed edition and the XML texts Folklore Project 14/14

Challenges Create an adequate representation of linguistically relevant data from a medieval manuscript –Multiple visualizations according to various editing traditions Annotate and analyze the use of punctuation marks Punctuation Project 1/12

Project History : first transcriptions using ASCII special characters 2001: first annotation using Excel 2003: XML-TEI (Charrette-style) transcriptions : XML-TEI (Menota-style) transcriptions Punctuation Project 2/12

Special data to be encoded Punctuation Project 3/12

Special data to be encoded Variant character glyphs Punctuation Project 3/12

Special data to be encoded Variant character glyphs Abbreviations Punctuation Project 3/12

Special data to be encoded Variant character glyphs Abbreviations Large initials Abnormal word spacing Punctuation Project 3/12

Normalized Presentation [ § 7] Endementres qu'il parloient einsi si entra laienz uns vaslez qui dist au roi: « Sire noveles vos aport mout merveilleuses. – Queles ? Multiple visualizations Extract from Ms.Lyon BM, P.A. 77, Queste del saint Graal, Photo: BM Lyon, Transcription: Graal Project Diplomatic Presentation [ § 7] ENdementres qu'il parloient einsi si entra laienz uns uaslez qui dist au roi. Sire noueles uos aport mout merueilleuses. Queles Imitative Presentation [ § 7] E Ndementreſ quıl parloıent eínſı ſı entͣ laıenz unſ uaſlez quı dıſt au roı. Sıre noueleſ uoſ apot mout merueılleuſeſ. Queleſ XML Transcription Endementres ENdementres E Ndementre&slong; qu Punctuation Project 4/12

Encoding choices Menota-style TEI extension –Multiple representation at a word level (norm, dipl, facs, pal?) Additional elements –punct, mdv_dropcap, mdv_lb… Additional attributes Punctuation Project 5/12

Workflow Compact syntax transcription –xml + shortcut characters (cf. Wiki) Text description using Access Database –Ms Description –Text typology Expanding to a standard XML format using a Perl script Export to tabular format for annotation Re-integration of annotation to XML documents Export and analysis using Weblex software Punctuation Project 6/12

Compact syntax Punctuation Project 7/12

Manuscript description Punctuation Project 8/12

Expanded XML Punctuation Project 9/12

Annotation Punctuation Project 10/12

Weblex Punctuation Project 11/12

Results 25 fragments of manuscripts transcribed and described Encoding guidelines Integrated database of text descriptors (editions and transcriptions) Perl scripts for conversions XSLT stylesheets Punctuation Project 12/12

Thank You!