Helsinki corpus tei xml edition


















Although several later corpus projects came to adopt some of the practices used in the Helsinki Corpus, such as the time periodisation which was developed specifically for the corpus, the COCOA annotation scheme did not become a universal standard in corpus linguistics. As years went by, new annotation standards appeared on the corpus linguistic scene and, encouragingly, many projects adopted variations of widely used metalanguages such as XML.

The idea of converting the Helsinki Corpus to a more universal encoding format was mentioned every now and again, but for a while practicalities prevented that from happening. VARIENG maintains an ever-growing bibliography of research conducted with the corpus as part of the Helsinki Corpus entry in the Corpus Resource Database CoRD , and every now and again we hear of other projects making use of the corpus or inspired by it. Our typical reaction was to contact the researcher or team in question, inquire about their work, and wish them success.

One particular type of activity involved conversions to other encoding and annotation formats, some of them XML. Although certainly of interest, it turned out none of the new encoding projects had attempted a comprehensive conversion of the entire corpus and all its rich data.

Our decision to produce an updated version turned from idea to action in the autumn of prompted by two facts, namely that the second funding period of the VARIENG research unit was coming to an end at the end of and the same year would also mark the 20th anniversary of the original release of the corpus. The latter of these two events had already inspired VARIENG to put the wheels in motion for an anniversary conference called the Helsinki Corpus Festival to be held in September—October , and it did not take long for us to realise that the conference would provide the perfect opportunity to release the new XML edition to the research community.

With somewhat less than a year to go before the conference, it was time to get busy. The answer is twofold. Firstly, the main reason for the conversion was to ensure the preservation of all the information encoded into the original corpus.

One of the dangers of proprietary encoding systems, a definition we may take to apply to all standards not in sufficiently wide use to be considered universal, is that over time, as new systems emerge, older ones in limited use are gradually forgotten and the data is rendered effectively inaccessible.

Because XML is de facto the universal markup language in the world of computing today, it only makes sense to convert existing corpora to it rather than produce yet another new markup language. The TEI Guidelines for Electronic Text Encoding and Interchange define an XML schema developed by and for humanities computing that provides an extraordinarily rich set of elements and attributes for all manner of documents.

Corpora are catered for as well, and the wealth of documentation makes TEI particularly attractive as a universal scheme for corpus linguists to adopt. Another distinct benefit of XML is the option of using standoff annotation, which will allow users to add new layers of data without disrupting the base text.

The monogr element only occurs within a biblStruct. The name element is used exclusively in respStmt elements for identifying the various people who have participated in the compilation of the original Helsinki Corpus and the production of the XML version. The same annotation is used for notes added by the editors of the editions used as the source of the corpus text and notes added by the corpus compilers, the resp attribute being used to distinguish between the two.

Corpus compilers are referred to using the xml:id values of the relevant respStmt elements in the corpus header, while the editor of a source edition is referred to by an XPath expression pointing to the editor element s of the relevant bibliographic item, e.

The lack of a resp attribute on a note element means that the note is in the original source. Contains a series of note elements providing additional information about the XML version of the corpus, its documentation and its source, the original version of th Helsinki Corpus.

Used only within text divisions of the type letter. The TEI Guidelines version 1. The TEI Guidelines have here been extended to allow also paragraphs to be incomplete; this extension will also be suggested for the Guidelines themselves.

All paragraphs that were indicated as such in the original Helsinki Corpus, either by three whitespaces at the head of a line or by a preceding empty line, have been annotated using this element. Also sections of text delimited by headings but not explicitly annotated as paragraphs in the original have been treated as paragraphs in this XML version.

Value description: the values contain letters, digits and punctuation characters, indicating either a page number, a folio number with recto and verso indicators or a combination of a volume identifier and page number. Following the TEI convention, pb elements appear at the start of the page to which they refer. The global n attribute indicates the number or other value associated with the page which follows. Used only within text divisions of the type letter , following a closer.

Contains a textClass element replicating the classificatory part of the Helsinki Corpus COCOA header, along with the dating and language information for the text encoded in a 'TEI-native' format using the creation and langUsage elements. Value description: the value is an integer, unique to the individual publication place within that biblStruct.

The corpus TEI header contains the full publication information of the corpus, while the headers of individual texts contain merely a reference to the whole corpus.

Each respStmt contains a single resp element describing a specific responsibility and a list of name elements annotating the names of the people responsible for this aspect of the XML conversion. Contains a series of change elements reflecting various stages of the XML conversion process. The series element only occurs within a biblStruct. The most common values used for this attribute refer to the compilers of the original version of the corpus, enumerated in the header.

In addition, values referring to the edition used as a source of the text also occur. In the corpus, the sic element always occurs within a choice element, together with a corr element indicating the corrected reading.

An individual speech in a performance text, or a passage presented as such in a prose or verse text. If speaker labels are present, they are contained within the sp element, along with all the p or lg elements that make up the content of the speech. Semi-automatically annotated for performance texts and other texts containing unambiguously indicated dialogue in the XML version, based on speaker labels and other cues in the original version, supplemented by manual editing of the annotation.

A specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment. Semi-automatically annotated for performance texts in the XML version, based on type changes and other annotated cues in the original version, supplemented by manual editing of the annotation. The editor of a source edition is referred to by an XPath expression pointing to the editor element s of the relevant bibliographic item, e.

As the manual of the original version states, it is used for encoding a variety of things, including italicized expansions of abbreviations by the editor unless they "occur repeatedly and frequently throughout the text, in which case they have been left uncoded" , emendations made to the text on the basis of other manuscript versions and text supplied from other manuscript versions and corrections.

When emendations indicated by italics were encoded by the original compilers of the Helsinki Corpus, the emendation code was used to cover the whole word, which practice is reflected in this XML version. In the Old English part of the corpus, individual characters supplied or emended within a word were coded on the level of the whole word following the practice of the Toronto Corpus , enclosing the whole word by the brackets, while in the Middle English and Early Modern English parts, only the characters actually supplied or emended were enclosed within the brackets.

These practices are reflected also in the use of the supplied element in the XML version of the corpus. Contains a single category element describing each class of the taxonomy, in addition to a brief description of the categorization.

Consists of a fileDesc containing the bibliographic information for the file and the text contained by it and a profileDesc containing all of the information describing the linguistic and contextual profile of the text.

The header for each text contains all of the information contained in the original COCOA header, along with the bibliographic information contained in the compiler's note following it, in a structured, TEI conformant format.

The text element contains the original Helsinki Corpus text, omitting the COCOA header and the following note containing the bibliographic information. Value description: a unique XML identifier consisting of the xml:id value of the text and the word classification , combined by an underscore.

In the case of multiple text classifications for a single text, each classification after the first has a running number appended to the identifier by an underscore. In cases where errata corrections have been made to the parameter values of the original version, the default textClass element contains the corrected values, while the original values are preserved for reference in a separate textClass element identified as the 'old' version. Used only for the title element contained in the titleStmt of a text.

The value does not yet point to any existing XML element, but provides the basis for the future linking of a bibliographic database to the corpus. Used only for title elements occurring within bibliographic entries. The title element can also occur without any attributes, merely annotating a stretch of text as the title of a work for formatting purposes. For the entire corpus, provides the title and the responsibility statements for the XML version of the corpus.

See copyright statement. Table of contents Introduction 1. Structure of the corpus 1. The structure of the original Helsinki Corpus 1. The TEI Header 2. The Corpus Header 2. File description 2. Encoding description 2. Revision description 2. Text Headers 2. Profile description 3. Annotation 3. Textual structure 3.

Paragraph division 3. Line division 3. Headings 3. Page breaks and other milestones 3. Additional structural annotation 3. Textual and paratextual features 3. Special characters, accents and punctuation 3. Abbreviations and superscripts 3. Type changes and runes 3. Foreign language 3. Emendation and notes 3. Editorial emendation 3.

Comments 3. The structure of the original Helsinki Corpus The topmost structural division of the Helsinki Corpus is its chronological division into three main and eleven subsidiary parts, indicated both by file name prefixes the main parts and the COCOA header.

The TEI Header All of the metadata for both the corpus itself as a whole and for each individual text is contained within teiHeader elements included in the root element of the corpus teiCorpus and in each of the TEI elements. The Corpus Header The teiHeader for the entire corpus contains three parts, documenting the different aspects of the corpus and its creation process: fileDesc file description , containing the bibliographic information for the corpus, encodingDesc encoding description , containing information about the technical aspects of the corpus, and revisionDesc revision description containing information on the production process of the TEI XML Edition of the corpus.

File description The file description contains the information that not only identifies the corpus, but also documents the various people and institutions involved in its production and their roles in a structured way.

Encoding description The encoding description contains information related to the construction process of the corpus and the way various things are encoded and annotated. Revision description The revision description revisionDesc consists of a series of change elements which document changes made to the corpus document since its creation and identify the dates of the changes and the persons responsible for them.

Text Headers The TEI headers of individual corpus texts are slightly different in structure from the main corpus header. File description The file description fileDesc contains a bibliographic description of the corpus text. Profile description The profileDesc profile description element contains all of the classificatory data contained in the COCOA header of the original version.

Annotation Description of the annotation used in the corpus and how it replicates the original annotation. Textual structure The basic textual structure, i.

Line division Line division in the original corpus is indicated by the lineation of the text file, apart from long lines which are divided onto two lines, the first incomplete one being marked by a line-final hash. Additional structural annotation In addition to structural annotation based on the original version of the Helsinki Corpus, certain types of texts, namely verse texts, dramatic texts and letters have also received additional structural annotation.

Emendation and notes The original version of the corpus contains several kinds of annotation for editorial and compilatorial intervention, including notes by both the corpus compilers and the editor, and a record of selected editorial emendation. Editorial emendation Editorial emendations corrections, text supplied from other manuscripts, expansions of abbreviations, etc. Comments The original version of the Helsinki Corpus contains two types of comments added to the text: editor's comments and compilers' comments.

Errata corrections In addition to notes and editorial emendation contained in the original version of the corpus, the creation of the TEI XML Edition also involved the implementation of a large number of errata corrections collected over the years by the original corpus team. Attributes xml:id identifier provides a unique identifier for the corpus text. Used only within texts. Remarks The analytic element only occurs within a biblStruct. Used only within the header.

Remarks Used only in the revisionDesc element of the corpus header for documenting changes made to the corpus. Value description: The author's name in the original Helsinki Corpus format.

Marttila, Ville. The file names reflect, by and large, the names of authors or texts in Old and Middle English sections of the Corpus.

In the Early Modern English section the file names are based on the systematic coverage of different text types. The team considered it a matter of utmost importance that any research carried out using the new XML edition was fully compatible with earlier work done using the original corpus. The decision was made that the best approach would be to start with an automated conversion, and then to correct the errors by a combination of manual proofreading and tweaks of the conversion script.

Marttila took responsibility for the scripting, and over the Christmas holidays came up with a script that produced a very promising first run. It soon became apparent, however, that there would be several challenges before the conversion work was finished. A number of discrepancies were discovered between what the original manual stated and the reality, particularly when it came to topics such as the encoding of text structure and representation of samples.

It was becoming clear that considerable amounts of manual work could not be avoided. Many of the challenges came about as a result of the hierarchical principles of TEI XML clashing with the way the corpus was encoded or constructed. In such cases it was not sufficient to simply replace existing codes with new XML compliant ones, but rather it became necessary to add encoding to where none existed before.

In the majority of cases, this had to be done manually. An important part of the conversion involved adding transparency to the coded header parameters in the original corpus. The new edition spells out parameters and their values, and the manual provides more information about their specific meanings.

Accomplishing this required going back to the original compilers and asking questions about principles followed in the compilation of the corpus. Honkapohja read through the editions and their introductions, and in the case of editions based on single manuscripts, localised them using LAEME.

The great majority of Helsinki Corpus ME texts are based on a single manuscript. Together with Marttila, Kauhanen performed various automated and semi-automated conversion tasks to do with bibliographical data and the annotation of line groups and speaker turns missing in the original corpus. Although sufficient, bibliographic information in the original corpus was not recorded in a structured form, which meant that additional annotation, based on a careful analysis of the structure of the original entries, needed to be added to make the information machine-readable.

A particular problem came about as a result of the annotation of correspondence samples, where a single text in the original corpus could contain more than one sample with partially overlapping header information. One of the major undertakings in the new version was the correction of errata, collected over the preceding two decades and now filling two large cardboard folders. Although the principle behind Phase I of the conversion project remains that the XML edition matches the old Helsinki Corpus, in this one case it was deemed acceptable to replace the original text with a far more accurate edition.

For one thing, we wanted to document the conversion project in order to preserve information of who did what and how. Accordingly, the XML file makes use of responsibility statements and a change log recording every stage of the conversion and the names of the editors responsible. Indeed, one of the many lessons we have learned over the more than two decades of corpus compilation work in Helsinki is the value of such historiography of compilation projects, and we would urge other projects to adopt the habit of keeping records not only of decisions made, but also of who in actual fact performs a given piece of work.



0コメント

  • 1000 / 1000