HomeResources for ContributorsHow to Encode

How to Encode

Our project presents the documents in the archive in two views: one that preserves, to the extent possible, the document as it appears on the page; and another that presents a modernized version that is easier to read.For each document in the collection, these two versions will be encoded together in a single <text> element, using the <choice> element to layer the transcription and edition together. For each text, will also encode semantic features such as dates and proper names, in order to allow us to later experiment with options for analysis and display.

Element reference for encoding

I. Representing the manuscript in XML

A. Structural elements

<pb n="1" facs="../images/ew_a1_342_001.jpg"/>

Page break. The value of the attribute n should be the page number. If there is no explicit pagination in the document, you can impose it here, numbering sequentially from 1. The value of the attribute facs should be filename of the corresponding image. This has already been configured for the for page in your file. You should copy, paste and modify that first <pb> tag for subsequent page breaks.


encloses a paragraph. Please put these elements on their own lines, as follows:

   This is the text of a paragraph.



marks the place where a line break occurs in the transcription. When a word is divided across the line, we use attribute break to indicate this:

<lb break="no"/>

However to avoid creating an empty space in the published transcription, do not create a new line after the break until there is a new line break that does not divide a word across lines. For instance:

   This document has several<lb/> 
   lines of text. Most lines end<lb/> 
   neatly at the end of a word,<lb/> 
but this line quite stubborn<lb break="no"/>ly does not.<lb/> </p>

If stubbornly were hyphenated in the original, you would not need to transcribe here the hyphen.   


marks the beginning of a column on a page that has more than one column. See the example on this page: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-cb.html    

B. Deletions, Insertions and Changes of Hand

<del type="strikeout"></del>

encloses any material that has been crossed out.

<del type="overwritten"></del> 

encloses any material that has been written over.  

<add type="caret"></add>

encloses any material that has been added above the line, or between lines, with a caret or other mark indicating its point of insertion in the original:

<add type="no_caret"></add>

encloses any material that’s been added, with the point of insertion not indicated (meaning you’ve have to make an educated guess about where it goes).  


can be used to transcribe any material you find hand-written into the margin of a document, using the place attribute to indicate which margin.

<note place="marginLeft"></note>
<note place="marginRight"></note>
<note place="marginTop"></note>
<note place=""marginBottom"></note>

If the note is in the left or right margin, you should locate this element as close as possible to the place to which it corresponds in the top-to-bottom flow of the text.

<note type="authorial" place="marginTop"></note> 

can also be used to record a heading that runs across the top of the page throughout a document but is separate from the flow of the text (as might occur in a diary, perhaps).  

 <supplied reason="omitted-in-original" cert="high">met</supplied>

can be used to add a word that you believe, according to your editorial discretion, was unintentionally omitted from the text. The reason attribute is used to provide your justification for this addition. Cert indicates your degree of certainty (valid values are low, medium, and high).  

C. Representing layout/formatting of original document

When we encode documents with TEI-XML, we are concerned more with content than appearance. Indeed, one of the benefits of using XML is that it separates content from how that content will ultimately be presented. That styling is generally done with XSLT, the stylesheet/transformation language for XML (like CSS is the stylesheet language for HTML). However, we are indeed interested in recording the appearance, or formatting, of the original document itself. This is why, after all, we are using <pb/>, <lb/> and other such structural elements. Here are a few others that may be useful: If a paragraph is indented in the original, you can indicate that through:

<p style="indent">

(The style attribute is used here and elsewhere to describe how things appear in the original, not how they should be rendered in any eventual output of the XML document. See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#STGAre)

<hi style="center"></hi>

encloses centered text (see http://www.tei-c.org/release/doc/tei-p5-doc/en/html/examples-hi.html).

<hi style="superscript"></hi> 

encloses raised text. Also...

<hi style="italic">


<hi style="underline"></hi>

For text written in all capitals, enclose it in the following:

<emph style="case(allcaps)"></emph>

To adjust the placement of the opener or closer of a text, use the style attribute as shown in the following example:

<opener style="center">
<closer style="right">

If headings or labels of any sort appear in the document in all caps, please transcribe them using title case, and use the style attribute to indicate the all caps formatting in original, as follows:

 <head style="case(allcaps)">Indigent Hospital Patients</head>

(This text appears in the original as "INDIGENT HOSPITAL PATIENTS.") and

 <label style="case(allcaps)">Laws and Rules</label>

(This appears in original as "LAWS AND RULES".) If other material appears in all caps in the original, please use the following:

 <emph style="case(allcaps)">

D. Handling gaps in the text

<gap reason="page missing from document"/> 

can be used to mark a place where there is an unrecoverable gap in the original, using reason attribute to explain the circumstances. This will often have to do with damage to the original or missing pages. If the gap is small, and you believe you can accurately deduce the missing letters or words, see IV. B. Material you add yourself below.  

E. Special characters

Some characters have to be transcribed in XML with character references. These include: ° (the degree sign, as in 98° Fahrenheit) must be represented in your text as follows:


& (the ampersand, meaning “and”) must be represented as follows:


F. Openings and Closings of Letters

These elements are both structural and semantic, but we'll put them in this section. Here are a examples of how to mark up the opening and closing of a letter:

    <name type="place" subtype="address">
       132 Main Street
       Jacksonville, FL 32207
 Dear <name type="person">Miss White</name> :<lb/>
 <hi style="center">
 <choice><sic>your</sic><corr>Your</corr></choice> sister in Christ.<lb/>
 <name type="person">Sarah Best</name>.<lb/>

These elements do not go inside <p> or <head> elements.  

G. Hierarchy of Headings 

Headings must nested within <div> tags whenever multiple levels of headings are present and/or when headings appear after the beginning of the documents. The <div> tag is used to express a hierarchy of headings. One <div> tag is a regular heading, two <div> tags is a subheading, three <div> tags is a sub-subheading, and so on. The following example illustrates the use of nesting tags to create a hierarchy.

<div><head>This is the biggest header.</head></div>
<div><div><head>This is a smaller subheader.</head></div></div>
<div><div><div><head>This is an even smaller sub-subheader.</head></div></div></div>

When using both headings and subheadings, all features of the code like <p>, <opener>, and <closer> must be enclosed in a <div> tag. See the following example: 

<div><p>Body text</p></div>
<div><p>Second section text</p></div>

H. Envelopes

Many of the documents we are editing are letters, and images of the corresponding envelopes are included with those of the letter itself. For now, we will consider an envelope a "page" in the corresponding XML document. TEI-XML P5 does not appear to contemplate this situation, so just encode each address block (sender, receiver), and anything else, as a <p>, with <lb/> at the end of each line.  

II. Regularizing/Modernizing

A. Abbreviations


can be used to simultaneously record an abbreviation and provide its resolution, as in:


If the letters st in this example actually appeared in raised script in the original, we would document that as follows:

<choice><abbr>1<hi style="superscript">st</hi></abbr><expan>first</expan></choice>

As a general rule, we will resolve all abbreviations.  

B. Misspellings


can be used to simultaneously record and correct a misspelling (a misused homonym, a word spelled incorrectly, etc.):

We had some very nice <choice><sic>whether</sic><corr>weather</corr></choice> last week.


Lisa's dog is partly <choice><sic>Dauchshound</sic><corr>Dachshund</corr></choice>, I think.

C. Archaic/Obsolete spellings


can be used to modernize a correct, but obsolete, spelling. We might not have occasion to do so in the current project. If you encounter a case of which you are unsure, please let the workshop leader know.  

D. Punctuation

We can use this sequence of elements to regularize punctuation:

removing comma:
removing period:
adding comma:
adding period: 
changing comma to period: <choice><orig>,</orig><reg>.</reg></choice>
changing period to comma:
<choice><orig>.</orig><reg>,</reg></choice> replacing semicolon with comma: <choice><orig>;</orig><reg>,</reg></choice>

E. Capital letters

We will respect the use of lower case/upper case letters in our transcription, but will regularize these in the edition, using


We will replace here the entire word needing a change in capitalization, not merely the individual letter or letters.

III. Encoding semantic aspects of the text

A. Dates

<date when="YYYY-MM-DD"></date> 

encloses a date, however it is articulated. You might see a standard situation like the following:

She was born on <date when="1970-04-01">April 1, 1970</date>.

If you had only year, this would be the format:

<date when="1970">1970</date>

Only month and year would be:

<date when="1980-02">February 1980</date>

Only month and day won't validate, so if that is the only information available, use "9999" as a placeholder year. Sometimes references to dates aren’t explicit, but we can still tag them, as in:

Earlier <date when="1970">that same year</date>, her parents had moved to Florida.

B. Names

1. Individual people

<name type="person"></name>

encloses the proper name of a person, as in:

<name type="person">Nikolai Vitti</name>

This can also be used to mark common nouns or phrases that refer to a specific, identifiable person, as in:

The <name type="person">current superintendent</name> of the local public school system...

where current superintendent refers, for instance, to Nikolai Vitti.  

2. Groups of people

<name type="person_group"></name>

encloses a proper noun indicating the name of a group of people that has a particular name, such as those of a given nationality or some other category. We would typically write this type of word with an initial capital, but not always. Here are some examples:

the <name type="person_group">Seminole</name> and <name type="person_group">Creek</name>


the <name type="person_group">British</name> and <name type="person_group">French</name>

3. Places

<name type="place"></name> 

encloses the proper names of places of any type. This includes buildings, streets, cities, states (and other political divisions), as well as geographical features like rivers, lakes, etc. for example:

<name type="place">Jacksonville</name>

This can also be used to mark common nouns or phrases that refer to specific, identifiable places, as in:

the level of toxicity in the <name type="place">river</name> has increased...

where river refers, for instance, to the St. John's. The subtype attribute can be used to provide a more specific category for a place, as in

<name type="place" subtype="river">St. John's River</name>


<name type="place" subtype="city">Jacksonville</name>

Let's handle specific street addresses as follows:

<name type="place" subtype="address">123 Main Street</name> 

For places that can be located on a map, let's include latitude and longitude as follows:

<name type="place">The Clara White Mission<location><geo>30.332632 -81.664020</geo></location></name>

To get this information, search for the place in Google Maps, right click on the location on the map and select "What's here?". A box will pop up showing lat. and long. If you aren't able to copy the numbers from there, click on them, and they will appear in the search box at the left, from where you will be able to copy them.    

4. Companies and organizations

<name type="company"></name>


<name type="organization"></name>

can be used to tag the names of such entities.  

5. Events

<name type="event"></name>

can be used to indicate the name of an event, as in:

<name type="event">The World's Fair</name>

C. Titles

<title level="m"></title> 

encloses the title of a monographic ("m") work (a book, primarily).

<title level="a"></title> 

encloses the title of an "analytic" ("a") work (a journal chapter, an article, etc.)  

IV. Doubts, Comments and Editorial Additions/Annotations

A. Documenting your doubts

1. <unclear>

<unclear cert="low" reason="Handwritten signature is difficult to read.">J. Henderson</unclear>

can be used to encode your uncertainty about any part of your transcription. The cert attribute indicates your level of certainty (low, medium, high), and please use reason to explain the circumstances.

2. XML comments

<!-- -->

is the format for an XML comment, which can be used anywhere to add additional documentation.

<!-- text here -->

can be used to insert comments into your XML file. This is metadata that will not display in the output of your file. You can use comments of this sort to record doubts or questions you might want to follow up on later.  

B. Material you add yourself

 <supplied reason="text smudged" cert="medium"></supplied> 

can be used to supply letters or words that can't be read. Use reason attribute to give a text explanation for the circumstances, and the cert attribute to indicate your degree of certainty about the solution you've provided (the acceptable values are high, medium, and low).

Once you have completed encoding your XML file, follow the instructions for testing and submitting it.