Basic Encoding

If you would like, you can access the sample XML file I use to demonstrate the markup on this page here.

In this section, you will learn the fundamentals of TEI-XML and carry out markup that deals with representing the structure and appearance of the document. 

Note that we use encoding and markup as synonyms. 

Step 1. Learn about TEI-XML

TEI-XML is an implementation of XML.  

Here are some basic things to know about XML:

  • XML stands for Extensible Markup Language. It is "extensible" because it is a language for writing other languages. 
  • XML is a markup language, like HTML, not a programming language, like Javascript, Python, etc. In other words, it is used to structure and describe data, not to carry out tasks or do things.
  • XML is a platform-independent and human-readable. XML files are plain text files. 
  • XML is commonly used to transport or exchange data between systems, including those that may have different internal formats. 
  • To learn more about XML, see the classic tutorial on XML from W3Schools.

TEI-XML was designed to describe written texts. It was developed and is maintained by the Text Encoding Initiative, an international consortium. You can find the full specification for the current standard (P5) on the TEI website. We will be using a subset of the elements available in that standard.

1.1. Of what does TEI-XML consist?

The following explanation draws examples from the TEI standard, but everything that is said here is true of XML in general, as TEI-XML is an implementation of XML.

1.1.1 Elements

TEI-XML is made up of elements that appear in pairs. The opening element and the closing element are identical, except that the closing element begins with a forward slash (/). For example:

<name>James Lloyd Jordan</name>

Elements can be nested inside each other, as in this example:

<body>
<p>
My first grade teacher was <name>Mrs. Sampson</name>.
</p>
<p>
My second grade teacher was <name>Mrs. Johnson</name>.
</p>
</body>

Note that we close each nested element before closing the element that contains it.

Some elements do not enclose data or other nested elements, but rather simply mark a spot in the text. W refer to these as empty elements. They consist of just the opening element with the forward slash at the end (essentially the opening and closing elements combined into one). For example, <lb/> marks the end of a line of text.

1.1.2 Attributes

Elements can have attributes, which go after the name of the element in the opening tag, and are joined to their values by an equals sign (=). The value of an attribute must appear in double quotation marks. For example, the element with which we begin a page is as follows:

<pb n="1"/>

The value of the attribute n (short for number) indicates that this is page 1.

The @ character is often used as a shorthand when referring to an attribute in regular text, as in this example:

We use a value for @n that corresponds to the number of the image.

1.1.3 Comments

We can add comments to an XML file in the following format:

<!-- some text here -->

Such comments are visible when we view the file in an XML or plain text editor, but do not appear when viewing the XML file in a web browser. In other words, these are internal notes are ignored by the browser when rendering the page, and are thus not seen by the casual viewer of the file. (A user could get to them, of course, by downloading and opening the XML file in an editor, or by viewing the page source in their browser.)

1.2 How is a TEI-XML file structured?

The high-level element in a TEI-XML file is <TEI>, which contains two main items:

  1. <teiHeader>, the TEI Header, which contains metadata about the file itself and the document being edited, and

  2. <text>, which contains the text of the document being edited.

The basic structure of the TEI-XML file, then, is as follows:

<TEI> 
 <teiHeader>
 </teiHeader>
 <text>
 </text>
</TEI>

1.3 What does it mean for an XML file to be well formed and valid?

Your XML file must be well formed and valid in order to display in the browser.

To be well formed, your file must have correct syntax. This means that all opening elements have matching closing elements, and that items are nested properly.

To be valid, your XML file must conform to the DTD (Document Type Definition) that defines the structure of a TEI-XML file. It can't contain any elements that aren't part of that standard, and all elements must be in the places where the standard allows them to be.

If you successfully installed the Scholarly XML extension in Setup, VS Code should prompt you when it detects problems related to your file being well formed and valid.

Avoiding Messy Validation Problems

The best way to avoid spending a lot of time making sure your file is free of errors is to fix any problems as you go. It is much easier to solve one problem than multiple, potentially overlapping problems.

Step 2. Carry out basic markup

This section covers the markup needed to represent the original document. Please note that we use the elements and attributes that follow to describe how things appear in the original, not how we believe they should be shown in any eventual rendering of the XML document in a browser or elsewhere.

You may either finish transcribing your document before you begin the steps below, or you may transcribe and encode at the same time (see 1.4 on Transcription).

2.1 Verify page breaks

There should already be a <pb/> element in your document marking the start of each page, with the value of @facs pointing to the corresponding image.

Here's an example of what that shoud look like:

<pb n="1" facs="../images/ew_a1_342_001.jpg"/>

Make sure that the values of n increase sequentially from 1. These numbers will correspond to the images themselves, not to any internal page numbers that might occur in the document. 

2.2 Add paragraph breaks

If you have already transcribed your entire document, wrap each paragraph in these elements:

<p>
 This is the text of a paragraph.
</p>

If you are transcribing and encoding at the same time, you can add these as you go.

Putting <p> and </p> on separate lines from the text itself can simplify solving problems later.

If the paragraph is indented in the original, provide a value of "indent" for @rend:

<p rend="indent">

When a paragraph spans across two pages, you should not add </p> before the page break. Instead, just let <pb> appear in the middle of the sentence in question, as in this example:

<p>
I really wrote you a<lb/>
long letter before sending<lb/>
<pb n="2" facs= "../images/ew_j2_511_001.jpg"/>
you the pictures.<lb/>
</p>

2.3 Add line breaks

If you have already transcribed your entire document, add a line break element at the end of each line:

<p>
This is a line<lb/>
of text. This is another<lb/>
line of text<lb/>
</p>

When a word is divided across the line, we use @break to indicate this (<lb break="no"/>):

<p>
This is a very inter-<lb break="no"/>esting
bit of text. This is another fascin-<lb break="no"/>ating
line of text.<lb/>
</p>

Note that when you use <lb break="no"/>, you need to put the second half of the divided word right after the <lb/> element. If you do not, a space will be inserted in the middle of the word when you view the file in a browser. 

Also, note that if there are hyphens in the original, we'll leave them as-is for now. 

2.4 Add column breaks

If your document consists of multiple columns,* use
<cb/> to mark the beginning of each.

<p>
<cb/>
This is the text of<lb/>
column one which continues<lb/>
over into<lb/>
<cb/>
column two, and then<lb/>
eventually, also continues<lb/>
into<lb/>
<cb/>
column three.

*This is most likely to be the case in multi-column print materials like newspaper clippings, pamphlets, and event programs. This will seldom apply to letters. If any document consists of only one column, you do not need to use this element. 

2.5 Record selected features of text formatting

2.5.1 Superscript

Use <hi rend="superscript"></hi> for any text that is raised in the original.

He arrived on the 1<hi rend="superscript">st</hi> of the month.

2.5.2 Italics and underlining

Use <hi rend="italics"></hi> and <hi rend="underline"></hi> to enclose any text that is italicized or underlined in the original.

We read <hi rend="italics">Along This Way</hi> by James Weldon Johnson.

We read <hi rend="underline">Along This Way</hi> by James Weldon Johnson.

2.5.3 All caps

If you encounter text in all caps in the original, transcribe it in the case in which it should appear in the reading version (sentence case or title case, as appropriate), and enclose it within the following:

<hi rend="allcaps"></hi>

For instance, if you read in the original "INDIGENT HOSPITAL PATIENTS," you could do the following:

<hi rend="allcaps">Indigent Hospital Patients</hi>

2.5.4 Alignment

We assume all text to be be left-aligned in the original. To indicate otherwise, wrap the text in <hi> and use a value of "center" or "right"  for @rend on any element, as in:

<hi rend="center">This text is centered in the original.</hi>
<hi rend="right">This text is right-aligned in the original.</hi>

2.5.5 Horizontal positioning

We assume that all blocks of text begin left margin in the original. When that is not the case, wrap the text in question in a <hi> element, with a value for @rend to indicate the approximate positioning, according to the following options: 

<hi rend="one-fourth">This text starts approximately a one-fourth of the way across the page in the original.</hi>
<hi rend="mid-page">This text starts approximately a halfway across the page in the original.</hi>
<hi rend="three-fourths">This text starts approximately a three-quarters of the way across the page in the original.</hi>

If horizontal positioning cannot be approximated in this way, or if you are unsure how to handle a given situation, consult with Dr. McCarl or simply disregard this aspect of the text for now.

2.5.6 Multiple styles

You can next elements to apply multiple styles to one section of text by nesting <hi> elements:

<hi rend="center"><hi rend="allcaps">Text is all caps, centered</hi></hi>

<hi rend="underline"><hi rend="center"><hi rend="allcaps">Text is all caps, centered, underlined</hi></hi></hi>

2.6 Account for insertions and deletions

2.6.1 Insertions

If any material has been added above or between lines, use <add>, with @type indicating whether there is a caret or other mark indicating the point of insertion in the original.

The dogs <add type="caret">and the cats</add> are waiting for their dinner.  

In this scenario, it should be clear where the added text belongs.

The dogs <add type="no_caret">and the cats</add> are waiting for their dinner. 

In this scenario, you may have to use your judgement to determine where the added text is supposed to appear.

2.6.2 Deletions

If any text has been struck in the original, indicate this with <del>, using @value to indicate whether the text was crossed out or overwritten.

<del type="strikeout">This is text that was crossed out.</del>
<del type="overwritten">This text had other text written over it later.</del> 

2.7 Record marginal notes

If anyone has written a note of any sort onto the text, use
<note>. Use a value of "authorial" for @type, even if it may not have been the original author who wrote the note, and a value for @place that indicates where the note appears, as follows:

<note type="authorial" place="marginLeft">This is a note in the left margin.</note>
 
<note type="authorial" place="marginRight">This is a note in the right margin.</note>
 
<note type="authorial" place="marginTop">This is a note in the top margin.</note>
 
<note type="authorial" place=""marginBottom">This is a note in the bottom margin.</note>

Locate the <note> element as close as possible to the place to which it corresponds in the top-to-bottom flow of the text. Ideally, this will be in a break between sentences.

2.8 Handle special characters

Some characters have to be transcribed in XML with character references. These include the following:

° (the degree sign, as in 98° Fahrenheit) must be represented as &#176;

It was already 85&#176; when I left the house this morning.

& (the ampersand, meaning “and”) must be represented as &amp;

John &amp; I will meet you at the lake.

The n-dash is &#8211;.

pages 31&#8211;34

The m-dash is &#8212;.

She bought a big television&#8212;the biggest one in the store&#8212;and a comfortable couch on which to sit.

Note that when representating hyphens that separate words divided across lines, we will use the simple hyphen (-).

2.9 Account for gaps 

To mark an unrecoverable gap in the original, use <gap/>, with a value for @reason that explains the circumstances. 

The dogs <gap reason="paper stained"/> loudly all night.

You can use <gap/> within <del> to represent the contents of a struck piece of text, when that text cannot be read:

The cats meow <del type="overwritten"><gap reason="struck text not legible"/></del> when I am trying to sleep.

You can also use this element if you encounter material that you cannot read and for which you need to put a placeholder into the text.

The <gap/> element will appear in the interface as [...].

2.10 Indicate your doubts

If you're not sure you have read any text correctly, you can enclose it in <unclear> with a value for @reason explaining the situation. 

<unclear reason="Signature is difficult to read">J. Henderson</unclear>

If you can't read it at all, you can use a <gap/> element within <unclear>:

The house was <unclear reason="text is difficult to read"><gap/></unclear> and old.

2.11 Ignore certain things for now

We will address all these items in Intermediate Encoding all of the following scenarios:

2.11.1 Headings and subheadings

Some documents, includind event programs, may have several levels of headings. For now, just encode everything as a paragraph (<p></p>).

2.11.2 Letters

Letters have openings, closings, envelopes, and other items. For now, just encode every seperate block of information as a paragraph (<p></p>).

2.11.3 Non-contiguous text

Many documents will have text that jumps around in a non-linear fashion or otherwise doesn't flow neatly from top to bottom across all the document images. For now, just transcribe everything in the order that you think it should be read.

 

Can I do this markup with AI?

Partially, perhaps, at least for some of the more basic elements on this page. You can transcribe the text and then ask ChatGPT to mark it up in TEI-XML. You may get the best results if you tell it the specific elements you want it to use. For instance, "Mark up the following text in TEI-XML using the elements <p>, and <lb>, [etc.]: [text of transcription]." You could also try a more generic prompt, not specifying elements: "Mark up the following text in TEI-XML: [text of transcription]." In either case, you will then need to make the results conform to the criteria on this page.

If you find a way to do markup effectively with AI, please inform Dr. McCarl.

Step 3. Document any problems and move on

If you come across items that you don't know how to handle, send your questions to Dr. McCarl. In the meantime, place a comment at the relevant point in the file, documenting your concern, such as:

***** <!-- I'm not sure what to do with these asterisks. -->

You do not need to wait to receive answers to all your questions before continuing to step 4 below. It's fine to leave placeholder comments in the file and move on.

Step 4. Make sure your file is well formed and valid

Before you will be able to view it on the server, you must make sure your file is well formed and valid. If you don't recall what this means, refer to 1.3 above.

Make sure that you see "XML is valid" in the lower-left-hand corner of the VS Code window. If you do not, look through the file for the red squiggly lines.

If you get stuck and can't resolve all the errors in your file, you can email it to Dr. McCarl for assistance. If you do so, make no further changes to the file until Dr. McCarl returns it to you. When he does so, make sure that you work from that point in the revised file, not your original copy.

Step 5. Place your file in the OneDrive folder

Once your file is free of problems, save it and then put a copy into the OneDrive folder of files to transfer. Remember that you already have access to this folder, but may to log in to your UNF email to access it.

Step 6. View your modified file and fix problems

Once your file has been transferred, you will be able to see it on the server. If you already have the page open, you will need to reload it to see your changes. If you don't have it open, remember that you can get to it with the corresponding URL on our document spreadsheet.

You can toggle back and forth between the transcription and reading versions of the file. 

If you see any problems, go back and fix them, using the documentation on this page. Then place your file in the OneDrive folder of files to transfer

If the XML file does not load in the browser and remains blank, your file probably isn't well formed and/or valid. If that's the case, return to step 4 above, fix any problems, and place it again in the OneDrive folder of files to transfer