We will build an edition that permits us to view the text in two states:

  1. A diplomatic transcription that records the document as-is. To the best of our ability, we will preserve spelling, punctuation, word spacing, use of capital/lower-case letters, abbreviations, line breaks, and any other formal aspect that we can represent in XML.
  2. A partially regularized reading text. This version is ” partially regularized” because our regularization addresses only those aspects of the texts which have no phonological transcendence (in other words, those features whose modernization does not affect the sound of the text). These include the correction of misspellings (but not dialectical usages — see below), and the standardization of word spacing and punctuation. We are not altering the grammar, syntax, morphology, or anything else that would alter the way the text might sound when read aloud. When we encounter language that reflects a particular dialect or otherwise does not reflect “standard” usage of English in the U.S., we will leave the text as-is without attempting to modernize any features (for example, we would not attempt any regularization in the following sentence: “Mr. Kilgo was de fust overseer I member; I was big enuf to tote meat an stuff fum de smokehouse to de kitchen an to tote water in an git wood fur granny to cook de dinner an fur de sucklers who nu’sed de babies, an I ca’ied dinners back to de hans”).

For each document in the collection, these two versions will be encoded together in a single <text> element, using the <choice> element to layer the transcription and edition together. For each text, will also encode semantic features such as dates and proper names, in order to allow us to later experiment with options for analysis and display.