Introduction to Markup, XML, and TEI

Introduction

This page is for anthology leads, editors, and research assistants new to TEI markup. It can also be used in undergraduate and graduate classrooms as a quick introduction to markup, XML languages in general, and TEI-XML in particular.

Why do we Mark Up Texts?

Unlike past generations of editors, we are producing texts that must be readable by machines before they are rendered and made readable by humans. Thus, virtually every editorial choice must be tagged in such a way that a computer can process it in various ways. We call this level of machine-readable information markup.
Markup has an additional advantage: we can process a marked-up text in many different ways. We can change how we render it. We can give readers the choice to display or suppress corrections, display or normalize the long s, show supplied readings, and to expand abbreviations. We can transform the marked-up text into many different types of outputs: HTML pages for display on a website, PDFs, ePubs, camera-ready print copies, actors’ scripts, and other XML languages. We can index it, link to it, generate concordances from it, count things in it (e.g., words, stage directions, lines, speeches), search it, and store it for long-term digital archiving. The effort you put into markup, therefore, makes your text extraordinarily valuable for many users and for diverse purposes.

What is Markup?

Markup is information added to a text in order to say something about a text. As a skilled reader of texts (and possibly already an experienced editor), you already have an incipient understanding of textual markup. White space, paragraph breaks, italicization, punctuation, capitalization, square brackets, and other features of a printed text are all forms of markup that signal something to the reader about the textual content.
We add markup to our own documents as we write, inserting spaces, paragraph breaks, italicizing titles, and punctuating our clauses. We don’t think too much about the fact that italics (for example) can signal multiple things: foreign words, monograph titles, things we want to emphasize, and words that we want to talk about as words. Early modern writers and compositors did likewise. Both now and then, context helps make meaning clear. For example, we recognize monograph titles in bibliographies not just because they are italicized but also because they occupy the place in the sequence of metadata where we usually record titles. Likewise, we recognize stage directions in early modern texts not just because they are often italicized but also because of what they say and where they are located on the page.
As we shall see, TEI markup needs to be more precise because computers are not as good at reading contextual clues as we are. Nonetheless, you have an incipient understanding of markup because you’ve been doing it and interpreting it for your entire reading life.

Terminology

Some terms will come up many times in this tutorial.
Tagging, marking up, and encoding are interchangeable terms.
The information added to a text is markup.
When we add markup to a text, we tag, mark up, or encode the text.
Text with markup is called tagged, marked-up, or encoded text.
You will also see markup (the noun) spelled mark-up or markup. At LEMDO, we use markup to refer to the tags you will add to your text, and reserve mark up for the action of adding markup to a text.

LEMDO’s Markup Language

LEMDO uses a markup language known as TEI-XML. It is a dialect of XML devised by the Text Encoding Initiative (thus the acronym TEI), a consortium of people who came together to devise a markup language specifically for text-bearing objects (e.g., manuscripts, books, documents, scripts, computer-mediated communication, performance texts).
If you are familiar with the Internet Shakespeare Editions’ Markup Language (IML) that was used to prepare texts for the old ISE platform, you will notice many similarities between IML and TEI. You will also notice many points where TEI is much simpler than IML, and a few points where TEI is more complex.
The value of encoding our editions in TEI is that our texts become usable in multiple platforms and processable by many open-access tools. The value of learning TEI is that you gain a transferable skill. Thousands of digital editions are encoded in TEI. So are legal cases, library records, parliamentary records, account books, and their associated metadata.

What is XML?

XML stands for eXtensible Markup Language. XML itself is not a single language, but a metalanguage prescribing a set of standards for writing XML languages. The standard was published in 1996 by the World Wide Web Consortium.
The XML specification is long and we do not recommend that you read it just yet. But if you really get into XML, the specification is hosted at https://www.w3.org/TR/2008/REC-xml-20081126/. In the meantime, you’ll learn more about the basics of XML later in this Quickstart.

What Does Markup Look Like?

Let’s start with an example using italics. In a printed or word-processed text, italics can indicate many different things:
Do you know what the word palimpsest means?
In Measure for Measure, Angelo is the main antagonist.
But since at Wakefield in a battell pitcht ….
These came by degrees, as additamenta honoris.
A human reader can read ambiguous markup. When we see italics, we can infer the meaning of the italics through contextual clues. A computer, however, is not very good at making contextual inferences about italics. As encoders, we do not use ambiguous markup. We use precise markup to describe how the word or phrase functions in our sentence. If we think about the examples above, we can work out that all the examples of italicized text are italicized for different reasons. We can have unique tags for each situation.
Words as words:
<p>Do you know what the word <term>palimpsest</term> means?</p>
Title:
<p>In <title level="m">Measure for Measure</title>, Angelo is the main antagonist.</p>
Italics in source:
<p>But since at <hi rendition="rnd:italic">Wakefield</hi> in a battell pitcht ….</p>
Foreign words:
<p>These came by degrees, as <foreign xml:lang="la">additamenta honoris</foreign>.</p>
These are all examples of descriptive markup. We aren’t saying anything about how we want the words and phrases to appear in our output. We decide later (in conjunction with a designer/developer) how we want the marked-up text to appear on a computer screen or in the print output we generate from our marked-up text. This precision in tagging also allows us to change our rendering in the future, if the MLA Handbook or the Chicago Manual of Style prescribes new treatment of foreign words, or if we need to output our encoded text in a way that conforms to an entirely different style manual.1

Elements, Attributes, and Values

All XML languages have components that we call elements, attributes, and values. You need to understand these terms—and text node, string, opening tag, closing tag, and empty element—before you read any other documentation.
An element is the descriptive tag that identifies an item in the text. In the following example, <title> is the element:
<title>Measure for Measure</title>
You can think of an element as being like a noun. The element describes what something is. In this case, Measure for Measure is a title. By tagging this string of alphanumeric characters with the <title> element, we are ensuring that the computer processor treats the string as a title.
The character, string, word, phrase, or text you are marking up is called the text node. In the previous example, Measure for Measure is the text node. The text node is the thing you are describing, identifying, and/or clarifying with your markup.
Every element has both an opening tag and a closing tag. Elements wrap around the text node, clearly marking the beginning and the end of the text node about which you want to say something. The opening tag is <title> and the closing tag is </title>. Note that the closing tag in XML begins with a forward slash character /.
What if you want to say more about the text that you’ve identified as a title? For example, you might want to indicate that Measure for Measure is the title of a book-like thing? You can add an attribute to your opening element tag:
<title level="m">Measure for Measure</title>
You can think about an attribute as a big category. Attributes are incomplete by themselves. You have to add a specific value that describes the text node. In this example, the value of the @level attribute is "m" (for monograph-like).
To recapitulate using a non-TEI example:
<animal size="big" colour="grey">elephant</animal>
The element describes what the text node is (animal).
The attribute is a category or quality of the element (size, colour).
The value is the precise value of the attribute (big, grey).
Colour is to grey as an attribute is to its value.
Note that the attribute and value go in the opening tag only. The closing tag does not repeat the attribute and value.
Elements can have more than one attribute. In this example, as above, we’re using the recommended TEI attribute and value for <title> . The @m attribute means “monograph or monograph-like work”. The @when attribute allows us to add a date to the title:
<title level="m" when="1603">Measure for Measure</title>
While particular elements, attributes, and values vary from one XML language to another, the structure of an XML element is always the same:
<element attribute="value">text node</element>

Empty Elements

Some elements do not have a text node. These elements are called empty elements, milestone elements, or self-closing elements. Common milestone elements in LEMDOʼs semi-diplomatic texts are <lb> (“line beginning”), <pb> (“page beginning”), and <cb> (“column beginning”). Milestone elements can be written with an opening tag and a closing tag, or you can abbreviate the closing tag by adding a forward slash at the end of an opening tag to indicate that it is both the opening and the closing tag of a content-less element.
<ab>
  <lb type="wln" n="2"/>TO ſing a Song that old was ſung, <lb type="wln" n="3"/>From aſhes, auntient Gower is come,</ab>
Empty elements can still have attributes and values, but they do not wrap around a text node.
<pb n="3|A2v" facs="facs:MV_Q1_BPL_03.jpg"/>

XML Structure

XML markup languages are hierarchical. Elements are contained by (or nested within) other elements. In order to talk about the relationship between nested elements, we use the terms parent, child, and sibling. We use these terms frequently in our documentation. Make sure you understand them before you read additional documentation.
Parent elements contain child elements. If a parent has multiple child elements, those child elements are siblings of each other. In the following example, <p> is the parent of <list> because it contains the <list> element. <list> is the child of <p> because it is contained within the <p> element. Note that the hierarchy can multiple levels: <list> is the parent of <item> . <item> is the child of <list> . We have three sibling <item> elements:
<p>According to the Shakespeare Census Project, copies survive at the following locations: <list>
  <item>British Library</item>
  <item>Huntington Library</item>
  <item>Folger Shakespeare Library</item>
</list> The British Library copy is the control text for the semi-diplomatic transcription.</p>
Note that an element can be both a parent and a child if it is contained by another element and contains an element. To phrase it another way, an element can have both a parent and a child.
An XML document must be well-formed. In a well-formed XML document, all child elements are closed with a closing tag before the parent element is closed. Another way to put this requirement: all tags must be closed in the reverse order they were opened.

What is TEI?

The Text Encoding Initiative is two things: (1) a consortium, and (2) an XML standard. The TEI Consortium describes itself and its work thus:
The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software developed for or adapted to the TEI. (https://tei-c.org/)
The TEI standard predates XML, but became XML-compliant after the XML standard was introduced in 1996. TEI is now a global standard published in eight languages (English, French, Spanish, Italian, German, Japanese, Mandarin Chinese, and Korean).
The TEI standard is captured in a robust set of guidelines known as the TEI Guidelines. The first set of guidelines, TEI P1, was released in 1990. The current edition of the guidelines is P5.
The standard is driven by the community of users. The guidelines have evolved continuously since the TEI was first proposed in November 1987. Anyone can post questions and suggestions for the TEI Technical Council" to the TEI listserv. There is also an active GitHub channel for ticketing and commenting. The TEI Technical Council2 meets monthly to respond to community requests and to improve both the standard and the guidelines. The community gathers for an annual meeting and conference, which features keynotes, conference papers, poster sessions, meetings of special interest groups, and business and council meetings.

TEI File Structure

The first line in the file starts with a pointy bracket and a question mark. It will be purple if you have opened this file in the default view in Oxygen:

                           Text reads: less-than angle bracket ? xml version = double quotation mark 1.0 double quotation mark encoding = double quotation mark UTF-8 double quotation mark ? greater-than angle bracket
This first line tells the processor that this present file is an XML file. Do not alter the first line at all.
Green text in the file is an XML comment:
<p><!-- This is a comment. --></p>
You can leave XML comments in your file. Note that an XML comment begin with an exclamation mark, two hyphens, and a space. Everything written inside an XML comment is ignored by the processor. XML comments are how we as encoders and editors can leave notes in the file for ourselves or other editors.
The second line in the file contains the opening tag of the root element. The root element contains the entire file (i.e., it is the biggest container in the document and contains all the other tags). For a TEI-XML document, the root element is <TEI> :

                           Text reads: less-than angle bracket ? xml version = double quotation mark 1.0 double quotation mark encoding = double quotation mark UTF-8 double quotation mark ? greater-than angle bracket / less-than angle bracket TEI xmlns = double quotation mark http://www.tei-c.org/ns/1.0 double quotation mark version = double quotation mark 5.0 double quotation mark xml:id = double quotation mark emdRom_edition double quotation mark greater-than angle bracket
The root element is the first element in the file and contains all the other elements; in other words, all the other elements are nested within the root element. If you scroll down to the very bottom of the file, you will see that the file ends with the closing tag of this root element.
Note that there are two attributes on the root element: @xmlns and @xml:id.
Namespace: The @xmlns attribute stands for XML namespace. Every XML language has its own namespace to which its elements and attributes belong (which is a bit like saying that French words belong to the French language, English words belong to the English language, and so on). The value of @xmlns is a URL that points to a page on the TEI website. This value tells the processor that we are encoding this file in TEI-XML (i.e., we are using the elements and attributes defined by the TEI instead of elements and attributes defined by another XML language).
The document id: The @xml:id attribute stands for XML identifier. This string of letters and numbers is unique within the project. No other file or piece of data or element may have the XML identifier that we give to this file.3
The <TEI> root element has two children in most LEMDO files: a <teiHeader> element and a <text> element. You will put the metadata about the text in the <teiHeader> . The content you are encoding goes in the <text> element.

Further Reading

Read more about markup language in general: Wikipedia.
Read more about XML markup languages in general: Wikipedia.
Read a highly accessible description of TEI: Burnard, Lou. What is the Text Encoding Initiative? How to Add Intelligent Markup to Digital Resources. Marseille: OpenEdition Press, 2014. http://books.openedition.org/oep/426
Go to the TEI Guidelines: P5: Guidelines for Electronic Text Encoding and Interchange. https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html
See how other projects use TEI markup: Melissa Terras, Edward Vanhoutte, and Ron Van den Branden, TEI by Example (https://teibyexample.org).

Notes

1.This affordance is what makes citation managers able to output references in multiple styles, for example. There’s underlying markup that identifies the components of a bibliography entry so precisely that the information can be restyled as needed simply by writing rendering instructions for each different marked-up component.
2.LEMDO Project Director Janelle Jenstad, founding Lead Programmer Joey Takeda, and current Lead Programmer Martin Holmes have all served or currently serve on the TEI Technical Council.
3.It is a specific rule of the LEMDO platform that the xml:id you give to a document and the file name by which you save the document must be the same.

Prosopography

Chloe Mee

Chloe Mee is a research assistant on the LEMDO team who is working as a remediator on Old Spelling texts. She is about to start her second year at UVic in Fall 2022 and is pursuing an Honours degree in English. Currently, she is working on the LEMDO team through a VKURA internship. She loves literature and is enjoying the opportunity to read and encode Shakespeare quartos!

Janelle Jenstad

Janelle Jenstad is a Professor of English at the University of Victoria, Director of The Map of Early Modern London, and Director of Linked Early Modern Drama Online. With Jennifer Roberts-Smith and Mark Kaethler, she co-edited Shakespeare’s Language in Digital Media: Old Words, New Tools (Routledge). She has edited John Stow’s A Survey of London (1598 text) for MoEML and is currently editing The Merchant of Venice (with Stephen Wittek) and Heywood’s 2 If You Know Not Me You Know Nobody for DRE. Her articles have appeared in Digital Humanities Quarterly, Elizabethan Theatre, Early Modern Literary Studies, Shakespeare Bulletin, Renaissance and Reformation, and The Journal of Medieval and Early Modern Studies. She contributed chapters to Approaches to Teaching Othello (MLA); Teaching Early Modern Literature from the Archives (MLA); Institutional Culture in Early Modern England (Brill); Shakespeare, Language, and the Stage (Arden); Performing Maternity in Early Modern England (Ashgate); New Directions in the Geohumanities (Routledge); Early Modern Studies and the Digital Turn (Iter); Placing Names: Enriching and Integrating Gazetteers (Indiana); Making Things and Drawing Boundaries (Minnesota); Rethinking Shakespeare Source Study: Audiences, Authors, and Digital Technologies (Routledge); and Civic Performance: Pageantry and Entertainments in Early Modern London (Routledge). For more details, see janellejenstad.com.

Joey Takeda

Joey Takeda is LEMDO’s Consulting Programmer and Designer, a role he assumed in 2020 after three years as the Lead Developer on LEMDO.

Martin Holmes

Martin Holmes has worked as a developer in the UVicʼs Humanities Computing and Media Centre for over two decades, and has been involved with dozens of Digital Humanities projects. He has served on the TEI Technical Council and as Managing Editor of the Journal of the TEI. He took over from Joey Takeda as lead developer on LEMDO in 2020. He is a collaborator on the SSHRC Partnership Grant led by Janelle Jenstad.

Navarra Houldin

Project manager 2022–present. Textual remediator 2021–present. Navarra Houldin (they/them) completed their BA in History and Spanish at the University of Victoria in 2022. During their degree, they worked as a teaching assistant with the University of Victoriaʼs Department of Hispanic and Italian Studies. Their primary research was on gender and sexuality in early modern Europe and Latin America.

Nicole Vatcher

Technical Documentation Writer, 2020–2022. Nicole Vatcher completed her BA (Hons.) in English at the University of Victoria in 2021. Her primary research focus was womenʼs writing in the modernist period.

Tracey El Hajj

Junior Programmer 2019–2020. Research Associate 2020–2021. Tracey received her PhD from the Department of English at the University of Victoria in the field of Science and Technology Studies. Her research focuses on the algorhythmics of networked communications. She was a 2019–2020 President’s Fellow in Research-Enriched Teaching at UVic, where she taught an advanced course on Artificial Intelligence and Everyday Life. Tracey was also a member of the Map of Early Modern London team, between 2018 and 2021. Between 2020 and 2021, she was a fellow in residence at the Praxis Studio for Comparative Media Studies, where she investigated the relationships between artificial intelligence, creativity, health, and justice. As of July 2021, Tracey has moved into the alt-ac world for a term position, while also teaching in the English Department at the University of Victoria.

Orgography

LEMDO Team (LEMD1)

The LEMDO Team is based at the University of Victoria and normally comprises the project director, the lead developer, project manager, junior developers(s), remediators, encoders, and remediating editors.

Metadata