LEMDO: Introduction to Markup, XML, and TEI

Introduction to Markup, XML, and TEI

Introduction

This page is for anthology leads, editors, and research assistants new to TEI markup. It can also be used in undergraduate and graduate classrooms as a quick introduction to markup, XML languages in general, and TEI-XML in particular.

Why Do We Mark Up Texts?

Unlike past generations of editors, we are producing texts that must be readable by machines before they are rendered and made readable by humans. Thus, virtually every editorial choice must be tagged in such a way that a computer can process it in various ways. We call this level of machine-readable information markup.

Markup has an additional advantage: we can process a marked-up text in many different ways. We can change how we render it. We can give readers the choice to display or suppress corrections, display or normalize the long s, show supplied readings, and to expand abbreviations. We can transform the marked-up text into many different types of outputs: HTML pages for display on a website, PDFs, ePubs, camera-ready print copies, actors’ scripts, and other XML languages. We can index it, link to it, generate concordances from it, count things in it (e.g., words, stage directions, lines, speeches), search it, and store it for long-term digital archiving. The effort you put into markup, therefore, makes your text extraordinarily valuable for many users and for diverse purposes.

What Is Markup?

Markup is information added to a text in order to say something about a text. As a skilled reader of texts (and possibly already an experienced editor), you already have an incipient understanding of textual markup. White space, paragraph breaks, italicization, punctuation, capitalization, square brackets, and other features of a printed text are all forms of markup that signal something to the reader about the textual content.

We add markup to our own documents as we write, inserting spaces, paragraph breaks, italicizing titles, and punctuating our clauses. We don’t think too much about the fact that italics (for example) can signal multiple things: foreign words, monograph titles, things we want to emphasize, and words that we want to talk about as words. Early modern writers and compositors did likewise. Both now and then, context helps make meaning clear. For example, we recognize monograph titles in bibliographies not just because they are italicized but also because they occupy the place in the sequence of metadata where we usually record titles. Likewise, we recognize stage directions in early modern texts not just because they are often italicized but also because of what they say and where they are located on the page.

As we shall see, TEI markup needs to be more precise because computers are not as good at reading contextual clues as we are. Nonetheless, you have an incipient understanding of markup because you’ve been doing it and interpreting it for your entire reading life.

Terminology

Some terms will come up many times in this tutorial.

Tagging, marking up, and encoding are interchangeable terms.

The information added to a text is markup.

When we add markup to a text, we tag, mark up, or encode the text.

Text with markup is called tagged, marked-up, or encoded text.

You will also see markup (the noun) spelled mark-up or mark up. At LEMDO, we use markup to refer to the tags you will add to your text, and reserve mark up for the action of adding markup to a text.

LEMDO’s Markup Language

LEMDO uses a markup language known as TEI-XML. It is a dialect of XML devised by the Text Encoding Initiative (thus the acronym TEI), a consortium of people who came together to devise a markup language specifically for text-bearing objects (e.g., manuscripts, books, documents, scripts, computer-mediated communication, performance texts).

If you are familiar with the Internet Shakespeare Editions’ Markup Language (IML) that was used to prepare texts for the old ISE platform, you will notice many similarities between IML and TEI. You will also notice many points where TEI is much simpler than IML, and a few points where TEI is more complex.

The value of encoding our editions in TEI is that our texts become usable in multiple platforms and processable by many open-access tools. The value of learning TEI is that you gain a transferable skill. Thousands of digital editions are encoded in TEI. So are legal cases, library records, parliamentary records, account books, and their associated metadata.

What Is XML?

XML stands for eXtensible Markup Language. XML itself is not a single language, but a metalanguage prescribing a set of standards for writing XML languages. The standard was published in 1996 by the World Wide Web Consortium.

The XML specification is long and we do not recommend that you read it just yet. But if you really get into XML, the specification is hosted on the World Wide Web Consortium’s website. In the meantime, you’ll learn more about the basics of XML later in this Quickstart.

What Does Markup Look Like?

Let’s start with an example using italics. In a printed or word-processed text, italics can indicate many different things:

Do you know what the word palimpsest means?

In Measure for Measure, Angelo is the main antagonist.

But since at Wakefield in a battell pitcht ….

These came by degrees, as additamenta honoris.

A human reader can read ambiguous markup. When we see italics, we can infer the meaning of the italics through contextual clues. A computer, however, is not very good at making contextual inferences about italics. As encoders, we do not use ambiguous markup. We use precise markup to describe how the word or phrase functions in our sentence. If we think about the examples above, we can work out that all the examples of italicized text are italicized for different reasons. We can have unique tags for each situation.

Words as words:

Do you know what the word <term>palimpsest</term> means?

Title:

In <title level="m">Measure for Measure</title>, Angelo is the main antagonist.

Italics in source:

But since at <hi rendition="rnd:italic">Wakefield</hi> in a battell pitcht ….

Foreign words:

These came by degrees, as <foreign xml:lang="la">additamenta honoris</foreign>.

These are all examples of descriptive markup. We aren’t saying anything about how we want the words and phrases to appear in our output. We decide later (in conjunction with a designer/developer) how we want the marked-up text to appear on a computer screen or in the print output we generate from our marked-up text. This precision in tagging also allows us to change our rendering in the future, if the MLA Handbook or the Chicago Manual of Style prescribes new treatment of foreign words, or if we need to output our encoded text in a way that conforms to an entirely different style manual.¹

Elements, Attributes, and Values

All XML languages have components that we call elements, attributes, and values. You need to understand these terms—and text node, string, opening tag, closing tag, and empty element—before you read any other documentation.

An element is the descriptive tag that identifies an item in the text. In the following example,


                                          <title>

is the element:

<title>Measure for Measure</title>

You can think of an element as being like a noun. The element describes what something is. In this case, Measure for Measure is a title. By tagging this string of alphanumeric characters with the


                                          <title>

element, we are ensuring that the computer processor treats the string as a title.

The character, string, word, phrase, or text you are marking up is called the text node. In the previous example, Measure for Measure is the text node. The text node is the thing you are describing, identifying, and/or clarifying with your markup.

Every element has both an opening tag and a closing tag. Elements wrap around the text node, clearly marking the beginning and the end of the text node about which you want to say something. The opening tag is <title> and the closing tag is </title>. Note that the closing tag in XML begins with a forward slash character /.

What if you want to say more about the text that you’ve identified as a title? For example, you might want to indicate that Measure for Measure is the title of a book-like thing? You can add an attribute to your opening element tag:

<title level="m">Measure for Measure</title>

You can think about an attribute as a big category. Attributes are incomplete by themselves. You have to add a specific value that describes the text node. In this example, the value of the


                                       @level

attribute is "m" (for monograph-like).

To recapitulate using a non-TEI example:

<animal size="big" colour="grey">elephant</animal>

The element describes what the text node is (animal).

The attribute is a category or quality of the element (size, colour).

The value is the precise value of the attribute (big, grey).

Colour is to grey as an attribute is to its value.

Note that the attribute and value go in the opening tag only. The closing tag does not repeat the attribute and value.

Elements can have more than one attribute. In this example, as above, we’re using the recommended TEI attribute and value for


                                          <title>

. The

@m

attribute means “monograph or monograph-like work”. The


                                       @when

attribute allows us to add a date to the title:

<title level="m" when="1603">Measure for Measure</title>

While particular elements, attributes, and values vary from one XML language to another, the structure of an XML element is always the same:

Empty Elements

Some elements do not have a text node. These elements are called empty elements, milestone elements, or self-closing elements. Common milestone elements in LEMDO’s semi-diplomatic transcriptions are


                                          <lb>

(“line beginning”),


                                          <pb>

(“page beginning”), and


                                          <cb>

(“column beginning”). Milestone elements can be written with an opening tag and a closing tag, or you can abbreviate the closing tag by adding a forward slash at the end of an opening tag to indicate that it is both the opening and the closing tag of a content-less element.

<ab>
<lb type="wln" n="2"/>TO sing a Song that old was sung, <lb type="wln" n="3"/>From ashes, auntient Gower is come,</ab>

Empty elements can still have attributes and values, but they do not wrap around a text node.

XML Structure

XML markup languages are hierarchical. Elements are contained by (or nested within) other elements. In order to talk about the relationship between nested elements, we use the terms parent, child, and sibling. We use these terms frequently in our documentation. Make sure you understand them before you read additional documentation.

Parent elements contain child elements. If a parent has multiple child elements, those child elements are siblings of each other. In the following example, the

<p>

element is the parent of


                                          <list>

because it contains the


                                          <list>

element. The


                                          <list>

element is the child of

<p>

because it is contained within the

<p>

element. Note that the hierarchy can multiple levels:


                                          <list>

is the parent of


                                          <item>

. The


                                          <item>

element is the child of


                                          <list>

. We have three sibling


                                          <item>

elements:

According to the Shakespeare Census Project, copies survive at the following locations: <list rend="bulleted">
 <item>British Library</item>
 <item>Huntington Library</item>
 <item>Folger Shakespeare Library</item>
</list> The British Library copy is the control text for the semi-diplomatic transcription.

Note that an element can be both a parent and a child if it is contained by another element and contains an element. To phrase it another way, an element can have both a parent and a child.

An XML document must be well-formed. In a well-formed XML document, all child elements are closed with a closing tag before the parent element is closed. Another way to put this requirement: all tags must be closed in the reverse order they were opened.

What is TEI?

The Text Encoding Initiative is two things: (1) a consortium, and (2) an XML standard. The TEI Consortium describes itself and its work thus:

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software developed for or adapted to the TEI. (https://tei-c.org/)

The TEI standard predates XML, but became XML-compliant after the XML standard was introduced in 1996. TEI is now a global standard published in eight languages (English, French, Spanish, Italian, German, Japanese, Mandarin Chinese, and Korean).

The TEI standard is captured in a robust set of guidelines known as the TEI Guidelines. The first set of guidelines, TEI P1, was released in 1990. The current edition of the guidelines is P5.

The standard is driven by the community of users. The guidelines have evolved continuously since the TEI was first proposed in November 1987. Anyone can post questions and suggestions for the TEI Technical Council" to the TEI listserv. There is also an active GitHub channel for ticketing and commenting. The TEI Technical Council² meets monthly to respond to community requests and to improve both the standard and the guidelines. The community gathers for an annual meeting and conference, which features keynotes, conference papers, poster sessions, meetings of special interest groups, and business and council meetings.

TEI File Structure

The first line in the file starts with a pointy bracket and a question mark. It will be purple if you have opened this file in the default view in Oxygen:

Text reads: less-than angle bracket ? xml version = double quotation mark 1.0 double quotation mark encoding = double quotation mark UTF-8 double quotation mark ? greater-than angle bracket

This first line tells the processor that this present file is an XML file. Do not alter the first line at all.

Green text in the file is an XML comment:

You can leave XML comments in your file. Note that an XML comment begins with an exclamation mark, two hyphens, and a space. Everything written inside an XML comment is ignored by the processor. XML comments are how we as encoders and editors can leave notes in the file for ourselves or other editors. Note that leaving a comment for someone else does not send them a notification, so you should let people know if you have left comments for them in a certain file.

The second line in the file contains the opening tag of the root element. The root element contains the entire file (i.e., it is the biggest container in the document and contains all the other tags). For a TEI-XML document, the root element is


                                          <TEI>

The root element is the first element in the file and contains all the other elements; in other words, all the other elements are nested within the root element. If you scroll down to the very bottom of the file, you will see that the file ends with the closing tag of this root element.

Note that there are two attributes on the root element:


                                       @xmlns

and


                                       @xml:id

Namespace: The


                                       @xmlns

attribute stands for XML namespace. Every XML language has its own namespace to which its elements and attributes belong (which is a bit like saying that French words belong to the French language, English words belong to the English language, and so on). The value of


                                       @xmlns

is a URL that points to a page on the TEI website. This value tells the processor that we are encoding this file in TEI-XML (i.e., we are using the elements and attributes defined by the TEI instead of elements and attributes defined by another XML language).

The document ID: The


                                       @xml:id

attribute stands for XML identifier. This string of letters and numbers is unique within the project. No other file or piece of data or element may have the XML identifier that we give to this file.³

The


                                          <TEI>

root element has two children in most LEMDO files: a


                                          <teiHeader>

element and a


                                          <text>

element. You will put the metadata about the text in the


                                          <teiHeader>

. The content you are encoding goes in the


                                          <text>

element.

Notes

1.This affordance is what makes citation managers able to output references in multiple styles, for example. There’s underlying markup that identifies the components of a bibliography entry so precisely that the information can be restyled as needed simply by writing rendering instructions for each different marked-up component.↑

2.LEMDO Project Director Janelle Jenstad, founding Lead Programmer Joey Takeda, and current Lead Programmer Martin Holmes have all served or currently serve on the TEI Technical Council.↑

3.It is a specific rule of the LEMDO platform that the xml:id you give to a document and the file name by which you save the document must be the same.↑

Prosopography

Chloe Mee

Chloe Mee (she/her) worked as a research assistant with the LEMDO team over several periods from 2022 to 2025. She graduated from the University of Victoria in 2025 with a BA (Hons with distinction) in English. She will be studying at the University of British Columbia to complete her MA in English. Chloe collaborated with the LEMDO team on a VKURA internship in summer 2022, mainly focusing on Hamlet quartos. Following her internship, she also worked as a research assistant in 2022–23 and 2025.

Isabella Seales

Isabella Seales is a fourth year undergraduate completing her Bachelor of Arts in English at the University of Victoria. She has a special interest in Renaissance and Metaphysical Literature. She is assisting Dr. Jenstad with the MoEML Mayoral Shows anthology as part of the Undergraduate Student Research Award program.

Janelle Jenstad

Janelle Jenstad is a Professor of English at the University of Victoria, Director of The Map of Early Modern London, and Director of Linked Early Modern Drama Online. With Jennifer Roberts-Smith and Mark Kaethler, she co-edited Shakespeare’s Language in Digital Media: Old Words, New Tools (Routledge). She has edited John Stow’s A Survey of London (1598 text) for MoEML and is currently editing The Merchant of Venice (with Stephen Wittek) and Heywood’s 2 If You Know Not Me You Know Nobody for DRE. Her articles have appeared in Digital Humanities Quarterly, Elizabethan Theatre, Early Modern Literary Studies, Shakespeare Bulletin, Renaissance and Reformation, and The Journal of Medieval and Early Modern Studies. She contributed chapters to Approaches to Teaching Othello (MLA); Teaching Early Modern Literature from the Archives (MLA); Institutional Culture in Early Modern England (Brill); Shakespeare, Language, and the Stage (Arden); Performing Maternity in Early Modern England (Ashgate); New Directions in the Geohumanities (Routledge); Early Modern Studies and the Digital Turn (Iter); Placing Names: Enriching and Integrating Gazetteers (Indiana); Making Things and Drawing Boundaries (Minnesota); Rethinking Shakespeare Source Study: Audiences, Authors, and Digital Technologies (Routledge); and Civic Performance: Pageantry and Entertainments in Early Modern London (Routledge). For more details, see janellejenstad.com.

Joey Takeda

Joey Takeda is LEMDO’s Consulting Programmer and Designer, a role he assumed in 2020 after three years as the Lead Developer on LEMDO.

Mahayla Galliford

Project manager, 2025-present; research assistant, 2021-present. Mahayla Galliford (she/her) graduated with a BA (Hons with distinction) from the University of Victoria in 2024. Mahayla’s undergraduate research explored early modern stage directions and civic water pageantry. Mahayla continues her studies through UVic’s English MA program and her SSHRC-funded thesis project focuses on editing and encoding girls’ manuscripts, specifically Lady Rachel Fane’s dramatic entertainments, in collaboration with LEMDO.

Martin Holmes

Martin Holmes has worked as a developer in the UVic’s Humanities Computing and Media Centre for over two decades, and has been involved with dozens of Digital Humanities projects. He has served on the TEI Technical Council and as Managing Editor of the Journal of the TEI. He took over from Joey Takeda as lead developer on LEMDO in 2020. He is a collaborator on the SSHRC Partnership Grant led by Janelle Jenstad.

Navarra Houldin

Training and Documentation Lead 2025–present. LEMDO project manager 2022–2025. Textual remediator 2021–present. Navarra Houldin (they/them) completed their BA with a major in history and minor in Spanish at the University of Victoria in 2022. Their primary research was on gender and sexuality in early modern Europe and Latin America. They are continuing their education through an MA program in Gender and Social Justice Studies at the University of Alberta where they will specialize in Digital Humanities.

Nicole Vatcher

Technical Documentation Writer, 2020–2022. Nicole Vatcher completed her BA (Hons.) in English at the University of Victoria in 2021. Her primary research focus was women’s writing in the modernist period.

Tracey El Hajj

Junior Programmer 2019–2020. Research Associate 2020–2021. Tracey received her PhD from the Department of English at the University of Victoria in the field of Science and Technology Studies. Her research focuses on the algorhythmics of networked communications. She was a 2019–2020 President’s Fellow in Research-Enriched Teaching at UVic, where she taught an advanced course on Artificial Intelligence and Everyday Life. Tracey was also a member of the Map of Early Modern London team, between 2018 and 2021. Between 2020 and 2021, she was a fellow in residence at the Praxis Studio for Comparative Media Studies, where she investigated the relationships between artificial intelligence, creativity, health, and justice. As of July 2021, Tracey has moved into the alt-ac world for a term position, while also teaching in the English Department at the University of Victoria.

Orgography

LEMDO Team (LEMD1)

The LEMDO Team is based at the University of Victoria and normally comprises the project director, the lead developer, project manager, junior developers(s), remediators, encoders, and remediating editors.

Metadata

Authority title	Introduction to Markup, XML, and TEI
Type of text	Documentation
Publisher	University of Victoria on the Linked Early Modern Drama Online Platform
Series	Linked Early Modern Drama Online
Source	TEI Customization created by Martin Holmes, Joey Takeda, and Janelle Jenstad; documentation written by members of the LEMDO Team
Editorial declaration	n/a
Edition	Released with Linked Early Modern Drama Online 1.0
Encoding description	Encoded in TEI P5 according to the LEMDO Customization and Encoding Guidelines
Document status	prgGenerated
Funder(s)	Social Sciences and Humanities Research Council of Canada
License/availability	This file is licensed under a CC BY-NC_ND 4.0 license, which means that it is freely downloadable without permission under the following conditions: (1) credit must be given to the author and LEMDO in any subsequent use of the files and/or data; (2) the content cannot be adapted or repurposed (except in quotations for the purposes of academic review and citation); and (3) commercial uses are not permitted without the knowledge and consent of the editor and LEMDO. This license allows for pedagogical use of the documentation in the classroom.