First Things First: Markup – BDMP Encoding Manual

What is Markup?

For a Scholarly Digital Edition such as the Beckett Digital Manuscript Project (BDMP) to work, the texts in the edition's corpus will need to be computer readable: only then can we take full advantage of all the possibilities the digital medium has to offer. To make our texts computer readable, we transcribe them into a descriptive markup language called XML. So the first thing you will need to know before you can start transcribing manuscripts, really, is what markup is.

As is explained in the Text Encoding Initiative's 'Gentle Introduction to XML’, the concept of electronic markup is derived from a practice dating back to the age of print, in which manuscripts were annotated with instructions for compositors or typists, explaining how the text should be printed or laid out (xxvii). By extension, markup can be regarded as a system that indicates how a text should be presented (or read). We can even go a step further: as James H. Coombs, Allen H. Renear, and Steven J. DeRose already argued in their influential paper 'Markup Systems and the Future of Scholarly Text Processing’ in 1987: 'Whenever an author writes anything, he or she "marks it up"' (934).

Because basic writing conventions such as the use of capitals, punctuation, or spacing can all be regarded as the most minimal of layout and reading instructions, markup can be said to form an inextricable part of the writing process itself. But there are of course many different ways to mark up text that go far beyond these basic writing conventions. In their paper, Coombs et al. distinguished six types of markup: punctuational markup, presentational markup, procedural markup, descriptive markup, referential markup, and metamarkup (935-937).

Punctuational markup

Punctuational markup is the type of markup that every writer uses: spaces to separate words from one another, full stops to distinguish between sentences, capitals to mark the beginning of sentences, titles, names, etc. Without punctuational markup, all texts would be written in scriptio continua – one long, uninterrupted string of characters.

Presentational markup

Presentational markup is a similar form of markup that is not applied to the words of the text itself, but to how those words are presented: pagination, indentation, white spaces, heading formats, etc. Markup on the page-level, rather than on the word-level.

Procedural markup

While punctuational and presentational markup are intended for human readers (to facilitate the reading process), procedural markup is intended for the computer. This type of markup consists of codes that a computer will interpret as formatting instructions (such as: leave a whitespace here or change the font here). In a WYSIWYG (What You See Is What You Get) input environment like Word, for instance, the codes that make up these instructions are hidden from the human reader’s sight, and translated into presentational markup instead.

Descriptive markup

Descriptive markup, then, does not tell the computer what to do, but rather what the text is (such as: this section of the text is a paragraph, this section of the text is a quote, or this section of the text is emphasized). To tell the computer which sections of the text are which, descriptive markup languages use tags that mark the beginning and ending of each section. For example:

[xml] <quote>To be, or not to be: that is the question.</quote>[/xml]

Referential Markup and Metamarkup

For our purposes, the most important thing about these last two types of markup is that they exist to make the descriptive markup work. As such, they are mainly used as reference points for the computer: referential markup is used to refer to information that is external to the marked up document, and metamarkup entails a set of instructions that tells the computer what the different elements in the descriptive markup mean, how they can be used, and how they should be formatted.

Descriptive markup has two significant advantages over procedural markup. Firstly, it makes formatting a text much easier, because it allows the author to declare how all instances of a specific class of textual elements should be formatted (such as: indent every paragraph or italicize all emphasized text), and to change those declarations (and thus the text’s formatting) at any point in the writing process. Say we have a text in which book titles and emphasis are both rendered in italics. And say we want to change this, by rendering all the book titles in bold instead. If we used descriptive markup language to encode the text, doing so would be a piece of cake: just go to the metamarkup file that contains the declaration that book titles should be rendered in italics, and change it to bold. If we used procedural markup to encode the text, however, changing the formatting of all the book titles automatically would be impossible. In a Word-file, for instance, we would have to go over every italicized string of characters, determine whether it constitutes a book title or not, and change the formatting accordingly.

The second advantage of descriptive markup languages over procedural markup languages is that identifying the different sections of text adds a layer of 'meaning' to the text that can be recognized, processed, and analyzed by a computer in a way that is impossible with procedural markup. For our descriptive markup language example, it would be relatively easy to produce a graph that displays how often the author used emphasis in her text, and compare her results to that of other authors, for instance. If her text was marked up using procedural markup, on the other hand, this would again be impossible, because there would be no (easy) way to automatically filter all of the book titles out of the graph’s results.

Standards for Markup Languages

Descriptive Markup Languages come in all shapes and sizes. To encode the quote from Hamlet above, I used a <quote> tag to mark the quote; but if I wanted to, I could just as easily have used a <q> tag instead. As long as you are consistent, you are free to choose your own tags. But the problem is, of course, that if everyone uses their own tag-set to encode their texts, nobody will be able to use and process texts that are encoded by others. To solve this problem, the Text Encoding Initiative developed a standard for encoding texts, using XML (eXtensible Markup Language). That is why we use TEI-compliant XML to encode our texts.

The TEI's goal is to provide an encoding standard that suits the needs of any type of text. As a result, the TEI's tag-set is huge, and its accompanying documentation can be daunting. But no-one needs to use all the tags. As Lou Burnard explained in his recent monograph What is the Text Encoding Initiative, deciding which tags your text will need is the first step of any text encoding project (9). Like any type of text, our specific corpus of genetic materials only needs a fraction of the tags that the TEI has to offer. This encoding manual will therefore serve to explain which of the TEI's tags we use at the BDMP, and how; for those who collaborate on one of the BDMP's upcoming modules as well as for those who are interested in our encoding schema in general.