Searchable web pages with MUFI and Junicode 2

Junicode 2 fully implements the recommendations of the Medieval Unicode Font Initiative (MUFI), which offers an encoding scheme for medieval texts based on Unicode, the worldwide standard for text exchange. You probably know that a computer text is a string of numbers, each representing a character: using Unicode, you employ a standard mapping of numbers to characters, ensuring that a text sent to you by (say) your friend in Thailand will look the same when it reaches you as it did when it left your friend’s computer. MUFI does a similar job for medieval characters. For example, if you need a Dutch libra sign, you won’t find it in Unicode; but MUFI has it at code point U+F2EA (), and it will be recognizable to anyone with a MUFI-compliant font (like Junicode 2).

To work its magic, MUFI selects characters of interest to medievalists from the Unicode standard, and when a medieval character is missing from Unicode it assigns a code point from a block of numbers set aside for individuals and groups to use in any way they like: the Private Use Area. MUFI-compliant fonts like Junicode 2 must all agree on the mappings of code point to character defined by MUFI so that texts can be exchanged reliably.

For example, here is the beginning of a diplomatic text of the Old Norse Vǫluspá, copied and pasted from the Medieval Nordic Text Archive (MENOTA—because Junicode 2 Italic is not yet ready, the text is presented only in roman type):

Hlıoðſ bið ec allar kinꝺir meiri oc miɴi maugo
heimꝺalar uilðo at ec ualꝼꜹþr uel ꝼyr telia ꝼoꝛn
ſpioll ꝼíra þꜹ er ꝼremſt um man. Ec mán iǫtna
ar um boꝛna þ⸠ꜹ̣⸡a er ꝼoꝛꝺom mic ꝼǫꝺꝺa hoꝼꝺo. nio man ec heima
nío iviþi[vr] miot uið mran ꝼyr molꝺ neðan.

Looks great, right? And it will read the same way whether it’s set in Junicode 2, Andron Scriptor, Cardo, or Palemonas MUFI. There’s just one problem—but it’s a significant one. To illustrate, look at the last word in the second line: ꝼoꝛn. That’s an insular f in the first position and an r rotunda in the third—different flavors of f and r. Now try searching this text for the word “forn” using your browser’s search function (Ctrl-F or Cmd-F). What happened? Your search skipped right over ꝼoꝛn in the medieval text and landed on “forn” in this paragraph. The fact that f and r are encoded differently in this text from the way they’re encoded in plain text means that this text can’t be searched the way most webpages can. (To be a little more precise, some browsers will find the f, but none of them will find the r).

MENOTA has its own powerful search tools, so that the inability to search in the browsers is not a great loss; but this inability may also signal problems with accessibility, which is increasingly important to university administrations, funding agencies, and (of course) end users. Is it possible, then, to create a MUFI-compliant webpage, rich with medieval characters, that is both searchable and accessible?

Turns out that it is both possible and easy to create such a page using Junicode 2. We’ll take it in two steps. The first will be to create as plain an etext as we can get away with, and the second will be to style this text using CSS (“Cascading Style Sheets,” the standard for styling web pages) and the features of Junicode 2.

Here’s my attempt to convert the passage to plain text:

Hlioðs bið ec allar kindir meiri oc mini maugo
heimdalar uilðo at ec ualfavþr uel fyr telia forn
spioll fíra þav er fremst um man. Ec mán iǫtna
ar um borna þ⸠aṿ⸡a er fordom mic fǫdda hofdo. nio man ec heima
nío iviþi[vr] miot uið mę́ran fyr mold neðan.

If “plain text” is what you can type on a U.S. keyboard, we haven’t quite made it: we still have several accented characters and the Icelandic thorn and eth, not to mention the strange brackets in line 4 (U+2E20 and U+2E21—we’ll deal with these later). The Icelandic characters are familiar enough to medievalists that we needn’t worry about them. The dot under the v in what was an av digraph (which we took apart to make it searchable as av, but we’ll put it back together later) is a combining diacritical mark: you type it right after the base character, and your software takes care of positioning it correctly. The accented ę́ was a character in the Private Use Area, but I have changed it to ę (U+0119, common in modern Polish) plus another combining mark, U+0301, the acute accent. That makes it standard Unicode, like the other accented characters. The ę is searchable as e, and browsers will ignore the marks when searching.

The next step is to select what we need from among the OpenType features of Junicode 2. OpenType is a standard for font construction that allows fonts to do all kinds of clever things. To cite one common example, when you type the letters “f‌ind” and the f and the i come out joined in a ligature (“find”), an OpenType feature did that. But Junicode 2 has more than eighty OpenType features that do a lot of useful things.

Here are the features we’ll want to apply to the whole text:

The features whose four-letter tags begin with ss are Stylistic Sets, and they are either on or off. The Character Variant features contain a list of variants of a particular character; we select from this list with a numerical index (starting with 1). To turn on these OpenType features, then, we need the following CSS rule:

font-feature-settings: "ss03" on, "cv05" 1, "cv09" 1, "ss16" on;

When we apply that to our fragment of Vǫluspá, here’s what we get:

Hlioðs bið ec allar kindir meiri oc mini maugo
heimdalar uilðo at ec ualfavþr uel fyr telia forn
spioll fíra þav er fremst um man. Ec mán iǫtna
ar um borna þ⸠aṿ⸡a er fordom mic fǫdda hofdo. nio man ec heima
nío iviþi[vr] miot uið mę́ran fyr mold neðan.

One line of CSS, and already we’re almost home! But there are several special cases to attend to. There’s the dotless i in Hlioðs, the small capital n in mini, the av digraph that has to be reassembled, and the r rotunda that should be a regular r at the end of ualfꜹþr. For those, we devise CSS classes that turn OpenType features on or off, and we apply these classes to individual words or letters:

As an example of what the HTML will look like, here’s the first line of our text:

<span class="initcap">H</span>l<span class="dotlessi">i</span>oðs bið
    ec allar kindir meiri oc mi<span class="pcap">n</span>i maugo

Of course it looks illegible and hard to write, but for most projects it will be generated automatically from an underlying text; and of course your readers will not see the ugly HTML code or the CSS, but rather this:

Hlioðs bið ec allar kindir meiri oc mini maugo
heimdalar uilðo at ec ualfavþr uel fyr telia forn
spioll fíra þav er fremst um man. Ec mán iǫtna
ar um borna þ⸠aṿ⸡a er fordom mic fǫdda hofdo. nio man ec heima
nío iviþi[vr] miot uið mę́ran fyr mold neðan.

and if any of them should happen to hit Ctrl-F and perform a quick-and-dirty search for “kindir” or “forn” or “mini” or “ualfavþr” or “meran,” they’ll find what they’re looking for.

For one final refinement, we change the brackets in the text to CSS classes: “deletion” and “restoration.” The brackets themselves are not inserted in the text, but rather displayed via the CSS content property. The effect of this change is that searches of the text will ignore the brackets: thus a user can search for (and find) iviþivr. When displayed by this method, editorial interventions can also be changed programmatically—e.g. highlighted or hidden. The final text:

Hlioðs bið ec allar kindir meiri oc mini maugo
heimdalar uilðo at ec ualfavþr uel fyr telia forn
spioll fíra þav er fremst um man. Ec mán iǫtna
ar um borna þaṿa er fordom mic fǫdda hofdo. nio man ec heima
nío iviþivr miot uið mę́ran fyr mold neðan.

This document is not only a demonstration, but also a how-to for producing a searchable online document. For detailed instructions, view the source for this page. The CSS is thoroughly commented so you can tell exactly what each bit does.

Junicode/Junicode 2 font copyright © 1998–2022 by Peter S. Baker.

Development site   ·   Specimen Page

Licensed under the Open Font License, v. 1.1.