Standard font loading in LaTeX2e with XeTeX and LuaTeX

The LaTeX Project have been making efforts over the past few years to update support in the LaTeX2e kernel for XeTeX and LuaTeX. Supporting these Unicode-enabled engines provide new features (and challenges) compared to the ‘classical’ 8-bit TeX engines (probably pdfTeX for most users). Over recent releases, the team have made the core of LaTeX ‘engine-aware’ and pulled a reasonable amount of basic Unicode data directly into the kernel. The next area we are addressing is font loading, or rather the question of what the out-of-the-box (text) font should be.

To date, the LaTeX kernel has loaded Knuth’s Computer Modern font in his original ‘OT1’ encoding for all engines. Whilst there are good reasons to load at least the T1-encoded version rather than the OT1 version, using an 8-bit engine using the OT1 version can be justified: it’s a question of stability, and nothing is actually out-and-out wrong.

Things are different with the Unicode engines: some of the basic assumptions change. In particular, there are some characters in the upper-half of the 8-bit range for T1 that are not in the same place in Unicode. That means that hyphenation will be wrong for words using some characters unless you load a Unicode font. At the same time, both LuaTeX and XeTeX have changed a lot over recent years: stability in the pdfTeX sense isn’t there. Finally, almost all ‘real’ documents using Unicode engines will be loading the excellent fontspec package to allow system font access. Under these circumstances, it’s appropriate to look again at the standard font loading.

After careful consideration, the team have therefore decided that as of the next (2017) LaTeX2e release, the standard text font loaded when XeTeX and LuaTeX are in use will be Latin Modern as a Unicode-encoded OpenType font. (This is the font chosen by fontspec so for almost all users there will no change in output.) No changes are being made to the macro interfaces for fonts, so users wanting anything other than Latin Modern will continue to be best served by loading fontspec. (Some adjustments are being made to the package to be ready for this.)

It’s important to add that no change is being made in math mode: the Unicode maths font situation is not anything like as clear as the text mode case.

There are still some details being finalised, but the general approach is clear and should make life easier for end users.

Font encodings, hyphenation and Unicode engines

The LaTeX team have over the past couple months been taking a good look at the Unicode TeX engines, XeTeX and LuaTeX, and making efforts to make the LaTeX2e kernel more ‘Unicode aware’. We’ve now started looking at an important question: moving documents from pdfTeX to XeTeX or LuaTeX. There are some important differences in how the engines work, and I’ve discussed some of them in a TeX StackExchange post, but here I’m going to look at one (broad) area in particular: font encodings and hyphenation. To understand the issues, we’ll first need a bit of background: first for ‘traditional’ TeX then for Unicode engines.

Knuth’s TeX (TeX90), e-TeX and pdfTeX are all 8-bit programs. That means that each font loaded with these engines has 256 slots available for different glyphs. TeX works with numerical character codes, not with what we humans think of as characters, and so what’s happening when we give the input

\documentclass{article}
\begin{document}
Hello world
\end{document}

to produce the output is that TeX is using the glyph in position 72 of the current font (‘H’), then position 101 (‘e’), and so on. For that to work and to allow different languages to be supported, we use the concept of font encodings. Depending on the encoding the relationship between character number and glyph appearance varies. So for example with

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
\char200
\end{document}

we get ‘È’ but with

\documentclass{article}
\usepackage[T2A]{fontenc}
\begin{document}
\char200
\end{document}

we get ‘И’ (T2A is a Cyrillic encoding).

This has a knock-on effect on dealing with hyphenation: a word which uses ‘È’ will probably have very different allowed hyphenation positions from one using ‘И’. ‘Traditional’ TeX engines store hyphenation data (‘patterns’) in the format file, and to set that up we therefore need to know which encoding will be used for a particular language. For example, English text uses the T1 encoding while Russian uses T2A. So when the LaTeX format gets built for pdfTeX there is some code which selects the correct encoding and does various bits of set up for each language before reading the patterns.

Unicode engines are different here for a few reasons. Unicode doesn’t need different font encodings to represent all of the glyph slots we need. Instead, there is a much clearer one-to-one relationship between a slot and what it represents. For the Latin-1 range this is (almost) the same as the T1 encoding. However, once we step outside of this all bets are off, and of course beyond the 8-bit range there’s no equivalent at all in classical TeX. That might sound fine (just pick the right encoding), but there’s the hyphenation issue to watch. Some years ago now the hyphenation patterns used by TeX were translated to Unicode form, and these are read natively by XeTeX (more on LuaTeX below). That means that at present XeTeX will only hyphenate text correctly if it’s either using a Unicode font set up or if it’s in a language that is covered by the Latin-1/T1 range: for example English, French or Spanish but not German (as ß is different in T1 from the Latin-1 position).

LuaTeX is something of a special case as it doesn’t save patterns into the format and as the use of ‘callbacks’ allows behaviour to be modified ‘on the fly’. However, at least without some precautions the same ideas apply here: things are not really quite ‘right’ if you try to use a traditional encoding. (Using LuaLaTeX today you get the same result as with XeTeX.)

There are potential ways to fix the above, but at the moment these are not fully worked out. It’s not also clear how practical they might be: for XeTeX, it seems the only ‘correct’ solution is to save all of the hyphenation patterns twice, once for Unicode work and once for using ‘traditional’ encodings.

What does this mean for users? Bottom line: don’t use fontenc with XeTeX or LuaTeX unless your text is covered completely by Latin-1/T1. At the moment, if you try something as simple as

\documentclass{article}
\usepackage[T1]{fontenc}
% A quick test to use inputenc only with pdfTeX
\ifdefined\Umathchar\else
  \usepackage[utf8]{inputenc}
\fi
\begin{document}
straße
\end{document}

then you’ll get a surprise: the output is wrong with XeTeX and LuaTeX. So working today you should (probably) be removing fontenc (and almost certainly loading fontspec) if you are using XeTeX or LuaTeX. The team are working on making this more transparent, but it’s not so easy!

Font schemes and LaTeX3

There was a question recently on the TeX.sx site about font selection and LaTeX3. At the moment, there is not a LaTeX3 font system set up, and there are issues outstanding, so this is not something with a single answer. What I can do, though, is look at what seems likely and what some of the areas to consider are.

(New) Font Selection Scheme

TeX’s font mechanism is pretty basic. There is no relationship between one text font and another: they are all simply set up using the \font primitive. So with plain TeX

{\bf Some {\it text}}

will have ‘Some’ in bold, but ‘text’ in mid-weight italics. LaTeX2e introduced the ‘New Font Selection Scheme’ (NFSS), which provides a method for managing fonts in a way that is likely to be more logical for the user. Thus

{\bfseries Some {\itshape text}}

will have the inner text both bold and italic. At the same time, the NFSS provides a system for loading font files in an organised way and substituting fonts when a particular shape combination is unavailable.

Over all, the NFSS is one the key successes of LaTeX2e compared with LaTeX2.09. There are also a lot of existing .fd files about for using fonts with LaTeX2e, and supporting those is important. So something like the NFSS is definitely needed: the ‘New’ is rather anachronistic nowadays, so the working title is just FSS.

The NFSS is not perfect, and so LaTeX3’s FSS cannot be simply a clone of NFSS. Perhaps the most common complaint about the NFSS is that \textsc is treated as a shape, which makes it impossible to combine it with \itshape to have italic small caps. Other areas which need addressing are for example flexible sizing and proportional/fixed width numbers for tables. This is all evolutionary, and so the plan is to port the existing NFSS first, tidy it up to fit better with LaTeX3 coding approaches, then add new abilities.

Font face loading

The second area to think about is loading fonts in the first place. The traditional LaTeX2e approach to this to set up a small(ish) package to select a font family, for example lmodern or mathptmx, which will then use the NFSS to load the appropriate TeX font files. For users of XeTeX or LuaTeX, the standard method is to use the fontspec package, which provides an interface between the extended \font primitives in these engines and the NFSS.

There are a few things to think about here. First, while XeTeX and LuaTeX can load system fonts directly, pdfTeX cannot. Secondly, even if you are using XeTeX or LuaTeX access to traditional TeX fonts cannot be ignored. There is a lot of MetaFont material on CTAN which is not available in any other format, so simply dropping support for these is not an option.

What I feel we need is a single font-loading interface at the user level which is capable of dealing with these requirements. Clearly, fontspec is going to provide inspiration on how to proceed, but some mechanism for working with pdfTeX will also be needed. My personal take on this is we’ll need a mapping layer, which will mean that at the user level you choose a font by name (as you would in a GUI application), and which then does the appropriate translation to the engine layer.

There are also math mode fonts to worry about. OpenType maths fonts are very much in development, but that doesn’t help with pdfTeX and again does not cover all cases. So again we need to continue to support TeX’s traditional math mode fonts. That will probably be the last part of this particular jigsaw to be tackled, simply because it’s the one with the least clear path at present.

Improving LaTeX for the user

I’ve been discussing some points about the future of (La)TeX with various people, and some key issues come to mind. Most LaTeX users do not want to meddle with the internal parts of LaTeX or TeX. In an ideal world, I suspect most users would like to need little beyond the correct document class to get things “just right” in their layout. Perhaps a few simple settings, but really little more than that.

With the correct packages and class loaded, you can do many things in LaTeX. However, you really shouldn’t need to load specific support for hyperlinks, T1 encoding, basic font changing, creating new float types and so on in 2009 (let alone 2010, 2011, etc.). The efforts of the LaTeX3 team have to date focussed on programming and to a lesser extent document design. How things will work at the user level is much less clear.

I’d suggest that a real focus on getting something for users would be the best way forward. This might mean less improvement internally, but I’d think that a LaTeX kernel which could do everything in The LaTeX Companion would be pretty successful, even with few changes “under the hood”. This would mainly be a re-coding excercise from existing packages, which in a way is similar to what I’ve tried to do with siunitx. Much of the basics in siunitx are taken from other packages (at least in terms of user interface), but it brings several ideas together in one place. The same idea could easily be applied to the kernel. Of course, this might leave some of the clever ideas for LaTeX3 out of the code at this stage, but I’d hope would get momentum behind a more regularly updated system.

One particular area to think about is fonts. With both XeTeX and LuaTeX able to handle system fonts directly, the basic LaTeX system seems very antiquated. At present, LaTeX3 only requires e-TeX, not LuaTeX (in contrast to ConTeXt Mark IV). Should the LaTeX team say something like:

For current testing purposes, only the e-TeX extensions are needed, but this is likely to change. XeTeX or LuaTeX will be required to run the release version of LaTeX3 with full functionality.

I’d say yes, as I think that it’s time to move on from complex font installation and usage restrictions. I’d also be very tempted to say that LaTeX3 will assume UTF-8 input unless otherwise specified (as both XeTeX and LuaTeX are native UTF-8 systems).

This type of approach will make LaTeX easier to use, and I’d hope to see it arrive! After all, users are the TeX community.