The LaTeX Project have been making efforts over the past few years to update support in the LaTeX2e kernel for XeTeX and LuaTeX. Supporting these Unicode-enabled engines provide new features (and challenges) compared to the ‘classical’ 8-bit TeX engines (probably pdfTeX for most users). Over recent releases, the team have made the core of LaTeX ‘engine-aware’ and pulled a reasonable amount of basic Unicode data directly into the kernel. The next area we are addressing is font loading, or rather the question of what the out-of-the-box (text) font should be.

To date, the LaTeX kernel has loaded Knuth’s Computer Modern font in his original ‘OT1’ encoding for all engines. Whilst there are good reasons to load at least the T1-encoded version rather than the OT1 version, using an 8-bit engine using the OT1 version can be justified: it’s a question of stability, and nothing is actually out-and-out wrong.

Things are different with the Unicode engines: some of the basic assumptions change. In particular, there are some characters in the upper-half of the 8-bit range for T1 that are not in the same place in Unicode. That means that hyphenation will be wrong for words using some characters unless you load a Unicode font. At the same time, both LuaTeX and XeTeX have changed a lot over recent years: stability in the pdfTeX sense isn’t there. Finally, almost all ‘real’ documents using Unicode engines will be loading the excellent fontspec package to allow system font access. Under these circumstances, it’s appropriate to look again at the standard font loading.

After careful consideration, the team have therefore decided that as of the next (2017) LaTeX2e release, the standard text font loaded when XeTeX and LuaTeX are in use will be Latin Modern as a Unicode-encoded OpenType font. (This is the font chosen by fontspec so for almost all users there will no change in output.) No changes are being made to the macro interfaces for fonts, so users wanting anything other than Latin Modern will continue to be best served by loading fontspec. (Some adjustments are being made to the package to be ready for this.)

It’s important to add that no change is being made in math mode: the Unicode maths font situation is not anything like as clear as the text mode case.

There are still some details being finalised, but the general approach is clear and should make life easier for end users.

## Font encodings, hyphenation and Unicode engines

The LaTeX team have over the past couple months been taking a good look at the Unicode TeX engines, XeTeX and LuaTeX, and making efforts to make the LaTeX2e kernel more ‘Unicode aware’. We’ve now started looking at an important question: moving documents from pdfTeX to XeTeX or LuaTeX. There are some important differences in how the engines work, and I’ve discussed some of them in a TeX StackExchange post, but here I’m going to look at one (broad) area in particular: font encodings and hyphenation. To understand the issues, we’ll first need a bit of background: first for ‘traditional’ TeX then for Unicode engines.

Knuth’s TeX (TeX90), e-TeX and pdfTeX are all 8-bit programs. That means that each font loaded with these engines has 256 slots available for different glyphs. TeX works with numerical character codes, not with what we humans think of as characters, and so what’s happening when we give the input

\documentclass{article}
\begin{document}
Hello world
\end{document}


to produce the output is that TeX is using the glyph in position 72 of the current font (‘H’), then position 101 (‘e’), and so on. For that to work and to allow different languages to be supported, we use the concept of font encodings. Depending on the encoding the relationship between character number and glyph appearance varies. So for example with

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
\char200
\end{document}


we get ‘È’ but with

\documentclass{article}
\usepackage[T2A]{fontenc}
\begin{document}
\char200
\end{document}


we get ‘И’ (T2A is a Cyrillic encoding).

This has a knock-on effect on dealing with hyphenation: a word which uses ‘È’ will probably have very different allowed hyphenation positions from one using ‘И’. ‘Traditional’ TeX engines store hyphenation data (‘patterns’) in the format file, and to set that up we therefore need to know which encoding will be used for a particular language. For example, English text uses the T1 encoding while Russian uses T2A. So when the LaTeX format gets built for pdfTeX there is some code which selects the correct encoding and does various bits of set up for each language before reading the patterns.

Unicode engines are different here for a few reasons. Unicode doesn’t need different font encodings to represent all of the glyph slots we need. Instead, there is a much clearer one-to-one relationship between a slot and what it represents. For the Latin-1 range this is (almost) the same as the T1 encoding. However, once we step outside of this all bets are off, and of course beyond the 8-bit range there’s no equivalent at all in classical TeX. That might sound fine (just pick the right encoding), but there’s the hyphenation issue to watch. Some years ago now the hyphenation patterns used by TeX were translated to Unicode form, and these are read natively by XeTeX (more on LuaTeX below). That means that at present XeTeX will only hyphenate text correctly if it’s either using a Unicode font set up or if it’s in a language that is covered by the Latin-1/T1 range: for example English, French or Spanish but not German (as ß is different in T1 from the Latin-1 position).

LuaTeX is something of a special case as it doesn’t save patterns into the format and as the use of ‘callbacks’ allows behaviour to be modified ‘on the fly’. However, at least without some precautions the same ideas apply here: things are not really quite ‘right’ if you try to use a traditional encoding. (Using LuaLaTeX today you get the same result as with XeTeX.)

There are potential ways to fix the above, but at the moment these are not fully worked out. It’s not also clear how practical they might be: for XeTeX, it seems the only ‘correct’ solution is to save all of the hyphenation patterns twice, once for Unicode work and once for using ‘traditional’ encodings.

What does this mean for users? Bottom line: don’t use fontenc with XeTeX or LuaTeX unless your text is covered completely by Latin-1/T1. At the moment, if you try something as simple as

\documentclass{article}
\usepackage[T1]{fontenc}
% A quick test to use inputenc only with pdfTeX
\ifdefined\Umathchar\else
\usepackage[utf8]{inputenc}
\fi
\begin{document}
straße
\end{document}


then you’ll get a surprise: the output is wrong with XeTeX and LuaTeX. So working today you should (probably) be removing fontenc (and almost certainly loading fontspec) if you are using XeTeX or LuaTeX. The team are working on making this more transparent, but it’s not so easy!

## Case changing: solving the challenges in TeX

I wrote recently about handling UTF-8 input in Lua, and in particular the fact that doing text manipulation needs a bit of care. One area that I’ve been looking at recently is doing case changing operations. We’ve been looking at this for expl3, so I thought it would be worth looking at this in a bit of detail. I’m going to mainly focus on the results rather than implementation: the latter is important when it affects the output but not really otherwise (except for the team!).

## Background

The first thing to think about is what case changing is needed for. We’ll see in a bit that TeX uses ‘case changing’ for something very different from what we might think of as changing case in ‘text’. First, though, let’s look at what those ‘normal’ requirements are. The Unicode Consortium have looked in detail at this: take a look at the standard for all of the detail. The common situations are:

• ‘Removing’ the case from text to allow ‘caseless’ comparisons (‘case-folding’). This is primarily used ‘internally’ by code, and tends traditionally to be handled by simply lower casing everything before some comparison. The Unicode approach has some slight differences between case-folding and lower-casing, but it’s relatively straight-forward.
• Upper-casing ‘text’. Here, all characters that have a case mapping are changed to the upper-case versions. That’s a relatively simple concept, but there is a bit more to it (as we’ll see).
• Title- or sentence-casing ‘text’. The concept here is usually implemented by upper-casing the first character of a phrase, or of each word, then to lower-case the rest. Again, the Unicode specs have a bit more to say on this: there are some character(s) that should not be upper-cased at the start of a word in this context but need a special ‘title-case’ character. (For example, in Dutch ‘IJ’ at the start of words should both be upper-cased.)

Just to make life a bit more fun, there are also some language-dependent rules for case changing, and some places where the outcome of a case change depends on the context (sigma at the end of words is the most obvious example). So there are a few challenges if we want to cover all of this in TeX. We’ve also got to think about the ‘TeX angle’: what does ‘text’ mean, how do we handle math mode, etc.

## TeX primitives

TeX provides two primitives for changing case, \lowercase and \uppercase. These are powerful operations, and in particular are very often used for something that’s got very little to do with case at all: making characters with non-standard category codes. As that isn’t a ‘real’ case change at all, I won’t look at it further here, other than noting that it means we need those primitives for something even if we do case changing another way entirely!

Sticking with changing case of ‘text’, \uppercase and \lowercase rely on the fact that each character has a one-to-one mapping for upper- and lower-casing (defined by \uccode and \lccode). Assuming these are not ‘do nothing’ mappings, they allow a simple replacement of characters

\uppercase{hello} => HELLO
\lowercase{WORLD} => world


With XeTeX and LuaTeX, these mappings are set up for all sensible UTF-8 codepoints (‘characters’). However, the are one-to-one mapping with no context-awareness: that makes it impossible to cover some parts of the Unicode definitions I’ve mentioned (at least using the primitives directly). They also change everything in the input, which makes handling things like

\uppercase{Some text $y = mx + c$}


a bit tricky (there are ways, of course!).

Another TeX concern is ‘expandability’: \uppercase and \lowercase are not expandable. That means that while we can do

\uppercase{\def\foo{some text}}


and have \foo defined as SOME TEXT, the apparently ‘obvious’ alternative

\edef\foo{\uppercase{some text}}


doesn’t have the expected result (\foo here is defined as \uppercase{some text}). Moreover, it means we can’t use the primitives inside places where TeX requires expansion. As a result, things like

\csname\lowercase{Some-input}\endcsname


result in an error. Of course, there are always ways around the problem, but I think it looks a lot ‘nicer’ for the user if a way can be found to do these operations expandably. As we’ll see in a bit, that is doable if we accept a few restrictions.

## Case folding

If we want to implement case changing without using \lowercase and \uppercase then we have to have some form of iterative mapping over the ‘text’ input. Doing that while keeping the code expandable is doable if we accept a few restrictions, which I’ll come to in a bit. One to mention now is that the code here assumes e-TeX is available and that we have the \pdfstrcmp primitive or equivalent functionality: pdfTeX, XeTeX and LuaTeX all cover these requirements.

For ‘case-folding’ we can make some simplifications which make this the most straight-forward situation to set up. First, case-folding is a one-to-one change with no context-dependence: nice and easy. Secondly, as this is needed only for ‘internal’ stuff and not for ‘text’ to be typeset we can assume that everything can be handled as a (TeX) string by applying \detokenize. That avoids issues with things like escaping math mode, brace groups and the like. Setting up an expandable mapping is then relatively straight-forward, and the issue becomes simply how do with actually change the case of each character.

With a list of over 1000 possible characters to case-fold, comparing each and every one to find a hit would get slow. Luckily, Bruno Le Floch spotted that we can divide up that long list into ‘bite sized’ chunks by using the last two digits of the character code of the input, giving 100 short lists, each of which is realistic just to look through. (For those interested in the internals, the final comparison is done using \str_case:nnF, which is an expandable string-based selection using \pdfstrcmp.)

Putting everything together lead to the documented interface

\str_fold_case:n { <input> }


which does exactly what it says: folds the case of the input, which is treated as a string. The only real point to note here is that with pdfTeX it doesn’t make sense to talk about UTF-8 as the engine doesn’t support it. Thus the changes here are restricted to ASCII (A-Z): for a string that’s clear, but life is a bit more murky for ‘text’ input. I’ll come back to that below.

## Case changing

Real case changing provides a few more challenges. Looking first at the Unicode definitions, there are both context- and language-dependent rules to worry about. It turns out that there are relatively few of these, so a bit of work with some hard-coding seems to cover most of them. That does require a bit of ‘bending the rules’ to fit in with how TeX parses stuff, so there may yet be more work to do here!

As we are now looking at text which might have a variety of TeX tokens in it then doing the mapping raises issues. It turns out that we can do an expandable mapping provided we accept that any brace groups end up with { and } as the grouping tokens even if that wasn’t true to start with (a bit of an edge-case but we have to specify these things!). (Note that this does require both e-TeX and \pdfstrcmp, so it’s not true for ‘classical’ TeX.) However, that raises an interesting issue: should stuff inside braces be case changed or not? At the moment, we’ve gone for ‘no’, as that’s very much like the BibTeX approach

title = {Some text with including a {Proper-Name}}


which also makes the code a bit easier to write. However, it’s not quite clear if this is the best plan: I’ll point to one open question below.

Another question is what category codes should apply in the output. For the folding case, it was easy: everything is treated as a string so the output is too. That’s not the situation for general text, but at the same time it seems sensible to assume that you are case changing things that will be typeset (‘letters’). Again, this is rather more of a concepts than a technical question.

Answering these questions, or at least taking a documented position on them, it’s possible to define functions such as

\tl_lower_case:n { <text> }
\tl_upper_case:nn { <language> } { <text> }


that implement the case changing I’ve outlines. As this is very much a ‘work in progress’ those names are not fixed: there’s a feeling that perhaps \text_... might be more ‘sensible’ (the input should be ‘well-behaved’). What’s needed is some testing: we thing the idea is a good one, but at the moment it’s not clear we’ve got all of the ideas right!

Notice the versions that know about languages: the idea is that these will get things like Turkish dotted/dotless-i correct. Of course, that assumes you know the language the input is in, but hopefully that’s normally true!

One thing to note here is again the pdfTeX case. As we are dealing with ‘engine native’ input, it’s only set up to do changes for the ASCII range. That’s fine, but it leaves open the question of LICR text. For example,

 \tl_upper_case:n { \'{e} }


currently doesn’t do anything as there are braces around the e. I’m not sure what’s best: skipping brace groups is generally easier for the user, but they probably would be surprise by this outcome! (With XeTeX or LuaTeX, the input would hopefully be é so the problem doesn’t arise.)

## Conclusions

Case changing is a tricky thing to get right. We’ve made some progress in providing a ‘clear’ interface in expl3 that can cover not only UTF-8 input but also language-dependence. What’ needed now is some testing and feedback: we hope these things are useful!

## LuaTeX: Manipulating UTF-8 text using Lua

Both the XeTeX and LuaTeX engines are natively UTF-8, which makes input of non-ASCII text a lot easier than with pdfTeX (certainly for the programmer: inputenc hides a lot of complexity for the end user!). With LuaTeX, there is the potential to script in Lua as well as program in TeX macros, and that of course means that you might well want to do manipulation of that UTF-8 input in Lua. What might then catch you out is that it’s not quite as simple as all that!

Lua itself can pass around arbitrary bytes, so input in UTF-8 won’t get mangled. However, the basic string functions provided by Lua are not UTF-8 aware. The LuaTeX manual cautions

The string library functions len, lower, sub, etc. are not UNICODE-aware.

As a result, applying these functions to anything outside the ASCII range is not a good idea. At best you might get unexpected output, so

tex.print (string.lower ("Ł"))


simply prints in Ł (with the right font set up). Worse, get an error as for example

tex.print (string.match ("Ł","[Ł]"))


results in

! String contains an invalid utf-8 sequence.


which is not what you want!

Instead of using the string library, the current correct approach here is to use slnunicode. Again, the LuaTeX manual has some advice:

For strings in the UTF-8 encoding, i.e., strings containing characters above code point 127, the corresponding functions from the slnunicode library can be used, e.g., unicode.utf8.len, unicode.utf8.lower, etc.

and indeed

tex.print(unicode.utf8.lower("Ł"))


does indeed print ł. There are still a few things to watch, though. The LuaTeX manual warns that unicode.utf8.find returns a byte range and that unicode.utf8.match and unicode.utf8.gmatch fall back on non-Unicode behaviour when an empty capture (()) is used. Both of those can be be allowed for, of course: they should not be big issues.

There’s still a bit of complexity for two reasons. First, there’s not really much documentation on the slnunicode library, so beyond trying examples it’s not so easy to know what ‘should’ happen. For example, case-changing in Unicode is more complex than a simple one-to-one mapping, and can have language-dependencies. I’ll probably return to that in another post for a TeX (or at least XeTeX/LuaTeX) take on this, but in the Lua context the problem is it’s not so clear quite what’s available! In a way, the second point links to this: the LuaTeX manual tells us

The slnunicode library will be replaced by an internal UNICODE library in a future LuaTeX version.

which of course should lead to better documentation but at the price of having to keep an eye on the situation.

Over all, provided you are aware that you have to think, using Lua with Unicode works well, it’s just that it’s not quite as obvious as you might expect!

## Unicode math versus document styling

There is a lot of work going on to develop methods for directly including mathematical meaning in documents. Projects such as STIX, XITS and Latin Modern Math are intended to provide a range of glyphs for mathematical use while retaining meaning by using the appropriate Unicode code point. Undoubtedly, this is a great idea for reusing information. However, there is always a pay-off, and in this case it is some awkwardness with document styling.

In a standard TeX math font, attributes such as bold or sans-serif can be switched on pretty easily, and also apply on an ‘ongoing’ basis. I make use of this in siunitx to allow ‘detection’ of the local font conditions. Life is much more complex with Unicode maths fonts. Instead of something like bold being a casual attribute of a symbol, it’s intrinsic to the symbol. So you can’t simply switch from on bold, or sans-serif, or anything else.

For serious mathematicians, that probably makes good sense: they make a wide and complex use of the appearance of symbols to convey meaning. On the other hand, it’s a bit awkward if you have a caption which is set in bold and want your simple piece of mathematics to match. I’m still thinking about the best way to handle this: suggestions are welcome!

## Pretesting TeX Live 2010

The first testing builds of TeX Live 2010 are now available, which you can also read about in the TeXblog entry. I downloaded it a few days ago, currently just to my Mac (Windows testing on my system at work starts next week). There are a few changes, some of which were planned for TeX Live 2009 and did not make it. The highlights for me

• Restricted \write18 support is back. I’ve written about the issues with this before, but as I understand it these are now solved. The idea of this support is that EPS graphics can be turned into PDF graphics automatically, meaning that pdfLaTeX is much easier to use for end users with mainly EPS graphics available.
• The default PDF output is level 1.5, which means that more compression of the output is available. The amount of compression depends on the type of output (files with lots of hyperlinks seem to show the most dramatic results). I’ve been using PDF 1.5 for a while with no issues, so I hope that this is applicable to most users.
• The is a Unicode version of BibTeX included: BibTeXU. I can’t see any details of where this is coming from or the exact nature of the support: I hope to gain enlightenment at some stage. I’ll certainly be testing it.

As I’m currently testing on my Mac, I’ve installed the 64-bit binaries (these still have to be installed in addition to MacTeX at the moment). I’m seeing slightly better performance with the 64 bit binaries than the 32 bit ones, but not by much. On Windows I’m currently limited to 32 bit, so there I’ll have nothing to worry about!

So far, I’ve not had any major issues. TeX Live is very much evolution, not revolution, so that is not too much of a surprise. The team have done a good job, as usual, and I hope that others will brave the testing status of this release to help find any bugs before it’s unleashed on the TeX world at large.