Standard font loading in LaTeX2e with XeTeX and LuaTeX

The LaTeX Project have been making efforts over the past few years to update support in the LaTeX2e kernel for XeTeX and LuaTeX. Supporting these Unicode-enabled engines provide new features (and challenges) compared to the ‘classical’ 8-bit TeX engines (probably pdfTeX for most users). Over recent releases, the team have made the core of LaTeX ‘engine-aware’ and pulled a reasonable amount of basic Unicode data directly into the kernel. The next area we are addressing is font loading, or rather the question of what the out-of-the-box (text) font should be.

To date, the LaTeX kernel has loaded Knuth’s Computer Modern font in his original ‘OT1’ encoding for all engines. Whilst there are good reasons to load at least the T1-encoded version rather than the OT1 version, using an 8-bit engine using the OT1 version can be justified: it’s a question of stability, and nothing is actually out-and-out wrong.

Things are different with the Unicode engines: some of the basic assumptions change. In particular, there are some characters in the upper-half of the 8-bit range for T1 that are not in the same place in Unicode. That means that hyphenation will be wrong for words using some characters unless you load a Unicode font. At the same time, both LuaTeX and XeTeX have changed a lot over recent years: stability in the pdfTeX sense isn’t there. Finally, almost all ‘real’ documents using Unicode engines will be loading the excellent fontspec package to allow system font access. Under these circumstances, it’s appropriate to look again at the standard font loading.

After careful consideration, the team have therefore decided that as of the next (2017) LaTeX2e release, the standard text font loaded when XeTeX and LuaTeX are in use will be Latin Modern as a Unicode-encoded OpenType font. (This is the font chosen by fontspec so for almost all users there will no change in output.) No changes are being made to the macro interfaces for fonts, so users wanting anything other than Latin Modern will continue to be best served by loading fontspec. (Some adjustments are being made to the package to be ready for this.)

It’s important to add that no change is being made in math mode: the Unicode maths font situation is not anything like as clear as the text mode case.

There are still some details being finalised, but the general approach is clear and should make life easier for end users.

LuaTeX: Manipulating UTF-8 text using Lua

Both the XeTeX and LuaTeX engines are natively UTF-8, which makes input of non-ASCII text a lot easier than with pdfTeX (certainly for the programmer: inputenc hides a lot of complexity for the end user!). With LuaTeX, there is the potential to script in Lua as well as program in TeX macros, and that of course means that you might well want to do manipulation of that UTF-8 input in Lua. What might then catch you out is that it’s not quite as simple as all that!

Lua itself can pass around arbitrary bytes, so input in UTF-8 won’t get mangled. However, the basic string functions provided by Lua are not UTF-8 aware. The LuaTeX manual cautions

The string library functions len, lower, sub, etc. are not UNICODE-aware.

As a result, applying these functions to anything outside the ASCII range is not a good idea. At best you might get unexpected output, so

tex.print (string.lower ("Ł"))

simply prints in Ł (with the right font set up). Worse, get an error as for example

tex.print (string.match ("Ł","[Ł]"))

results in

! String contains an invalid utf-8 sequence.

which is not what you want!

Instead of using the string library, the current correct approach here is to use slnunicode. Again, the LuaTeX manual has some advice:

For strings in the UTF-8 encoding, i.e., strings containing characters above code point 127, the corresponding functions from the slnunicode library can be used, e.g., unicode.utf8.len, unicode.utf8.lower, etc.

and indeed

tex.print(unicode.utf8.lower("Ł"))

does indeed print ł. There are still a few things to watch, though. The LuaTeX manual warns that unicode.utf8.find returns a byte range and that unicode.utf8.match and unicode.utf8.gmatch fall back on non-Unicode behaviour when an empty capture (()) is used. Both of those can be be allowed for, of course: they should not be big issues.

There’s still a bit of complexity for two reasons. First, there’s not really much documentation on the slnunicode library, so beyond trying examples it’s not so easy to know what ‘should’ happen. For example, case-changing in Unicode is more complex than a simple one-to-one mapping, and can have language-dependencies. I’ll probably return to that in another post for a TeX (or at least XeTeX/LuaTeX) take on this, but in the Lua context the problem is it’s not so clear quite what’s available! In a way, the second point links to this: the LuaTeX manual tells us

The slnunicode library will be replaced by an internal UNICODE library in a future LuaTeX version.

which of course should lead to better documentation but at the price of having to keep an eye on the situation.

Over all, provided you are aware that you have to think, using Lua with Unicode works well, it’s just that it’s not quite as obvious as you might expect!

Fixing problems the rapid way

The latest l3kernel update included a ‘breaking change’: something we know alters behaviour, but which is needed for the long term. Of course, despite the fact the team try to pick up what these things will break, we missed one, and there was an issue with lualatex-math as a result, which showed up for people using unicode-math (also reported on TeX-sx). Luckily, those packages all use GitHub, as does the LaTeX3 team, so it was easy to quickly fork the code and for me to create a fix. That’s the big advantage of having code available using one of the distributed version systems (GitHub and BitBucket are the two obvious places): sending in a fix is a two-minute job, even if it’s someone else’s project. So I’d encourage everyone developing open code to got to CTAN to consider using one of these services: it really does make fixing bugs easier. From report to fix and CTAN update in less than 24 h, which I’d say is pretty good!

A LaTeX format beyond LaTeX2e

The question of why LaTeX3 development is not focussed on LuaTeX came up yesterday on the TeX-sx site. I’ve added an answer there covering some of the issues, but I thought that something a bit more open-ended might also be useful on the same topic.

Before I look at the approaches that are available, it’s worth asking why a format is needed beyond LaTeX2e. There are a few reasons I feel it’s needed, but a few stand out.

The first, strangely, is stability. LaTeX2e is stable: there will be no changes other than bug fixes. That means that a document written 10 or more years ago should still give the same output when typeset today. That sounds great, but there is an issue here. While the kernel is stable, packages are not, and the limitations of the kernel mean that there are a lot of packages. So for a lot of real documents, stability in the kernel does not mean that they will still work after many years, at least without some effort. So we need a kernel which provides a lot more of the basics, and perhaps new approaches to providing stable code.

Secondly, and related, is the fact that most real documents need a lot of packages, and that is a barrier to new users. Again, stability is great but not if it means we don’t continue to attract new people to the LaTeX world. I think that the LaTeX approach is a good one, so that is important to me. So I feel that we need a format which works well and provides a lot more functionality as standard.

Thirdly, there are some fundamental issues which are hard to address, such as inter-paragraph spacing, the placement of floats and better separation of design from input. There all need big changes in LaTeX, and it’s not realistic to hope to bolt such changes on to LaTeX2e and have everything continue to work.

All of that tells me we need a new kernel. So the question is how to achieve that. There are at least four programming approaches I’ve thought about.

Two are closely related: stick with TeX macro programming and cross-engine working, but make things more systematic. Perhaps the simplest way to do this is to adopt an approach similar to the etoolbox package, and to essentially add to the structures already available. The more radical approach in the same area is to do what the LaTeX3 Project have to date, and define a new programming language from the ground up using TeX macros.  There are arguments in favour of both of these approaches: I’ve done some experiments with a more etoolbox-like method for creating a format. My take here is that if you really want something more systematic than LaTeX2e then you do have to go to something like the LaTeX3 method: dealing with expansion with names like \csletcs gets too unwieldy as you try to construct an entire format.

Moving to a LuaTeX-only solution, and doing a lot of the programming in Lua, is the method that the ConTeXt team has decided on. This brings in a proper programming language without any direct effort, but leaves open some issues Using Lua does not automatically solve the challenges in writing a better format, and using LuaTeX does not mean not that there is no TeX programming to do. So a LuaTeX-only approach would still need some TeX work.

Finally, there is the argument for parsing LaTeX-like input in an entirely new way. In this model, you don’t use TeX at all to read the user’s input: that’s done by another language, and TeX is only involved at all when you do the typesetting. That sound challenging, and the big issue here is finding someone who has the necessary programming skills (I certainly do not).

Of the four approaches, it seems to me that from where we are now, the LaTeX3 approach is not so bad. If you were starting today with no code at all, and not background in programming expl3 or Lua, you might pick the LuaTeX method. That’s not, however, where we are: there is experience of expl3 available, and there is also code written (but in need of revision). Of course, the proof of that will be in delivering a working LaTeX3 format: on that, back to work!

LuaTeX category code tables

There are lots of very clever ideas in LuaTeX, and it’s easy to miss some of the good stuff there is. One that many people might miss is category code tables. As any TeX programmer rapidly becomes aware, category codes are central to TeX, and the construction

\catcode`\<char> = <number>\relax

is one you soon get used to. The problem comes when several people start altering the codes: there is no easy way to get back to a known position.

A good illustration of this is verbatim material. The way that something like LaTeX’s \verb macro works is by setting the category code for all of the ‘special’ characters to ‘other’:

\let\do\@makeother
\dospecials

where both \@makeother and \dospecials are provided by the LaTeX kernel. That works because \dospecials is defined as

\do \ \do \\\do \{\do \}\do \$\do \&\do \#\do \^\do \_\do \%\do \~

and so maps the function \@makeother to all of the ‘special’ characters. Using that, a (simplified) verbatim command looks like

\makeatletter
\newcommand*\stdverb{%
  \begingroup
    \let\do\@makeother
    \dospecials
    \@stdverb
}
\newcommand*\@stdverb[1]{%
  \catcode`#1=\active
  \lccode`\~=`#1%
  \lowercase{\let~}\endgroup
  \ttfamily
}
\makeatother

In most cases, this works fine. However, if someone makes another character ‘special’ then things go wrong:

\catcode`\+=\active
\newcommand+{oops}
\stdverb=#{+=

Of course, you could loop over every character, which would be slow for 8-bit input but with UTF-8 input that becomes impractical. Of course, you could add each active character to \dospecials, but this is dependent on everyone sticking to good practice.

This is where category code table come in. These are pre-set lists of category codes, which can be applied in one go. Heiko Oberdiek’s luatex package provides a LaTeX interface for these, meaning we can do:

\usepackage{luatex}
\makeatletter
\newcommand*\luaverb{%
  \begingroup
    \BeginCatcodeRegime\CatcodeTableOther
    \@stdverb
}
\makeatother

(reusing the same internal macro as before). now trying

\catcode`\+=\active
\newcommand+{oops}
\luaverb=#{+=

works as expected, as all of the category codes change in one go. Simple, clear and effective!

LuaTeX 0.45.0

I see from the LuaTeX mailing list that version 0.45.0 has been released. There is the usual long list of new things and bug fixes, but some that caught my eye:

  • \input and \openin now accept braced filenames;
  • The new primitives \aligntab and \alignmark are aliases for the use of & and # in alignments;
  • LuaTeX can now optionally use kpathsea to find lua require() files.

Regular expressions

Regular expressions are very popular as a quick and powerful way to carry out searches and replacements in text of all sorts. Traditionally, TeX handles tokens and not strings or characters. This means that doing regex searches using TeX82 is pretty much impossible. To solve this, recent versions of pdfTeX adds the \pdfmatch primitive to allow real string matching inside TeX. The LuaTeX team have decided not to take all of the existing “new” primitives forward from pdfTeX, and as I understand it \pdfmatch will not be implemented in LuaTeX. However, Lua itself has regular expression matching, and so the functionality will still be around.

I’ve recently talked about adding new primitives to XeTeX, and you’ll see that \pdfmatch was not on the list for adding to XeTeX. The reason is that a XeTeX implementation would have to be slightly different from pdfTeX, as it is natively UTF-8, but also would be different to LuaTeX, as it would still be a TeX primitive and not a Lua function. So here “the prize wasn’t worth the winning”, in my opinion. As it is, using \pdfmatch is not widespread, and the idea of having three different regex methods inside TeX didn’t seem like a great idea!

Talking of regex implementations, I’ve been reading Programming in Lua, and also working with TeXworks to try to get syntax highlighting the way I like it. Both systems are slightly different, and it seems both are different from the Perl implementation. It seems that every time you want to use a regex system you have to read the manual to see which things are different from every other implementation!

Lua, TeX, LaTeX and ConTeXt

There is currently a very active thread on comp.text.tex about LuaTeX. Several related questions have come up, and it seems that some kind of summary would help me, at least.

Why did the LuaTeX team choose Lua?

Possibly the most frequently asked question with LuaTeX is why the team behind it chose Lua, as opposed to one of the other scripting languages available. The LuaTeX FAQ addresses this as question 1, so it’s obviously important. As I understand it, they looked at a number of possible languages for extending TeX, and decided that many of the “obvious” candidates (Java, Perl, Python, Ruby, Scheme) were not suitable. Not everyone is going to agree with that assessment, and some people are bound to feel that there was not sufficient consultation.

With LuaTeX already at version 0.40, it is progressing and will deliver. So unless a team can be got together to provided an alternative scripting system, it looks like Lua is what we will get. I don’t see there being enough experienced programmers with time on their hands to deliver a complete alternative at the moment.

How does LuaTeX relate to ConTeXt and LaTeX?

LuaTeX is an “engine”, like pdfTeX or XeTeX. So it provides certain functions, which might or might not get used. The end-user can use them directly, but this is really not helping most people as they don’t want to do this type of low-level work. So in the main it is down to either the TeX format (ConTeXt, LaTeX, …) or an add on (LaTeX package, ConTeXt module, etc.) to take advantage of them.

In the ConTeXt case, the entire format is being re-written as “Mark IV”, which will use a lot of Lua. This of course means things fundamentally change compared to early versions of ConTeXt, and ties it to a single engine.

In the LaTeX case, the format is written to work on plain TeX, no extensions at all. That is not about to change: essentially, LaTeX2e will never be “updated” in that sense. The current LaTeX3 plans don’t envisage requiring LuaTeX, although I’d hope that some low-level support will be included if LaTeX3 ever becomes a reality. There will, though, be LaTeX packages that use Lua: I’m sure that at some point there will be a fontspec-like package for LuaTeX, for example.

What is wrong with LaTeX2ε?

One issue that always comes up when future directions for TeX are discussed is whether LaTeX2ε needs to change, or whether it is okay as it is. Currently, if you know what you are doing then you can achieve a lot with LaTeX. There are a vast range of packages, and these cover very many of the things you could ever hope to do in TeX. So in a sense LaTeX works as it is. On the other hand, you have to know which packages to load, even to get some of the basics right (try making a new float type without loading any packages and using one of the base classes!). At the same time, things like UTF-8 text, system fonts and so on are not available using only the kernel.

A more fundamental issue is that LaTeX currently doesn’t do enough to provide structured data, and is not a good choice for dealing with XML input. This makes it hard to get data in and out, and is closely related to the fact that there is not enough separation of appearance and structure in the kernel and in add-on packages.

Neither of these points has escaped the notice of the LaTeX team. The question is whether a successor to LaTeX2ε will appear, and if it does whether it can succeed. There is a need to start from scratch with many things, meaning that a new LaTeX simply won’t work with most packages currently available. So the team have to deliver something that really works.

How can (La)TeX recruit new users?

One question that comes up here is what is a typical (La)TeX user anyway. Everyone has their own view on this: some people see TeX users mainly as mathematicians and physicists, other people point out that particularly with XeTeX available there is a lot of potential in the humanities.

Making life easier for new users is a priority for gaining new users. This means making it easy to install TeX (which both TeX Live and MiKTeX do, I think), making it easier to edit TeX files (the excellent TeXworks helps here), and making it easier to use TeX. It is, of course, the last one that is the problem. I think that we do need a new LaTeX as part of this. Removing the need to load a dozen packages and alter half a dozen internal macros to get reasonable output would benefit all LaTeX users. At the same time, something as simple as a generic thesis template could be made available: that would again help to “sell” LaTeX and by extension TeX as a whole.

LuaTeX has a role to play here. Using system fonts with pdfTeX is not easy, and XeTeX has helped a lot with this. LuaTeX provides this possibility and many more, and so we can hopefully look forward to not having to worry about loading system fonts at all. At the same time, it should be possible to do things similar to the current BibTeX and MakeIndex programs directly in the engine, but with full UTF-8 input. This is something that is long overdue, and again can’t hurt when it comes to selling (La)TeX.

LuaTeX reachs 0.40

A recent announcement on the LuaTeX mailing list that LuaTeX has reached v0.40 has been expected for a while. One of the most notable, and widely discussed, changes is in the way new primitives are made available. As of 0.40, “out of the box” LuaTeX only provides one new primitive, \directlua. To get any primitives beyond those from TeX82, you then have to call Lua and get it to turn them on. The reason for this rather radical change is that it avoids any name clashes between new primitives and existing packages. So LuaTeX should, in principle, be able to replace pdfTeX as the engine of choice for most people. Of course, that is still some way off: LuaTeX is scheduled to reach version 1.0 in 2012.