TeX Live 2017 Pretesting

Eager TeX users will have noticed that a few days ago TeX Live 2016 updates were frozen for ever. We now have the pretest available for TeX Live 2017. As always, using pre-release software is not without risk, but as you can install it in parallel with the older releases there is not a big problem. The LaTeX team will be updating a few things on CTAN to go into the new release, and I’ll probably mention some of that in future posts. A quick look over the changes tells us that there are minor (and perhaps not-so-minor) engine changes to explore: I’m particularly keen to try out the new XeTeX math mode approach, using HarfBuzz.

Standard font loading in LaTeX2e with XeTeX and LuaTeX

The LaTeX Project have been making efforts over the past few years to update support in the LaTeX2e kernel for XeTeX and LuaTeX. Supporting these Unicode-enabled engines provide new features (and challenges) compared to the ‘classical’ 8-bit TeX engines (probably pdfTeX for most users). Over recent releases, the team have made the core of LaTeX ‘engine-aware’ and pulled a reasonable amount of basic Unicode data directly into the kernel. The next area we are addressing is font loading, or rather the question of what the out-of-the-box (text) font should be.

To date, the LaTeX kernel has loaded Knuth’s Computer Modern font in his original ‘OT1’ encoding for all engines. Whilst there are good reasons to load at least the T1-encoded version rather than the OT1 version, using an 8-bit engine using the OT1 version can be justified: it’s a question of stability, and nothing is actually out-and-out wrong.

Things are different with the Unicode engines: some of the basic assumptions change. In particular, there are some characters in the upper-half of the 8-bit range for T1 that are not in the same place in Unicode. That means that hyphenation will be wrong for words using some characters unless you load a Unicode font. At the same time, both LuaTeX and XeTeX have changed a lot over recent years: stability in the pdfTeX sense isn’t there. Finally, almost all ‘real’ documents using Unicode engines will be loading the excellent fontspec package to allow system font access. Under these circumstances, it’s appropriate to look again at the standard font loading.

After careful consideration, the team have therefore decided that as of the next (2017) LaTeX2e release, the standard text font loaded when XeTeX and LuaTeX are in use will be Latin Modern as a Unicode-encoded OpenType font. (This is the font chosen by fontspec so for almost all users there will no change in output.) No changes are being made to the macro interfaces for fonts, so users wanting anything other than Latin Modern will continue to be best served by loading fontspec. (Some adjustments are being made to the package to be ready for this.)

It’s important to add that no change is being made in math mode: the Unicode maths font situation is not anything like as clear as the text mode case.

There are still some details being finalised, but the general approach is clear and should make life easier for end users.

TeX on Windows: TeX Live versus MiKTeX revisited

On Windows, users have two main choices of TeX system to install: TeX Live or MiKTeX. I’ve looked at this before a couple of times: first in 2009 then again in 2011. Over the past few years both systems have developed, so it seems like a good time to revisit this. (I know from my logs that this is one of the most popular topics I’ve covered!)

The first thing to say is that for almost all ‘end users’ (with a TeX system on their own PC just for them to use), both options are fine: they’ll probably notice no difference between the two in use. It’s also worth noting that there is a third option: W32TeX. I’ve mentioned this before: it’s popular in the far East and is where the Windows binaries for TeX Live come from. (There’s a close relationship between W32TeX and TeX Live, with W32TeX more ‘focussed’ and expecting more user decisions in installing.)

Assuming you are going for one of the ‘big two’, what is there to think about? For most people, it’s simply:

  • Both MiKTeX and TeX Live include a ‘full’ set of TeX-related binaries, including the engines pdfTeX, XeTeX, LuaTeX and support programs such as BibTeX, Biber, MakeIndex and Xindy.
  • The standard installer for MiKTeX installs ‘just the basics’ and uses on-the-fly installation for anything else you need; the standard install for TeX Live is ‘everything’ (about 4.5 Gb!). Which is right for you will depend on how much space you have: you can of course customise the installation of either system to include more or less of the ‘complete’ set up.
  • MiKTeX has a slightly more flexibly approach to licensing than TeX Live does: there are a small number of LaTeX packages that MiKTeX includes that TeX Live does not. (Probably the most odious example is thesis.)
  • TeX Live has a Unix background so the management GUI looks slightly less ‘standard’ than the MiKTeX one.
  • TeX Live has a strict once-a-year freeze,which means that to update you have to do a fresh install once a year. On the other hand, MiKTeX versions change only when there is a significant change and otherwise ‘roll onward’.

So the decision is likely to come down to whether you want auto-installation of packages. (If you do go for MiKTeX on a one-user PC, choose the ‘Just for me’ installation option: it makes life a lot simpler!)

For more advanced users there are a few more factors you probably want to consider

  • TeX Live was originally developed on Unix and so is available for Linux and on the Mac (and other systems) as well as Windows; MiKTeX is a Windows system so is (more-or-less) Windows-only. So if you want exactly the same set up on Windows and other operating systems, this of course means you need to use TeX Live.
  • Both systems have graphical management tools as well as command line interfaces. They have a lot in common, but they are not identical (in particular, MiKTeX tends to emulate TeX Live command line interfaces, but the reverse is not true).
  • The engine binaries in TeX Live are (almost) never updated other than in the yearly freeze period, meaning that for a given release you know which version of pdfTeX, etc., you’ll have: MiKTeX is more flexible with such updates. (At different times, one or other of the systems can be more ‘up to date’: this is not necessarily predictable! The W32TeX system often has very up-to-date testing binaries.)
  • The two systems differ slightly in handling how local trees are managed (places to add TeX files that are not controlled by the TeX system itself). TeX Live automatically expects <installation root>/texmf-local to hold system-wide ‘local’ additions and <user root>/texmf to hold per-user additions, whereas MiKTeX has no out-of-the box locations, but does make it easier to add and remove them from the command line. MiKTeX also makes it easy to add multiple per-user trees, whereas for TeX Live there’s more of an assumption that all user additions will be added in one place. (This makes it easier in MiKTeX to add/remove local additions by altering a setting in the TeX system rather than deleting files.)
  • TeX Live has a team doing the work; MiKTeX is a one-man project. This cuts both ways: you know exactly who is doing everything in MiKTeX (Christian Schenk), and he’s very fast, but there is more ‘spread’ in TeX Live for the work.
  • For people wanting to step quickly between different versions of TeX system, the fact that TeX Live freezes once a year makes life convenient (I have TeX Live 2009,2010, 2011, 2012, 2013, 2014, 2015 and 2016 installed at present, plus MiKTeX 2.9 of course!) You can switch installations by adjusting the PATH or by choosing the appropriate version from your editor, so have a ‘fall back’ if there is an issue when you update.
  • TeX Live has build-in package backup during maintenance updates.

Dependencies

There’s been some recent discussion on the TeX Live mailing list about recording dependencies for (La)TeX packages. This is a good idea but means that package authors need to think about their dependency situation. So I thought a few words on this would be helpful, at least from the point of view of the most common case: LaTeX packages.

It’s pretty easy to accumulate \RequirePackage lines in your source, but if you are serious about giving a useful set of dependencies you need to know what each one is for. In many ways the rule is easy: require each package you use. What makes that more complicated is that you might use features which are available when you load package X but are actually provided by package Y. For example, if you load my siunitx package, it loads array so means that you can do for example

\begin{tabular}{>{$}l<{$}}

So how do you tell what your ‘real’ dependencies are? The usual rule is that you check the documentation: does it say that package X itself provides the features you use? In the case above, siunitx doesn’t document that syntax extension for tabular: it’s documented by array. So if you wrote a package that uses siunitx but also needs to use features from array you should

\RequirePackage{array}
\RequirePackage{siunitx}

This means that even if at some future stage there’s a change in the internals of a package you load, things should still all work.

If you want to track down where stuff might be coming from, you can always \listfiles to get a full overview of your current package use (starting from a small example).

There are a few places were packages are so closely linked you might not have to list them both. The most obvious is TikZ/pgf: the two are different ‘layers’ of the same set up but are documented together, so if you load TikZ you can assume pgf. Of course, there is no harm in listing both!

LaTeX2e and e-TeX

LaTeX2e was released in 1994 and since then the LaTeX3 Project have been committed to keeping it working smoothly for users. That means balancing up keeping the code stable with fixing bugs and adding new features.

Back in 2003 the team announced that the e-TeX extensions would be used by the kernel when they were available. The new primitives offered by e-TeX make many parts of TeX programming easier and  often there’s no way in ‘classical’ TeX to get the same effect. As e-TeX was finalised in 1999, starting to use it seriously in around 2004 meant most people had access to them.

Since then, the availability and use of e-TeX has spread, and almost all users have them available. Indeed, the standard format-building routines for LaTeX have included them for many years. There are also a lot of packages on CTAN that use e-TeX, most obviously any using the expl3 programming language that the LaTeX3 Project have created.

The team had always meant to say at some stage that e-TeX was now required, and indeed thought we had until I checked over the official newsletters! So as of the next LaTeX2e release, scheduled for the start of 2017, the kernel will only build if e-TeX is enabled. For this release, we are likely to add a test for e-TeX but no actual use directly in the kernel, though in the future there will probably be more use of the extensions.

pgfplots: Showing points as just error bars

Presenting experimental work in a clear form is an important skill. For plotting data, I like the excellent pgfplots package, which makes it easy to put together consistent presentations of complex data. At the moment, I’d doing some experiments where showing the error bars on the raw data is important, but at the same time to show fit lines clearly. The best style I’ve seen for this is one where the data are show as simple vertical bars which have length determined by the error bars for the measurements. The fit lines then stand out clearly without overcrowding the plot. That style isn’t built in to pgfplots but it’s easy to set up with a little work:

\documentclass{standalone}
\usepackage{pgfplots}

% Use features from current release
\pgfplotsset{compat = 1.12}

% Error 'sticks'
\pgfplotsset{
  error bars/error mark options = {draw = none}
  % OR more low-level
  % error bars/draw error bar/.code 2 args = {\draw #1 -- #2;} 
}

\begin{document}
\begin{tikzpicture}
  \begin{axis}
    [
      error bars/y dir      = both,
      error bars/y explicit = true,
    ]
    \addplot[draw = none] table[y error index = 2]
      {
        0   0.023 0.204
        1   0.956 0.332
        2   4.234 0.552
        3   8.764 0.345
        4  17.025 0.943
        5  27.201 2.445
      };
    \addplot[color = red, domain = 0:5, samples = 100] {x^2};
  \end{axis}
\end{tikzpicture}
\end{document}

Demo
My demo only has a few data points, but this style really shows it’s worth as the number of points rises.

Font encodings, hyphenation and Unicode engines

The LaTeX team have over the past couple months been taking a good look at the Unicode TeX engines, XeTeX and LuaTeX, and making efforts to make the LaTeX2e kernel more ‘Unicode aware’. We’ve now started looking at an important question: moving documents from pdfTeX to XeTeX or LuaTeX. There are some important differences in how the engines work, and I’ve discussed some of them in a TeX StackExchange post, but here I’m going to look at one (broad) area in particular: font encodings and hyphenation. To understand the issues, we’ll first need a bit of background: first for ‘traditional’ TeX then for Unicode engines.

Knuth’s TeX (TeX90), e-TeX and pdfTeX are all 8-bit programs. That means that each font loaded with these engines has 256 slots available for different glyphs. TeX works with numerical character codes, not with what we humans think of as characters, and so what’s happening when we give the input

\documentclass{article}
\begin{document}
Hello world
\end{document}

to produce the output is that TeX is using the glyph in position 72 of the current font (‘H’), then position 101 (‘e’), and so on. For that to work and to allow different languages to be supported, we use the concept of font encodings. Depending on the encoding the relationship between character number and glyph appearance varies. So for example with

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
\char200
\end{document}

we get ‘È’ but with

\documentclass{article}
\usepackage[T2A]{fontenc}
\begin{document}
\char200
\end{document}

we get ‘И’ (T2A is a Cyrillic encoding).

This has a knock-on effect on dealing with hyphenation: a word which uses ‘È’ will probably have very different allowed hyphenation positions from one using ‘И’. ‘Traditional’ TeX engines store hyphenation data (‘patterns’) in the format file, and to set that up we therefore need to know which encoding will be used for a particular language. For example, English text uses the T1 encoding while Russian uses T2A. So when the LaTeX format gets built for pdfTeX there is some code which selects the correct encoding and does various bits of set up for each language before reading the patterns.

Unicode engines are different here for a few reasons. Unicode doesn’t need different font encodings to represent all of the glyph slots we need. Instead, there is a much clearer one-to-one relationship between a slot and what it represents. For the Latin-1 range this is (almost) the same as the T1 encoding. However, once we step outside of this all bets are off, and of course beyond the 8-bit range there’s no equivalent at all in classical TeX. That might sound fine (just pick the right encoding), but there’s the hyphenation issue to watch. Some years ago now the hyphenation patterns used by TeX were translated to Unicode form, and these are read natively by XeTeX (more on LuaTeX below). That means that at present XeTeX will only hyphenate text correctly if it’s either using a Unicode font set up or if it’s in a language that is covered by the Latin-1/T1 range: for example English, French or Spanish but not German (as ß is different in T1 from the Latin-1 position).

LuaTeX is something of a special case as it doesn’t save patterns into the format and as the use of ‘callbacks’ allows behaviour to be modified ‘on the fly’. However, at least without some precautions the same ideas apply here: things are not really quite ‘right’ if you try to use a traditional encoding. (Using LuaLaTeX today you get the same result as with XeTeX.)

There are potential ways to fix the above, but at the moment these are not fully worked out. It’s not also clear how practical they might be: for XeTeX, it seems the only ‘correct’ solution is to save all of the hyphenation patterns twice, once for Unicode work and once for using ‘traditional’ encodings.

What does this mean for users? Bottom line: don’t use fontenc with XeTeX or LuaTeX unless your text is covered completely by Latin-1/T1. At the moment, if you try something as simple as

\documentclass{article}
\usepackage[T1]{fontenc}
% A quick test to use inputenc only with pdfTeX
\ifdefined\Umathchar\else
  \usepackage[utf8]{inputenc}
\fi
\begin{document}
straße
\end{document}

then you’ll get a surprise: the output is wrong with XeTeX and LuaTeX. So working today you should (probably) be removing fontenc (and almost certainly loading fontspec) if you are using XeTeX or LuaTeX. The team are working on making this more transparent, but it’s not so easy!

LaTeX2e and Unicode engines: the detail

As I mentioned in my last post, the LaTeX team are working on various small but important improvements to the LaTeX2e kernel. One area we are looking at is adjusting how the ‘vanilla’ format works with Unicode engines. I’ve been asked for a bit more detail on this area, so I’ll try to fill in what’s going on with the ‘newer’ engines.

To date, the ‘vanilla’ LaTeX format (latex.ltx and associated files) has been pretty much engine-neutral with no attempt to differentiate anything other than to deal with differences between TeX2 (7-bit, released 1982) and TeX3 (8-bit, released 1990). However, the LaTeX formats that almost all users actually load are not just made by running

<engine> -ini latex.ltx

(or similar). The ‘format builders’ [principally the TeX Live team and Christian Schenk (MiKTeX] use a series of .ini files for building formats. For example, pdflatex.inicurrently says

% Thomas Esser, 1998. public domain.
\input pdftexconfig.tex
\scrollmode
\input latex.ltx
\endinput

(The .ini files are in the main common to both TeX Live and MiKTeX. For building a pdfLaTeX that makes sense: pdftexconfig.tex just sets up related to direct PDF output as opposed to working in DVI mode. Things get more complicated, though, when we look at the Unicode engines: some of the stuff is really ‘general’ and should be present in all LaTeX-based formats with these engines.

Both XeTeX and LuaTeX work with the entire Unicode range, so they need information on things like case mapping (\lccode/\uccode) and Unicode math handling (\Umathcode). The LaTeX format includes a \dump at the end, so without hacking about no code can be added after it’s loaded. More importantly, as XeTeX builds hyphenation into the format in the same way as ‘classical’ TeX the \lccode data needs to be right before the format loads the patterns. However, that can’t be done just by reading data before latex.ltx: it sets up the 8-bit range for the T1 encoding scheme. That’s an issue nowadays as the hyphenation patterns are nowadays stored in Unicode form: the stuff that happens ‘behind the scenes’ therefore (quite reasonably) assume that the Unicode engines can read these files with no ‘trickery’. To accommodate this, at the moment you’ll find that xelatex.ini includes

\input unicode-letters
% disable the \dump in latex.ltx
\expandafter\let\csname saved-dump-cs\endcsname\dump
\let\dump=\relax
\scrollmode
\input latex.ltx

and later

% Because latex.ltx sets up character code tables for T1 encoding by default,
% we need to reset values from unicode-letters that may have been overridden
\begingroup
\catcode`\@=11 \count@=128 % reset chars "80-"FF to category "other", no case mapp
ing
\loop \ifnum\count@<256
  \global\uccode\count@=0 \global\lccode\count@=0
  \global\catcode\count@=12 \global\sfcode\count@=1000
  \advance\count@ by 1 \repeat
\def\C #1 #2 #3 {\global\uccode"#1="#2 \global\lccode"#1="#3 } % case mappings (non-letter)
\def\L #1 #2 #3 {\global\catcode"#1=11 % category: letter
  \C #1 #2 #3 % with case mappings
  \ifnum"#1="#3 \else \global\sfcode"#1=999 \fi % uppercase letters have sfcode=999
  \global\XeTeXmathcode"#1="7"01"#1 % BMP letters default to class 7 (var), fam 1
  }
\def\l #1 {\L #1 #1 #1 } % letter without case mappings
[data lines]
\endgroup
\expandafter\let\expandafter\dump\csname saved-dump-cs\endcsname
\dump

There are some slight differences for lualatex.ini, but the general idea is the same. The need to ‘hack around’ the kernel is not great, and the team are much keener on the idea that it’s a documented feature that the Unicode engines are set up for a Unicode encoding (‘UC’) rather than for T1. (I’ll probably return to Unicode encodings in another context in a later post.)

As well as this important area, there are some things that are ‘tacked on’ to the formats by the .ini files but which apply only to one of either XeTeX or LuaTeX. For XeTeX, there is a need to manage the \XeTeXinterchartoks system, for which xelatex.ini currently does

%
% Allocator for \XeTeXintercharclass values, from Enrico Gregorio 
%
\catcode`\@=11
\newcount\xe@alloc@intercharclass % allocates intercharclass
\xe@alloc@intercharclass=\thr@@ % from 4 (1,2 and 3 are used by CJK, AFAIK)
\def\xe@alloc@#1#2#3#4#5{\global\advance#1\@ne
 \xe@ch@ck#1#4#2% make sure there's still room
 \allocationnumber#1%
 \global#3#5\allocationnumber
 \wlog{\string#5=\string#2\the\allocationnumber}}
\def\xe@ch@ck#1#2#3{%
 \ifnum#1<#2\else
  \errmessage{No room for a new #3}%
 \fi}
\def\newXeTeXintercharclass{%
 \xe@alloc@\xe@alloc@intercharclass\XeTeXintercharclass\chardef\@cclv} %at most 254

For LuaTeX, there are a couple of things in lualatex.ini that should be in the format. First, there is a difference in how this engine handles negative values of \endlinechar compared with other TeX engines. That requires a patch to LaTeX2e’s \@xtypein. More importantly, LuaTeX only actives the extensions to TeX if some Lua code is used

\begingroup
\catcode`\{=1
\catcode`\}=2
\directlua{
  % etex and pdftex primitives are enabled without prefixing
  % as well as extented Unicode math primitives (see below)
  tex.enableprimitives('', 
    tex.extraprimitives('etex', 'pdftex', 'umath'))
  % other primitives are prefixed with luatex (see below)
  tex.enableprimitives('luatex', 
    tex.extraprimitives('core', 'omega', 'aleph', 'luatex'))
  }
\endgroup

This has to come right at the start of the build process, but is another thing that can sensibly go into latex.ltx. The team also wonder if all of the primitives should have their ‘natural’ names without the luatex prefix.

All of this can be added to latex.ltx without altering what users have available and without breaking LaTeX2e for pdfTeX users. The team have these changes made in the development version of the kernel. There are other things yet to be finalised, but it’s highly likely the next release of the LaTeX2e kernel will (finally) recognise the Unicode engines and bring this stuff ‘in house’.

Fixing LaTeX2e

When LaTeX2e was first released in 1994 a lot of work had been done to avoid breaking existing LaTeX2.09 documents but allowing changes such as the package and font selection systems. The stability of LaTeX as demonstrated by that approach is one reason it’s been a success. However, there is also a need to allow for change: the world does not stand still. While the LaTeX2e kernel is not about to alter radically, the team are looking to address some areas where the needs today mean that change (or at least adaptation) is the right approach. David Carlisle talked about this at the UK-TUG meeting in November: here I’m going to try to look at the same issues in my own way. An important note before I start: the fixes I’m talking about here are all important but they are not about to change LaTeX2e into something else!

Kernel modifications

Over the years various bugs and issues have come up in the LaTeX2e kernel. Out-and-out bugs get fixed, but issues which are more about ‘code design’ are more tricky. There’s a tension between sorting these out and having the kernel ‘stable’, so not altering existing documents at all. The approach the team have taken to this to date is a package called fixltx2e. It contains ideas that really should go into the kernel but haven’t as they might alter existing documents. The idea is then that most people should really use these fixes in the form

\RequirePackage{fixltx2e}
\documentclass...

The problem: most people don’t do that, or load fixltx2e half-way through a preamble, or use it with packages that were not tested both with and without the fixes. That’s not a great position.

What we are looking at now is moving to a situation where the fixes are in the kernel as standard but with a mechanism to back them out. The details still need to be finalised, but the general plan is that once we make the change people will get the fixes without needing to take any action. If a document really has to be completely unchanged we’ll provide an ‘undo’ package with a way of setting the date that the kernel should be rolled-back to: that way you’ll be able to say ‘I always want the kernel as it was on … even if any fixes at all are made later’. We hope that will be a good balance.

Register allocation

Classical TeX provides 256 registers of each type. That limit was raised by the e-TeX extensions, which were finalised in 1999 and give us 32768 of the main register types (more on that nuance in a bit). While the team have used the extensions for many years in some packages, the LaTeX kernel itself still uses the classical TeX allocation system. That means that you can run into the

No room for a new ....

error even though there is lots of space. Loading the etex package

\RequirePackage{etex}
\documentclass...

modifies the allocation system to use those extra registers, but a lot of non-expert users don’t know this. So again we have a situation where a change in the kernel is the best plan.

What we are looking at here is what is the obvious solution: extending the register allocators in the LaTeX2e kernel ‘out of the box’ as long as the e-TeX extensions are available. That should be a transparent change for almost everyone, and will still allow etex to be loaded.

One minor wrinkle is inserts. e-TeX doesn’t extend how many inserts TeX has: there are still only 256. LaTeX2e doesn’t actually need many inserts as floats are handled without them (or without needing one insert per float), but at present the code for making floats does allocate inserts. The best solution here is to change what the kernel does so it no longer uses \newinsert to make floats: that will let us provide more float storage with basically no ‘cost’.

Unicode Engines

The Unicode engines XeTeX and LuaTeX have been with us for a few years now, and quite a lot of what they need to do at the format level is well-established. At the moment, the format-building routines make some changes ‘around’ the core latex.ltx file to accommodate these requirements: the code supplied by the team doesn’t ‘know’ about these newer engines. We’re therefore looking to address that by adding some conditional code.

The first area to tackle overlaps with the point above: LuaTeX extends the register allocation again beyond e-TeX, while XeTeX needs an allocator for \XeTeXinterchartoks. Both of these can readily be added to an updated allocation system.

The bigger impact of Unicode engines is that they have a different requirement from 8-bit engines in setting up the codes TeX uses for case changing. The LaTeX2e kernel sets up the \lccode and \uccode for the 8-bit range and assumes T1 encoding. With the newer engines, that’s not really great as they use Unicode code points and (almost certainly) Unicode (EU1/2) encodings. The format builders alter these assumptions using something of a hack, so we are looking to add the appropriate conditionals to the format itself. For end users that won’t really show, but it will mean that the format itself will be ‘in control’ here: something we are keen to work on.

LuaTeX extras

As well as the issues it shares with XeTeX, LuaTeX introduces ideas such as Lua callbacks and \attribute allocation. These areas are still somewhat ‘in flux’: the team currently feel that we need to get some consensus from the community (particularly active package authors) before adding anything here. However, it’s important that we get people thinking.

Conclusions

The changes we are looking at for LaTeX2e should help keep things ‘ticking over’ in the kernel will help us keep things working and offer some new abilities to end users. At the same time, they should move more of the kernel people see ‘in the wild’ back into the control of the team: something we are keen on as we need to be able to fix the bugs. We’re hoping to check in the code for these changes soon: expect requests for testing!