Some TeX Developments

Coding in the TeX world

Font encodings, hyphenation and Unicode engines

with 6 comments

The LaTeX team have over the past couple months been taking a good look at the Unicode TeX engines, XeTeX and LuaTeX, and making efforts to make the LaTeX2e kernel more ‘Unicode aware’. We’ve now started looking at an important question: moving documents from pdfTeX to XeTeX or LuaTeX. There are some important differences in how the engines work, and I’ve discussed some of them in a TeX StackExchange post, but here I’m going to look at one (broad) area in particular: font encodings and hyphenation. To understand the issues, we’ll first need a bit of background: first for ‘traditional’ TeX then for Unicode engines.

Knuth’s TeX (TeX90), e-TeX and pdfTeX are all 8-bit programs. That means that each font loaded with these engines has 256 slots available for different glyphs. TeX works with numerical character codes, not with what we humans think of as characters, and so what’s happening when we give the input

\documentclass{article}
\begin{document}
Hello world
\end{document}

to produce the output is that TeX is using the glyph in position 72 of the current font (‘H’), then position 101 (‘e’), and so on. For that to work and to allow different languages to be supported, we use the concept of font encodings. Depending on the encoding the relationship between character number and glyph appearance varies. So for example with

\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
\char200
\end{document}

we get ‘È’ but with

\documentclass{article}
\usepackage[T2A]{fontenc}
\begin{document}
\char200
\end{document}

we get ‘И’ (T2A is a Cyrillic encoding).

This has a knock-on effect on dealing with hyphenation: a word which uses ‘È’ will probably have very different allowed hyphenation positions from one using ‘И’. ‘Traditional’ TeX engines store hyphenation data (‘patterns’) in the format file, and to set that up we therefore need to know which encoding will be used for a particular language. For example, English text uses the T1 encoding while Russian uses T2A. So when the LaTeX format gets built for pdfTeX there is some code which selects the correct encoding and does various bits of set up for each language before reading the patterns.

Unicode engines are different here for a few reasons. Unicode doesn’t need different font encodings to represent all of the glyph slots we need. Instead, there is a much clearer one-to-one relationship between a slot and what it represents. For the Latin-1 range this is (almost) the same as the T1 encoding. However, once we step outside of this all bets are off, and of course beyond the 8-bit range there’s no equivalent at all in classical TeX. That might sound fine (just pick the right encoding), but there’s the hyphenation issue to watch. Some years ago now the hyphenation patterns used by TeX were translated to Unicode form, and these are read natively by XeTeX (more on LuaTeX below). That means that at present XeTeX will only hyphenate text correctly if it’s either using a Unicode font set up or if it’s in a language that is covered by the Latin-1/T1 range: for example English, French or Spanish but not German (as ß is different in T1 from the Latin-1 position).

LuaTeX is something of a special case as it doesn’t save patterns into the format and as the use of ‘callbacks’ allows behaviour to be modified ‘on the fly’. However, at least without some precautions the same ideas apply here: things are not really quite ‘right’ if you try to use a traditional encoding. (Using LuaLaTeX today you get the same result as with XeTeX.)

There are potential ways to fix the above, but at the moment these are not fully worked out. It’s not also clear how practical they might be: for XeTeX, it seems the only ‘correct’ solution is to save all of the hyphenation patterns twice, once for Unicode work and once for using ‘traditional’ encodings.

What does this mean for users? Bottom line: don’t use fontenc with XeTeX or LuaTeX unless your text is covered completely by Latin-1/T1. At the moment, if you try something as simple as

\documentclass{article}
\usepackage[T1]{fontenc}
% A quick test to use inputenc only with pdfTeX
\ifdefined\Umathchar\else
  \usepackage[utf8]{inputenc}
\fi
\begin{document}
straße
\end{document}

then you’ll get a surprise: the output is wrong with XeTeX and LuaTeX. So working today you should (probably) be removing fontenc (and almost certainly loading fontspec) if you are using XeTeX or LuaTeX. The team are working on making this more transparent, but it’s not so easy!

Written by Joseph Wright

January 28th, 2015 at 9:31 pm

Posted in LaTeX

Tagged with , ,

LaTeX2e and Unicode engines: the detail

with one comment

As I mentioned in my last post, the LaTeX team are working on various small but important improvements to the LaTeX2e kernel. One area we are looking at is adjusting how the ‘vanilla’ format works with Unicode engines. I’ve been asked for a bit more detail on this area, so I’ll try to fill in what’s going on with the ‘newer’ engines.

To date, the ‘vanilla’ LaTeX format (latex.ltx and associated files) has been pretty much engine-neutral with no attempt to differentiate anything other than to deal with differences between TeX2 (7-bit, released 1982) and TeX3 (8-bit, released 1990). However, the LaTeX formats that almost all users actually load are not just made by running

<engine> -ini latex.ltx

(or similar). The ‘format builders’ [principally the TeX Live team and Christian Schenk (MiKTeX] use a series of .ini files for building formats. For example, pdflatex.inicurrently says

% Thomas Esser, 1998. public domain.
\input pdftexconfig.tex
\scrollmode
\input latex.ltx
\endinput

(The .ini files are in the main common to both TeX Live and MiKTeX. For building a pdfLaTeX that makes sense: pdftexconfig.tex just sets up related to direct PDF output as opposed to working in DVI mode. Things get more complicated, though, when we look at the Unicode engines: some of the stuff is really ‘general’ and should be present in all LaTeX-based formats with these engines.

Both XeTeX and LuaTeX work with the entire Unicode range, so they need information on things like case mapping (\lccode/\uccode) and Unicode math handling (\Umathcode). The LaTeX format includes a \dump at the end, so without hacking about no code can be added after it’s loaded. More importantly, as XeTeX builds hyphenation into the format in the same way as ‘classical’ TeX the \lccode data needs to be right before the format loads the patterns. However, that can’t be done just by reading data before latex.ltx: it sets up the 8-bit range for the T1 encoding scheme. That’s an issue nowadays as the hyphenation patterns are nowadays stored in Unicode form: the stuff that happens ‘behind the scenes’ therefore (quite reasonably) assume that the Unicode engines can read these files with no ‘trickery’. To accommodate this, at the moment you’ll find that xelatex.ini includes

\input unicode-letters
% disable the \dump in latex.ltx
\expandafter\let\csname saved-dump-cs\endcsname\dump
\let\dump=\relax
\scrollmode
\input latex.ltx

and later

% Because latex.ltx sets up character code tables for T1 encoding by default,
% we need to reset values from unicode-letters that may have been overridden
\begingroup
\catcode`\@=11 \count@=128 % reset chars "80-"FF to category "other", no case mapp
ing
\loop \ifnum\count@<256
  \global\uccode\count@=0 \global\lccode\count@=0
  \global\catcode\count@=12 \global\sfcode\count@=1000
  \advance\count@ by 1 \repeat
\def\C #1 #2 #3 {\global\uccode"#1="#2 \global\lccode"#1="#3 } % case mappings (non-letter)
\def\L #1 #2 #3 {\global\catcode"#1=11 % category: letter
  \C #1 #2 #3 % with case mappings
  \ifnum"#1="#3 \else \global\sfcode"#1=999 \fi % uppercase letters have sfcode=999
  \global\XeTeXmathcode"#1="7"01"#1 % BMP letters default to class 7 (var), fam 1
  }
\def\l #1 {\L #1 #1 #1 } % letter without case mappings
[data lines]
\endgroup
\expandafter\let\expandafter\dump\csname saved-dump-cs\endcsname
\dump

There are some slight differences for lualatex.ini, but the general idea is the same. The need to ‘hack around’ the kernel is not great, and the team are much keener on the idea that it’s a documented feature that the Unicode engines are set up for a Unicode encoding (‘UC’) rather than for T1. (I’ll probably return to Unicode encodings in another context in a later post.)

As well as this important area, there are some things that are ‘tacked on’ to the formats by the .ini files but which apply only to one of either XeTeX or LuaTeX. For XeTeX, there is a need to manage the \XeTeXinterchartoks system, for which xelatex.ini currently does

%
% Allocator for \XeTeXintercharclass values, from Enrico Gregorio 
%
\catcode`\@=11
\newcount\xe@alloc@intercharclass % allocates intercharclass
\xe@alloc@intercharclass=\thr@@ % from 4 (1,2 and 3 are used by CJK, AFAIK)
\def\xe@alloc@#1#2#3#4#5{\global\advance#1\@ne
 \xe@ch@ck#1#4#2% make sure there's still room
 \allocationnumber#1%
 \global#3#5\allocationnumber
 \wlog{\string#5=\string#2\the\allocationnumber}}
\def\xe@ch@ck#1#2#3{%
 \ifnum#1<#2\else
  \errmessage{No room for a new #3}%
 \fi}
\def\newXeTeXintercharclass{%
 \xe@alloc@\xe@alloc@intercharclass\XeTeXintercharclass\chardef\@cclv} %at most 254

For LuaTeX, there are a couple of things in lualatex.ini that should be in the format. First, there is a difference in how this engine handles negative values of \endlinechar compared with other TeX engines. That requires a patch to LaTeX2e’s \@xtypein. More importantly, LuaTeX only actives the extensions to TeX if some Lua code is used

\begingroup
\catcode`\{=1
\catcode`\}=2
\directlua{
  % etex and pdftex primitives are enabled without prefixing
  % as well as extented Unicode math primitives (see below)
  tex.enableprimitives('', 
    tex.extraprimitives('etex', 'pdftex', 'umath'))
  % other primitives are prefixed with luatex (see below)
  tex.enableprimitives('luatex', 
    tex.extraprimitives('core', 'omega', 'aleph', 'luatex'))
  }
\endgroup

This has to come right at the start of the build process, but is another thing that can sensibly go into latex.ltx. The team also wonder if all of the primitives should have their ‘natural’ names without the luatex prefix.

All of this can be added to latex.ltx without altering what users have available and without breaking LaTeX2e for pdfTeX users. The team have these changes made in the development version of the kernel. There are other things yet to be finalised, but it’s highly likely the next release of the LaTeX2e kernel will (finally) recognise the Unicode engines and bring this stuff ‘in house’.

Written by Joseph Wright

January 17th, 2015 at 11:36 pm

Posted in LaTeX

Fixing LaTeX2e

with 3 comments

When LaTeX2e was first released in 1994 a lot of work had been done to avoid breaking existing LaTeX2.09 documents but allowing changes such as the package and font selection systems. The stability of LaTeX as demonstrated by that approach is one reason it’s been a success. However, there is also a need to allow for change: the world does not stand still. While the LaTeX2e kernel is not about to alter radically, the team are looking to address some areas where the needs today mean that change (or at least adaptation) is the right approach. David Carlisle talked about this at the UK-TUG meeting in November: here I’m going to try to look at the same issues in my own way. An important note before I start: the fixes I’m talking about here are all important but they are not about to change LaTeX2e into something else!

Kernel modifications

Over the years various bugs and issues have come up in the LaTeX2e kernel. Out-and-out bugs get fixed, but issues which are more about ‘code design’ are more tricky. There’s a tension between sorting these out and having the kernel ‘stable’, so not altering existing documents at all. The approach the team have taken to this to date is a package called fixltx2e. It contains ideas that really should go into the kernel but haven’t as they might alter existing documents. The idea is then that most people should really use these fixes in the form

\RequirePackage{fixltx2e}
\documentclass...

The problem: most people don’t do that, or load fixltx2e half-way through a preamble, or use it with packages that were not tested both with and without the fixes. That’s not a great position.

What we are looking at now is moving to a situation where the fixes are in the kernel as standard but with a mechanism to back them out. The details still need to be finalised, but the general plan is that once we make the change people will get the fixes without needing to take any action. If a document really has to be completely unchanged we’ll provide an ‘undo’ package with a way of setting the date that the kernel should be rolled-back to: that way you’ll be able to say ‘I always want the kernel as it was on … even if any fixes at all are made later’. We hope that will be a good balance.

Register allocation

Classical TeX provides 256 registers of each type. That limit was raised by the e-TeX extensions, which were finalised in 1999 and give us 32768 of the main register types (more on that nuance in a bit). While the team have used the extensions for many years in some packages, the LaTeX kernel itself still uses the classical TeX allocation system. That means that you can run into the

No room for a new ....

error even though there is lots of space. Loading the etex package

\RequirePackage{etex}
\documentclass...

modifies the allocation system to use those extra registers, but a lot of non-expert users don’t know this. So again we have a situation where a change in the kernel is the best plan.

What we are looking at here is what is the obvious solution: extending the register allocators in the LaTeX2e kernel ‘out of the box’ as long as the e-TeX extensions are available. That should be a transparent change for almost everyone, and will still allow etex to be loaded.

One minor wrinkle is inserts. e-TeX doesn’t extend how many inserts TeX has: there are still only 256. LaTeX2e doesn’t actually need many inserts as floats are handled without them (or without needing one insert per float), but at present the code for making floats does allocate inserts. The best solution here is to change what the kernel does so it no longer uses \newinsert to make floats: that will let us provide more float storage with basically no ‘cost’.

Unicode Engines

The Unicode engines XeTeX and LuaTeX have been with us for a few years now, and quite a lot of what they need to do at the format level is well-established. At the moment, the format-building routines make some changes ‘around’ the core latex.ltx file to accommodate these requirements: the code supplied by the team doesn’t ‘know’ about these newer engines. We’re therefore looking to address that by adding some conditional code.

The first area to tackle overlaps with the point above: LuaTeX extends the register allocation again beyond e-TeX, while XeTeX needs an allocator for \XeTeXinterchartoks. Both of these can readily be added to an updated allocation system.

The bigger impact of Unicode engines is that they have a different requirement from 8-bit engines in setting up the codes TeX uses for case changing. The LaTeX2e kernel sets up the \lccode and \uccode for the 8-bit range and assumes T1 encoding. With the newer engines, that’s not really great as they use Unicode code points and (almost certainly) Unicode (EU1/2) encodings. The format builders alter these assumptions using something of a hack, so we are looking to add the appropriate conditionals to the format itself. For end users that won’t really show, but it will mean that the format itself will be ‘in control’ here: something we are keen to work on.

LuaTeX extras

As well as the issues it shares with XeTeX, LuaTeX introduces ideas such as Lua callbacks and \attribute allocation. These areas are still somewhat ‘in flux': the team currently feel that we need to get some consensus from the community (particularly active package authors) before adding anything here. However, it’s important that we get people thinking.

Conclusions

The changes we are looking at for LaTeX2e should help keep things ‘ticking over’ in the kernel will help us keep things working and offer some new abilities to end users. At the same time, they should move more of the kernel people see ‘in the wild’ back into the control of the team: something we are keen on as we need to be able to fix the bugs. We’re hoping to check in the code for these changes soon: expect requests for testing!

Written by Joseph Wright

December 28th, 2014 at 10:55 pm

Posted in LaTeX

TUG Membership

with 2 comments

While TeX and all of the supporting ideas are free (both in monetary terms and intellectually), supporting that is a lot of effort from a range of volunteers and hard cash for parts of the infrastructure behind it. A key component of making all of that work is TUG: the worldwide TeX user group. TUG is the central point for co-ordinating a range of activities: running the TUG conference series, supporting TeX development, producing TeX Live and hosting mailing lists, to name a few.

Those of us in TUG have recently had a mail from the President pointing to a slightly concerning trend: a slow but perceptible drop in membership. That doesn’t mean there are fewer TeX users about: the accessibility of modern TeX systems means that there are a lot of TeX users (see for example the popularity of the TeX StackExchange site). That accessibility means that users don’t need to join a user group to use TeX, so there is something of a challenge.

To encourage people to take up membership, and of course take advantage of the benefits, TUG have launched a membership campaign. The aim is to encourage existing members to look out for new recruits, and of course to remind us that TUG is only as strong as its membership. So if you are a member, remind your fellow TeX users to join TUG, and if you are not in TUG: why not?

Written by Joseph Wright

December 11th, 2014 at 8:43 pm

Posted in General

Tagged with

A new list for TeX meetings

without comments

Keeping a track of what TeX meetings are going on can be tricky. To help us all keep up, Karl Berry has just set up a new mailing list: TeX meetings. The idea is simple: it gives everyone a single place to post notices of meetings upcoming, and so to track what is happening in the TeX world. I’ve joined up (of course), and I’d encourage everyone else too as well. It should make life a lot easier, particularly if we can get a good take-up from the people organising meetings.

Written by Joseph Wright

November 21st, 2014 at 8:42 am

Posted in General

Tagged with

Beamer overlays beyond the \visible

without comments

I wrote earlier this year about using the beamer overlay concept with relative slide specifications to produce dynamic slide structures. Another question about overlays came up recently on TeX StackExhange, but this time wanting to do something a bit different.

The ‘standard’ beamer overlay system does the same as the \visible command: makes things appear and disappear, but always keeps space for them on the slide. However, beamer also provides \only, which completely omits items not visible on a slide. So the question was how to combine this idea with the general overlay concept.

It turns out that this is all quite straight-forward if you know what to look for. The standard beamer overlay syntax, for example

\item<+->

extends to include an action type to specify what the overlay should do. That is given as a keyword and an @ before the overlay number(s). So for example

\begin{itemize}
  \item First item
  \item<only@1> Second item
  \item<only@2> Replacement second item
...

will show Second item on the first slide then replace it entirely with Replacement second item on the second slide. That approach can be combined with the idea of relative slide specs, as I talked about before, to give something like

\documentclass{beamer}
\begin{document}
   \begin{frame}
   \begin{itemize}[<+->]
      \item item 1
      \item item 2
      \item<only@+-.(2)> item 3
      \item item 4
      \item item 5
   \end{itemize}

   \end{frame}
\end{document}

to have the ‘normal’ items appear one at a time but with item 3 only on slides 3 and 4.

This doesn’t just apply to only: other keywords that work here include visible and alert. The latter tends to be seen with another syntax element: | to separate out appearance from a second action. A classic example of that is

\documentclass{beamer}

\begin{document}
   \begin{frame}
   \begin{itemize}[<+->]
      \item item 1
      \item item 2
      \item<+-|alert@+(1)> item 3
      \item item 4
      \item item 5
   \end{itemize}

   \end{frame}
\end{document}

where item 3 appears on the third slide and is highlighted on the fourth one. (Note that both + substitutions in this line use the same value for the pause counter, hence needing the (1) offset.) That’s useful even without the ‘one at a time’ effect, with for example

\documentclass{beamer}

\begin{document}
   \begin{frame}
   \begin{itemize}
      \item item 1
      \item item 2
      \item<alert@+(1)> item 3
      \item item 4
      \item item 5
   \end{itemize}

   \end{frame}
\end{document}

highlighting the item on the second slide.

A bit of imagination with this syntax can cover almost any appearance/disappearance/highlight requirement. As I said before: the key thing is not to overdo it!

Written by Joseph Wright

October 13th, 2014 at 8:00 pm

Posted in beamer

Reworking and exposing siunitx internals

with 6 comments

I’ve been talking for a while about working on a new major version of siunitx. I’ve got plans to add some new features which are difficult or impossible to deliver using the v2 set up, but here I want to look at perhaps what’s more important: the back end, programming set up and related matters.

I’ve now made a start on the new code, working first on what I always think of as the core of siunitx: the unit processor. If you take a look at the new material and compare it with the existing release the first thing that should be obvious is that I’ve finally made a start on splitting everything up into different sub-parts. There are at least a couple of reasons for this. First, the monolithic .dtx for v2 is simply too big to work with comfortably. More importantly, though, the package contains a lot of different ideas and some of them are quite useful beyond my own work. To ensure that these are available to other people, it would seem best to make the boundaries clear, and separate sources helps with that.

That leads onto the bigger picture change that I’m aiming for. As regular readers will know, I wrote the first version of siunitx somewhat by accident and in an ad hoc fashion. Working on v2, I decided to make things more organised and also to use expl3, which I’d not really looked at before. So the process of writing the second version was something of a learning experience. At the same time, expl3 itself has firmed up a lot over the time I’ve been working with it. As such, the current release of siunitx has rather a lot of rough edges. In the new code, I’m working from a much firmer foundation in terms of conventions, coding ideas and testing implementations. So for v3 I’m aiming to do several things. A key one for prospective expl3 programmers is the idea of defined interfaces. Rather than making everything internal, this time I’m documenting code-level access to the system. That means doing some work to have clearly defined paths for information to pass between sub-modules, but that’s overall a good thing. I’m also using the LaTeX3 teams new testing suite, l3build, to start setting up proper code tests: these are already proving handy.

The net result of the work should be a better package for end users but also extremely solid code that can be used by other people. I’m also hopeful that the ideas will be usable with little change in a ‘pure’ LaTeX3 context. Documenting how things work might even have a knock-on effect in emulating siunitx in say MathJax. Beyond that, I’ve viewed siunitx as something of a sales pitch for expl3, and providing a really top-class piece of code is an important part of that. If I can get the code level documentation and interfaces up to the standard of the user level ones, and improve the user experience at the same time, I think I’ll be doing my job there.

Written by Joseph Wright

September 18th, 2014 at 9:28 pm

Posted in LaTeX3,siunitx

Tagged with , , ,

River Valley videos on the move

without comments

Many readers will be familiar with River Valley, a typesetting company with a long-standing interested in TeX and related technologies. One of the things they do is great work videoing meetings in the area of publishing, technology, XML and all kinds of related things. I had an e-mail a couple of days ago from Kaveh Bazargan, the head of River Valley, to let me know that videos are ‘on the move’ to a new site: http://river-valley.zeeba.tv/: I’ll be altering my links in the blog.

Written by Joseph Wright

September 6th, 2014 at 8:11 am

Posted in General

Tagged with

Case changing: solving the challenges in TeX

with 2 comments

I wrote recently about handling UTF-8 input in Lua, and in particular the fact that doing text manipulation needs a bit of care. One area that I’ve been looking at recently is doing case changing operations. We’ve been looking at this for expl3, so I thought it would be worth looking at this in a bit of detail. I’m going to mainly focus on the results rather than implementation: the latter is important when it affects the output but not really otherwise (except for the team!).

Background

The first thing to think about is what case changing is needed for. We’ll see in a bit that TeX uses ‘case changing’ for something very different from what we might think of as changing case in ‘text’. First, though, let’s look at what those ‘normal’ requirements are. The Unicode Consortium have looked in detail at this: take a look at the standard for all of the detail. The common situations are:

  • ‘Removing’ the case from text to allow ‘caseless’ comparisons (‘case-folding’). This is primarily used ‘internally’ by code, and tends traditionally to be handled by simply lower casing everything before some comparison. The Unicode approach has some slight differences between case-folding and lower-casing, but it’s relatively straight-forward.

  • Upper-casing ‘text’. Here, all characters that have a case mapping are changed to the upper-case versions. That’s a relatively simple concept, but there is a bit more to it (as we’ll see).

  • Title- or sentence-casing ‘text’. The concept here is usually implemented by upper-casing the first character of a phrase, or of each word, then to lower-case the rest. Again, the Unicode specs have a bit more to say on this: there are some character(s) that should not be upper-cased at the start of a word in this context but need a special ‘title-case’ character. (For example, in Dutch ‘IJ’ at the start of words should both be upper-cased.)

Just to make life a bit more fun, there are also some language-dependent rules for case changing, and some places where the outcome of a case change depends on the context (sigma at the end of words is the most obvious example). So there are a few challenges if we want to cover all of this in TeX. We’ve also got to think about the ‘TeX angle': what does ‘text’ mean, how do we handle math mode, etc.

TeX primitives

TeX provides two primitives for changing case, \lowercase and \uppercase. These are powerful operations, and in particular are very often used for something that’s got very little to do with case at all: making characters with non-standard category codes. As that isn’t a ‘real’ case change at all, I won’t look at it further here, other than noting that it means we need those primitives for something even if we do case changing another way entirely!

Sticking with changing case of ‘text’, \uppercase and \lowercase rely on the fact that each character has a one-to-one mapping for upper- and lower-casing (defined by \uccode and \lccode). Assuming these are not ‘do nothing’ mappings, they allow a simple replacement of characters

\uppercase{hello} => HELLO
\lowercase{WORLD} => world

With XeTeX and LuaTeX, these mappings are set up for all sensible UTF-8 codepoints (‘characters’). However, the are one-to-one mapping with no context-awareness: that makes it impossible to cover some parts of the Unicode definitions I’ve mentioned (at least using the primitives directly). They also change everything in the input, which makes handling things like

\uppercase{Some text $y = mx + c$}

a bit tricky (there are ways, of course!).

Another TeX concern is ‘expandability': \uppercase and \lowercase are not expandable. That means that while we can do

\uppercase{\def\foo{some text}}

and have \foo defined as SOME TEXT, the apparently ‘obvious’ alternative

\edef\foo{\uppercase{some text}}

doesn’t have the expected result (\foo here is defined as \uppercase{some text}). Moreover, it means we can’t use the primitives inside places where TeX requires expansion. As a result, things like

\csname\lowercase{Some-input}\endcsname

result in an error. Of course, there are always ways around the problem, but I think it looks a lot ‘nicer’ for the user if a way can be found to do these operations expandably. As we’ll see in a bit, that is doable if we accept a few restrictions.

Case folding

If we want to implement case changing without using \lowercase and \uppercase then we have to have some form of iterative mapping over the ‘text’ input. Doing that while keeping the code expandable is doable if we accept a few restrictions, which I’ll come to in a bit. One to mention now is that the code here assumes e-TeX is available and that we have the \pdfstrcmp primitive or equivalent functionality: pdfTeX, XeTeX and LuaTeX all cover these requirements.

For ‘case-folding’ we can make some simplifications which make this the most straight-forward situation to set up. First, case-folding is a one-to-one change with no context-dependence: nice and easy. Secondly, as this is needed only for ‘internal’ stuff and not for ‘text’ to be typeset we can assume that everything can be handled as a (TeX) string by applying \detokenize. That avoids issues with things like escaping math mode, brace groups and the like. Setting up an expandable mapping is then relatively straight-forward, and the issue becomes simply how do with actually change the case of each character.

With a list of over 1000 possible characters to case-fold, comparing each and every one to find a hit would get slow. Luckily, Bruno Le Floch spotted that we can divide up that long list into ‘bite sized’ chunks by using the last two digits of the character code of the input, giving 100 short lists, each of which is realistic just to look through. (For those interested in the internals, the final comparison is done using \str_case:nnF, which is an expandable string-based selection using \pdfstrcmp.)

Putting everything together lead to the documented interface

\str_fold_case:n { <input> }

which does exactly what it says: folds the case of the input, which is treated as a string. The only real point to note here is that with pdfTeX it doesn’t make sense to talk about UTF-8 as the engine doesn’t support it. Thus the changes here are restricted to ASCII (A-Z): for a string that’s clear, but life is a bit more murky for ‘text’ input. I’ll come back to that below.

Case changing

Real case changing provides a few more challenges. Looking first at the Unicode definitions, there are both context- and language-dependent rules to worry about. It turns out that there are relatively few of these, so a bit of work with some hard-coding seems to cover most of them. That does require a bit of ‘bending the rules’ to fit in with how TeX parses stuff, so there may yet be more work to do here!

As we are now looking at text which might have a variety of TeX tokens in it then doing the mapping raises issues. It turns out that we can do an expandable mapping provided we accept that any brace groups end up with { and } as the grouping tokens even if that wasn’t true to start with (a bit of an edge-case but we have to specify these things!). (Note that this does require both e-TeX and \pdfstrcmp, so it’s not true for ‘classical’ TeX.) However, that raises an interesting issue: should stuff inside braces be case changed or not? At the moment, we’ve gone for ‘no’, as that’s very much like the BibTeX approach

title = {Some text with including a {Proper-Name}}

which also makes the code a bit easier to write. However, it’s not quite clear if this is the best plan: I’ll point to one open question below.

Another question is what category codes should apply in the output. For the folding case, it was easy: everything is treated as a string so the output is too. That’s not the situation for general text, but at the same time it seems sensible to assume that you are case changing things that will be typeset (‘letters’). Again, this is rather more of a concepts than a technical question.

Answering these questions, or at least taking a documented position on them, it’s possible to define functions such as

\tl_lower_case:n { <text> }
\tl_upper_case:nn { <language> } { <text> }

that implement the case changing I’ve outlines. As this is very much a ‘work in progress’ those names are not fixed: there’s a feeling that perhaps \text_... might be more ‘sensible’ (the input should be ‘well-behaved’). What’s needed is some testing: we thing the idea is a good one, but at the moment it’s not clear we’ve got all of the ideas right!

Notice the versions that know about languages: the idea is that these will get things like Turkish dotted/dotless-i correct. Of course, that assumes you know the language the input is in, but hopefully that’s normally true!

One thing to note here is again the pdfTeX case. As we are dealing with ‘engine native’ input, it’s only set up to do changes for the ASCII range. That’s fine, but it leaves open the question of LICR text. For example,

 \tl_upper_case:n { \'{e} }

currently doesn’t do anything as there are braces around the e. I’m not sure what’s best: skipping brace groups is generally easier for the user, but they probably would be surprise by this outcome! (With XeTeX or LuaTeX, the input would hopefully be é so the problem doesn’t arise.)

Conclusions

Case changing is a tricky thing to get right. We’ve made some progress in providing a ‘clear’ interface in expl3 that can cover not only UTF-8 input but also language-dependence. What’ needed now is some testing and feedback: we hope these things are useful!

Written by Joseph Wright

July 10th, 2014 at 9:21 am

Posted in LaTeX3

Tagged with ,

LuaTeX: Manipulating UTF-8 text using Lua

with 2 comments

Both the XeTeX and LuaTeX engines are natively UTF-8, which makes input of non-ASCII text a lot easier than with pdfTeX (certainly for the programmer: inputenc hides a lot of complexity for the end user!). With LuaTeX, there is the potential to script in Lua as well as program in TeX macros, and that of course means that you might well want to do manipulation of that UTF-8 input in Lua. What might then catch you out is that it’s not quite as simple as all that!

Lua itself can pass around arbitrary bytes, so input in UTF-8 won’t get mangled. However, the basic string functions provided by Lua are not UTF-8 aware. The LuaTeX manual cautions

The string library functions len, lower, sub, etc. are not UNICODE-aware.

As a result, applying these functions to anything outside the ASCII range is not a good idea. At best you might get unexpected output, so

tex.print (string.lower ("Ł"))

simply prints in Ł (with the right font set up). Worse, get an error as for example

tex.print (string.match ("Ł","[Ł]"))

results in

! String contains an invalid utf-8 sequence.

which is not what you want!

Instead of using the string library, the current correct approach here is to use slnunicode. Again, the LuaTeX manual has some advice:

For strings in the UTF-8 encoding, i.e., strings containing characters above code point 127, the corresponding functions from the slnunicode library can be used, e.g., unicode.utf8.len, unicode.utf8.lower, etc.

and indeed

tex.print(unicode.utf8.lower("Ł"))

does indeed print ł. There are still a few things to watch, though. The LuaTeX manual warns that unicode.utf8.find returns a byte range and that unicode.utf8.match and unicode.utf8.gmatch fall back on non-Unicode behaviour when an empty capture (()) is used. Both of those can be be allowed for, of course: they should not be big issues.

There’s still a bit of complexity for two reasons. First, there’s not really much documentation on the slnunicode library, so beyond trying examples it’s not so easy to know what ‘should’ happen. For example, case-changing in Unicode is more complex than a simple one-to-one mapping, and can have language-dependencies. I’ll probably return to that in another post for a TeX (or at least XeTeX/LuaTeX) take on this, but in the Lua context the problem is it’s not so clear quite what’s available! In a way, the second point links to this: the LuaTeX manual tells us

The slnunicode library will be replaced by an internal UNICODE library in a future LuaTeX version.

which of course should lead to better documentation but at the price of having to keep an eye on the situation.

Over all, provided you are aware that you have to think, using Lua with Unicode works well, it’s just that it’s not quite as obvious as you might expect!

Written by Joseph Wright

July 8th, 2014 at 1:01 pm

Posted in General

Tagged with , , ,