## Reworking and exposing siunitx internals

I’ve been talking for a while about working on a new major version of siunitx. I’ve got plans to add some new features which are difficult or impossible to deliver using the v2 set up, but here I want to look at perhaps what’s more important: the back end, programming set up and related matters.

I’ve now made a start on the new code, working first on what I always think of as the core of siunitx: the unit processor. If you take a look at the new material and compare it with the existing release the first thing that should be obvious is that I’ve finally made a start on splitting everything up into different sub-parts. There are at least a couple of reasons for this. First, the monolithic .dtx for v2 is simply too big to work with comfortably. More importantly, though, the package contains a lot of different ideas and some of them are quite useful beyond my own work. To ensure that these are available to other people, it would seem best to make the boundaries clear, and separate sources helps with that.

That leads onto the bigger picture change that I’m aiming for. As regular readers will know, I wrote the first version of siunitx somewhat by accident and in an ad hoc fashion. Working on v2, I decided to make things more organised and also to use expl3, which I’d not really looked at before. So the process of writing the second version was something of a learning experience. At the same time, expl3 itself has firmed up a lot over the time I’ve been working with it. As such, the current release of siunitx has rather a lot of rough edges. In the new code, I’m working from a much firmer foundation in terms of conventions, coding ideas and testing implementations. So for v3 I’m aiming to do several things. A key one for prospective expl3 programmers is the idea of defined interfaces. Rather than making everything internal, this time I’m documenting code-level access to the system. That means doing some work to have clearly defined paths for information to pass between sub-modules, but that’s overall a good thing. I’m also using the LaTeX3 teams new testing suite, l3build, to start setting up proper code tests: these are already proving handy.

The net result of the work should be a better package for end users but also extremely solid code that can be used by other people. I’m also hopeful that the ideas will be usable with little change in a ‘pure’ LaTeX3 context. Documenting how things work might even have a knock-on effect in emulating siunitx in say MathJax. Beyond that, I’ve viewed siunitx as something of a sales pitch for expl3, and providing a really top-class piece of code is an important part of that. If I can get the code level documentation and interfaces up to the standard of the user level ones, and improve the user experience at the same time, I think I’ll be doing my job there.

## River Valley videos on the move

Many readers will be familiar with River Valley, a typesetting company with a long-standing interested in TeX and related technologies. One of the things they do is great work videoing meetings in the area of publishing, technology, XML and all kinds of related things. I had an e-mail a couple of days ago from Kaveh Bazargan, the head of River Valley, to let me know that videos are ‘on the move’ to a new site: http://river-valley.zeeba.tv/: I’ll be altering my links in the blog.

## Case changing: solving the challenges in TeX

I wrote recently about handling UTF-8 input in Lua, and in particular the fact that doing text manipulation needs a bit of care. One area that I’ve been looking at recently is doing case changing operations. We’ve been looking at this for expl3, so I thought it would be worth looking at this in a bit of detail. I’m going to mainly focus on the results rather than implementation: the latter is important when it affects the output but not really otherwise (except for the team!).

## Background

The first thing to think about is what case changing is needed for. We’ll see in a bit that TeX uses ‘case changing’ for something very different from what we might think of as changing case in ‘text’. First, though, let’s look at what those ‘normal’ requirements are. The Unicode Consortium have looked in detail at this: take a look at the standard for all of the detail. The common situations are:

• ‘Removing’ the case from text to allow ‘caseless’ comparisons (‘case-folding’). This is primarily used ‘internally’ by code, and tends traditionally to be handled by simply lower casing everything before some comparison. The Unicode approach has some slight differences between case-folding and lower-casing, but it’s relatively straight-forward.

• Upper-casing ‘text’. Here, all characters that have a case mapping are changed to the upper-case versions. That’s a relatively simple concept, but there is a bit more to it (as we’ll see).

• Title- or sentence-casing ‘text’. The concept here is usually implemented by upper-casing the first character of a phrase, or of each word, then to lower-case the rest. Again, the Unicode specs have a bit more to say on this: there are some character(s) that should not be upper-cased at the start of a word in this context but need a special ‘title-case’ character. (For example, in Dutch ‘IJ’ at the start of words should both be upper-cased.)

Just to make life a bit more fun, there are also some language-dependent rules for case changing, and some places where the outcome of a case change depends on the context (sigma at the end of words is the most obvious example). So there are a few challenges if we want to cover all of this in TeX. We’ve also got to think about the ‘TeX angle’: what does ‘text’ mean, how do we handle math mode, etc.

## TeX primitives

TeX provides two primitives for changing case, \lowercase and \uppercase. These are powerful operations, and in particular are very often used for something that’s got very little to do with case at all: making characters with non-standard category codes. As that isn’t a ‘real’ case change at all, I won’t look at it further here, other than noting that it means we need those primitives for something even if we do case changing another way entirely!

Sticking with changing case of ‘text’, \uppercase and \lowercase rely on the fact that each character has a one-to-one mapping for upper- and lower-casing (defined by \uccode and \lccode). Assuming these are not ‘do nothing’ mappings, they allow a simple replacement of characters

\uppercase{hello} => HELLO
\lowercase{WORLD} => world


With XeTeX and LuaTeX, these mappings are set up for all sensible UTF-8 codepoints (‘characters’). However, the are one-to-one mapping with no context-awareness: that makes it impossible to cover some parts of the Unicode definitions I’ve mentioned (at least using the primitives directly). They also change everything in the input, which makes handling things like

\uppercase{Some text $y = mx + c$}


a bit tricky (there are ways, of course!).

Another TeX concern is ‘expandability’: \uppercase and \lowercase are not expandable. That means that while we can do

\uppercase{\def\foo{some text}}


and have \foo defined as SOME TEXT, the apparently ‘obvious’ alternative

\edef\foo{\uppercase{some text}}


doesn’t have the expected result (\foo here is defined as \uppercase{some text}). Moreover, it means we can’t use the primitives inside places where TeX requires expansion. As a result, things like

\csname\lowercase{Some-input}\endcsname


result in an error. Of course, there are always ways around the problem, but I think it looks a lot ‘nicer’ for the user if a way can be found to do these operations expandably. As we’ll see in a bit, that is doable if we accept a few restrictions.

## Case folding

If we want to implement case changing without using \lowercase and \uppercase then we have to have some form of iterative mapping over the ‘text’ input. Doing that while keeping the code expandable is doable if we accept a few restrictions, which I’ll come to in a bit. One to mention now is that the code here assumes e-TeX is available and that we have the \pdfstrcmp primitive or equivalent functionality: pdfTeX, XeTeX and LuaTeX all cover these requirements.

For ‘case-folding’ we can make some simplifications which make this the most straight-forward situation to set up. First, case-folding is a one-to-one change with no context-dependence: nice and easy. Secondly, as this is needed only for ‘internal’ stuff and not for ‘text’ to be typeset we can assume that everything can be handled as a (TeX) string by applying \detokenize. That avoids issues with things like escaping math mode, brace groups and the like. Setting up an expandable mapping is then relatively straight-forward, and the issue becomes simply how do with actually change the case of each character.

With a list of over 1000 possible characters to case-fold, comparing each and every one to find a hit would get slow. Luckily, Bruno Le Floch spotted that we can divide up that long list into ‘bite sized’ chunks by using the last two digits of the character code of the input, giving 100 short lists, each of which is realistic just to look through. (For those interested in the internals, the final comparison is done using \str_case:nnF, which is an expandable string-based selection using \pdfstrcmp.)

Putting everything together lead to the documented interface

\str_fold_case:n { <input> }


which does exactly what it says: folds the case of the input, which is treated as a string. The only real point to note here is that with pdfTeX it doesn’t make sense to talk about UTF-8 as the engine doesn’t support it. Thus the changes here are restricted to ASCII (A-Z): for a string that’s clear, but life is a bit more murky for ‘text’ input. I’ll come back to that below.

## Case changing

Real case changing provides a few more challenges. Looking first at the Unicode definitions, there are both context- and language-dependent rules to worry about. It turns out that there are relatively few of these, so a bit of work with some hard-coding seems to cover most of them. That does require a bit of ‘bending the rules’ to fit in with how TeX parses stuff, so there may yet be more work to do here!

As we are now looking at text which might have a variety of TeX tokens in it then doing the mapping raises issues. It turns out that we can do an expandable mapping provided we accept that any brace groups end up with { and } as the grouping tokens even if that wasn’t true to start with (a bit of an edge-case but we have to specify these things!). (Note that this does require both e-TeX and \pdfstrcmp, so it’s not true for ‘classical’ TeX.) However, that raises an interesting issue: should stuff inside braces be case changed or not? At the moment, we’ve gone for ‘no’, as that’s very much like the BibTeX approach

title = {Some text with including a {Proper-Name}}


which also makes the code a bit easier to write. However, it’s not quite clear if this is the best plan: I’ll point to one open question below.

Another question is what category codes should apply in the output. For the folding case, it was easy: everything is treated as a string so the output is too. That’s not the situation for general text, but at the same time it seems sensible to assume that you are case changing things that will be typeset (‘letters’). Again, this is rather more of a concepts than a technical question.

Answering these questions, or at least taking a documented position on them, it’s possible to define functions such as

\tl_lower_case:n { <text> }
\tl_upper_case:nn { <language> } { <text> }


that implement the case changing I’ve outlines. As this is very much a ‘work in progress’ those names are not fixed: there’s a feeling that perhaps \text_... might be more ‘sensible’ (the input should be ‘well-behaved’). What’s needed is some testing: we thing the idea is a good one, but at the moment it’s not clear we’ve got all of the ideas right!

Notice the versions that know about languages: the idea is that these will get things like Turkish dotted/dotless-i correct. Of course, that assumes you know the language the input is in, but hopefully that’s normally true!

One thing to note here is again the pdfTeX case. As we are dealing with ‘engine native’ input, it’s only set up to do changes for the ASCII range. That’s fine, but it leaves open the question of LICR text. For example,

 \tl_upper_case:n { \'{e} }


currently doesn’t do anything as there are braces around the e. I’m not sure what’s best: skipping brace groups is generally easier for the user, but they probably would be surprise by this outcome! (With XeTeX or LuaTeX, the input would hopefully be é so the problem doesn’t arise.)

## Conclusions

Case changing is a tricky thing to get right. We’ve made some progress in providing a ‘clear’ interface in expl3 that can cover not only UTF-8 input but also language-dependence. What’ needed now is some testing and feedback: we hope these things are useful!

## LuaTeX: Manipulating UTF-8 text using Lua

Both the XeTeX and LuaTeX engines are natively UTF-8, which makes input of non-ASCII text a lot easier than with pdfTeX (certainly for the programmer: inputenc hides a lot of complexity for the end user!). With LuaTeX, there is the potential to script in Lua as well as program in TeX macros, and that of course means that you might well want to do manipulation of that UTF-8 input in Lua. What might then catch you out is that it’s not quite as simple as all that!

Lua itself can pass around arbitrary bytes, so input in UTF-8 won’t get mangled. However, the basic string functions provided by Lua are not UTF-8 aware. The LuaTeX manual cautions

The string library functions len, lower, sub, etc. are not UNICODE-aware.

As a result, applying these functions to anything outside the ASCII range is not a good idea. At best you might get unexpected output, so

tex.print (string.lower ("Ł"))


simply prints in Ł (with the right font set up). Worse, get an error as for example

tex.print (string.match ("Ł","[Ł]"))


results in

! String contains an invalid utf-8 sequence.


which is not what you want!

Instead of using the string library, the current correct approach here is to use slnunicode. Again, the LuaTeX manual has some advice:

For strings in the UTF-8 encoding, i.e., strings containing characters above code point 127, the corresponding functions from the slnunicode library can be used, e.g., unicode.utf8.len, unicode.utf8.lower, etc.

and indeed

tex.print(unicode.utf8.lower("Ł"))


does indeed print ł. There are still a few things to watch, though. The LuaTeX manual warns that unicode.utf8.find returns a byte range and that unicode.utf8.match and unicode.utf8.gmatch fall back on non-Unicode behaviour when an empty capture (()) is used. Both of those can be be allowed for, of course: they should not be big issues.

There’s still a bit of complexity for two reasons. First, there’s not really much documentation on the slnunicode library, so beyond trying examples it’s not so easy to know what ‘should’ happen. For example, case-changing in Unicode is more complex than a simple one-to-one mapping, and can have language-dependencies. I’ll probably return to that in another post for a TeX (or at least XeTeX/LuaTeX) take on this, but in the Lua context the problem is it’s not so clear quite what’s available! In a way, the second point links to this: the LuaTeX manual tells us

The slnunicode library will be replaced by an internal UNICODE library in a future LuaTeX version.

which of course should lead to better documentation but at the price of having to keep an eye on the situation.

Over all, provided you are aware that you have to think, using Lua with Unicode works well, it’s just that it’s not quite as obvious as you might expect!

## Testing TeX: Lua and TeX, and not just for LuaTeX

I wrote recently about the LaTeX3 build scripts, and that we are moving them to Lua for cross-platform work. One particular area the team are interested in is ‘unit testing’, something that’s common in ‘real’ programming but not so widespread for (La)TeX. The main reason for that is obvious: most people programming TeX aren’t professionals but instead do it as an add-on to their ‘real’ jobs (in my case, chemistry).

The LaTeX3 team have had unit tests for LaTeX for many years. The way these work is quite simple. First, set up a TeX run which does whatever tests are needed and write the output to the log file. The raw logs tend to be rather long, and have information in that varies from system to system, so the second stage is to run a script to extract out just the ‘important’ parts of the log (those get marked up by the TeX set up). The ‘processed’ log can then be compared to one prepared in a ‘reference’ run (where someone has checked the results by hand): if the two results match, the test passes.

Up to now, we’ve used Perl for that ‘log processing’ and have only run tests using pdfTeX. Moving to Lua for our scripting, we can drop Perl and do the post-processing in the build scripts themselves. That makes almost everything self-contained: other than needing Info-ZIP for making CTAN release zip files, all that is needed is a TeX system (featuring LuaTeX for texlua) and the operating system itself (basic file operations but also diff/fc). At the same time, we’re expanding our tests to run them with pdfTeX, XeTeX and LuaTeX. That’s already thrown up several bugs in LuaTeX: nothing most people will notice most of the time, but reported to the developers and to be fixed. (Most of these are about formatting in the log: if you test based on log changes, these are important!)

While the scripts aren’t fully ‘portable’ as they are designed around our development set up, the structures should be pretty clear. The LaTeX test script is also quite general. So we’d like to hope that other people can adopt and adapt them: feedback always very welcome!

## Lua for LaTeX3 build scripts

Anyone who follows the LaTeX3 GitHub pages will have seen a lot of checkins recently, not only for the code itself but also for a new set of build scripts. The work should hopefully make us more efficient but also impact on others, directly or indirectly.

Once of the first things I did when I joined the LaTeX3 Project was to write a load of Windows batch scripts for building the code. At the time, most of the rest of the team were using Unix (Linux, Macs, …) but I was using Windows, so this was quite important. Since then, the team’s IT set up has varied a bit but so have the requirements for our scripts, so I’ve ended up working quite a bit with both the batch files and the Unix Makefiles.

The ‘two sets of scripts’ approach has a few problems. The most obvious is that we’ve had to keep the two in synch: not always entirely successful, and not always remembered! At the same time, the ‘standard’ facilities available on the two systems are different: we’ve had to require Perl on all platforms, and work out how best to make everything ‘looks the same’.

Some recent discussion prompted me to consider some new work on the scripts, but with Lua now available anywhere that has a TeX system (as texlua for script work), it seemed natural to move to an entirely new set up. The plan has been to have one ‘master’ script for the entire LaTeX3 repository, rather than the ‘copy-paste’ arrangement we’ve had to date, and to use very simple ‘configuration’ files for each separate module. That’s worked really well: we’ve now got one big file covering almost everything, no longer need Perl and have also addressed some of the things that prompted this in the first place! (More on those in another post soon.)

Even if you are not using LuaTeX for day-to-day TeX work, the scripting ability for supporting development is very handy. Lua doesn’t cover the OS-dependent stuff, but using our existing script code and a small amount of detection we’ve been able to work with that. The new scripts are hopefully nice and clear, more flexible than the old ones and most importantly only need changing in one place for each issue.

So how does this impact on others? First, it makes it easier for the team to work on LaTeX3, which should be a good thing all round. The scripts should also mean that we don’t have the odd strange change depending on which operating system is used to do a release. Second, we’d like to hope that other TeX developers can take ideas from our scripts for their own work. We’re particularly interested in testing TeX code, and I’ll come back to that in my next post. Last, and linked to testing, we are picking up some ‘interesting’ engine-based issues. Those are getting reported, and even if you never use LaTeX we’d hope that will be a benefit!

## Biblatex: more versatile shorthand lists

One of the useful features of biblatex is shorthands, which can be defined in a BibTeX database and listed in the bibliography. A long-standing request has been to make this even more powerful by allows several different kinds of shorthand, for example for journal abbreviations, general abbreviations, etc. This ability has now been added to the development version of the package, by generalising shorthands to ‘biblists’. Of course, new features always need testing, so it would be great if interested users would grab the code and try it out!

## A place for (PGF) plot examples

Not content with running The LaTeX Community, TeXample and texdoc.net, blogging on TeX matters and being a moderator on TeX Stack Exchange, Stefan Kottwitz has now started a new site: pgfplots.net. The idea for the new site is simple: it’s a place to collect great examples of plots, (primarily) made using using the excellent pgfplots package. Why do this? Plots are just graphics, but they are a very special form of graphic with particular requirements. As a working scientist, I really appreciate the need for well-presented, carefully-constructed plots: they can make (or break) a paper.

At the moment, the selection of plots is of course quite small: the site is new and the room for ‘artistic’ work is perhaps a little more limited than in the TeXample gallery. I’m sure it will soon grow, and we can all pick up a trick or two! (Don’t worry: there will certainly be a few plots for chemists. Indeed, you might already spot some.)

## Work on siunitx v3

I recently posted a few ‘notes to myself’ about future directions in siunitx development. With them down in print, I’ve been giving them some serious thought and have made a proper start on work on version 3 of the package. I’m starting where I’m happiest: the unit parser and related code, and am working on proper separation of different parts of the code. That’s not easy work, but I think it should give me a good platform to build on. I’m also working hard to make the new code show ‘best practice’ in LaTeX3 coding: the plan is to have much richer documentation and some test material to go with the new code. Looking forward, that should make creating a ‘pure’ LaTeX3 units module pretty easy: it will be a minor set of edits from what I’m working on now.

I’ve got a good idea of the amount of work I need to do: there are about 17k lines in the current siunitx.dtx, which comes out to around 7.5k lines of code. That sounds like a lot, but as much of what I need to do is more editing that writing from scratch I’m hoping for an alpha build of version 3 some time this summer.

## siunitx development: Notes for my future self

Working on siunitx over the years, I’ve learnt a lot about units and how people want to typeset them. I’ve also realised that I’ve made a few questionable choices in how I’ve tackled various problems. With one eye to future LaTeX3 work, and another to what I might still improve in siunitx, I thought it would be interesting to make a few notes about what I’ve picked up.

1. Sticking to math mode only would be a good idea. The flexibility that the package offers in terms of using either math or text mode is very popular, but it makes my life very tricky and actually makes some features impossible (TeX only allows us to reliably escape from math mode by using a box). It’s also got some performance implications, and the more I’ve thought about it, the more I realise that it was probably not the best plan to allow both choices.

2. A different approach to the \boldmath issue would be sensible. Currently, one of the reasons I use a switch from text to math mode internally is that it allows ‘escaping’ from \boldmath. However, that’s probably not the best plan, as it again introduces some restrictions and performance hits, and I think is very unlikely to actually be helpful!

3. The default number parser should be simple and quick: complex parsing should be an option. As it stands, the parser in siunitx is quite tricky as it does lots of things. A better approach would be to only deal with digit separation ‘out of the box’ (so not really parsing at all), and to allow things like uncertainties, complex numbers and the like as add-ons.

4. Tables really need a separate subpackage. Dealing with tables of numbers was never really my aim, and I think much clearer tack would be to have some of the internals of the number parse accessible ‘publicly’, then build the table functionality up separately.

5. The unit parser works very well: don’t change it! Although people ask me mainly about numbers and tables, the real business end of siunitx is the unit parser/formatter. It’s basically spot-on, with only really minor tune-ups needed.

Probably most of this has to wait for a ‘real’ LaTeX3 numbers/units bundle: I can’t break existing documents. However, I’ve got a few ideas which can be implemented when I get the time: watch this space.