LuaTeX: Manipulating UTF-8 text using Lua

Both the XeTeX and LuaTeX engines are natively UTF-8, which makes input of non-ASCII text a lot easier than with pdfTeX (certainly for the programmer: inputenc hides a lot of complexity for the end user!). With LuaTeX, there is the potential to script in Lua as well as program in TeX macros, and that of course means that you might well want to do manipulation of that UTF-8 input in Lua. What might then catch you out is that it’s not quite as simple as all that!

Lua itself can pass around arbitrary bytes, so input in UTF-8 won’t get mangled. However, the basic string functions provided by Lua are not UTF-8 aware. The LuaTeX manual cautions

The string library functions len, lower, sub, etc. are not UNICODE-aware.

As a result, applying these functions to anything outside the ASCII range is not a good idea. At best you might get unexpected output, so

tex.print (string.lower ("Ł"))

simply prints in Ł (with the right font set up). Worse, get an error as for example

tex.print (string.match ("Ł","[Ł]"))

results in

! String contains an invalid utf-8 sequence.

which is not what you want!

Instead of using the string library, the current correct approach here is to use slnunicode. Again, the LuaTeX manual has some advice:

For strings in the UTF-8 encoding, i.e., strings containing characters above code point 127, the corresponding functions from the slnunicode library can be used, e.g., unicode.utf8.len, unicode.utf8.lower, etc.

and indeed

tex.print(unicode.utf8.lower("Ł"))

does indeed print ł. There are still a few things to watch, though. The LuaTeX manual warns that unicode.utf8.find returns a byte range and that unicode.utf8.match and unicode.utf8.gmatch fall back on non-Unicode behaviour when an empty capture (()) is used. Both of those can be be allowed for, of course: they should not be big issues.

There’s still a bit of complexity for two reasons. First, there’s not really much documentation on the slnunicode library, so beyond trying examples it’s not so easy to know what ‘should’ happen. For example, case-changing in Unicode is more complex than a simple one-to-one mapping, and can have language-dependencies. I’ll probably return to that in another post for a TeX (or at least XeTeX/LuaTeX) take on this, but in the Lua context the problem is it’s not so clear quite what’s available! In a way, the second point links to this: the LuaTeX manual tells us

The slnunicode library will be replaced by an internal UNICODE library in a future LuaTeX version.

which of course should lead to better documentation but at the price of having to keep an eye on the situation.

Over all, provided you are aware that you have to think, using Lua with Unicode works well, it’s just that it’s not quite as obvious as you might expect!

Lua for LaTeX3 build scripts

Anyone who follows the LaTeX3 GitHub pages will have seen a lot of checkins recently, not only for the code itself but also for a new set of build scripts. The work should hopefully make us more efficient but also impact on others, directly or indirectly.

Once of the first things I did when I joined the LaTeX3 Project was to write a load of Windows batch scripts for building the code. At the time, most of the rest of the team were using Unix (Linux, Macs, …) but I was using Windows, so this was quite important. Since then, the team’s IT set up has varied a bit but so have the requirements for our scripts, so I’ve ended up working quite a bit with both the batch files and the Unix Makefiles.

The ‘two sets of scripts’ approach has a few problems. The most obvious is that we’ve had to keep the two in synch: not always entirely successful, and not always remembered! At the same time, the ‘standard’ facilities available on the two systems are different: we’ve had to require Perl on all platforms, and work out how best to make everything ‘looks the same’.

Some recent discussion prompted me to consider some new work on the scripts, but with Lua now available anywhere that has a TeX system (as texlua for script work), it seemed natural to move to an entirely new set up. The plan has been to have one ‘master’ script for the entire LaTeX3 repository, rather than the ‘copy-paste’ arrangement we’ve had to date, and to use very simple ‘configuration’ files for each separate module. That’s worked really well: we’ve now got one big file covering almost everything, no longer need Perl and have also addressed some of the things that prompted this in the first place! (More on those in another post soon.)

Even if you are not using LuaTeX for day-to-day TeX work, the scripting ability for supporting development is very handy. Lua doesn’t cover the OS-dependent stuff, but using our existing script code and a small amount of detection we’ve been able to work with that. The new scripts are hopefully nice and clear, more flexible than the old ones and most importantly only need changing in one place for each issue.

So how does this impact on others? First, it makes it easier for the team to work on LaTeX3, which should be a good thing all round. The scripts should also mean that we don’t have the odd strange change depending on which operating system is used to do a release. Second, we’d like to hope that other TeX developers can take ideas from our scripts for their own work. We’re particularly interested in testing TeX code, and I’ll come back to that in my next post. Last, and linked to testing, we are picking up some ‘interesting’ engine-based issues. Those are getting reported, and even if you never use LaTeX we’d hope that will be a benefit!

Scripting in TeXworks

Stefan Löffler has recently posted to the TeXworks mailing list that he’s sorted out a patch to integrate Lua into TeXworks. Many people will be aware that Lua is very much the scripting language of the moment in the TeX world, because of the LuaTeX project. So it makes sense to consider it as a method for scripting TeXworks. The idea has always been that TeXworks will have a simple interface but powerful features available to those who want them. So adding scripting is a vital step forward. With a light-weight scripting system available, power users can code their own features into TeXworks while leaving the basics accessible to everyone.

That doesn’t mean that Lua has to be the (only) language available: Jonathan Kew (TeXworks lead developer) has suggested QtScript (which is JavaScript-like) as a possibility. TeXworks is built using Qt, so there is a definite logic here. As Jonathan himself points out, actually having a patch available certainly means that Lua is likely to be integrated!

“Programming in Lua”

Over the week-end, the copy of Programming in Lua that I’d ordered arrived at the bookshop. So I’ve been able to make at least a small start on the book, and get some of the ideas into my head. Having looked at the first 40 or so pages, I can see why Lua has been chosen for integration with TeX as LuaTeX. Some TeX concepts are also Lua concepts (such as the scope of variables and the ability to re-define anything), and so the fit between the two languages looks good. How that translates in practice I’ll have to see.

What I have found to date is that you really don’t want to be a beginner if you pick up Programming in Lua. The book starts with a lot of pretty complicated programming ideas, and without some background I’d be in trouble. Luckily I’ve picked up bits and pieces over the years (from Basic, C and Perl, amongst other things), and so I’m not totally lost. But for a book aimed at people starting off with the language, it is tough going.

The arrival of the book happily coincides with the release of LuaTeX 0.36.0, as announced on the LuaTeX mailing list. There are a number of changes aimed at making more changes in the future, but at least I can now begin to understand what some of them mean!