Regular expressions

Regular expressions are very popular as a quick and powerful way to carry out searches and replacements in text of all sorts. Traditionally, TeX handles tokens and not strings or characters. This means that doing regex searches using TeX82 is pretty much impossible. To solve this, recent versions of pdfTeX adds the \pdfmatch primitive to allow real string matching inside TeX. The LuaTeX team have decided not to take all of the existing “new” primitives forward from pdfTeX, and as I understand it \pdfmatch will not be implemented in LuaTeX. However, Lua itself has regular expression matching, and so the functionality will still be around.

I’ve recently talked about adding new primitives to XeTeX, and you’ll see that \pdfmatch was not on the list for adding to XeTeX. The reason is that a XeTeX implementation would have to be slightly different from pdfTeX, as it is natively UTF-8, but also would be different to LuaTeX, as it would still be a TeX primitive and not a Lua function. So here “the prize wasn’t worth the winning”, in my opinion. As it is, using \pdfmatch is not widespread, and the idea of having three different regex methods inside TeX didn’t seem like a great idea!

Talking of regex implementations, I’ve been reading Programming in Lua, and also working with TeXworks to try to get syntax highlighting the way I like it. Both systems are slightly different, and it seems both are different from the Perl implementation. It seems that every time you want to use a regex system you have to read the manual to see which things are different from every other implementation!

Lua, TeX, LaTeX and ConTeXt

There is currently a very active thread on comp.text.tex about LuaTeX. Several related questions have come up, and it seems that some kind of summary would help me, at least.

Why did the LuaTeX team choose Lua?

Possibly the most frequently asked question with LuaTeX is why the team behind it chose Lua, as opposed to one of the other scripting languages available. The LuaTeX FAQ addresses this as question 1, so it’s obviously important. As I understand it, they looked at a number of possible languages for extending TeX, and decided that many of the “obvious” candidates (Java, Perl, Python, Ruby, Scheme) were not suitable. Not everyone is going to agree with that assessment, and some people are bound to feel that there was not sufficient consultation.

With LuaTeX already at version 0.40, it is progressing and will deliver. So unless a team can be got together to provided an alternative scripting system, it looks like Lua is what we will get. I don’t see there being enough experienced programmers with time on their hands to deliver a complete alternative at the moment.

How does LuaTeX relate to ConTeXt and LaTeX?

LuaTeX is an “engine”, like pdfTeX or XeTeX. So it provides certain functions, which might or might not get used. The end-user can use them directly, but this is really not helping most people as they don’t want to do this type of low-level work. So in the main it is down to either the TeX format (ConTeXt, LaTeX, …) or an add on (LaTeX package, ConTeXt module, etc.) to take advantage of them.

In the ConTeXt case, the entire format is being re-written as “Mark IV”, which will use a lot of Lua. This of course means things fundamentally change compared to early versions of ConTeXt, and ties it to a single engine.

In the LaTeX case, the format is written to work on plain TeX, no extensions at all. That is not about to change: essentially, LaTeX2e will never be “updated” in that sense. The current LaTeX3 plans don’t envisage requiring LuaTeX, although I’d hope that some low-level support will be included if LaTeX3 ever becomes a reality. There will, though, be LaTeX packages that use Lua: I’m sure that at some point there will be a fontspec-like package for LuaTeX, for example.

What is wrong with LaTeX2ε?

One issue that always comes up when future directions for TeX are discussed is whether LaTeX2ε needs to change, or whether it is okay as it is. Currently, if you know what you are doing then you can achieve a lot with LaTeX. There are a vast range of packages, and these cover very many of the things you could ever hope to do in TeX. So in a sense LaTeX works as it is. On the other hand, you have to know which packages to load, even to get some of the basics right (try making a new float type without loading any packages and using one of the base classes!). At the same time, things like UTF-8 text, system fonts and so on are not available using only the kernel.

A more fundamental issue is that LaTeX currently doesn’t do enough to provide structured data, and is not a good choice for dealing with XML input. This makes it hard to get data in and out, and is closely related to the fact that there is not enough separation of appearance and structure in the kernel and in add-on packages.

Neither of these points has escaped the notice of the LaTeX team. The question is whether a successor to LaTeX2ε will appear, and if it does whether it can succeed. There is a need to start from scratch with many things, meaning that a new LaTeX simply won’t work with most packages currently available. So the team have to deliver something that really works.

How can (La)TeX recruit new users?

One question that comes up here is what is a typical (La)TeX user anyway. Everyone has their own view on this: some people see TeX users mainly as mathematicians and physicists, other people point out that particularly with XeTeX available there is a lot of potential in the humanities.

Making life easier for new users is a priority for gaining new users. This means making it easy to install TeX (which both TeX Live and MiKTeX do, I think), making it easier to edit TeX files (the excellent TeXworks helps here), and making it easier to use TeX. It is, of course, the last one that is the problem. I think that we do need a new LaTeX as part of this. Removing the need to load a dozen packages and alter half a dozen internal macros to get reasonable output would benefit all LaTeX users. At the same time, something as simple as a generic thesis template could be made available: that would again help to “sell” LaTeX and by extension TeX as a whole.

LuaTeX has a role to play here. Using system fonts with pdfTeX is not easy, and XeTeX has helped a lot with this. LuaTeX provides this possibility and many more, and so we can hopefully look forward to not having to worry about loading system fonts at all. At the same time, it should be possible to do things similar to the current BibTeX and MakeIndex programs directly in the engine, but with full UTF-8 input. This is something that is long overdue, and again can’t hurt when it comes to selling (La)TeX.

More on XeTeX primitives

There has been a bit more work on the idea of adding primitives to XeTeX to match those available in pdfTeX.The list of pdfTeX primitives which look interesting has grown slightly, and now reads:

  • \ifincsname
  • \ifpdfprimitive
  • \pdfprimitive
  • \pdfshellescape
  • \pdfstrcmp

At the same time, it would be useful to include the “extended” version of \vadjust which pdfTeX makes available. This is something that has been asked about before, and as with the rest of the changes the main issue is not the idea of doing it but the time for actual implementation.

The real need to have \pdfstrcmp available for LaTeX3 work means that some effort has actually gone into this. I’ve got no experience with either Pascal or the WEB format, but I’ve managed but dint of determination to get something passable to Jonathan Kew. There will need to be some adjustments, as XeTeX works with UTF-8 internally, which pdfTeX does not do. However, I’m hopeful that we will see new primitives in XeTeX soon.

Quite how the primitives will be named is still to be decided. The existing \pdf... naming does not really make sense with these non-PDF related functions. So they could end up as \XeTeX... or may just be given generic names. I’m leaving that to Jonathan!

Additional primitives for XeTeX

XeTeX has, over the past few years, made using TeX with multiple fonts and UTF-8 input easy. The work-flow using XeTeX is very much more accessible than the routes needed using pdfTeX or TeX82. So I’m sure that many people, like me, use XeTeX whenever they want to use arbitrary fonts or to write anything which doesn’t use western European characters.

XeTeX is based on ε-TeX, which means it has a number of primitives which were not present in TeX82, but are present in ε-TeX itself or in pdfTeX (which also includes the ε-TeX primitives). However, ε-TeX was finalised over ten years ago, and since then the pdfTeX team have added a number of new primitives, many related directly to PDF output. At the same time, XeTeX includes its own new or extended primitive functions, in this case focussed on UTF input. For the most part this does not concern people as things work fine.

Recently, there has been some testing of the current LaTeX3 code with XeTeX (and older versions of pdfTeX, which don’t have all of the newer primitives). LaTeX3 requires the ε-TeX extensions, which are as I said available with any modern TeX engine. However, when it’s available LaTeX3 also uses the \pdstrcmp primitive: this is only present in newer versions of pdfTeX. For those people not familiar with \pdstrcmp, it allows you to do string comparisons of text (not token comparisons), and in an expandable manner. This is very useful, and much better than doing things without it; with no \pdfstrcmp, comparisons are not expandable. It became clear that there is a danger of some things working when using newer versions of pdfTeX, but failing with older ones or with XeTeX. Older versions of pdfTeX is one thing (the advice can simply be “sorry, you’ll have to update your pdfTeX”), but failing with XeTeX is simply no acceptable. After a bit of discussion, the best solution seemed to be to talk to Jonathan Kew about getting a very small number of “new” pdfTeX primitives into XeTeX.

At the moment, things are still under discussion, but the list of additional primitives is going to be small (somewhere between 2 and 5 seems likely). I think it’s giving nothing away to say that \pdfstrcmp is one that really is needed (although the name might be an issue!). Another likely candidate is \ifincsname, which looks handy and also not too complex to implement. There are a few other suggestions, but I’m not sure just yet what will be really needed, as opposed to nice to have. What is clear is that this is a one-off request. Once these small gaps are filled, LaTeX3 will not be using other primitives for general functions. I’m not sure how long it will take to finalise things, both for the team to agree on what is needed and for Jonathan Kew to do the hard work, but I’d imagine weeks not longer.

A XeTeX snag

Testing the current LaTeX3 code means worrying about what primitives are available. Recent releases of pdfTeX have included a number of new primitives, and the expl3 system currently uses \pdfstrcmp if it finds it. This has already caused one issue, as things were broken if it was not available. So I’ve been modifying the test system a little, to run the tests with pdfLaTeX and XeLaTeX. The “new” primitives are not available with XeTeX, so this is a good test of our code: it’s supposed to work with both. I’ve found one odd thing going on with the LateX3 code, but also one thing that seems to be a XeTeX snag.

The snag can be tracked down (thanks to Morten Høgholm) to a minimal case:

\message{\ifcat\par\relax T\else F\fi}

If you try this in TeX or pdfTeX, you get T in the log (as The TeXbook says you should). With XeTeX, you get F. Whether this is a bug or a feature, I don’t know. I’m sure Jonathan Kew will elaborate, but it reminds me of the need to test using different set ups. You can imagine some very odd errors appearing with this type of thing!

Supporting Windows users

Many “open” projects are developed mainly by people using Unix-based operating systems (Linus, MacOS X, OpenBSD, etc.). This sometimes leads to a rather awkward situation for Windows users. Many of the tools that are assumed to be available (GCC, make, grep, …) simply are not. TeX is luckily cross-platform, and recent versions of TeX Live work hard to work on Windows as well as on *nix systems. However, that can still leave a few issues.The LaTeX3 experimental code has a series of test files, make scripts and so on, to aid development. However, these only work if things like make and bash are actually available. So I’ve recently added a set of batch files to the source store, which hopefully do basically the same thing but using Windows tools. I’ve had to require Perl for the test scripts, as these need to be parsed and re-formatted. The batch files also need a command-line zip program: I tend to use 7-Zip. Hopefully, though, these are not too much to ask for!

In the longer term, LuaTeX should mean that auxiliary stuff can be done using Lua, as it will be available cross-platform. However, it could be many years before we reach that state!