Regular expressions

Regular expressions are very popular as a quick and powerful way to carry out searches and replacements in text of all sorts. Traditionally, TeX handles tokens and not strings or characters. This means that doing regex searches using TeX82 is pretty much impossible. To solve this, recent versions of pdfTeX adds the \pdfmatch primitive to allow real string matching inside TeX. The LuaTeX team have decided not to take all of the existing “new” primitives forward from pdfTeX, and as I understand it \pdfmatch will not be implemented in LuaTeX. However, Lua itself has regular expression matching, and so the functionality will still be around.

I’ve recently talked about adding new primitives to XeTeX, and you’ll see that \pdfmatch was not on the list for adding to XeTeX. The reason is that a XeTeX implementation would have to be slightly different from pdfTeX, as it is natively UTF-8, but also would be different to LuaTeX, as it would still be a TeX primitive and not a Lua function. So here “the prize wasn’t worth the winning”, in my opinion. As it is, using \pdfmatch is not widespread, and the idea of having three different regex methods inside TeX didn’t seem like a great idea!

Talking of regex implementations, I’ve been reading Programming in Lua, and also working with TeXworks to try to get syntax highlighting the way I like it. Both systems are slightly different, and it seems both are different from the Perl implementation. It seems that every time you want to use a regex system you have to read the manual to see which things are different from every other implementation!

6 thoughts on “Regular expressions

  1. Whenever, the luatex team can not program something, they will say that has to be done in lua! (sorry for that, but it is what I feel!)

    • That’s one way to look at it. Of course, they have programmed it in the past: pdfmatch is a regex matching primitive. I’m not sure anyone uses it, thought!

  2. I’m not a big fun of regex. Similarly to TeX language, regular expressions feel fine for simple stuff but once you try to do something more complex… Well, if anything can challenge TeX in code obscurity, then regex language must be it. And the fact that, as you say, everybody seems to use a different flavour of this language doesn’t help either.

    BTW, do you know which flavour is supported by TeXworks? I thought it uses Qt specs but just the other day I played with syntax highlighting patterns and wanted to use look ahead/behind matching to colour environment names separately from begin / end commands. This should be supported according to Qt docs but it didn’t work for me.

  3. I’m not a big regex person either, but in TeX it is hard to do search and replace for strings rather than tokens. So I see why newer engines provide something.

    I’ve also had some struggles with the TeXworks regex engine: I also wanted some look behind things. There was some discussion on the mailing list, but I’m not sure what the result was. I think TeXworks just loads a regex parser, but perhaps the Qt version is different or something.

  4. After some more careful study of Qt docs I’ve found out that only lookahead assertions are supported but not lookbehind (which would be much more useful for TeX sources).

Leave a Reply