Programming LaTeX3: Token list variables

In the last post, I talked about the concept of a token list and some general functions which act on token lists. That’s fine if you just want to take some input and ‘do stuff’, but a very common requirement when programming is storing input, and for that we need variables. LaTeX3 provides a number of different types of variable: we’ll start with perhaps the most general of all, the token list variable.

Token list variables

So what is a token list variable (‘tl’)? You might well guess from the name that its a way of storing a token list! As such, a tl can be used to hold just about anything, and indeed this means that several of the other variable types we’ll meet later are tls with a special internal structure.

Before we can save anything in a tl, we need to create the variable: this is a general principle of programming LaTeX3. We can then store something inside the variable by setting it:

\tl_new:N \l_mypkg_name_tl
\tl_set:Nn \l_mypkg_name_tl { Fred }

Hopefully, the analysis of this code is not too hard. First, \tl_new:N creates a new token list variable which I’ve called \l_mypkg_name_tl. (I’ll explain how the naming works in a little while.) The second line will set the new tl to contain the text Fred. Assuming that the surrounding code has done nothing strange, we’ve stored four letter tokens in \l_mypkg_name_tl.

As I said, a tl can contain anything: we are not limited to letters. So

\tl_new:N \l_mypkg_other_tl
\tl_set:Nn \l_mypkg_other_tl { \ERROR ^ _ # $ ! }

is also perfectly-valid for the content of a token list variable (although whether we’ll be able to use it safely is a different matter).

Variable naming and TeX’s grouping system

From the earlier discussion of the way that functions are named in LaTeX3, it might be obvious that there is also a system to how variables are named. Skipping over the initial \l_, what we’ve got is a module name (mypkg), some further description of the nature of the variable (in this case name), and finally the variable type (tl), divided up by _ in exactly the same way we did for functions. We’ll see that other variables follow the same scheme.

So what’s the leading \l_ about? This tells us about the scope that we should use when setting the variable. As TeX is a macro expansion language, variables are not local to functions. However, they can be local to TeX groups, which are created in LaTeX3 using

\group_begin:
% Code here
\group_end:

Setting a variable locally means that any changes stay within a group

\tl_new:N \l_mypkg_name_tl
\tl_set:Nn \l_mypkg_name_tl { Fred }
\group_begin:
  \tl_set:Nn \l_mypkg_name_tl { Ginger }
\group_end:
% \l_mypkg_name_tl reverts to 'Fred'

On the other hand, we sometimes need global variables which ignore any groups

\tl_new:N \g_mypkg_name_tl
\tl_gset:Nn \g_mypkg_name_tl { Fred }
\group_begin:
  \tl_gset:Nn \g_mypkg_name_tl { Ginger }
\group_end:
% \g_mypkg_name_tl still 'Ginger'

So the \l_ or \g_ tells you what scope the variable contents have, and whether you should set or gset it. (You can probably work out that gset means ‘set globally’.)

Using the content of token list variables

Okay, putting stuff into token list variables is all very well and good, but unless we can do something with the content then it’s not really that useful. Of course, we can do things with the content of variables. The most basic thing to do is simply to insert the content of the tl into the input that TeX is working with

\tl_use:N \l_mypkg_name_tl

That’s very handy, but we can also examine the content of a token list variable. For example, we saw before that \tl_length:n will produce the length of a token list, and we can do the same for a token list variable using \tl_length:N.

\tl_set:Nn \l_mypkg_name_tl { Fred }
\tl_length:N \l_mypkg_name_tl % '4'

There’s a lot more we can do with token list variables, but this post is already long enough, so I’ll come back to more that we can do with them in the next post.

Tips for TeX programmers: the internals of token list variable

Experienced TeX programmers are probably wondered about token list variables, and in particular exactly what the underlying TeX structure is. A tl is just a macro that we are using as a variable rather than function. That should not be too much of a surprise, as storing tokens in macros is very much basic TeX programming. So \tl_set:Nn is almost the same as the \def primitive.

What might worry you slightly is that I said

\tl_new:N \l_mypkg_other_tl
\tl_set:Nn \l_mypkg_other_tl { \ERROR ^ _ # $ ! }

will work. That won’t work with \def, and you’d normally expect to need a token register (toks) for this. However, we don’t use toks for LaTeX3 programming at all, and that’s because we require e-TeX. So

\tl_set:Nn \l_mypkg_other_tl { \ERROR ^ _ # $ ! }

is actually the same as

\edef \l_mypkg_other_tl { \unexpanded { \ERROR ^ _ # $ ! } }

which will allow us to put any tokens inside a macro.

The other thing you might notice is that I’ve said that tls have to be declared, even though at a TeX level this is not the case. This is a principle of good LaTeX3 programming, and although it’s not enforced as standard any non-declared token list variables are coding errors. You can test for this using

\usepackage[check-declarations]{expl3}

which uses some slow checking code to make sure that all variables are declared before they are used.

Programming LaTeX3: Category codes, tokens and token lists

Understanding LaTeX3 programming relies on understanding TeX concepts, and one we need to get to grips with is how TeX deals with tokens. Experienced TeX programmers will probably find the first part of this post very straight-forward, so might want to skim read the start!

Category codes and tokens

When TeX reads input, it is not only the characters that are there that are important. Each character has an associated category code: a way to interpret that character. The combination of a character and it’s category code then sets how TeX will deal with the input. For example, when TeX read ‘a’ it finds that it’s (normally) a letter, and so tokenizes the input as ‘a, letter’. This seems pretty obvious: ‘a’ is a letter, after all. But this is not fixed, at least for TeX. I’ve already mentioned that within the LaTeX3 programming environment : and _ can be part of function names: that’s because they are ‘letters’ while we are programming! What’s of most importance now is that a control sequence (something like \emph or \cs_new:Npn) is stored as a single token. So most of the time it these can’t be divided up into their component characters: they act as a single item.

Token lists

The fact that TeX works with tokens means that most of the time we carry out operations on a token-by-token basis, rather than as strings. In LaTeX3  terminology, an arbitrary set of tokens is called a token list, and which of has both defined content and defined order. To get a better feel for how token lists work, we’ll apply a few basic token list functions to some simple input:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\cs_new:Npn \demo:n #1
  {
    \tl_count:n {#1} ;
    \tl_if_empty:nT {#1} { Empty! }
    \tl_if_blank:nTF {#1}
      { Blank! }
      {
        Head = \tl_head:n {#1} ;
        Tail = \tl_tail:n {#1} ;
        End
      }
  }
\cs_new_eq:NN \demo \demo:n
\ExplSyntaxOff
\newcommand*{\hello}{hello}
\begin{document}
\demo{Hello world}

\demo{ }

\demo{}

\demo{\hello}
\end{document}

Okay, what’s going on here? Well, as we saw

last time I’ve created a new function, in this case called \demo:n, which contains the code I want to use. In contrast to the last post, I’ve not used it directly but have instead used \cs_new_eq:NN to make a copy of this function but with a document-level name. This is a general LaTeX3 idea: the internals of your code should be defined separately from the interface (indeed, we’ll see later that there is a more formalised way of creating a document-level function). You can probably work out that \cs_new_eq:NN needs two arguments: the new function to create and the old one to copy. (For experienced TeX programmers, it will be no surprise that this is a wrapper around the \let primitive.) Moving on to what \demo:n is doing, the first thing to see is that I’ve defined it with one argument, agreeing with the :n part of its name. I’ve then done some simple tests on the argument. The first is \tl_count:n, which will count how many tokens are in the input and simply output the result. You’ll notice that it’s ignored the space in Hello world: it’s a common feature of TeX that spaces are often skipped over. You can also see the space-skipping behaviour in the line where I feed \demo a space: the result has a ‘length’ of zero. Also notice that as promised \hello is only a single token. (There is an experimental function in LaTeX3 to count the length of a token list including the spaces. Most of the time, we’ll actually want to ignore them so we won’t worry about that here!) We then have to conditionals, \tl_if_empty:nT and \tl_if_blank:nTF. First, we’ll look at what a conditional does in general, then at these two in particular. The LaTeX3 approach to conditionals is to accept either one or two arguments, which might read T, F or TF, so in general there are always three related functions:

\foo_if_something:nT
  \foo_if_something:nF
  \foo_if_something:nTF

The test is always the same for the three related versions, with the

T and F part tells us what code is used depending on the result of the test. So if we do a test and it’s true, the T code will be used if it’s there, and the F code will be skipped entirely, while if there is no T code then nothing happens. It’s of course the other way around when the test is false! So what’s happening with \tl_if_empty:nT and \tl_if_blank:nTF? In the first test, we only print { Empty! } if there is nothing at all in the argument to \demo:n. If the argument is no empty, then this test does nothing at all. On the other hand, the \tl_if_blank:nTF test will print { Blank! } if the argument is either entirely empty or is only made up of spaces (so it looks blank). However, if it’s not blank then we apply two more functions. The functions \tl_head:n and \tl_tail:n find the very first token and everything but the very first token, respectively. So \tl_head:n finds just the H of Hello world while \tl_tail:n finds ello world. I’ve only used them if the entire argument is not blank as they are not really designed to deal with cases where there is nothing to split up! You might wonder about the last test, where \demo{\hello} has Hello as the head part and nothing as the tail. That happens because what is tested here is \hello, a single token, which is then turned into the text we see by TeX during typesetting. That can be avoided, but at this stage we’ll not worry too much!

Programming LaTeX3: Creating functions

Teaching a programming language traditionally starts with a method to print ‘Hello World’. For programming LaTeX3, we can’t quite start there as

\documentclass{article}
\begin{document}
Hello world
\end{document}

will happily do that without needing any programming. So I’ll start by printing ‘Hello World’ lots of times!

Our first function

LaTeX3 has a built-in method for creating multiple copies of text, which we could use directly. However, that would mean using a code-level macro in the document itself, and so I’ll create a wrapper macro. For this first example, I’ll include all of the document:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
\cs_new:Npn \SayHello #1
  { \prg_replicate:nn {#1} { Hello~World!~ } }
\ExplSyntaxOff
\begin{document}
\SayHello{100}
\end{document}

This will give you, as promised, 100 copies of ‘Hello World!’.

So what is going on here? As you might work out, I’ve defined a new command called \SayHello which prints as many copies of ‘Hello World!’ as requested. Later on we’ll see that this is usually not how I’d choose to create a ‘document command’, but for the moment I’ll pass over that point so we can get some basics established.

The structure of function names

Getting down to detail, I’ve introduced two LaTeX3 functions here: \cs_new:Npn and \prg_replicate:nn. As promised, these use : and _ as ‘letters’ in their names. But what do they do? As you might guess from the names, \cs_new:Npn is used to create a new control sequence, while \prg_replicate:nn makes lots of copies of something (it replicates stuff). The naming convention for LaTeX3 is that the first part of the name (\cs_… or \prg_…) refers to the module the function comes from. So \cs_new:Npn is from the module for control sequences, which we abbreviate as cs, while \prg_replicate:nn is from the general programming utilities module, which is abbreviated as prg. For programmers working outside of the LaTeX3 kernel, a module is probably going to be the same as a LaTeX2e package. So the module part of the name is used to divide up code into related blocks: each module should use a unique prefix, and I’ll tend to use \mypkg… for demonstration purposes.

Up to the :, the rest of the name is up to the programmer and should help you understand what a function does. So \cs_new:Npn tells us that the function makes new a control sequence, and so is pretty similar to LaTeX2e’s \newcommand. We can have multiple parts to the name divided by _ for ease of reading. For example \cs_new_nopar:Npn is available for creating new functions which will give an error if they pick up a \par: this is similar to \newcommand*. You can probably work out the analysis of \prg_replicate:nn yourself!

The part of the name after the : is perhaps one of the most confusing ideas for new LaTeX3 programmer, especially if they are used to other languages. It’s called the argument specification or signature of the function, and tells us about the number and type of arguments a function takes. If you have experience in other programming languages, you’re probably wondering why we include this information in the function name. As we’ll see as we look in more detail at LaTeX3, this approach works as it reflects how TeX works.

So what do the different letters mean? Each letter (usually) represents one argument for a function. So \prg_replicate:nn with two letters after the : needs two arguments. (For those of you who haven’t come across arguments before, something like \maketitle takes no arguments, \emph needs one argument, \setlength takes two arguments, and so on.) The letter itself then tells us about the type of argument: n means tokens in braces (a ‘normal’ argument).  In \cs_new:Npn, the n-type argument is the code which we are creating. An N means that the argument has to be a single token without any braces: in our current case this will be the name of the new function. The p is a bit more complicated: it means that the second argument here is a parameter specification. Here, we can use #1, #2, etc., to represent the arguments for the new function, in exactly the same way we do in the code. So when we use \SayHello, it will expect to find one argument, and will insert that into the place marked as #1 in the code part.

Analysis \prg_replicate:nn

The same analysis applies to \prg_replicate:nn, which we can now see needs two arguments, both in braces. The first one is the number of times to repeat, and the second argument is what to repeat. So in \SayHello the number of repetitions is supplied by the user (this will replace #1), but the text is fixed by the programmer.

The reference for finding out what functions are available, and what arguments they take, is interface3. I’ll only be covering a selection of what is available, so over time you’ll need to get familiar with the formal documentation to find out what you can do. If you take a look there, you’ll see that the first argument for \prg_replicate:nn is an integer expression. That means that we don’t have to use a number directly here, but can also use something that will result in a number once TeX has worked it out. That will carry through to our user function, so

\SayHello{ 10 - 3 + 4 }

will be valid input.

Functions or macros?

Experienced TeX programmers will probably be worried that I’m talking about ‘functions’ and not about ‘macros’. TeX is a macro expansion language, which means that when it reads \SayHello, it replaces it by the code we’ve defined as the meaning of \SayHello, then reads the start of the inserted code, replaces it as necessary and so on until it has something to typeset (such as a letter) or execute (a ‘primitive’). That means that programming TeX is very different from programming using true functions.

The LaTeX3 programming approach allows us to treat many macros as if they were functions, but there are places where we’ll need to think about macros being expanded. Throughout the LaTeX3 documentation, programming is described in terms of functions, and so I’ll stick to that approach. Bear in mind that underlying everything is a set of macros, and that this will show up from time to time.

Programming LaTeX3: The programming environment

In the previous post, I mentioned that programming LaTeX3 today really means programming using LaTeX3 ideas but on top of LaTeX2e. To do that, we are going to need to load the appropriate code, and then access the LaTeX3 programming environment. The exact detail depends on whether we are programming in the preamble of a LaTeX document or creating a package. I’ll look at both of these before taking a closer look at the LaTeX3 programming environment in general.  What you should notice is that the use of a separate programming environment very much separates out the process of creating code from creating documents: that is quite deliberate and is something that we’ll see again in the series.

In the preamble of a document

The LaTeX3 programming code usable with LaTeX2e is available as a package called expl3 (which for various reasons is distributed as part of l3kernel). This is loaded in the usual way

\documentclass{article}
\usepackage{expl3}

That loads the code, but does not get us into the programming environment. To do that, we need to use a couple of new macros

\ExplSyntaxOn
% Code goes here
\ExplSyntaxOff

In some ways, this is similar to the LaTeX2e \makeatletter … \makeatother idea, but as we’ll see it’s a bit more advanced.

In a LaTeX2e package

In exactly the same way as in a document, the first stage in using LaTeX3 programming in a package is to load the code.

\RequirePackage{expl3}

Once again, that loads the code but does not switch the syntax on. We could use \ExplSyntaxOn here, but for packages a more flexible alternative is to declare the package as being LaTeX3-based:

\ProvidesExplPackage
  {mypkg}               % Package name
  {2011-12-11}          % Release date
  {1.0}                 % Release version
  {Some things I wrote} % Description

This is a special version of the standard \ProvidesPackage macro, which will automatically turn on LaTeX3 programming syntax and more importantly turn it off at the end of the package. It also deals properly with nested package loading, and so is the recommended way to use LaTeX3 syntax inside LaTeX2e packages.

The coding environment

Whether you’re using LaTeX3 syntax in a document or a package, the basic ideas are the same. The first thing to notice is that white space (spaces, tabs and new lines) are ignored inside the programming environment. This means we can use it to lay out our code more clearly, but you might wonder how to actually include a space. This is handled by defining ~ as a ‘normal’ space, rather than as the usual non-breaking version.

The programming environment also makes it possible to use : and _ inside the names of commands, which are more formally called control sequences. TeX decides what is a valid control sequence name based on something called the category code of each character. I’ll be explaining more about category code as we go along, but for the moment the key is to understand that that a control sequence is \ followed either by exactly one non-‘letter’ or by one or more ‘letters’. Inside the code environment : and _ are treated as letters by TeX: this is the same idea as using @ as an extra ‘letter’ in LaTeX2e code.

Not only are : and _ available for use in control sequences but they are required by the conventions of LaTeX3 programming. In contrast to LaTeX2e’s sometimes haphazard use of @ in names, there are guidelines for applying both : and _ in LaTeX3 names. Rather than give a formal list now, I’ll bring in the system in the next couple of posts using some examples.

One difference between programming in a document and in a package is the status of @. LaTeX2e automatically makes it a letter in package code, but in a document this does not happen. LaTeX3 does not assign any special meaning to @, and so these difference are not affected by loading LaTeX3 support.

A standard document

As we’ll be needing the basics here for everything from now on, I’ll assume that you are using a short testing document for LaTeX3 programming:

\documentclass{article}
\usepackage{expl3}
\ExplSyntaxOn
% Code will go here
\ExplSyntaxOff
\begin{document}
\end{document}

Programming LaTeX3: Background

Before the series on programming LaTeX3 can really get started, it’s going to be important to establish some background, basic concepts and indeed what the aims are. So in this post I’m going cover some of these issues: we won’t be seeing any code just yet! The approach I’m aiming to take is to bring in concepts as they are needed: this may mean a few simplifications in the beginning to allow ideas to be developed.

LaTeX3: What is available now?

The very first thing to cover is what the current status of LaTeX3 is, and what the aim of this series is. Anyone following LaTeX3 development will know that at the moment it’s not ready for creating documents independent of LaTeX2e. What is available now is a programming layer: l3kernel. At the same time, one of the aims with LaTeX3 is to clearly separate out programming, design decisions and actually using LaTeX. So what I will be covering here is programming. At the same time, I’ll aim to highlight concepts which are not necessarily tied to LaTeX3 programming but which the LaTeX3 Project feel are part of the overall aims of LaTeX3 development.

The target audience

I have two distinct audiences in mind in writing this series. The first is experienced (La)TeX programmers who want to see what ideas LaTeX3 introduces. These people will be familiar with many basic TeX concepts, and will want to see the relationship between what they are used to and the ‘LaTeX3 way’. The second group is experience LaTeX2e users who want to learn to program LaTeX, and have decided to miss out learning to program TeX first. It’s important that the latter group are included: another key aim for LaTeX3 is to provide a complete set of documentation and support without having to say ‘read The TeXbook’ as a requirement to make progress.

What both of these groups have in common is lots of experience with LaTeX2e. So I’m going to expect familiarity with LaTeX2e’s user syntax, concepts and so on. So that will very much be the baseline: I do hope that the more experienced LaTeX programmers will bear with me.

Requirements

As I’ve indicated, programming LaTeX3 currently means works on top of LaTeX2e. So to get started you need a LaTeX2e installation, which for most people means either TeX Live or MiKTeX. Most of the code in the programming layer of LaTeX3 has been moving to a stable situation for some time, but there are refinements going on all of the time. As a result, I’ll be assuming that readers have the latest CTAN releases of l3kernel and l3packages installed. That can be done by downloading them from CTAN directly, or using the package managers in TeX Live 2011 or MiKTeX 2.9.

Programming LaTeX3: Introduction

Development of LaTeX3 has attracted interest from other TeX programmers for a while. One of the big barriers to new entrants is that programming LaTeX3 is distinct from programming LaTeX2e or plain TeX. So what is needed is a ‘Programming LaTeX3’ guide. The problem is getting one written: these things take time, and what to write is also something of a challenge.

To make a start on tackling this, I thought it would be useful to write a series of short blog posts, taking one area of LaTeX3 at a time and looking at it from the point of view of beginner in programming LaTeX3. The idea is that by keeping things short I can divide the problem into manageable chunks (both for readers and for me), and get feedback on each part before taking on the next one. If I make decent progress, I’ll then have some material to edit into something like an article for TUGBoat.

Now, to do a reasonable job I will have to cover some things I’ve looked at before: sorry if it turns out to be repetitive in places. I’m planning to start by looking at how you can actually start programming LaTeX3 today, covering the idea of ‘LaTeX3 in 2e’, for example. Then it will be on to the basics of the language, before we even get to creating any macros. Ideas for topics to cover are very welcome!