Some TeX Developments

Case changing in expl3

A few years ago I wrote about the work the LaTeX team were doing on providing case changing functions in expl3. Since then, the code has been tested and revised, and very recently has moved to a ‘final’ home within expl3. It therefore seems like a good time to look again at what the challenges are and what tools we’ve provided.

It’s worth noting up-front that all of the expl3 functions work with UTF-8 input, and as far as possible case changing (and other text manipulation) follows the Unicode Consortium guidelines.

Different kinds of input, different kinds of case changing

To understand what functions we’ve provided for case changing, we first have to know what different types of input we might be dealing with. There are broadly two types

Unsurprisingly, case-changing strings is a lot more straight-forward than case-changing text. They ‘live’ in different parts of the expl3 code, so I’ll look at them separately.

Strings

In TeX terms, a string is a series of characters which are all treated as ‘other’ tokens (except spaces, which are still spaces). That’s important here because it means strings won’t contain any control sequences, and because with pdfTeX there can’t be any (useful) accented characters.

The most obvious need to handle case in programming strings is when comparing in a caseless manner: ‘removing’ the case. Programmers often do that by lowercasing text, but there are places where that’s not right. For example, Greek has two forms of the lowercase sigma (σ and ς), and these should be treated as the same for a caseless test. Unicode define the correct operation: case folding. In expl3, that’s called \str_foldcase:n.

\exp_args:Ne \str_show:n
  { \str_foldcase:n { AbC } }

Much more rare is the need to upper- or lowercase a string. Unicode do not mention this at all, but in TeX we might want to construct a control sequence dynamically. To do that, we might want to uppercase the first character of some user input string, and lowercase the rest. We can do that by combining \str_uppercase:n and \str_lowercase:n with the \str_head:n and \str_tail:n functions:

\exp_args:Ne \str_show:n
  {
    \str_uppercase:f { \str_head:n { SomeThing } }
    \str_lowercase:f { \str_tail:n { SomeThing } }
  }

Text

The basics

Case changing text is much more complicated because it has to deal with control sequences, accents, math mode and context. The first step of case changing here is to expand the input as far as possible: that’s done using a function called \text_expand:n which works very similarity to the LaTeX2e command \protected@edef, but is expandable. We don’t really need to worry too much about this: it’s built in to the case changing system anyway.

Upper- and lowercasing is quite straight-forward: the functions have the natural names \text_uppercase:n and \text_lowercase:n. These deal correctly with things like the Greek final-sigma rule and (with LuaTeX and XeTeX) cover the full Unicode range.

% Try with XeTeX or LuaTeX
\exp_args:Ne \tl_show:n
  {
    \text_uppercase:n { Ragıp~Hulûsi~Özdem } ~
    \text_lowercase:n { ὈΔΥΣΣΕΎΣ }
  }

A variety of standard LaTeX accents and letter-like commands are set up for correct case changing with no user intervention required.

\exp_args:Ne \tl_show:n
  {
    \text_uppercase:n { \aa{}ngstr\"{o}m  ~ caf\'{e} }
  }

Case changing exceptions

There are places that case changing should not apply, most obviously to math mode material. There are a set of exceptions built-in to the case changer, and that list can be extended: it’s easy to add the equivalent of \NoCaseChange from the textcase package.

\tl_put_right:Nn \l_text_case_exclude_arg_tl
  { \NoCaseChange }
\exp_args:Ne \tl_show:n
  {
    \text_uppercase:n { Hello ~ $y = max + c$ } ~
    \text_lowercase:n { \NoCaseChange { iPhone } ~ iPhone }
  }

Titlecasing

Commonly, people think about uppercasing the first character of some text then lowercasing the rest, for example to use it at the start of a sentence. Unicode describe this operation as titlecasing, as there are some situations where the ‘first character’ is handled in a non-standard way. Perhaps the best example is IJ in Dutch: it’s treated as a single ‘letter’, so both letters have to be uppercase at the start of a sentence. (We’ll come to language-dependence in a second.)

Depending on the exact nature of the input, we might want to titlecase the first ‘character’ then lowercase everything else, or we might want just to titlecase the first ‘character’ and leave everything else unchanged. These are called \text_titlecase:n and \text_titlecase_first:n, respectively.

\exp_args:Ne \tl_show:n
  {
    \text_titlecase:n { some~text } ~
    \text_titlecase:n { SOME~TEXT } ~
    \text_titlecase_first:n { some~text } ~
    \text_titlecase_first:n { SOME~TEXT }
  }

As we are not simply grabbing the first token of the input, non-letters are ignored and the first real text is case-changed.

\exp_args:Ne \tl_show:n
  {
    \text_titlecase:n { 'some~text' }
  }

Language-dependent functions

One important context for case changing text is the language the text is written in: there are special considerations for Dutch, Lithuanian, Turkic languages and Greek. That’s all handled by using versions of the case-changing functions that take a second argument: a BCP 47 string which can determine the path taken.

\exp_args:Ne \tl_show:n
  {
    \text_uppercase:n { Ragıp~Hulûsi~Özdem } ~
    \text_uppercase:nn { tr } { Ragıp~Hulûsi~Özdem }
  }
\exp_args:Ne \tl_show:n
  {
    \text_uppercase:n { ὈΔΥΣΣΕΎΣ} ~
    \text_uppercase:nn { el } { ὈΔΥΣΣΕΎΣ }
  }

Over time, mechanisms to link this behaviour to babel will be developed.

Conclusions

Case-changing functions in expl3 are now mature and stable, and ready for wider use. It’s likely that they will be made available as document-level commands in the medium term, but programmers can use them now to make handling this important aspect of text manipulation easier.