Universal UTF-8

The existence of editors such as TeXworks make it very easy to work with UTF-8 source documents. However, there are still a number of issues to thing about before deciding to use UTF-8 for all of your work.

First, there is the issue of other users. If you are writing things that will not need to be edited by others, then the choice is down to you. The moment you have collaborators, you need to ensure that they are also okay with UTF-8. They might be using an editor where this is not going to work (WinEdt, for example). If you are preparing stuff for a publisher, you have to be even more careful, as they may have quite a “traditional” TeX system. I know that the American Chemical Society don’t even have the e-TeX extensions, for example.

Then there are more technical issues. If you are a LaTeX user, you might well also use BibTeX. BibTeX is old, and as yet there is no real UTF-8 aware replacement. So at least in a database of references you may have to stick with escape sequences or some other encoding.

There are also choices to make about the engine you use. XeTeX is the obvious choice for UTF-8 documents, but that means missing out on the pdfTeX extensions to TeX, for example micro-typography. LuaTeX might help here, but if you are a MiKTeX user this is still some way off being available.

All in all, UTF-8 input is not quite the universal standard for TeX, just yet. New editors and engines mean that things are almost there, but a few awkward issues remain.

5 thoughts on “Universal UTF-8

  1. Hi,

    first of all I’d like to say `thank you’ for your blog!

    As for the UTF-8 problem, I’d add a few thoughts, especially that my mother tongue is incompatible with ASCII.

    1. Not using e-TeX really sucks, as of 2008…

    2. Probably XeTeX *is* an `obvious choice’; I, however, still use (for LaTeX) a more `traditional’ TeX engine (pdf-e-TeX) with [utf8]{inputenc}. Being aware of inputenc’s problems (it messes with catcodes, and log/terminal output with UTF-8 is rather garbled), I’m ok with it. I’ll consider switching to XeTeX, too. When using ConTeXt, I use luaTeX. (encTeX is another option, btw.)

    3. Does luaTeX really have problems with MikTeX? If yes, maybe that’s another reason to switch to texlive?

    4. BibTeX sucks. Really. *Do* use amsrefs instead. (Or maybe some kind of a replacement for BibTeX, there are a few; having amsrefs, I’ve never needed them.)

    5. You didn’t mention another problem with UTF-8: the infamous BOMs. (I use Linux almost exclusively, and convert to CP-1250 when going Windows at work, so it’s not a problem for me;).)

    6. As for data interchange, CP-1250 is probably a reasonable choice when working with Polish-speaking Windows people; Emacs (and vim, too) have no problems with it. Emacs even parses the LaTeX file to find an inputenc-loading line, checks the encoding and acts accordingly.

    Greets

    • Thanks for the detailed comments, Marcin. You are quite right about the BOM issue, but I’ve never had a problem with that one myself, hence it slipped my mind. I’ve not really tried amsrefs either, as I have lots of references in a BibTeX database (used for general tracking of references). With care, you can use some UTF-8, provided it is kept away from BibTeX; so no accents in keys, for example.
      MiKTeX doesn’t so much have a problem with LuaTeX, more it’s simply not been integrated yet. There are other things that MiKTeX handles much more cleanly on Windows, so for the moment I shall be patient!

Leave a Reply