PDF Version and file size

The PDF format has evolved over the years as Adobe have released new versions of their Acrobat and Reader software. New ideas have been added to the file format, and as a result there are different versions of the PDF format. If you take a look at a PDF in Adobe Reader, you can see which version the file is using in the Document Properties information. Of course, files using the newer versions of the PDF format need a suitable viewer, be that Reader or something else.

This is relevant to TeX users as PDF tends to be the target format, either directly or via DVI files, for many users. Tools such as pdfTeX are not tied to one version of the PDF specification. For example, when creating a PDF directly with pdfTeX the \pdfminorversion primitive can be used to set the PDF version to 1.3, 1.4 or 1.5.

Why would you want to do this? Well, obviously the newer versions bring new features. A particularly significant one is the compression of non-stream objects. The detail of these objects is not really important, but they relate to items such as links within documents. Version 1.5 of the PDF specification allows these to be compressed, which can make quite a difference to the resulting file size. For example, I did a trial run with the siunitx manual, and by adding the lines

\pdfminorversion=5
\pdfobjcompresslevel=2

resulted in reducing the file size from around 700 KiB to around 550 KiB, a saving of roughly 20 %.

There is some discussion ongoing at the moment on the TeX Live mailing list about possibly changing the default PDF version produced by tools such as pdfLaTeX, XeLaTeX, etc. The current standard setting is version 1.4, which makes larger files but does have the advantage of being readable by a wider range of viewers. On the other hand, PDF version 1.5 was first released in 2003, and there is pretty good support for it in most of the well-known readers. As long as switching to version 1.5 also enables the compression, this looks like a good idea: just moving to version 1.5 without using the features available seems a bit odd to me.

There are times where you need to use PDF version 1.4 (for example for archive-type PDFs), but for those you also need to check other features of the PDF. So I feel that the change looks like a good idea, provided there is a good way to set the version to something else.

EPS graphics with PDF(La)TeX

One issue a lot of people find confusing with (La)TeX is the rules about which types of graphic files work with which engines. EPS files are fine when going via the DVI route, but do not work with direct PDF creation. The solution is to turn the EPS files in PDFs, and the problem goes away. However, there is then the question of how to do the conversion.

For most documents, having to convert every file by hand is not a sensible choice. The next nearest thing is the epstopdf package, which will do the same thing but from within a LaTeX run. However, it needs \write18 enabled, and this is not always desirable. More importantly, a lot of people who struggle with the graphics problem do not know how to turn on \write18 anyway. A good way around has been added to the latest version of TeX Live, which is currently in the final testing stages. TeX Live 2009 has some restricted \write18 functions enabled as standard, and also has a version of epstopdf “built in”. The result is that EPS files are automatically converted to PDF files, in a transparent manner. Of course, this only happens if the PDF does not also exist! At the moment, this feature is not in MiKTeX 2.8, so it is one reason to favour TeX Live 2009 even on Windows.

There are places where epstopdf will not help: for example, when using psfrag or pstricks. There, the best solution will either be auto-pst-pdf or pstool. Both are written by Will Robertson, and both need \write18 enabled to work. pstool is more efficient (it only re-creates graphics as needed), but for some cases on auto-pst-pdt will work. Will has documented both packages very well, so the best way to learn about them is to have a read of the documentation.

Regular expressions

Regular expressions are very popular as a quick and powerful way to carry out searches and replacements in text of all sorts. Traditionally, TeX handles tokens and not strings or characters. This means that doing regex searches using TeX82 is pretty much impossible. To solve this, recent versions of pdfTeX adds the \pdfmatch primitive to allow real string matching inside TeX. The LuaTeX team have decided not to take all of the existing “new” primitives forward from pdfTeX, and as I understand it \pdfmatch will not be implemented in LuaTeX. However, Lua itself has regular expression matching, and so the functionality will still be around.

I’ve recently talked about adding new primitives to XeTeX, and you’ll see that \pdfmatch was not on the list for adding to XeTeX. The reason is that a XeTeX implementation would have to be slightly different from pdfTeX, as it is natively UTF-8, but also would be different to LuaTeX, as it would still be a TeX primitive and not a Lua function. So here “the prize wasn’t worth the winning”, in my opinion. As it is, using \pdfmatch is not widespread, and the idea of having three different regex methods inside TeX didn’t seem like a great idea!

Talking of regex implementations, I’ve been reading Programming in Lua, and also working with TeXworks to try to get syntax highlighting the way I like it. Both systems are slightly different, and it seems both are different from the Perl implementation. It seems that every time you want to use a regex system you have to read the manual to see which things are different from every other implementation!

More on XeTeX primitives

There has been a bit more work on the idea of adding primitives to XeTeX to match those available in pdfTeX.The list of pdfTeX primitives which look interesting has grown slightly, and now reads:

  • \ifincsname
  • \ifpdfprimitive
  • \pdfprimitive
  • \pdfshellescape
  • \pdfstrcmp

At the same time, it would be useful to include the “extended” version of \vadjust which pdfTeX makes available. This is something that has been asked about before, and as with the rest of the changes the main issue is not the idea of doing it but the time for actual implementation.

The real need to have \pdfstrcmp available for LaTeX3 work means that some effort has actually gone into this. I’ve got no experience with either Pascal or the WEB format, but I’ve managed but dint of determination to get something passable to Jonathan Kew. There will need to be some adjustments, as XeTeX works with UTF-8 internally, which pdfTeX does not do. However, I’m hopeful that we will see new primitives in XeTeX soon.

Quite how the primitives will be named is still to be decided. The existing \pdf... naming does not really make sense with these non-PDF related functions. So they could end up as \XeTeX... or may just be given generic names. I’m leaving that to Jonathan!

Additional primitives for XeTeX

XeTeX has, over the past few years, made using TeX with multiple fonts and UTF-8 input easy. The work-flow using XeTeX is very much more accessible than the routes needed using pdfTeX or TeX82. So I’m sure that many people, like me, use XeTeX whenever they want to use arbitrary fonts or to write anything which doesn’t use western European characters.

XeTeX is based on ε-TeX, which means it has a number of primitives which were not present in TeX82, but are present in ε-TeX itself or in pdfTeX (which also includes the ε-TeX primitives). However, ε-TeX was finalised over ten years ago, and since then the pdfTeX team have added a number of new primitives, many related directly to PDF output. At the same time, XeTeX includes its own new or extended primitive functions, in this case focussed on UTF input. For the most part this does not concern people as things work fine.

Recently, there has been some testing of the current LaTeX3 code with XeTeX (and older versions of pdfTeX, which don’t have all of the newer primitives). LaTeX3 requires the ε-TeX extensions, which are as I said available with any modern TeX engine. However, when it’s available LaTeX3 also uses the \pdstrcmp primitive: this is only present in newer versions of pdfTeX. For those people not familiar with \pdstrcmp, it allows you to do string comparisons of text (not token comparisons), and in an expandable manner. This is very useful, and much better than doing things without it; with no \pdfstrcmp, comparisons are not expandable. It became clear that there is a danger of some things working when using newer versions of pdfTeX, but failing with older ones or with XeTeX. Older versions of pdfTeX is one thing (the advice can simply be “sorry, you’ll have to update your pdfTeX”), but failing with XeTeX is simply no acceptable. After a bit of discussion, the best solution seemed to be to talk to Jonathan Kew about getting a very small number of “new” pdfTeX primitives into XeTeX.

At the moment, things are still under discussion, but the list of additional primitives is going to be small (somewhere between 2 and 5 seems likely). I think it’s giving nothing away to say that \pdfstrcmp is one that really is needed (although the name might be an issue!). Another likely candidate is \ifincsname, which looks handy and also not too complex to implement. There are a few other suggestions, but I’m not sure just yet what will be really needed, as opposed to nice to have. What is clear is that this is a one-off request. Once these small gaps are filled, LaTeX3 will not be using other primitives for general functions. I’m not sure how long it will take to finalise things, both for the team to agree on what is needed and for Jonathan Kew to do the hard work, but I’d imagine weeks not longer.