It’s strange but true: seven years after the PDF reference was published as an ISO standard (ISO-32000-1), there are still developers who think that the Portable Document Format is a closed document format owned by Adobe. PDF is often perceived as a strange, impenetrable document format. It’s high time we bury that idea and take a look at what’s going on in the world of PDF today...

In this guest post Bruno Lowagie, the project lead of iText, gives an overview of some of the most important innovations in the PDF standard.

Different ISO committees are currently working on new standards such as PDF 2.0 (to be released in 2017), XFDF, and ECMAScript for PDF. Substandards such as PDF/X, PDF/A, PDF/E, PDF/VT, and PDF/UA are updated on a regular basis.

Up until today, ignorance about these standards and about best practices for PDF prevents smarter use of documents. Creation of PDF is often seen as a one-way process: once a PDF is created, the content is locked inside and no longer accessible. Content management systems treat such PDFs as if it were images. It shouldn’t have to be that way.

PDF is for viewing and printing – an idea from the past

The Portable Document Format was created in 1993 to solve the problem of viewing and printing documents in a platform independent way. The goal was to create a document format that allowed the industry to render documents in a reliable and consistent way regardless of the operating system used.

The ubiquity of Adobe Reader as a viewer that was distributed for free to be used either as a standalone application or as a plug-in of an internet browser, contributed to the success of the format. Content owners could easily create documents in the PDF format and then upload them to their server without having to worry if the consumer would be able to view or print the document. So what’s the problem? Let’s take a look at some of the issues that have emerged over the years:

  • Too many features. The PDF specification has been growing in many different directions. As a result, not all PDFs are created equal. For example, documents can be set to depend on external fonts to reduce the file size, but this results in an illegible or odd-looking page if those fonts are missing on the end user’s machine.
  • Too many tools. As Adobe allowed third parties to use the PDF specification to create software that produces PDF, there are thousands of different tools and applications that generate PDF. Some of those tools cut corners that result in badly formatted PDF. Sometimes a PDF document is nothing more than a bunch of scanned images. Although the human eye sees text, there is no text inside the document when a machine looks at the PDF.
  • Lack of responsiveness. Most PDFs are made for a specific paper size. When viewed on a small screen, a visitor needs to do a lot of panning to read the content. It’s possible to provide reflow capabilities in a PDF by “tagging” the content, but most PDFs aren’t tagged correctly (or aren’t tagged at all).
  • Inaccessible content: Most current PDF documents are by default not accessible. For instance: what we perceive as being a table, is nothing more than a bunch of lines and text added at specific coordinates to a machine. As far as the PDF is concerned, there’s no way of telling if a specific text snippet belongs to a cell in a table or to a paragraph because that information wasn’t added when the PDF was created. As a result it’s really hard to browse through documents with screen readers.

Standards to the rescue!

PDF/A: a standard for archiving PDF

When archiving a document, you want to be sure that you will still be able to read the document in 10, 30, maybe even 100 years. This promise of long-term preservation is hard to keep if the specification allows too much freedom in the document creation process. The PDF/A standard (ISO 19005) is a subset of the PDF specification, introducing restrictions and obligations. For instance: the document may not be encrypted as there is no guarantee that it will be possible to decrypt the document in a distant future. The document also has to be self-contained. For instance: fonts need to be embedded, because there is no guarantee that an external font will still be available in the long term. Another obligation is that the document requires metadata in the XMP format.

The PDF/A standard consists of different parts: PDF/A-1 (2005), PDF/A-2 (2011), and PDF/A-3 (2012). Approved parts will never become invalid; new parts define new, useful features. There are also different conformance levels.

  • Level A: the document needs to be accessible. This is achieved through Tagged PDF.
  • Level B: this is the basic level. The visual representation is guaranteed, but the semantic structure of the content may be missing.
  • Level U: the visual representation needs to be guaranteed, but all text needs to be stored in Unicode (introduced in PDF/A-2).

There is only one difference between PDF/A-2 and PDF/A-3. In PDF/A-2, all attachments need to be conforming PDF/A documents. In PDF/A-3, the attachment can be in any format. For instance: you can add a Word file that was used as the source of the PDF document, you can add an XLS file containing raw data that is explained in the document, and so on.

ZUGFeRD: a standard for invoices

In 2014, the German Forum for Electronic Invoicing (FeRD) released a standard that is built on top of PDF/A-3. To a human consumer, a ZUGFeRD invoice looks just like any other electronic invoice, but thanks to the fact that the document conforms to the PDF/A-3 standard, it can be preserved for the long term. ZUGFeRD also requires the PDF to have a standardized XML attachment that allows machines to read and interpret the content of the invoice. Soon suppliers for government agencies will be required to submit their invoices in PDF/ZUGFeRD. Over time as it gains a wider adoption e-commerce applications will also have to follow.

PAdES: making the document official with PDF Advanced Electronic Signatures

To ensure the integrity of a document (for instance: once an invoice is final, it shouldn’t change anymore), to make sure a document is authentic (for instance: making sure the invoice is genuine and not sent by a third party), and to allow non-repudiation, a PDF document can be signed using a digital signature.

The digital signing mechanism of PDFs is described in an ETSI standard that makes it possible to verify a document independent of the vendor of the software that was used to generate it. Digital signatures in PDF allow for documents to be signed by different people in a workflow and they also support Long-Term Validation (LTV). All of this is currently described in PAdES and will be part of the PDF 2.0 standard (ISO 32000-2) that will be released in 2017.

PDF/UA: universal accessibility

There has been a really large push on the web for a more user accessible internet. When accessibility standards are observed, they enable the blind and visually impaired to understand, navigate and scan through documents in a much more efficient way. PDF/UA implements these accessibility standards by requiring, among others, the already mentioned Tagged PDF functionality. Web accessibility standards have become compulsory for public sector websites in a range of countries. There is now also a push to make PDF/UA obligatory (amongst others in the US).

Conclusion

We’ve only been scratching the surface as far as PDF is concerned. We didn’t even mention the different technologies that are available in PDF with respect to forms and templates. We’ve only listed a couple of the many benefits.

You can view the video of "PDF Is Dead; Long Live PDF ...and Java!" presentation here. It was the part of JavaOne, a Java conference, that took place in San Francisco, October, 2015.

Better PDF for Drupal

Do you think it would be worth the investment to build a PDF daemon that we can use from Drupal? Do you need accessible, human and/or computer readable PDFs? Do you want to be informed when we start an iText PDF implementation? Leave your comments on the feedback form, it will help us make a case for the development work.

This was a guest post by Bruno Lowagie, the project lead of iText, you can find more information about iText on their website.

About the author

Bruno Lowagie

Project Lead of iText

Original developer of iText, an open source PDF library available in Java and C#, under either an AGPL or a commercial license. Author of iText in Action, first and second revision, published by Manning Publications. Currently writing books on LeanPub.