In the web community, PDF has become synonym for a range of accessibility bad practices. Some people even think that we would all be better off if PDF would finally die, just like Flash and Internet Explorer. As a result PDF is not very sexy in the Drupal and wider PHP community and this has negatively impacted our tooling.
This is a shame: when properly implemented, the PDF standard doesn’t need to suffer from the accessibility issues that a lot of online PDF documents are plagued with. PDF also holds a unique position in the digital world, it is a widely accepted standard that enables a range of applications for which there are no real alternatives.
In this series we will give a short introduction to some of the new PDF standards, explain how you can recognise an accessible PDF and describe how we could up our PDF game in the Drupal community.
Interoperability in a centralising world
More and more, digital services are being performed in walled gardens through online platforms and their APIs. It is easier to develop applications inside a closed system where the data structures are dictated by a single authority and centralisation aggregates information that can be used to train artificial intelligence. It is also easier to monetise SaaS software.
Centralisation however also carries an inherent risk, as it makes systems less resilient. For this reason many organisations require applications that work in a distributed architecture. Open format architectures like email, and HTML enable an ecosystem in which a large number of clients implement a protocol and nobody needs to rely on a central data store to exchange information using these standards.
Benefits of PDF
Correct use of the PDF standards has many benefits:
Interoperability: an invoice produced using the ZUGFeRD standard (as explained in part 2 of this series) can be processed by any system that supports the standard. A document that was signed using the PAdES standard can be verified using any system that supports the PAdES standard. Documents that are properly tagged can be read and interpreted by any screen reader that understands the standardized Tagged PDF. As long as the standards are followed, different systems running on different platforms can use PDF as the format of choice to exchange documents.
Open data: Several countries have signed into law open data policies that guarantee open access to datasets that were created with government funds. Traditionally a lot of data is shared with the public in PDFs, but most of these are flat PDFs that don’t retain any of the semantic mark-up. As a result data often needs to be extracted by hand and cleaned up before it can be reused. In open data portals visitors sometimes expect to be able to generate PDF reports. If PDF/UA becomes required by law for government agencies the reports will also have to provide their data in this accessible format.
Decentralised use: you can use PDF to transmit structured information, you can use PDF to embed movies and sound, you can add different attachments to a PDF, you can seal a PDF and sign it with a digital signature. While most applications are becoming ever more centralised there are however another set of advantages to using PDF as a digital envelope that works in a distributed architecture.
PDF and PHP/Drupal
PDF in Drupal is in a rather sorry state. There are a lot of projects that are not properly maintained and most of these projects were created to export websites, and therefore often lack the ambition to implement all the improvements that have been made in the new PDF standards. PDF implementations in PHP are often also resource hungry, and might cause serious performance issues on a high traffic site. To make nice PDFs in Drupal you will often need to install third party libraries that hosting companies might not allow. (Check out Chris Ward’s blogpost for a run down of the currently available solutions and some of their problems).
The Java community has better PDF generation tools. One such tool is iText, an open source Java library available under a dual license (AGPL for free usage and a paying commercial license for organisations that can’t comply with the AGPL license). As a specialised tool for PDF generation, it implements a broad range of the specialised PDF features mentioned above.
Open source PDF API
I see very strong parallels between search and PDF generation functionalities in Drupal. With Lucene, the Java community had a more powerful, better performing solution for indexing and searching websites. When a plugin was developed for Apache Solr, that uses Lucene, Drupal sites were able to delegate a resource intensive function, and to make use of a superior search technology. I believe that we need something similar for PDF generation and I think that iText would be a great candidate to solve our PDF problem.
Do you think it would be worth the investment to build a PDF daemon that we can use from Drupal? Do you need accessible, human and/or computer readable PDFs? Do you want to be informed when we start an iText PDF implementation? Leave your comments on the feedback form, it will help us make a case for the development work.
This post is based on a discussion I had with Bruno Lowagie, the project lead of iText, an open source PDF library for Java.