December 2006 archive

pyPdf 1.8 – with PDF encryption!

PyPdf version 1.8 has been released. This new version features two major improvements over the last release. The first is support for the PDF standard security handler, allowing the encryption and decryption of average PDF files. The second major feature is documentation.

The security handler was a fun project to implement. Sometimes, reading encryption algorithms in a document can be a fairly mind-warping experience. It’s not until you start to code the algorithm that you begin to understand the purpose, and how it all fits together. To be honest, sometimes even after you code it, it doesn’t make much sense.

I’m no cryptography expert, but I do feel I have a pretty good basic grasp of the technology and concepts. The PDF reference manual, section 3.5.2, contains a small number of algorithms that include processes like this:

Do the following 50 times: Take the output from the previous MD5 hash and pass the first n bytes of the output as input into a new MD5 hash…

Frankly, it doesn’t make much sense to me. It seems like busy-work. If the chosen hash function is believed to be secure, then rehashing the output 50 times is unnecessary. If the hash function turns out to be insecure, you should replace it, rather than running it 50 times. But I suppose it doesn’t matter much — pyPdf supports it now, whether it makes sense or not.

Documentation was another fun matter. It took a surprising amount of searching to find pythondoc, a documentation system. All I wanted was something that allowed the documentation to be integrated with the code, and allow hyperlinks between documentation bits. I recommend pythondoc if anyone has similar needs — it worked great to generate pyPdf’s documentation.