pyPdf 1.8 – with PDF encryption!

PyPdf version 1.8 has been released. This new version features two major improvements over the last release. The first is support for the PDF standard security handler, allowing the encryption and decryption of average PDF files. The second major feature is documentation.

The security handler was a fun project to implement. Sometimes, reading encryption algorithms in a document can be a fairly mind-warping experience. It’s not until you start to code the algorithm that you begin to understand the purpose, and how it all fits together. To be honest, sometimes even after you code it, it doesn’t make much sense.

I’m no cryptography expert, but I do feel I have a pretty good basic grasp of the technology and concepts. The PDF reference manual, section 3.5.2, contains a small number of algorithms that include processes like this:

Do the following 50 times: Take the output from the previous MD5 hash and pass the first n bytes of the output as input into a new MD5 hash…

Frankly, it doesn’t make much sense to me. It seems like busy-work. If the chosen hash function is believed to be secure, then rehashing the output 50 times is unnecessary. If the hash function turns out to be insecure, you should replace it, rather than running it 50 times. But I suppose it doesn’t matter much — pyPdf supports it now, whether it makes sense or not.

Documentation was another fun matter. It took a surprising amount of searching to find pythondoc, a documentation system. All I wanted was something that allowed the documentation to be integrated with the code, and allow hyperlinks between documentation bits. I recommend pythondoc if anyone has similar needs — it worked great to generate pyPdf’s documentation.

Be Sociable, Share!

5 Comments on pyPdf 1.8 – with PDF encryption!

  1. Justin
    2006/12/15 at 8:21 am (7 years ago)

    Usually that kind of thing is done to make brute forcing keys harder.. WPA does a similar thing, but with 4096 rounds of hashing.

  2. Sharad Popli
    2007/02/23 at 1:26 am (7 years ago)

    Justin’s right. Each call to the hash function consumes a minuscule amount of time and the iterations cause the time spent to increase. For ordinary, legitimate use, the user will barely notice. But for someone attempting a brute force attack, it increases the difficulty level significantly. And hopefully their frustration levels too… ;)

  3. samj
    2007/02/28 at 6:41 am (7 years ago)

    Just a note to say thanks for making PyPdf available – it works really well.

  4. BernieC
    2007/04/30 at 6:54 am (7 years ago)

    Many thx for the excellent work on PyPdf. I love it.
    A quick question, would it require much work to include set-functions on the documentInfo objects. FYI, I use pypdf to read in pdf-files from my academic literature, and to build an up-2-date db of everything I own on pdf, including a title, author, and summary. Unfortunately, not many researchers write metadata. So it would be nice if on first encounter with empty metadata fields I could enter them myself, instead of having to open every pdf and do it there.

    kind regards,
    Bernie.

  5. Lionel
    2007/05/14 at 6:40 am (7 years ago)

    I have a document with pages in A4 format, and I want to merge
    them to get 2 pages on a A3 page. I played with mediaBox,
    cropBox, … of page0 and merge page1 on it, no success, page1 is always ON
    TOP of page0 (I had also changed Boxes of page1, not good). Now my question :
    how do you use merging to put page0 on the left, page1 on the right of an A3
    page?