Posts Tagged ‘pdf’

Python 3.0a1 support in pyPdf and pg8000

pyPdf and pg8000 have been ported to run under Python 3.0a1, in new Mercurial repository branches.

pg8000 is a Pure-Python database driver for PostgreSQL, compatible with the standard DB API (although under Python 3.0, the Binary object expects a bytes argument). pg8000 does not yet support every standard PostgreSQL data type, but it supports some of the most common data types.

pyPdf is a Pure-Python PDF toolkit. It is capable of reading and writing PDF files, and can be easily used for operations like splitting and merging PDF files.

I am pretty happy with the upgrade to Python 3.0a1. The 2to3 conversion utility provides a good start for some of the most mechanical of changes. pyPdf and pg8000 used strings as byte buffers pretty extensively, especially pyPdf, and so the changes were pretty extensive.

Having a good test suite is essential to the upgrade process. That was why I chose these two projects to start with, as I have a pretty good pg8000 test suite, and a very comprehensive pyPdf suite. After running 2to3 on the source code, it was just a matter of beating the code into order until all the tests run. It took about 4 hours per project, but many projects wouldn’t have as many changes as these projects have.

There are a couple of unexpected behaviours (in my opinion) regarding the new bytes type:

  • b"xref"[0] != b"x". Getting a single item out of a bytes type returns an integer, which fails to compare with a bytes instance of a length 1.
  • b"x" == "x" throws an exception, rather than returning False. This exception is useful for finding places where byte/string comparisons are being done by mistake, but I ran into one instance where I wanted to compare these objects and have it be false. It was easy to code around.
  • You can’t derive a class from bytes. I hope that this will be fixed in future releases, since pyPdf’s StringObject class derived from str previously. (It can’t derive from str now, since the PDF files have no encoding information for strings [that I know of…])

Good work on Python 3.0a1, developers! I love the separation of strings and byte arrays, even though it took me a lot of work to fix up these couple of projects. It’s the right way to do things.

pyPdf 1.8 – with PDF encryption!

PyPdf version 1.8 has been released. This new version features two major improvements over the last release. The first is support for the PDF standard security handler, allowing the encryption and decryption of average PDF files. The second major feature is documentation.

The security handler was a fun project to implement. Sometimes, reading encryption algorithms in a document can be a fairly mind-warping experience. It’s not until you start to code the algorithm that you begin to understand the purpose, and how it all fits together. To be honest, sometimes even after you code it, it doesn’t make much sense.

I’m no cryptography expert, but I do feel I have a pretty good basic grasp of the technology and concepts. The PDF reference manual, section 3.5.2, contains a small number of algorithms that include processes like this:

Do the following 50 times: Take the output from the previous MD5 hash and pass the first n bytes of the output as input into a new MD5 hash…

Frankly, it doesn’t make much sense to me. It seems like busy-work. If the chosen hash function is believed to be secure, then rehashing the output 50 times is unnecessary. If the hash function turns out to be insecure, you should replace it, rather than running it 50 times. But I suppose it doesn’t matter much — pyPdf supports it now, whether it makes sense or not.

Documentation was another fun matter. It took a surprising amount of searching to find pythondoc, a documentation system. All I wanted was something that allowed the documentation to be integrated with the code, and allow hyperlinks between documentation bits. I recommend pythondoc if anyone has similar needs — it worked great to generate pyPdf’s documentation.

pyPdf 1.6

Finally! Apparently I must be unemployed in order to get anything done on pyPdf. I’ve finally released version 1.6 today. Major highlights include:

  • Reads more PDF files than ever before.
  • Supports reading and creating compressed content streams.
  • Allows access to document information, such as the title, author, creator, and so on.

Basically, it’s just better. Mr. I-Am-Bitter has been using it on mountains of PDF files, so I feel confident that it works better than ever.

“import zlib” vs. .NET Framework

During my current period of unemployedness, I’ve been preparing for some contract development work that I expect to be doing in the near future. Inspired by the article series on IronPython and .NET GUI development over at The Voidspace Techie Blog, I’ve been looking into what kinds of development struggles I might face using IronPython and .NET as a platform. To that end, I began to look at making pyPdf work under IronPython.

The first struggle I encountered was that the “zlib” module was not available in IronPython. “No problem,” I think to myself. “There’s got to be access to a DEFLATE library through .NET, somehow.”

“Yes, younger-self,” my older-self now says. “There is a .NET way to do this, but apparently it requires an annoyingly large amount of code.”

Here’s the original Python code that was used to implement the FlateEncode streams in pyPdf:

import zlib
def decompress(data):
    return zlib.decompress(data)
def compress(data):
    return zlib.compress(data)

Okay, that was simple and straightforward. Here’s the IronPython solution (note, if you have suggestions to make this shorter, please do let me know):

import System
from System import IO, Collections, Array
def _string_to_bytearr(buf):
    retval = Array.CreateInstance(System.Byte, len(buf))
    for i in range(len(buf)):
        retval[i] = ord(buf[i])
    return retval
def _bytearr_to_string(bytes):
    retval = ""
    for i in range(bytes.Length):
        retval += chr(bytes[i])
    return retval
def _read_bytes(stream):
    ms = IO.MemoryStream()
    buf = Array.CreateInstance(System.Byte, 2048)
    while True:
        bytes = stream.Read(buf, 0, buf.Length)
        if bytes == 0:
            ms.Write(buf, 0, bytes)
    retval = ms.ToArray()
    return retval
def decompress(data):
    bytes = _string_to_bytearr(data)
    ms = IO.MemoryStream()
    ms.Write(bytes, 0, bytes.Length)
    ms.Position = 0  # fseek 0
    gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Decompress)
    bytes = _read_bytes(gz)
    retval = _bytearr_to_string(bytes)
    return retval
def compress(data):
    bytes = _string_to_bytearr(data)
    ms = IO.MemoryStream()
    gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Compress, True)
    gz.Write(bytes, 0, bytes.Length)
    ms.Position = 0 # fseek 0
    bytes = ms.ToArray()
    retval = _bytearr_to_string(bytes)
    return retval

Basically, the code grew in length for a few reasons. First of all, the original compress and decompress functions took string arguments, but they were basically being used as arrays of bytes. In .NET, there is a clear difference between an array of bytes and a string, so conversion methods were necessary to create a byte array from a string. I actually like this, because it forces you to encode and decode strings whenever you use them, making you aware of their unicode nature (which is actually optional in CPython, basically).

The other added complexity was the use of streams, rather than just basic functions that can be called. A nice object-oriented stream library is actually quite flexible and powerful, but as you can see it can make things a little more verbose. But, you know… both have their advantages.

Finally, I had to write a function just to read an entire stream into a byte array. MemoryStream has a simple “ToArray()” function on it — I wish this was standard on all Stream objects. But regardless, this function only really needs to be written once and can be used for many different purposes. So it isn’t really adding to the length of the deflate encoding, it should be adding to the length of my toolbox somewhere else. Note that my implementation is fairly wasteful of memory, but it is a simple approach that will not fail if Read returns partial buffers, or anything like that.

IronPython is interesting. One hurdle is down for PyPdf, but a few still exist. We’ll see what happens next.

Python PDF Split/Merge Library

When you have good tools, working with PDF files can be fun. When you have no tools – it’s time to build a pure-Python library for working with PDF files.

Enter the challenge: create a website that can split and merge PDF files on demand. Given a PDF file of a few hundred pages, split the PDF file and store individual pages as seperate PDF files. On demand, merge any set of individual pages to create and serve a new PDF file.

Rejected solution #1: activePDF Toolkit, a COM based library that receives excellent reviews from a co-worker. Sounds super! However, my deployment platform is Linux, making a Windows COM library virtually unusable.

Rejected solution #2: pdftk, a command-line utility that allows splitting and merging PDF files. pdftk is based on a modified version of the Java iText library, which I am familiar with. However, spawning processes on every page view to merge PDF files is probably relatively slow. When you add in the fact that my pdftk process kept dying with SIGABRT when running it through os.system, os.spawnl, and popen (in other words, I couldn’t get it to work), this solution was rejected.

Rejected solution #3: Use the iText Java library, which is capable of splitting and merging files. However, my web server is somewhat memory limited at the moment. Adding a JRE would not help. Plus, who wants to code in Java when it can be done in Python? Nobody, that’s who.

Enter the solution: a pure-Python library for working with PDF files. It may not be perfect (okay, okay, it definitely is not), but it does work with the PDF files I was most interested in splitting and merging. I’ve also tested it lightly with other random PDF files I’ve found on my system and it seems to work pretty happily with them.

I’ve created a pyPdf project page and uploaded it to PyPI.