Mathieu Fenniak's Weblog

2007/09/05

Python 3.0a1 support in pyPdf and pg8000

Filed under: pdf, postgresql, programming, python — admin @ 10:28 pm

pyPdf and pg8000 have been ported to run under Python 3.0a1, in new Mercurial repository branches.

pg8000 is a Pure-Python database driver for PostgreSQL, compatible with the standard DB API (although under Python 3.0, the Binary object expects a bytes argument). pg8000 does not yet support every standard PostgreSQL data type, but it supports some of the most common data types.

pyPdf is a Pure-Python PDF toolkit. It is capable of reading and writing PDF files, and can be easily used for operations like splitting and merging PDF files.

I am pretty happy with the upgrade to Python 3.0a1. The 2to3 conversion utility provides a good start for some of the most mechanical of changes. pyPdf and pg8000 used strings as byte buffers pretty extensively, especially pyPdf, and so the changes were pretty extensive.

Having a good test suite is essential to the upgrade process. That was why I chose these two projects to start with, as I have a pretty good pg8000 test suite, and a very comprehensive pyPdf suite. After running 2to3 on the source code, it was just a matter of beating the code into order until all the tests run. It took about 4 hours per project, but many projects wouldn’t have as many changes as these projects have.

There are a couple of unexpected behaviours (in my opinion) regarding the new bytes type:

  • b"xref"[0] != b"x". Getting a single item out of a bytes type returns an integer, which fails to compare with a bytes instance of a length 1.
  • b"x" == "x" throws an exception, rather than returning False. This exception is useful for finding places where byte/string comparisons are being done by mistake, but I ran into one instance where I wanted to compare these objects and have it be false. It was easy to code around.
  • You can’t derive a class from bytes. I hope that this will be fixed in future releases, since pyPdf’s StringObject class derived from str previously. (It can’t derive from str now, since the PDF files have no encoding information for strings [that I know of...])

Good work on Python 3.0a1, developers! I love the separation of strings and byte arrays, even though it took me a lot of work to fix up these couple of projects. It’s the right way to do things.

3 Comments

  1. Your 2 first bytes issues are here: http://wiki.python.org/moin/BytesStr
    Someone one could also setup a wiki page for documenting early porting experiences like yours and this ones:

    http://intertwingly.net/blog/2007/09/01/2to3
    http://oakwinter.com/code/porting-setuptools-to-py3k/

    Comment by Eduardo Padoan — 2007/09/05 @ 11:10 pm

  2. Out of topic:
    Is there a way to catch the exception if the file that is loaded is no pdf?

    Something like this:

    try:
    pdfFile = pyPdf.PdfFileReader(file(path, “rb”))
    except IOError,e:
    print ‘File wasn\’t opened’
    except PdfError,e:
    print ‘File is no supported pdf’

    Comment by Koblaid — 2007/10/01 @ 11:00 am

  3. Koblaid – most PDF read errors will throw an exception, pyPdf.utils.PdfReadError.

    Comment by Mathieu Fenniak — 2007/10/07 @ 9:33 am

RSS feed for comments on this post. TrackBack URL

Sorry, the comment form is closed at this time.

Powered by WordPress