pyPdf and pg8000 have been ported to run under Python 3.0a1, in new Mercurial repository branches.
pg8000 is a Pure-Python database driver for PostgreSQL, compatible with the standard DB API (although under Python 3.0, the Binary object expects a bytes argument). pg8000 does not yet support every standard PostgreSQL data type, but it supports some of the most common data types.
pyPdf is a Pure-Python PDF toolkit. It is capable of reading and writing PDF files, and can be easily used for operations like splitting and merging PDF files.
I am pretty happy with the upgrade to Python 3.0a1. The 2to3 conversion utility provides a good start for some of the most mechanical of changes. pyPdf and pg8000 used strings as byte buffers pretty extensively, especially pyPdf, and so the changes were pretty extensive.
Having a good test suite is essential to the upgrade process. That was why I chose these two projects to start with, as I have a pretty good pg8000 test suite, and a very comprehensive pyPdf suite. After running 2to3 on the source code, it was just a matter of beating the code into order until all the tests run. It took about 4 hours per project, but many projects wouldn’t have as many changes as these projects have.
There are a couple of unexpected behaviours (in my opinion) regarding the new bytes type:
- b"xref" != b"x". Getting a single item out of a bytes type returns an integer, which fails to compare with a bytes instance of a length 1.
- b"x" == "x" throws an exception, rather than returning False. This exception is useful for finding places where byte/string comparisons are being done by mistake, but I ran into one instance where I wanted to compare these objects and have it be false. It was easy to code around.
- You can’t derive a class from bytes. I hope that this will be fixed in future releases, since pyPdf’s StringObject class derived from str previously. (It can’t derive from str now, since the PDF files have no encoding information for strings [that I know of…])
Good work on Python 3.0a1, developers! I love the separation of strings and byte arrays, even though it took me a lot of work to fix up these couple of projects. It’s the right way to do things.
A new version of pg8000, a Pure-Python interface for the PostgreSQL database, has been released today. This version supports DB-API 2.0 as documented in PEP-249. The request to add DB-API support to pg8000 was the biggest thing I heard about over the last pg8000 release.
Also new in version 1.02 is SSL support, datetime parameter input, comprehensive unit tests, and bytea object support.
pg8000 is a Pure-Python interface to the PostgreSQL database engine. Yesterday, it was released to the public for the first time.
pg8000’s name comes from the belief that it is probably about the 8000th PostgreSQL interface for Python. However, pg8000 is somewhat distinctive in that it is written entirely in Python and does not rely on any external libraries (such as a compiled python module, or PostgreSQL’s libpq library). As such, it is quite small and easy to deploy. It is suitable for distribution where one might not have a compiled libpq available, and it is a great alternative to supplying one with your package.
Why use pg8000?
- No external dependencies other than Python’s standard library.
- Pretty cool to hack on, since it is 100% Python with no C involved.
- Being entirely written in Python means it should work with Jython, PyPy, or IronPython without too much difficulty.
- libpq reads the entire result set into memory immediately following a query. pg8000 uses cursors to read chunks of rows into memory, attempting to find a balance between speed and memory usage for large datasets. You could accomplish this yourself using libpq by declaring cursors and then executing them to read rows, but this has two disadvantages:
- You have to do it yourself.
- You have to know when your query returns rows, because you can’t DECLARE CURSOR on an INSERT, UPDATE, DELETE, CREATE, ALTER, ect.
- pg8000 offers objects to represent prepared statements. This makes them easy to use, which should increase their usage and improve your application’s performance.
- It has some pretty nice documentation, I think.
Now, that being said, reality kicks in. Here’s why not to use pg8000:
- It’s pretty new. This means there are likely bugs that haven’t been found yet. It will mature over the next couple weeks with some community feedback and some internal testing.
- It doesn’t support the DB-API interface. I didn’t want to limit myself to DB-API, so I created just a slightly different interface that made more sense to me. I intend to include a DB-API wrapper in the next release, v1.01.
- It isn’t thread-safe. When a sequence of messages needs to be sent to the PG backend, it often needs to occur in a given order. The next release, v1.01, will address this by protecting critical areas of the code.
- It doesn’t support every PostgreSQL type, or even the majority of them. Notably lacking are: parameter send for float, datetime, decimal, interval; data receive for interval. This will just be a matter of time as well, and hopefully some user patches to add more functions. For the case of interval, I expect to optionally link in mxDateTime, but have a reasonable fallback if it is not available.
- It doesn’t support UNIX sockets for connection to the PostgreSQL backend. I just don’t quite know how to reliably find the socket location. It seems that information is compiled into libpq. Support could be added very easily if it was just assumed that the socket location was provided by the user.
- It only supports authentication to the PG backend via trust, ident, or md5 hashed password.
pg8000’s website is http://pybrary.net/pg8000/. The source code is directly accessible through SVN at http://svn.pybrary.net/pg8000/.
PyPdf version 1.8 has been released. This new version features two major improvements over the last release. The first is support for the PDF standard security handler, allowing the encryption and decryption of average PDF files. The second major feature is documentation.
The security handler was a fun project to implement. Sometimes, reading encryption algorithms in a document can be a fairly mind-warping experience. It’s not until you start to code the algorithm that you begin to understand the purpose, and how it all fits together. To be honest, sometimes even after you code it, it doesn’t make much sense.
I’m no cryptography expert, but I do feel I have a pretty good basic grasp of the technology and concepts. The PDF reference manual, section 3.5.2, contains a small number of algorithms that include processes like this:
Do the following 50 times: Take the output from the previous MD5 hash and pass the first n bytes of the output as input into a new MD5 hash…
Frankly, it doesn’t make much sense to me. It seems like busy-work. If the chosen hash function is believed to be secure, then rehashing the output 50 times is unnecessary. If the hash function turns out to be insecure, you should replace it, rather than running it 50 times. But I suppose it doesn’t matter much — pyPdf supports it now, whether it makes sense or not.
Documentation was another fun matter. It took a surprising amount of searching to find pythondoc, a documentation system. All I wanted was something that allowed the documentation to be integrated with the code, and allow hyperlinks between documentation bits. I recommend pythondoc if anyone has similar needs — it worked great to generate pyPdf’s documentation.
My friend Bradley is putting on a talk at “VanPy” entitled “Rapid Development of Enterprise-Level Web Applications”. It is going to be an interesting case study of a large web application that was re-developed in Python over a couple of years. The application went from ASP and Windows based to Python and Linux – yay! For anyone who has never see a Python talk that has to do with an Oracle database (*gasp – not MySQL? *), and terabytes of data, this is your chance.
Hopefully Jim Hugunin’s IronPython talk won’t steal too much of the potential audience away.
I’ll be the rude guy in the back of the room making silly faces.