Mathieu Fenniak's Weblog

2007/09/05

Python 3.0a1 support in pyPdf and pg8000

Filed under: pdf, postgresql, programming, python — admin @ 10:28 pm

pyPdf and pg8000 have been ported to run under Python 3.0a1, in new Mercurial repository branches.

pg8000 is a Pure-Python database driver for PostgreSQL, compatible with the standard DB API (although under Python 3.0, the Binary object expects a bytes argument). pg8000 does not yet support every standard PostgreSQL data type, but it supports some of the most common data types.

pyPdf is a Pure-Python PDF toolkit. It is capable of reading and writing PDF files, and can be easily used for operations like splitting and merging PDF files.

I am pretty happy with the upgrade to Python 3.0a1. The 2to3 conversion utility provides a good start for some of the most mechanical of changes. pyPdf and pg8000 used strings as byte buffers pretty extensively, especially pyPdf, and so the changes were pretty extensive.

Having a good test suite is essential to the upgrade process. That was why I chose these two projects to start with, as I have a pretty good pg8000 test suite, and a very comprehensive pyPdf suite. After running 2to3 on the source code, it was just a matter of beating the code into order until all the tests run. It took about 4 hours per project, but many projects wouldn’t have as many changes as these projects have.

There are a couple of unexpected behaviours (in my opinion) regarding the new bytes type:

  • b"xref"[0] != b"x". Getting a single item out of a bytes type returns an integer, which fails to compare with a bytes instance of a length 1.
  • b"x" == "x" throws an exception, rather than returning False. This exception is useful for finding places where byte/string comparisons are being done by mistake, but I ran into one instance where I wanted to compare these objects and have it be false. It was easy to code around.
  • You can’t derive a class from bytes. I hope that this will be fixed in future releases, since pyPdf’s StringObject class derived from str previously. (It can’t derive from str now, since the PDF files have no encoding information for strings [that I know of...])

Good work on Python 3.0a1, developers! I love the separation of strings and byte arrays, even though it took me a lot of work to fix up these couple of projects. It’s the right way to do things.

2007/03/13

pg8000 v1.02

Filed under: postgresql, programming, python — admin @ 2:34 pm

A new version of pg8000, a Pure-Python interface for the PostgreSQL database, has been released today. This version supports DB-API 2.0 as documented in PEP-249. The request to add DB-API support to pg8000 was the biggest thing I heard about over the last pg8000 release.

Also new in version 1.02 is SSL support, datetime parameter input, comprehensive unit tests, and bytea object support.

2007/03/09

pg8000 v1.00 — a new PostgreSQL/Python interface

Filed under: postgresql, programming, python — admin @ 10:24 am

pg8000 is a Pure-Python interface to the PostgreSQL database engine. Yesterday, it was released to the public for the first time.

pg8000’s name comes from the belief that it is probably about the 8000th PostgreSQL interface for Python. However, pg8000 is somewhat distinctive in that it is written entirely in Python and does not rely on any external libraries (such as a compiled python module, or PostgreSQL’s libpq library). As such, it is quite small and easy to deploy. It is suitable for distribution where one might not have a compiled libpq available, and it is a great alternative to supplying one with your package.

Why use pg8000?

  • No external dependencies other than Python’s standard library.
  • Pretty cool to hack on, since it is 100% Python with no C involved.
  • Being entirely written in Python means it should work with Jython, PyPy, or IronPython without too much difficulty.
  • libpq reads the entire result set into memory immediately following a query. pg8000 uses cursors to read chunks of rows into memory, attempting to find a balance between speed and memory usage for large datasets. You could accomplish this yourself using libpq by declaring cursors and then executing them to read rows, but this has two disadvantages:
    • You have to do it yourself.
    • You have to know when your query returns rows, because you can’t DECLARE CURSOR on an INSERT, UPDATE, DELETE, CREATE, ALTER, ect.
  • pg8000 offers objects to represent prepared statements. This makes them easy to use, which should increase their usage and improve your application’s performance.
  • It has some pretty nice documentation, I think.

Now, that being said, reality kicks in. Here’s why not to use pg8000:

  • It’s pretty new. This means there are likely bugs that haven’t been found yet. It will mature over the next couple weeks with some community feedback and some internal testing.
  • It doesn’t support the DB-API interface. I didn’t want to limit myself to DB-API, so I created just a slightly different interface that made more sense to me. I intend to include a DB-API wrapper in the next release, v1.01.
  • It isn’t thread-safe. When a sequence of messages needs to be sent to the PG backend, it often needs to occur in a given order. The next release, v1.01, will address this by protecting critical areas of the code.
  • It doesn’t support every PostgreSQL type, or even the majority of them. Notably lacking are: parameter send for float, datetime, decimal, interval; data receive for interval. This will just be a matter of time as well, and hopefully some user patches to add more functions. For the case of interval, I expect to optionally link in mxDateTime, but have a reasonable fallback if it is not available.
  • It doesn’t support UNIX sockets for connection to the PostgreSQL backend. I just don’t quite know how to reliably find the socket location. It seems that information is compiled into libpq. Support could be added very easily if it was just assumed that the socket location was provided by the user.
  • It only supports authentication to the PG backend via trust, ident, or md5 hashed password.

pg8000’s website is http://pybrary.net/pg8000/. The source code is directly accessible through SVN at http://svn.pybrary.net/pg8000/.

2006/12/14

pyPdf 1.8 – with PDF encryption!

Filed under: pdf, programming, python — admin @ 9:47 pm

PyPdf version 1.8 has been released. This new version features two major improvements over the last release. The first is support for the PDF standard security handler, allowing the encryption and decryption of average PDF files. The second major feature is documentation.

The security handler was a fun project to implement. Sometimes, reading encryption algorithms in a document can be a fairly mind-warping experience. It’s not until you start to code the algorithm that you begin to understand the purpose, and how it all fits together. To be honest, sometimes even after you code it, it doesn’t make much sense.

I’m no cryptography expert, but I do feel I have a pretty good basic grasp of the technology and concepts. The PDF reference manual, section 3.5.2, contains a small number of algorithms that include processes like this:

Do the following 50 times: Take the output from the previous MD5 hash and pass the first n bytes of the output as input into a new MD5 hash…

Frankly, it doesn’t make much sense to me. It seems like busy-work. If the chosen hash function is believed to be secure, then rehashing the output 50 times is unnecessary. If the hash function turns out to be insecure, you should replace it, rather than running it 50 times. But I suppose it doesn’t matter much — pyPdf supports it now, whether it makes sense or not.

Documentation was another fun matter. It took a surprising amount of searching to find pythondoc, a documentation system. All I wanted was something that allowed the documentation to be integrated with the code, and allow hyperlinks between documentation bits. I recommend pythondoc if anyone has similar needs — it worked great to generate pyPdf’s documentation.

2006/11/27

Pybrary.Plot

Filed under: programming — admin @ 11:09 am

I just finished building a webpage for this software release, so I thought I’d mention it here.

Pybrary.Plot is a C# / .NET library designed for simple X/Y graphs and plots. It has the following capabilities:

  • multiple datasets with independent line and symbol styling,
  • numeric, date based, and time span based X axis options,
  • unlimited number of numeric Y axes,
  • scatter series type (arbitrary x/y values),
  • stacked plot series type,
  • basic menu driven user-interface (zooming, saving plot),
  • ability to save plot to clipboard as data and image,
  • capable of drawing on any graphics implementation – i.e. printers, screens,
  • Open Source, available under the modified BSD license.

Check out the website for more information, including screenshots and an online demo.

2006/08/11

Beyond SELECT — Part 1: Constraints

Filed under: programming — admin @ 12:28 pm

You’ve just built your first database application. You’re proud of your accomplishments — and you should be. You mastered the fundamentals of SQL: creating a table, putting data into it, and querying it. You took the basic approach: when it came time to sum a column of numbers, you wrote a FOR loop. Everything works, but shouldn’t your database be doing more for you?

Many people are in the position just described. When time came to use a database, the first web hit for “SQL Tutorial” became the bible. SELECT, INSERT, and DELETE — the bread and butter of data manipulation. The SQL database has satisfied the need for data storage. What else can it do?

It can do a lot more. From aggregate functions to stored procedures, this article will help your database sing and dance. (Note: the author is not responsible for damage, physical, mental or emotional, caused by database servers and software singing and dancing)

This article is part 1 of a multi-part article, Beyond SELECT. This introduction to more than basic SQL is written with multiple databases in mind. Feature availability is documented for PostgreSQL, and MySQL. If you have information on the use of features here with other databases, please leave a comment and I will be glad to update this article. This is part 1 of a multi-part article. Links to the next parts will be added at the end as they are published. I am eager to do similar writing on a freelance basis for any publications. If you are interested, please leave a comment.

Constraints — Make a smarter database

Keys, and Primary Keys

A key is a value that uniquely identifies a row in a table. (PostgreSQL, MySQL 5) A table can have many keys, and most tables have at least one. One key is called the primary key, and is intended to be the method of identifying a row in a table.

Most new SQL users are familiar with the concept of a primary key in the form of an auto-incrementing integer column. In MySQL, this type of column is called AUTO_INCREMENT. In PostgreSQL, the SERIAL identifier is the equivilant. These data types are easy to use and understand, and make excellent primary keys in many cases.

In cases where integer primary keys are used, undesirable data duplication can be posible. For example, let’s create a simple table describing an employee, and recording his name and social security number:

CREATE TABLE employee (
    employee_id SERIAL,
    name VARCHAR(200),
    ssn VARCHAR(11)
);

With this data definition, a simple data entry mistake could create two employee rows representing the same employee. Maybe the HR director has big thick fingers, and he hit the enter button twice when creating a record. You could put application logic in place to prevent two people with the same social security numbers from being created, but your database can do that for you. Let’s drop the employee_id row, and use the SSN as the table’s primary key instead:

CREATE TABLE employee (
    ssn VARCHAR(11)
        PRIMARY KEY,
    name VARCHAR(200),
);

By taking this simple step, we’ve reduced the size of our employee table and also made it more “error-proof”. If you attempt to enter two employees with the same SSN, the database will refuse and give you an error.

Not every table will have a single field primary key. For example, a posting to an online forum can’t be keyed off the time, since multiple people could post at the same time. Sometimes it is easier to use an integer primary key, and sometimes it is a good idea to use a composite primary key. A composite primary key takes two columns, and ensures that no other row in the table exists with those same two columns. It uniquely identifies a row based upon multiple values.

Our company has just gone multinational. It seems unlikely, but somehow John Q in Canada has the same social insurance number as Jane B in Tennessee. Now we need to alter our database to allow both users in, but we still don’t want to go to an integer primary key. We can use a composite primary key on the country of employment AND the social (insurance | security) number of the employee:

CREATE TABLE employee (
    ssn VARCHAR(11),
    country VARCHAR(2),
    name VARCHAR(200),
    PRIMARY KEY (country, ssn)
);

Now (‘US’, ‘223-0423-85′, ‘Jane B’) is a distinct row from (‘CA’, ‘223-0423-85′, ‘John Q’), and both can be entered into the database. Of course, this is a contrived example since Canadian SIN numbers and US SSN numbers have different formats, but that doesn’t really matter.

What is the difference between a primary key and a key? Is there such a thing as a secondary key? A tertiary key? As it happens, there’s nothing much “primary” about a primary key. More than anything else, the “primary” is documentation, letting people know how this table was intended to be used. You can create any number of keys on your table by creating unique indicies.

For example, let’s say that you decided not to use a composite primary key for the employee table. Many other tables reference employee, so it is easiest to just use the original employee_id that we started with. However, we still don’t want multiple employees to have the same social security number, so we create a unique index on that column:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    ssn VARCHAR(11)
);
CREATE UNIQUE INDEX employee_ssn_key ON employee (ssn);

Now only one row can have any given social security number, and only one row can have any given employee_id. This table has two equally valid keys. Keys should be created on any unique data, and ideally the primary key should not be a manufactured arbitrary number. However, you will find that integer primary keys can reduce the size and complexity of your database once you start using many foreign keys.

Foreign Key Constraints

A foreign key occurs when a row in one table references a row in another table. (PostgreSQL, MySQL 5) This is the relational part of a relational database system, and it is very common. Most people learn how to make two tables reference each other, but a surprising number don’t know that the database itself can help enforce that. Let’s create a little example:

CREATE TABLE manager (
    manager_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    evil BOOLEAN
);
CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    manager_id INTEGER
);

In this scenario, each employee has a manager, which can be looked up based upon their manager_id. It’s actually a pretty bad design, for a couple different reasons. Most importantly, I could enter an employee with a manager_id that doesn’t exist. That could be intentional, or it could be a giant mistake. We’re going to introduce a foreign key constraint that will make sure all employees have a manager that exists:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200),
    manager_id INTEGER
        REFERENCES manager (manager_id)
);

Suddenly it is impossible to enter a manager_id that does not exist. The database is doing the hard work of checking every input manager_id for us, and all it took was a couple of words! But we actually didn’t quite accomplish what we want. The manager_id can still have a NULL value entered into it. Is there some kind of constraint that can fix that?

NULL / NOT NULL Constraints

The NOT NULL constraint is quite possibly the simplest we’re going to take a look at. (PostgreSQL, MySQL 5) Let’s throw the words NOT NULL into a table, and see what affect that has:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    name VARCHAR(200)
        NOT NULL,
    manager_id INTEGER
        REFERENCES manager (manager_id)
        NOT NULL
);

Adding NOT NULL has made it so that the name and manager_id field must be provided. The employee_id field is already NOT NULL, because it is a database primary key. Now we have employees that must have names, and must have managers, and their managers must exist in the manager table.

There is a constraint (or, really, a lack of constraint) that is opposite of NOT NULL. Writing "NULL" in after a field allows a value to be NULL. This is actually the default for all columns, and it does not have to be explicitly stated. Personally, I like writing NULL for every field that can be NULL — it’s a reminder to myself, when I’m looking at the schema in the future.

Don’t you think it’s about time our employees got paid?

CHECK Constraints

A check constraint enforces a defined rule on a table or column. (PostgreSQL, unsupported in MySQL 5?) It is another tool that helps you design databases that only take logical and sensical data. Let’s create a table of employees and how much they get paid:

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    monthly_salary NUMERIC
        NOT NULL
);

This table works great, until an employee complains about the payroll system deducting money from his bank account. A quick look shows that the HR director and his fat fingers are the cause once again — he entered $-640 into the payroll application! He must have been aiming for the 9 key, you figure. Let’s put a check constraint onto that column to prevent this from ever happening again. While we’re at it, let’s limit the monthly salary to values under $15,000. You can always come back and change it later if necessary, but for now it will prevent another data entry typo.

CREATE TABLE employee (
    employee_id SERIAL
        PRIMARY KEY,
    monthly_salary NUMERIC
        NOT NULL
        CHECK (monthly_salary > 0)
        CHECK (monthly_salary < 15000)
);

Now we cannot enter salary values outside of (0 … 15000) per month. You might need to rebuild this table and increase the salary limit, once your boss finds out how you prevented a terrible payroll mistake. … ha ha. ha.

Part 2

In the next few days, additional portions of this article will be published. The next section will deal with data querying.

2006/07/31

Vancouver Python Workshop

Filed under: programming, python — admin @ 7:35 am

My friend Bradley is putting on a talk at “VanPy” entitled “Rapid Development of Enterprise-Level Web Applications”. It is going to be an interesting case study of a large web application that was re-developed in Python over a couple of years. The application went from ASP and Windows based to Python and Linux – yay! For anyone who has never see a Python talk that has to do with an Oracle database (*gasp – not MySQL? ;-) *), and terabytes of data, this is your chance.

Hopefully Jim Hugunin’s IronPython talk won’t steal too much of the potential audience away.

I’ll be the rude guy in the back of the room making silly faces.

Ever want to unwrite something?

Filed under: programming, python — admin @ 7:24 am

In my last post, I kinda ranted about python development. I thought that I was being constructive and presenting a well thought out point-of-view, but it wasn’t really. There were probably some ideas in there somewhere, but I forgot a couple important actions in writing. I did not research the topic of Python 3000 very well, and I did not think about rational reactions to my “arguments”.

I think that I was wrong.

If you didn’t read it, the just of the post was that development of the Python language should stop. The real reason I felt this way is because I’ve been reading about Perl 6 lately. I love Python, and I was scared to think of it going down the same road. Since then, I’ve thought about language development in general, and I’ve watched a presentation by Guido about Python 3000 (which I will probably see again this weekend at “VanPy”).

Programming languages do need to develop and evolve. Mistakes are made, new alternatives are developed, and things need to be fixed. Python 2.4 is a better environment for developing software than Python 1.5 was, and Python should continue to improve. There will likely be some growing pains. Maybe a feature I like will get chopped from the language. But it’s not the end of the world. I look forward to seeing the future of Python – don’t ever go Perl 6 on me, please.

2006/07/17

My Vision of Python 3000 — Back To Basics

Filed under: programming, python — admin @ 1:05 pm

Let’s just get a little bit of a disclaimer out of the way – I don’t know everything. I tell my wife that I do, but I’m not sure she’s really pretending her hardest to believe me. Designing and developing a programming language is a hard job, and I’m certainly no expert in the field. I am extremely thankful for the many years of hard work that the Python development team has put into Python. In my humble opinion, Python is the greatest computer programming language in existence.

Now please, stop developing Python. Put down your keyboards, and walk away.

Well, I don’t really mean that. Not in the literal sense, anyways. If everyone put down their keyboards and walked away, they might find their way outside where they might get sunburnt. Then they would try to remember who told them to go outside and get sunburnt, and suddenly I would have a class-action lawsuit to deal with. I would have to retain a legal team at an extreme expense to defend myself. I don’t have a extreme cash to expense, so don’t put down your keyboards, don’t walk away, and don’t sue me.

Python the language is complete. It doesn’t need any more features.

Let me define “need” for you — that feeling you get when you know something will be missed in the first two minutes of looking at a new programming language. Alright, I admit it, it’s a crappy definition of “need”. But it’ll do.

I did not need the builtin function reversed(). If it had it been included in the standard library, I would have been happy.

I did not need the builtin set types. They already were included in the standard library. Adding them to the language as a new type was not necessary.

I did not need generator expressions. I could have just written a generator function.

I do not need a “with” statement. I can use a short variable name.

The difficulty comes with defining which syntax options are worth including in the language itself. For example, I can add two numbers with the expression “x – (-y)”, but I think most rational people would agree that “x + y” is a better choice. Sometimes adding language syntax could provide a new and powerful expressive tool. When do you draw the line? I’m drawing the line at Python 2.4. That far, no further.

New builtin functions? Don’t bother, please. Put them into a library. Adding a new builtin function means that I need an updated vim syntax colouring file. It means my namespace is just a touch more polluted. It’s just not necessary.

CPython the Python interpreter probably isn’t complete. It keeps getting faster and better the more people work on it. We all like you CPython! Chin up!

Python’s documentation is a good effort. Writing documentation is hard, thankless work. Python documentation writers deserve a pat on the back. But it’s a job that is not “done”.

New users have difficulty finding some very specific parts of the documentation. For example, section 2.3.6.1 of the Python library reference is “String Methods”. This isn’t even on the library reference’s table of contents, since it is buried one level too deep, yet it is likely one of the first few pages people need. The same is true for section 2.3.6.2 (String Formatting Operations) and section 2.3.6.4 (Mutable Sequence Types). The entire section 2.3, builtin types, is one of the most important references for Python. It is where many people will start – how do I read a file, how do I manipulate a string or an array? Yet some Python programmers don’t even find it.

Want to get a little more complex? How about creating a class that acts like a dictionary? You’ll probably need section 3.3 (Special Methods) of the Python reference manual. Can someone please take this section, create a list of all the special method names, and hyperlink them to the appropriate description of what they are?

The Python documentation has good content. I believe that finding what you’re looking for is the hard part.

Python’s standard library has a fork in the road, ahead.

I want to say that the standard library should include everything. It would make it easier to work with new python software if it had no dependencies though, right? If the Python standard library pulled more and more software into it, it would be very easy to build new software. Need a widgetly? Use the widgetly module. Need a foogleblarg? It’s in the foogle package. But this isn’t practical. The widgetly people need a different release schedule than the foogle package. foogle needs to put out an emergency patch to fix a security hole, but there’s no Python release planned so it doesn’t get out to people.

So then there’s the middle road, that the standard library is traveling on now. Need the ’socket’ module? It’s included in Python. Need some XML parsing? Got that too. Need a bit more XML parsing? You’ve passed the line, you better download PyXML. CGI, yes; FastCGI, no; regular expressions, yes; python image library, no. Everyone has their own vision of what should be included, and what shouldn’t be.

I think the third road here lies with distutils and the “.egg” package format. If this can make it so that dependant packages of appropriate versions are automatically downloaded, compiled, and installed… then Python doesn’t need a standard library. Every module of importance could be distributed and maintained separately from Python. Modules can still be maintained and developed by the Python developers, but until a programmer or another piece of software needs the sndhdr module it isn’t installed.

I’m not being critical that much. But I don’t like the way Python 3000 planning seems to be going. I’m not into language changes. Python is pretty darn great the way it is – that’s why I like it.

2006/06/06

pyPdf 1.6

Filed under: pdf, programming, python — admin @ 9:54 am

Finally! Apparently I must be unemployed in order to get anything done on pyPdf. I’ve finally released version 1.6 today. Major highlights include:

  • Reads more PDF files than ever before.
  • Supports reading and creating compressed content streams.
  • Allows access to document information, such as the title, author, creator, and so on.

Basically, it’s just better. Mr. I-Am-Bitter has been using it on mountains of PDF files, so I feel confident that it works better than ever.

Older Posts »

Powered by WordPress