Mathieu Fenniak's Weblog

2006/06/04

“import zlib” vs. .NET Framework

Filed under: pdf, programming, python — admin @ 9:44 pm

During my current period of unemployedness, I’ve been preparing for some contract development work that I expect to be doing in the near future. Inspired by the article series on IronPython and .NET GUI development over at The Voidspace Techie Blog, I’ve been looking into what kinds of development struggles I might face using IronPython and .NET as a platform. To that end, I began to look at making pyPdf work under IronPython.

The first struggle I encountered was that the “zlib” module was not available in IronPython. “No problem,” I think to myself. “There’s got to be access to a DEFLATE library through .NET, somehow.”

“Yes, younger-self,” my older-self now says. “There is a .NET way to do this, but apparently it requires an annoyingly large amount of code.”

Here’s the original Python code that was used to implement the FlateEncode streams in pyPdf:

import zlib
def decompress(data):
    return zlib.decompress(data)
def compress(data):
    return zlib.compress(data)

Okay, that was simple and straightforward. Here’s the IronPython solution (note, if you have suggestions to make this shorter, please do let me know):

import System
from System import IO, Collections, Array
def _string_to_bytearr(buf):
    retval = Array.CreateInstance(System.Byte, len(buf))
    for i in range(len(buf)):
        retval[i] = ord(buf[i])
    return retval
def _bytearr_to_string(bytes):
    retval = ""
    for i in range(bytes.Length):
        retval += chr(bytes[i])
    return retval
def _read_bytes(stream):
    ms = IO.MemoryStream()
    buf = Array.CreateInstance(System.Byte, 2048)
    while True:
        bytes = stream.Read(buf, 0, buf.Length)
        if bytes == 0:
            break
        else:
            ms.Write(buf, 0, bytes)
    retval = ms.ToArray()
    ms.Close()
    return retval
def decompress(data):
    bytes = _string_to_bytearr(data)
    ms = IO.MemoryStream()
    ms.Write(bytes, 0, bytes.Length)
    ms.Position = 0  # fseek 0
    gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Decompress)
    bytes = _read_bytes(gz)
    retval = _bytearr_to_string(bytes)
    gz.Close()
    return retval
def compress(data):
    bytes = _string_to_bytearr(data)
    ms = IO.MemoryStream()
    gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Compress, True)
    gz.Write(bytes, 0, bytes.Length)
    gz.Close()
    ms.Position = 0 # fseek 0
    bytes = ms.ToArray()
    retval = _bytearr_to_string(bytes)
    ms.Close()
    return retval

Basically, the code grew in length for a few reasons. First of all, the original compress and decompress functions took string arguments, but they were basically being used as arrays of bytes. In .NET, there is a clear difference between an array of bytes and a string, so conversion methods were necessary to create a byte array from a string. I actually like this, because it forces you to encode and decode strings whenever you use them, making you aware of their unicode nature (which is actually optional in CPython, basically).

The other added complexity was the use of streams, rather than just basic functions that can be called. A nice object-oriented stream library is actually quite flexible and powerful, but as you can see it can make things a little more verbose. But, you know… both have their advantages.

Finally, I had to write a function just to read an entire stream into a byte array. MemoryStream has a simple “ToArray()” function on it — I wish this was standard on all Stream objects. But regardless, this function only really needs to be written once and can be used for many different purposes. So it isn’t really adding to the length of the deflate encoding, it should be adding to the length of my toolbox somewhere else. Note that my implementation is fairly wasteful of memory, but it is a simple approach that will not fail if Read returns partial buffers, or anything like that.

IronPython is interesting. One hurdle is down for PyPdf, but a few still exist. We’ll see what happens next.

2006/01/10

Python PDF Split/Merge Library

Filed under: pdf, programming, python — admin @ 12:48 pm

When you have good tools, working with PDF files can be fun. When you have no tools – it’s time to build a pure-Python library for working with PDF files.

Enter the challenge: create a website that can split and merge PDF files on demand. Given a PDF file of a few hundred pages, split the PDF file and store individual pages as seperate PDF files. On demand, merge any set of individual pages to create and serve a new PDF file.

Rejected solution #1: activePDF Toolkit, a COM based library that receives excellent reviews from a co-worker. Sounds super! However, my deployment platform is Linux, making a Windows COM library virtually unusable.

Rejected solution #2: pdftk, a command-line utility that allows splitting and merging PDF files. pdftk is based on a modified version of the Java iText library, which I am familiar with. However, spawning processes on every page view to merge PDF files is probably relatively slow. When you add in the fact that my pdftk process kept dying with SIGABRT when running it through os.system, os.spawnl, and popen (in other words, I couldn’t get it to work), this solution was rejected.

Rejected solution #3: Use the iText Java library, which is capable of splitting and merging files. However, my web server is somewhat memory limited at the moment. Adding a JRE would not help. Plus, who wants to code in Java when it can be done in Python? Nobody, that’s who.

Enter the solution: a pure-Python library for working with PDF files. It may not be perfect (okay, okay, it definitely is not), but it does work with the PDF files I was most interested in splitting and merging. I’ve also tested it lightly with other random PDF files I’ve found on my system and it seems to work pretty happily with them.

I’ve created a pyPdf project page and uploaded it to PyPI.

2005/01/16

Freedom?

Filed under: politics — admin @ 2:41 pm

German politicians call for Nazi symbol ban in Europe:

"In a Europe of peace and freedom there should be no place for Nazi
symbols," said General Secretary of the Christian Social Party, Markus
Soeder, to the Bild am Sonntag newspaper published on Sunday.

You keep using that word. I do not think it means what you think it means.

2004/12/08

Java Web Start 1.5

Filed under: java, programming — admin @ 3:37 pm

The Java Network Launching Protocol (JNLP), commonly known as Java Web Start, is some very cool technology. In my opinion, it hits a sweet spot between web applications and local applications:

  • Software is run locally on the client machine. JNLP is supported on every platform that Sun’s Java Runtime environment is available, including Linux, Windows, and Mac OS X.
  • The software can be updated on the remote server, and it is updated locally as soon as it is next run.
  • Desktop integration is possible, meaning that the user of the software has a nice double-clickable application shortcut on their desktop, start menu, applications folder, or wherever you would store such a thing in Linux.
  • New in Java 1.5^H^H^H5.0 – file associations can be specified in JNLP files, meaning that I can open a “.whatever” file with the Whatever Java Web Start application by clicking it. Cool!

So, basically JNLP has some of the advantages of Web applications (immediate upgrades available, doesn’t need to be rolled out by network administrators [whether this is really an advantage is questionable]), and some of the advantages of desktop applications (run locally, quick to run, shortcuts and file associations). Cool.

However, the changes to JNLP in Java 1.5 are somewhat poorly documented. I’ve discovered three tidbits of information that I believe are pretty useful, so I’m going to share them with you.

First of all, I wanted my users to know that by using Java 1.5 they’re going to get some additional functionality (file associations), but my application still runs in Java 1.4. I updated my web start launch page to check for 1.5.0 (thanks to Sun’s documentation), and then made it link to my download page with some useful text if they don’t have 1.5. However, if the user went to the download page when running Java 1.4, it did not upgrade to Java 1.5 – it saw that Java Web Start was available, and launched the application.

Adding a #Version tag to the download page fixed this, forcing an upgrade to 1.5.0:

<!--
Automatically installs Java 1.5.0 and runs the PDA application with
Java Web Start.  The addition of the #Version section on the object's
codebase will cause an upgrade to JRE 1.5.0 even if the user already
has a Java Runtime installed.
-->
<object
  codebase="http://java.sun.com/update/1.5.0/jinstall-1_5_0-windows-i586.cab#Version=1,5,0,0"
  classid="clsid:5852F5ED-8BF4-11D4-A245-0080C6F74284"
  height="0"
  width="0">
    <param name="app" value="http://yoursite.com/app.jnlp">
    <param name="back" value="true">
    <!-- Alternate HTML for browsers which cannot instantiate the object -->
    <a
        xhref="http://java.sun.com/j2se/1.5.0/download.html"
        mce_href="http://java.sun.com/j2se/1.5.0/download.html">
          Download Java Web Start
    </a>
</object>

So, now my users had the option to upgrade to Java 1.5. Some of them might even do it. Now, how does one get the file associations to work? At the time I was looking at it (and still as I write this), Sun’s getting started guide is grossly incorrect when discussing the <association> tag in the JNLP file. The getting started guide also does not explain what changes to make to your application to get it to actually open the file when you double click on it.

The first part was easy – Keith Lea already documented the problem, and Google lead me straight to it. Add an <association> tag to your JNLP file, like this:

One step closer! Now I can click on my files and the application launches, but it doesn’t do anything with the file I opened. I took a guess that it was probably passing the filename in through the command line parameters, and put some message boxes into my main function. Sure enough, that’s how it’s being done. I’m being passed two options: -open, and then the filename to open. The following code in my main function dealt with this:

// Handle JNLP association command line argument file opening.
// Format: '-open' 'path-to-file'
boolean openFlag = false;
for (int i = 0; i &lt; args.length; i++)
{
    if (openFlag)
        openPath(args[i]);

    if (args[i].equals("-open"))
        openFlag = true;
    else
        openFlag = false;
}

Now I’m happily taking advantage of the new features in JNLP 1.5, without forcing my users to upgrade if they don’t want to. It’s a happy day.

2004/11/23

Adapting Classes

Filed under: java, programming — admin @ 12:55 am

A few days ago, the wisdom of the java.io.Reader interface dawned on me suddenly, and at the same moment the world of interfaces came into a new light. I’ve always understood what an interface (or pure virtual class) is, and the purpose of them – they allow you to change the implementation of your class without changing the calling code. Some people have even told me that the use of interfaces can replace multiple inheritance – but I never really got how.

For those of you who are unaware, Reader is a basic interface that reads arrays of character data from “some source”. This seems like a good idea, of course. You get the data, and you don’t care about the source. Yay, nice and simple, and everyone is happy. Typically one creates a java.io.FileInputStream, creates a java.io.InputStreamReader (which implements Reader), and you’re off to the races.

One day, another class caught my eye: java.io.BufferedReader. This class implements the Reader interface, but doesn’t specify in the name any kind of data source. How does this class work? A BufferedReader takes another Reader instance as part of its constructor, and adapts it.

Why is this such a special idea? Because BufferedReader is not derived from InputStreamReader. As a result, any Reader can be buffered by this simple class. In the same way, other classes can adapt a BufferedReader to add additional functionality. A LineNumberReader can take a BufferedReader, an InputStreamReader, a KeyboardJunkReader, a RandomDataReader, or a ManagementBullshitReader – whatever Reader one wants to count the lines of. (The fact that LineNumberReader is derived from BufferedReader is irrelevant [and frankly, pointless...])

So, you’re thinking “fine, but so what?”. Let me give you an example of another simple interface that this kind of adapting would be cool for:

public interface XYDataset
{
    public int getCount();
    public Number getX(int index);
    public Number getY(int index);
}

This interface is pretty simple, and would be good as part of a plotting package. Any set of X and Y values could be plotted easily by creating an instance of this XYDataset wherever your data is. This is simple, effective, and cool. A basic implementation of this could have two List objects, or one List object, or … whatever, who cares – storing data is boring.

What if you find that some users are plotting thousands of points, and it’s very slow? Let’s create a filtering adapter class:

public class FilteringXYDataset implements XYDataset
{
    private XYDataset delegate;
    private int maxPoints = 1000; // only plot this many points.

    public FilteringXYDataset(XYDataset initDelegate)
    {
        delegate = initDelegate;
    }

    public int getCount()
    {
        int realCount = delegate.getCount();
        if (realCount &lt; maxPoints)
            return realCount;
        else
            return maxPoints;
    }

    public Number getX(int index)
    {
        int correctItemCount = delegate.getCount();
        if (correctItemCount < numPts)
            return delegate.getX(index);

        int newIdx = (int)(correctItemCount * ((double)item / (double) maxPoints));
        return delegate.getX(newIdx);
    }
    // (repeat for getY)
}

Cool, the dataset is filtered now. It’s a crude filtering, but when you’re plotting a thousand points it’ll do nicely – it’s hard to tell a thousand points from ten thousand points on a normal screen sized plot.

What other dataset adapters could you use?

  • Add an extra 50% to the number of points, and generate them from a bezier smoothing curve.
  • Add a (0, 0) point to every dataset. Pretend the count is one greater, and then add the point in at the appropriate index.
  • Plotting arbitrary X-Y points on a log-log plot is impossible if the points are negative – so chop them out in another adapter class.

Adapting classes like this are simple and nifty – you take one interface, and provide the same interface back to the library user.

The really great part is that, without multiple inheritence, you can later create a dataset that is smoothed, filtered, and has negative points chopped out – all with one line of code. The smoothing and filtering algorithms are only written once, but can be applied in various orders and with various other tools. … and you still don’t know how the XYDataset is being stored. Good!

2004/10/22

Exponential Secant Root Finder

Filed under: programming — admin @ 7:25 pm

Given a function ƒ, find x such that ƒ(x) = 0. Sounds simple enough…

Root finding is a hobby of mine. It’s kind of a lame hobby. It can be lots of fun though, and of course there are many real world usages for it. Once in a while, I’ll read through papers and methods on the general subject and implement new algorithms. I once even had an “algorithm war” program which would pit two against each other: given an arbitrary function with one root, which method would find it first? Most consistently first? Highest accuracy with fewest function calls?

I was faced with an interesting more specific problem yesterday. A simple secant solver had been used on a function with great results, except at a few extreme software inputs. In general, the solver only knew that 0 < x <= ?, and, in fact, ƒ could not be evaluated for values <= 0. “Hm…. the secant method worked so well for this nice, smooth function… if only there was a way to prevent it from going negative,” I thought.

What if the secant method were drawn on a semi-log plot, rather than a cartesian plot? That would cause the root finder to never go negative. It would have greater resolution at smaller values with fewer iterations, but could suffer from requiring more iterations at higher values (for the same tolerance).

/wsimages/plot-linear.png

Figure A: Function plotted on a cartesian scale.

The function being solved looked similar to figure A. In figure A, if the secant method were to hit points between 5 and 10 billion, it would quickly be thrown off by the straight slope into values less than zero, where the function cannot be evaluated.

/wsimages/plot-log.png

Figure B: Function plotted on a semi-log scale.

If it is instead plotted on a semi-log scale, as shown in figure B, the flattened area would throw the values off to very small numbers (around 1E-10). This function would have a very high slope at those values, and the secant method would throw it back into a reasonable range very quickly.

Implementing this modified secant method idea was pretty simple. The standard secant root solver is used, but with a variation of how the next point is chosen:

  • Rather than fitting y = m * x + b to two points on the curve, an exponential function y = m * exp(x) + b was fitted to the two points.
    1. The m value was calculated with m = (y1 - y2) / (log(x1) - log(x2)),
    2. the b value was calculated as b = y1 - log(x1) * m.
  • A new x coordinate by solving the equation for y == 0, which simplifies down to x = exp((y1 * log(x2) - log(x1) * y2) / (y1 - y2)).

This method proved to be as quick as a normal secant solution, and very effective for functions which cannot be evaluated at values <= 0.

2004/10/01

Some Nifty Things

Filed under: java, programming, python — admin @ 1:08 am

Lately I seem to have pushed a large number of important projects to the side to make room for some smaller personal projects. Here’s what I’ve been thinking about lately that’s kinda nifty:

  • I gave notice of resignation to my employer earlier this week. October 8th is my final work day. I’ll be free for a few months to persue some contract work that I’ve got waiting in the wings, and then I’m into a new job in the new year. I’ll be working as the head of software development at a small start-up company. Exciting!

  • I built a new photo gallery system. This one builds on a number of features in flickr. It has tag based image catagorizing, EXIF photo information, and a small number of different image sizes that can be viewed. You can see it in action on my pictures page.

  • A new version of Java was released today. Java 1.5, Tiger, contains a number of boring features: autoboxing (saves some typing), a new for/in loop (saves some typing), and generics (saves some typing – may find some ClassCastExceptions waiting to happen at compile time). These seem to be the features everyone thinks are cool, but I think they’re pretty lame.

    That said, there are a couple of features that are cool. Java 1.5 adds the ability to change a function’s return type when it is being overridden. You can only change it to a subclass of the original type, but this makes a lot of sense. When you really think about it… it turns out that all it does is save you some typing. In any situation where you’d actually use this, you’d be doing some casting that you wouldn’t have to do anymore.

    Hmmm… so… what does it have that’s cool? Variable length argument lists… I’ve never missed these in Java before. Annotations look pretty cool, but if you think about the pretty static applications they have unless you recode your own compiler, they seem to be pretty much just a couple nice builtin features and a new way to add documentation. So… static imports? Yeah, that’s nice. Yay!

    I really wanted to be impressed with Java 1.5’s new features, since I’m doing a lot of Java development these days. But I can just type faster, and I’ll still retain backwards compatibility with people using Java 1.4.

  • I added HTTP Digest authentication into my Twisted based weblog aggregator. This allows me to view LiveJournal RSS feeds with a logged in user, and hence getting links to protected LiveJournal entries that m yuser can see. I submitted a small patch to urllib2 to make it work with those same LiveJournal feeds, and I may add real authentication support to twisted.web.client rather than the hacked support I’m currently using. Maybe this weekend, if I have the inclination.

2004/08/29

Book and Game Reviews

Filed under: books, video games — admin @ 8:43 pm

Book: The Curious Incident of the Dog in the Night-Time

A week ago, Joel Spolsky wrote some comments regarding a book called The Curious Incident of the Dog in the Night-Time, written by Mark Haddon. I added the book to an order I was placing at Amazon, and Thursday afternoon it arrived at my workplace.

The story is written from the point of view of Christopher Boone, fifteen year old boy who suffers from Asperger’s Syndrome. Young Christopher finds a poodle killed, and sets out to solve the mystery of who killed the Wellington.

This book is very definately going to hold a place near the top of my most favourite novels. It is quite simply an amazing novel.

The use of the first-person perspective pulls you into this novel, and engages you completely. When reading this book, you it truely feels honest and real. The author writes as if the main character is writing this book – as if it’s a novel he wrote during his efforts to solve the mystery. It is an amazingly strong narrative technique. If the author’s name on the cover and the character’s name in the book matched up, a reader would never believe this to be a work of fiction.

I recommend that everyone read this book, especially you geeky types. You’ll probably giggle through the first few chapters, as I did, but it won’t take long before you realize how serious things are.

Game: Tales of Symphonia

Tales of Symphonia is a role-playing game for the GameCube system. It is published by Namco. It is sure to be available at any of your favourite video game stores.

Colette is the Chosen of Regeneration, fated to regenerate the world and save everyone’s existence. Her party, including the inexperienced swordsman (Lloyd, your main character), a young magician (Genis), your school-teacher (Raine) and a mercenary (Kratos) set out to regenerate the world, each for their own reasons. All that stands in their way is an army of half-elfs, a world where humans are harvested in ranches, and just about everything else you can imagine.

This game is basically amazing. It has wonderful elements of some of the best fantasy games ever, such as FF6 and FF7. The story is a beautiful twisting saga. The graphics are beautiful, cell-shaded, and very similar to Wind Waker’s. The voice acting is the best I’ve ever heard in a video game, as it actually adds to the game’s dialogue rather than making my ears cringe. And finally, the characters are deep (for a video game), and subtle (for a video game).

I recommend this game strongly for anyone who has ever liked Final Fantasy games, and who owns a GameCube. Actually, you know what… buy a GameCube for this. It’s worthwhile and relatively cheap anyways.

Roundup

Filed under: programming, python — admin @ 2:47 am

Roundup is some damn beautiful software. It’s a very nice and simple package for software bug tracking (oh, pardon me… issue tracking). It can be customized very easily, and in fact from a minimal ‘tracker’ just about any web-based database application could be built with a minimum of fuss. The mail gateway is a beautiful design too. Oh, and I love the fact that e-mailing the system creates a user “account” for that e-mail address (unless it’s associated with an existing account, of course). No fuss bug tracking.

I’d love if it supported some e-mail security, though. Digitally signed messages, for example. The current complete lack of e-mail security makes me irrationally scared – a bad person couldn’t do much, but they could do some.

Here’s a neat trick – for nice clean URLs, place the roundup.cgi script wherever you want it to be, renamed it to just roundup, and add a couple lines to your Apache configuration:

<Location /blah/roundup>
    SetHandler cgi-script
</Location>

And you’ll magically get the CGI interface of roundup working without the minor annoyance of having ’roundup.cgi’ in your URLs. Go Apache!

2004/08/26

DevEnv vs. the Programmer

Filed under: programming, python — admin @ 3:57 pm

How can you capture the console output of a program, when it buffers that output if you’re not using a console to view it? This was a problem run into when building an automation tool for MS Visual Studio .NET. In the end, the programmer subjugated his tool (as it should be) by beating it over the head with a pipe.

A few of us programmers with in the unfortunately unfriendly environment of MS Windows. It might look pretty and have lots of applications written for it, but it’s basically an unfriendly environment for a software developer. Even MS Visual Studio .NET can be unfriendly to a developer, which is unfortunate since it’s the one program you’d expect would be really friendly.

Visual Studio allows you to provide command-line options which start a software build. Running inside a command prompt, all you need to do is pass a solution file and a build configuration to the program, and you’re off. In fact, Visual Studio even gives you more command-line flexibility by providing two executables, devenv.com and devenv.exe – the former will tend towards printing console output all the time, while the latter will avoid it if a build log file is provided instead.

In the creation of a complete build tool, I wanted to run devenv.com and capture the output so I could display the progress to a user. That’s when it became tough. Running the executable through os.popen (or any other popen function) didn’t accomplish what I wanted – the output being printed to the console (and now being read through a pipe) was buffered inside the devenv process and only printed after the build was completed. Clearly this didn’t accomplish the goal of providing a progress display for the user.

devenv.com provides an option which I thought might have some promise: /out. This writes the build output to a specified file. Great! All I need to do is start it writing to a file, and read through the file at the same time. I wasn’t sure of the implementation details, but it seemed feasible. Unfortunately, the devenv process locks the output file exclusively. Python’s open() was unable to read it, and even trying to find obscure parameters to win32file’s functions failed to give me the necessary access to the file.

In the UNIX world, the solution would be obvious. Create a pipe, and write the build output into the pipe while reading the pipe. In the Windows world though, a pipe is not a filesystem object. It can’t be created in a specific location, and so devenv wouldn’t be able to open it like a normal file and write to it. I considered for a while that there are a bunch of standard reserved file names, like CON and PRN. Might one of them help me? Could one of them be used to connect to a pipe? Well, no. Not really. They’re ancient history, a relic from years gone past, and they don’t have any concept of a pipe.

I started digging around for more information about named pipes, which seemed to be the prefered mechanism for IPC in Windows software. Could a named pipe be referenced through a file location? Yes, it can! \\%(host)s\pipe\%(name)s refers to the named pipe name on the host host. And as a bonus, the host . refers to the local machine at all times. Now I finally have a plan of action: Create a named pipe, make devenv write to \\.\pipe\buildOutput, and read the output on the fly.

In the end, I wrapped the named pipe code into module, NamedPipe, and the code to read devenv output on the fly was easy:

from NamedPipe import AnonymousNamedPipeReader

pipe = AnonymousNamedPipeReader()

# Application command line...
# (build application cmd line, devenv.com x.sln /build Release, etc..
# {code omitted}
cmd = cmd + r' /out \\.\pipe\%s' % pipe.name

# Okay, one of us needs to loop and accept a pipe connection, read
# data, display it to the user, and so on.
# The other of us needs to run the build command.
class ExecThread(threading.Thread):
    def __init__(self, cmdLine):
        threading.Thread.__init__(self)
        self.cmdLine = cmdLine
    def run(self):
        self.retval = os.system(self.cmdLine)
thread = ExecThread(cmd)
thread.start()

buildLog = ""
line = ""

for data in pipe:
    buildLog += data
    # {code omitted - display output on the fly}

Now, obviously this code snippet has left out all the magic. It’s a bit long and boring, so I thought maybe you’d just like a link to NamedPipe.py instead. Through the magic of functions like CreateNamedPipe and ConnectNamedPipe, you can read data being written to a file on the fly. It even works when the writer is a jerk, locking the file.

« Newer PostsOlder Posts »

Powered by WordPress