Python PDF Split/Merge Library

When you have good tools, working with PDF files can be fun. When you have no tools – it’s time to build a pure-Python library for working with PDF files.

Enter the challenge: create a website that can split and merge PDF files on demand. Given a PDF file of a few hundred pages, split the PDF file and store individual pages as seperate PDF files. On demand, merge any set of individual pages to create and serve a new PDF file.

Rejected solution #1: activePDF Toolkit, a COM based library that receives excellent reviews from a co-worker. Sounds super! However, my deployment platform is Linux, making a Windows COM library virtually unusable.

Rejected solution #2: pdftk, a command-line utility that allows splitting and merging PDF files. pdftk is based on a modified version of the Java iText library, which I am familiar with. However, spawning processes on every page view to merge PDF files is probably relatively slow. When you add in the fact that my pdftk process kept dying with SIGABRT when running it through os.system, os.spawnl, and popen (in other words, I couldn’t get it to work), this solution was rejected.

Rejected solution #3: Use the iText Java library, which is capable of splitting and merging files. However, my web server is somewhat memory limited at the moment. Adding a JRE would not help. Plus, who wants to code in Java when it can be done in Python? Nobody, that’s who.

Enter the solution: a pure-Python library for working with PDF files. It may not be perfect (okay, okay, it definitely is not), but it does work with the PDF files I was most interested in splitting and merging. I’ve also tested it lightly with other random PDF files I’ve found on my system and it seems to work pretty happily with them.

I’ve created a pyPdf project page and uploaded it to PyPI.

Be Sociable, Share!

11 Comments on Python PDF Split/Merge Library

  1. Florian
    2006/04/16 at 9:46 am (8 years ago)

    I’ve the same errors as Shailesh when using pyPdf and the only solution that I found is to uncompress the pdf files with :
    pdftk doc.pdf output doc2.pdf uncompress
    before using it… which is definitely not a good solution :(

    Any idea ? maybe I missed something !

    Good job however ! It’s a really useful tool !

  2. Shailesh
    2006/03/01 at 3:06 am (8 years ago)

    I am getting errors when I try to run the example script. Any ideas?

    Traceback (most recent call last):
    File “splmrg.py”, line 7, in ?
    output.addPage(input1.getPage(0))
    File “E:\Python23\Lib\site-packages\pyPdf\pdf.py”, line 232, in getPage
    self._flatten()
    File “E:\Python23\Lib\site-packages\pyPdf\pdf.py”, line 244, in _flatten
    catalog = self.getObject(self.trailer["/Root"])
    File “E:\Python23\Lib\site-packages\pyPdf\pdf.py”, line 296, in getObject
    retval = readObject(self.stream, self)
    File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 63, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
    File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 329, in readFromStream
    value = readObject(stream, pdf)
    File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 54, in readObject
    return ArrayObject.readFromStream(stream, pdf)
    File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 128, in readFromStream
    arr.append(readObject(stream, pdf))
    File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 57, in readObject
    return NullObject.readFromStream(stream)
    NameError: global name ‘NullObject’ is not defined

  3. David Boddie
    2006/01/18 at 10:30 am (8 years ago)

    It’s good to see another PDF library for Python. I wrote one aimed at reading PDF files and extracting information from them:

    http://www.boddie.org.uk/david/Projects/Python/pdftools/

    It’s licensed under the LGPL.

    At this point I usually tentatively put forward the idea of collaborating on a general purpose PDF library for Python, but I’ve got a lot of other ongoing projects to contend with.

    Still, let me know if you want to cooperate or discuss ideas.

  4. Lee June
    2006/01/24 at 1:15 am (8 years ago)

    Thank you. I think I can write my pdf download-merge application in pure python soon :)
    Two more things:
    1.How can I create a pdf, and let all the pages appear to be the same size in pdf readers?
    2.Is there any to-do list? For example, (re)construct the bookmark according to a text file.

  5. Fabrizio
    2006/05/26 at 2:32 pm (8 years ago)

    Good job, thanks!
    One of the best solutions to generate PDF files in python is the one of reportlab, i guess, but the merge functionality is missing in that lib. ( well, missing in the OpenSource chapter :) ).

  6. Steven Lee
    2006/08/08 at 2:33 am (8 years ago)

    When I tried the document ,FreeImage390.pdf (download from http://switch.dl.sourceforge.net/sourceforge/freeimage/FreeImage390.pdf)
    I get a error as follow:

    title = FreeImage 3.9.0 documentation
    document1.pdf has 101 pages.
    Traceback (most recent call last):
    File “F:\pdf\python\pyPdf- 1.6\ex1.py”, line 38, in ? output.write(outputStream)
    File “D:\Python24\Lib\site-packages\pyPdf\pdf.py”, line 121, in write obj.writeToStream(stream)
    File “D:\Python24\Lib\site-packages\pyPdf\generic.py”, line 326, in writeToStream value.writeToStream (stream)
    File “D:\Python24\Lib\site-packages\pyPdf\generic.py”, line 128, in writeToStr eam data.writeToStream(stream)
    AttributeError: type object ‘NullObject’ has no attribute ‘writeToStream’

  7. bob
    2006/12/29 at 1:04 pm (7 years ago)

    Just came across this – great effort and excellent library.

  8. Ric
    2007/04/18 at 3:36 pm (7 years ago)

    Hi!

    I’ve been looking for a lib to handle PDFs in python, and yours looked awesome… until I tried to handle PDFs from gb.espacenet.com (a patent search site).

    I was trying to build a tool to merge the separate PDF pages into full PDF files (you will know what I mean if you download a PDF from the site and have a look at it), but pypdf keeps complaining about the PDF it is handling is not decrypted, but I never needed a password to open any of these PDFs in any reader.

    I would greatl appreciate if you could have a look at it.

  9. Ric
    2007/04/18 at 4:15 pm (7 years ago)

    I just got it patched myself. The problem was the absence of an “/ID” mark on the pdf.

    I patched your pdf.py this way:

    Changed
    id_entry = self.safeGetObject(self.trailer['/ID'])
    id1_entry = self.safeGetObject(id_entry[0])

    for:
    if self.trailer.has_key(‘/ID’):
    id_entry = self.safeGetObject(self.trailer['/ID'])
    id1_entry = self.safeGetObject(id_entry[0])
    else:
    id1_entry = ”

    And now it works perfectly.

1Pingbacks & Trackbacks on Python PDF Split/Merge Library