When you have good tools, working with PDF files can be fun. When you have no tools – it’s time to build a pure-Python library for working with PDF files.
Enter the challenge: create a website that can split and merge PDF files on demand. Given a PDF file of a few hundred pages, split the PDF file and store individual pages as seperate PDF files. On demand, merge any set of individual pages to create and serve a new PDF file.
Rejected solution #1: activePDF Toolkit, a COM based library that receives excellent reviews from a co-worker. Sounds super! However, my deployment platform is Linux, making a Windows COM library virtually unusable.
Rejected solution #2: pdftk, a command-line utility that allows splitting and merging PDF files. pdftk is based on a modified version of the Java iText library, which I am familiar with. However, spawning processes on every page view to merge PDF files is probably relatively slow. When you add in the fact that my pdftk process kept dying with SIGABRT when running it through os.system, os.spawnl, and popen (in other words, I couldn’t get it to work), this solution was rejected.
Rejected solution #3: Use the iText Java library, which is capable of splitting and merging files. However, my web server is somewhat memory limited at the moment. Adding a JRE would not help. Plus, who wants to code in Java when it can be done in Python? Nobody, that’s who.
Enter the solution: a pure-Python library for working with PDF files. It may not be perfect (okay, okay, it definitely is not), but it does work with the PDF files I was most interested in splitting and merging. I’ve also tested it lightly with other random PDF files I’ve found on my system and it seems to work pretty happily with them.
I’ve created a pyPdf project page and uploaded it to PyPI.
I’ve the same errors as Shailesh when using pyPdf and the only solution that I found is to uncompress the pdf files with :
pdftk doc.pdf output doc2.pdf uncompress
before using it… which is definitely not a good solution
Any idea ? maybe I missed something !
Good job however ! It’s a really useful tool !
Comment by Florian — 2006/04/16 @ 9:46 am
I am getting errors when I try to run the example script. Any ideas?
Traceback (most recent call last):
File “splmrg.py”, line 7, in ?
output.addPage(input1.getPage(0))
File “E:\Python23\Lib\site-packages\pyPdf\pdf.py”, line 232, in getPage
self._flatten()
File “E:\Python23\Lib\site-packages\pyPdf\pdf.py”, line 244, in _flatten
catalog = self.getObject(self.trailer["/Root"])
File “E:\Python23\Lib\site-packages\pyPdf\pdf.py”, line 296, in getObject
retval = readObject(self.stream, self)
File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 63, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 329, in readFromStream
value = readObject(stream, pdf)
File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 54, in readObject
return ArrayObject.readFromStream(stream, pdf)
File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 128, in readFromStream
arr.append(readObject(stream, pdf))
File “E:\Python23\Lib\site-packages\pyPdf\generic.py”, line 57, in readObject
return NullObject.readFromStream(stream)
NameError: global name ‘NullObject’ is not defined
Comment by Shailesh — 2006/03/01 @ 3:06 am
It’s good to see another PDF library for Python. I wrote one aimed at reading PDF files and extracting information from them:
http://www.boddie.org.uk/david/Projects/Python/pdftools/
It’s licensed under the LGPL.
At this point I usually tentatively put forward the idea of collaborating on a general purpose PDF library for Python, but I’ve got a lot of other ongoing projects to contend with.
Still, let me know if you want to cooperate or discuss ideas.
Comment by David Boddie — 2006/01/18 @ 10:30 am
Thank you. I think I can write my pdf download-merge application in pure python soon
Two more things:
1.How can I create a pdf, and let all the pages appear to be the same size in pdf readers?
2.Is there any to-do list? For example, (re)construct the bookmark according to a text file.
Comment by Lee June — 2006/01/24 @ 1:15 am
Good job, thanks!
).
One of the best solutions to generate PDF files in python is the one of reportlab, i guess, but the merge functionality is missing in that lib. ( well, missing in the OpenSource chapter
Comment by Fabrizio — 2006/05/26 @ 2:32 pm
When I tried the document ,FreeImage390.pdf (download from http://switch.dl.sourceforge.net/sourceforge/freeimage/FreeImage390.pdf)
I get a error as follow:
title = FreeImage 3.9.0 documentation
document1.pdf has 101 pages.
Traceback (most recent call last):
File “F:\pdf\python\pyPdf- 1.6\ex1.py”, line 38, in ? output.write(outputStream)
File “D:\Python24\Lib\site-packages\pyPdf\pdf.py”, line 121, in write obj.writeToStream(stream)
File “D:\Python24\Lib\site-packages\pyPdf\generic.py”, line 326, in writeToStream value.writeToStream (stream)
File “D:\Python24\Lib\site-packages\pyPdf\generic.py”, line 128, in writeToStr eam data.writeToStream(stream)
AttributeError: type object ‘NullObject’ has no attribute ‘writeToStream’
Comment by Steven Lee — 2006/08/08 @ 2:33 am
Interesting. Saw this – pyPDF – today. Nice to see another Python/PDF library. I’ll download and check it out. I came across pyPDF via http://del.icio.us/ixx/pdf (which also had a link to an article of mine about my xtopdf PDF creation/conversion toolkit, also written in Python (and uses ReportLab). xtopdf is here:
http://www.dancingbison.com/products.html
Vasudev Ram
Dancing Bison Enterprises
http://www.dancingbison.com
Comment by Vasudev Ram — 2006/12/24 @ 9:09 am
Just came across this – great effort and excellent library.
Comment by bob — 2006/12/29 @ 1:04 pm
Send Flowers…
useful blogs…
Trackback by Send Flowers — 2007/02/28 @ 3:33 am
Hi!
I’ve been looking for a lib to handle PDFs in python, and yours looked awesome… until I tried to handle PDFs from gb.espacenet.com (a patent search site).
I was trying to build a tool to merge the separate PDF pages into full PDF files (you will know what I mean if you download a PDF from the site and have a look at it), but pypdf keeps complaining about the PDF it is handling is not decrypted, but I never needed a password to open any of these PDFs in any reader.
I would greatl appreciate if you could have a look at it.
Comment by Ric — 2007/04/18 @ 3:36 pm
I just got it patched myself. The problem was the absence of an “/ID” mark on the pdf.
I patched your pdf.py this way:
Changed
id_entry = self.safeGetObject(self.trailer['/ID'])
id1_entry = self.safeGetObject(id_entry[0])
for:
if self.trailer.has_key(‘/ID’):
id_entry = self.safeGetObject(self.trailer['/ID'])
id1_entry = self.safeGetObject(id_entry[0])
else:
id1_entry = ”
And now it works perfectly.
Comment by Ric — 2007/04/18 @ 4:15 pm