A big part of the product I'm working on entails producing PDFs of business reports. The product is completely web-based and all reports are available via the web but PDF export is necessary for a couple of reasons:
The reports must go to print and must look professional when printed.
Reports are often sent via email to people who don’t have access to the website. (We don’t like this and would prefer for everyone to come to the website to view reports. However, PDF/Excel via Email is how things have been done in this space forever so we have to take a gradual approach to bringing our client’s processes to the web.)
The web application presents multi-page reports as multiple web pages (each report page has its own URL) but when exporting to PDF, a bunch of these pages need to go into a single multi-page document.
The product is built on Rails (with lots of custom stuff on top). Each individual report page has an ERB (.rhtml) template that generates the HTML version of the report. We then duplicate the view code in an .rpdf template, which is just Ruby with a PDF::Writer object scoped in. We also export to Excel so we have an additional .rxls template, which is just Ruby with a Spreadsheet::Excel object scoped in. For any given report, we end up with something that looks like this in the project:
-
cost-comparison.rhtml– The HTML view code. -
cost-comparison.rpdf– The PDF view code. -
cost-comparison.rxls– The Excel view code.
The system is pretty simple, really, and we've been able to accomplish a lot with it. Our controller and model code is completely view independent – all routing and template picking logic lives in a nice little plugin.
Rails 2.0 adds explicit support for multi-view setups like this and, although I haven’t looked into it yet with any great depth, it seems to be very well done and should be able to accommodate our needs.
(As an aside, we hacked support for HTTP content negotiation in here as well, which was fun but absolutely no-one cares or will ever use it, so I digress.)
Anyway, what I really want to talk about is that maintaining these PDF templates is a real pain in the ass. PDF is basically a raw 2D vector surface with drawing primitives similar to what you find in SVG (e.g., lines, rects, circles, ellipses, paths, text, etc.) You don’t get a nice box model and flowing layout as you do in HTML. Ruby’s PDF::Writer alleviates a bit of the pain here by providing automatic text wrapping and few other niceties but, for the most part, you're forced to manage the x, y, width, and height of everything manually. Pain.
The PDF templates are especially disgusting from a code clarity perspective. They are typically 3-4x the LOC of our HTML templates, at least half of which is dedicated to managing the incidental complexity of page layout, positioning, and style. The other half is a complete duplication of the presentation logic from the HTML view.
Managing these PDF templates is a nightmare and everyone tries to avoid them if at all possible.
I've been experimenting with different ways to make generating PDFs a bit simpler. One of the more promising experiments was using a CSS print stylesheet on top of the existing HTML based reports. This turned out to work fairly well in Firefox using MacOS’s built in Print to PDF support. The resulting PDF actually looked quite a bit better than the hand coded PDF output in a lot cases.
Instead of maintaining N PDF templates (one for each report page), I could theoretically have a single print stylesheet for the entire project and get PDF versions of all of our HTML based reports for free.
Theoretically.
The problem is that you can’t count on Print to PDF support on the client (MacOS is the only platform I know of that has good support built in). You can get Print to PDF support on Windows but you have to purchase a separate PDF print driver. But even if you could count on Print to PDF being supported everywhere, the browser incompatibilities going to print are another barrier you'd have to deal with. And even if that worked, we would still need to generate PDFs on the server because we have a report scheduler that automatically runs reports and sends Email with PDF attachments.
Clearly, moving PDF generation client side isn’t the answer. But I hope that doesn’t necessarily mean we can’t use the browser as our rendering engine. What I'd like to do is run Firefox/Gecko on the server. It would load up the report, render it with the print stylesheet and then output the PDF. The concept is not unlike khtml2png or webkit2png but instead of outputting a raster image, it would output a PDF: gecko2pdf, if you will.
I've been researching the concept off and on for about six months but I haven’t seen anything even approaching this. All of the discussion around Firefox 3.0’s PDF support seems specific to saving the screen presented page as PDF (i.e., without the print CSS applied). Another huge downer is that I can’t seem to find many examples of people using Gecko on the server in any sort of automated fashion.
Quite frankly, I'm stumped. I figured I'd take a moment and write down my thoughts to clear my head. Maybe someone has been down this dark path before or maybe I'm just way out in left field with the whole concept.
Discuss
I know it is neither free nor open source, but I've just learned about princexml (http://www.princexml.com) from this presentation: http://youtube.com/watch?v=vcXUrNSvjhU. It seems like a nice product, although somewhat pricey.
— Eric-Olivier Lamey on Thursday, January 17, 2008 at 08:48 AM #
I did a similar thing to email filled host hosting contracts a few years ago at that time it was html2ps and then convert the ps to pdf using ps2pdf these are both available on most unix boxes
— garyl on Thursday, January 17, 2008 at 09:43 AM #
http://www.philhassey.com/blog/2007/10/17/great-commercial-libraries-for-web-development/
I also rec. princexml. The existing open html2ps projects are okay, but not nearly as nice.
— philhassey on Thursday, January 17, 2008 at 11:26 AM #
why not just write to latex and let that generate the report ? latex is pretty easy to write to , has the fluid box model etc. etc. and quickly generates pdf.
Instead of pdf, with firefox 2 itself you could run it headless on the server, and use drawWindow to create a png image of the html page and send THAT to the client .
— duryodhan on Thursday, January 17, 2008 at 11:58 AM #
I suppose I should give Prince XML another look. A while back I watched the video philhassey linked to above and it looked pretty amazing. We don’t currently run any proprietary software so that’s a bit of a turn off but it does seem like it would be able to handle anything we threw at it.
— Ryan Tomayko on Thursday, January 17, 2008 at 12:08 PM #
I highly recommend PrinceXML, you can use it to produce high quality output from the same HTML you render in the browser, using HTML and CSS. And with print media CSS, you can handle headers/footers, page numbering, page breaks, etc.
Here’s the Web site for one project I'm working on, this is all HTML generated from Textile/HAML:
http://incubator.apache.org/buildr/
We also generate an everything-in-one-page HTML file, and run PrinceXML against it, to create the PDF:
http://incubator.apache.org/buildr/buildr.pdf
We didn’t have to do anything specific in the HTML (although site and one-page have slightly different templates). Title in the header, page numbers in the footer, non-breaking paragraphs, and page numbers in the ToC, are all specified in the CSS:
http://svn.apache.org/repos/asf/incubator/buildr/trunk/doc/css/print.css
— Assaf on Thursday, January 17, 2008 at 03:07 PM #
There are also CSSToXSLFO (http://www.re.be/css2xslfo/) and, python based, pisa (http://pisa.spirito.de/).
— michele on Friday, January 18, 2008 at 07:59 AM #
Although I haven’t tried very fancy CSS with it, OpenOffice has HTML/CSS2 support and PDF export.
It also has X-less installation packages so you don’t need to install X on the server.
— Ryan on Friday, January 18, 2008 at 10:16 AM #
I don’t have a complete answer, but I know there've been definite movements on the Gecko front to make things like this possible. On the one hand it should be possible on Gecko trunk to hook into Cairo and use its PDF abilities – and on the other hand it’s possible to render web pages to images.
I see wishes for functionality similar to this every other month or so, so one of these days someone with the technical know-how to pull it all together should be getting around to actually doing that. I hope…
— Sander on Monday, January 21, 2008 at 03:29 AM #
We have the same problem at our company I ended up hacking together an um “interesting” but basically solid solution using a Virtual PC install, firefox and some c# winforms. The website has a PDF generation queue that it inserts the URL to the printable form of a document, the c# code monitors that queue and sends keystrokes to firefox to open that and then print it to the default printer (which is one of the many free PDF printers), the C# code then copies the file to the server and hits a notification page to remove it from the queue. It is every bit as terrifying as it sounds, but it works fairly well (though I have the C# app restart every 15 minutes to make sure it doesn’t get hung up somewhere by an unexpected dialog). Currently using this system (and without as much tweaking as you'd think) we've created over 16600 PDFs. And yes, I'm always looking for a better idea. The latest one I've heard involves using a headless openoffice install & some python on linux (might work under windows though) to convert to a PDF from one of the many formats openoffice supports. It'd require a redesign though and frankly until my ugly hack starts failing I don’t have time.
— Aaron on Monday, January 21, 2008 at 04:26 PM #
This sort of scenario is entire the reason for XSL-FO, and somebody above mentioned css2xmlfo which in theory should do just what you want and let you avoid learning XSL-FO and dealing with an XSLT template, but I have no idea how well it works. Also: all the free/open tools for XSL-FO related stuff seem to be Java-based.
— Wes on Saturday, February 02, 2008 at 11:23 PM #
Check out InfixServer at http://www.iceni.com/infixServer.htm Takes PDF templates – can dynamically add/change/reflow text & graphics. Can also add/place/scale a PDF from another PDF document onto a PDF page. Driven via an XML interface – so very easy to integrate.
— Richard Patterson on Monday, February 11, 2008 at 05:25 AM #
This might take you down the right road: the program Paparazzi! for Mac uses a modified webkit2png and generates PDFs – its also open source. It might not be too much of a leap to get khtml2png doing this.
— Steven on Wednesday, April 16, 2008 at 02:53 PM #
This is quite possible, I've got an extension (printpdf) that “prints” the current page to a PDF file using the Cairo PDF surface.
The functionality is in Gecko 1.9, just not exposed in Firefox 3 without an extension.
— Alex on Monday, May 26, 2008 at 01:49 AM #
Leave a comment