labs-post-pdf-tools

This is a read only archive of pad.okfn.org. See the shutdown announcement for details.

labs-post-pdf-tools # Libraries for Extracting Data and Text from PDFs: A Review

<meta>
* Authors: Rufus Pollock
[add your name here if you contribute and want to be credited]
* Who is this for? Data wranglers who would be looking to extract information from PDF
* We should try and offer opinions on tools where possible
[should actually review the tools ie. which is best? pros/cons? level of capability required? reliability
[paragraph on crowd-scraping as well? "alternative approaches when your geeks can't do it"]
[perhaps find a sample PDF and see how each tool does? show the differences in output?] - This is a GREAT idea.
</meta>

Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.

3 categories:

* Extracting text from PDF
* Extracting tables from PDF
* Extracting data (text or otherwise) from PDFs with scans

The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here.
[should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]

[[TODO: some nice PDF screenshots - perhaps we can reference]]

## Generic (PDF -> text)

* [PDFMiner][pdfminer] - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
* Pure python
* [pdftohtml][] - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
* Command-line Linux
* [pdftoxml][] - command line utility to convert PDF to XML built on poppler.
* [docsplit][] - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
* [pypdf2xml][] - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
* [pdf2htmlEX][] - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction.

[pdf2htmlEX]: http://coolwanglu.github.io/pdf2htmlEX/
[pypdf2xml]: https://github.com/zejn/pypdf2xml
[docsplit]: http://documentcloud.github.io/docsplit/
[pdfminer]: http://www.unixuser.org/~euske/python/pdfminer/
[pdftohtml]: http://pdftohtml.sourceforge.net/
[pdftoxml]: http://pdftoxml.sourceforge.net/

### Tables from PDF

* <http://tabula.nerdpower.org/> - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.
* https://github.com/okfn/pdftables - open-source. Created by Scraperwiki but no longer seems to be available so here is a fork)
* <http://pdftoxml.sourceforge.net/> - one of the better for tables but have not used for a while
* <http://pdftohtml.sourceforge.net/> - linux only afaict
* <https://github.com/liberit/scraptils/blob/master/scraptils/tools/pdf2csv.py> AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515
* [pdf.js](http://mozilla.github.io/pdf.js/) - you probably want a fork like [pdf2json](https://github.com/modesty/pdf2json) or [node-pdfreader](https://github.com/jviereck/node-pdfreader) that integrates this better with node. I have not tried this on tables though ...
* Using scraperwiki + pdftoxml - see this recent tutorial [Get Started With Scraping – Extracting Simple Tables from PDF Documents][scoda-simple-tables]

### Existing open services

* http://pdfx.cs.man.ac.uk/ - has a nice command line interface
* Is this open? Says at [bottom of usage](http://pdfx.cs.man.ac.uk/usage) that it is powered by http://www.utopiadocs.com/
* Scraperwiki - https://views.scraperwiki.com/run/pdf-to-html-preview-1/ and [this tutorial](http://blog.scraperwiki.com/2010/12/17/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/)

### Existing proprietary free or paid-for services

* http://www.newocr.com/ - free, no API
* http://www.free-ocr.com/ - free, no API, captcha
* http://www.onlineocr.net/ - free
* http://captricity.com/
* https://pdftables.com/ - pay-per-page service

Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview

### By Language

@maxogden has this list of Node libraries and tools:

https://gist.github.com/maxogden/5842859

Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247

## Other good intros

* <http://thomaslevine.com/!/parsing-pdfs/>
* [Extracting Data from PDFs - School of Data][scoda-1]

[scoda-1]: http://schoolofdata.org/handbook/courses/extracting-data-from-pdf/
[scoda-simple-tables]: http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/