This is a read only archive of See the shutdown announcement for details.

labs-post-pdf-tools # Libraries for Extracting Data and Text from PDFs: A Review

* Authors: Rufus Pollock
    [add your name here if you contribute and want to be credited]
* Who is this for? Data wranglers who would be looking to extract information from PDF
* We should try and offer opinions on tools where possible
[should actually review the tools ie. which is best? pros/cons? level of capability required? reliability
[paragraph on crowd-scraping as well? "alternative approaches when your geeks can't do it"]
[perhaps find a sample PDF and see how each tool does? show the differences in output?] - This is a GREAT idea.

Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.

3 categories:

* Extracting text from PDF
* Extracting tables from PDF
* Extracting data (text or otherwise) from PDFs with scans

The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here.
[should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]

[[TODO: some nice PDF screenshots - perhaps we can reference]]

## Generic (PDF -> text)

* [PDFMiner][pdfminer] - PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
  * Pure python
* [pdftohtml][] - pdftohtml is a utility which converts PDF files into HTML and XML formats. Based on xpdf
  * Command-line Linux
* [pdftoxml][] - command line utility to convert PDF to XML built on poppler.
* [docsplit][] - part of DocumentCloud. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
* [pypdf2xml][] - convert PDF to XML. Built on pdfminer. Started as an alternative to poppler's pdftoxml, which didn't properly decode CID Type2 fonts in PDFs.
* [pdf2htmlEX][] - Convert PDF to HTML without losing text or format. C++. Fast. Primarily focused on producing HTML that exactly resembles the original PDF. Limited use for straightforward text extraction.


### Tables from PDF

* <> - open-source, designed specifically for tabular data. Now easy to install. Ruby-based.
* - open-source. Created by Scraperwiki but no longer seems to be available so here is a fork)
* <> - one of the better for tables but have not used for a while
* <> - linux only afaict
* <> AGPLv3+, python, scraptils has other useful tools as well, pdf2csv needs pdfminer==20110515
* [pdf.js]( - you probably want a fork like [pdf2json]( or [node-pdfreader]( that integrates this better with node. I have not tried this on tables though ...
* Using scraperwiki + pdftoxml - see this recent tutorial [Get Started With Scraping – Extracting Simple Tables from PDF Documents][scoda-simple-tables]

### Existing open services

* - has a nice command line interface
  * Is this open? Says at [bottom of usage]( that it is powered by
* Scraperwiki - and [this tutorial](

### Existing proprietary free or paid-for services

* - free, no API
* - free, no API, captcha
* - free
* - pay-per-page service

Google app engine used to do this

### By Language

@maxogden has this list of Node libraries and tools:

Here's a gist showing how to use pdf2json:

## Other good intros

* <!/parsing-pdfs/>
* [Extracting Data from PDFs - School of Data][scoda-1]