Convert Trapped Tables within PDFs as Pandas DataFrames

pandas Nov 30, 2022

Pandas is the most popular Python data analysis library available today and can read in data directly from a wide variety of sources, including CSVs, Excel Workbooks, JSON files, SQL databases, parquet files, and even from your clipboard. Currently, there is no direct method using pandas to read in data trapped within a PDF file. Thankfully, the tabula-py library (credit to Aki Ariga for developing it) is available to read in these tables within a PDF as pandas DataFrames.

The tabula-py library itself is a wrapper around tabula-java, a command line tool for extracting trapped data within a PDF. Get started by installing it with pip install tabula-py.

Sample PDF - Book Sales¶

For this tutorial, we will extract a single table within the royalty statement from Pandas Cookbook, a book I published in 2017. An image of the PDF is displayed below.

Automatic table detection¶

Tabula-py automatically detects all tables on the pages provided with its read_pdf function. All we have to do is provide it the file path of the PDF document and the pages to extract. The returned result is a list of DataFrames, stored here as the variable dfs.

In [1]:

import pandas as pd
import tabula
dfs = tabula.read_pdf('2018Q1.pdf', pages=1)
len(dfs)

Out[1]:

Tabula has detected a single table. Let's output the DataFrame to see how well the extraction worked.

In [2]:

df = dfs[0]
df

Out[2]:

	Unnamed: 0	AUTHOR ROYALTY STATEMENT	Unnamed: 1
0	NaN	ROYALTY STATEMENT DATE: 01/01/18 - 31/03/18	NaN
1	Pandas Cookbook	NaN	NaN
2	Format	Copies Revenue	Royalties
3	Subscription 3rd Party	0 $ 0.00	$ 0.00
4	Video	0 $ 0.00	$ 0.00
5	Translation	0 $ 0.00	$ 0.00
6	Course	0 $ 0.00	$ 0.00
7	Subscription Mapt	203 $ 626.05	$ 100.16
8	Print Book	366 $ 12,927.57	$ 2,068.41
9	Ebook	643 $ 9,131.04	$ 1,460.97

For this PDF, I wanted to extract the table beginning in the middle box with the four columns Format, Copies, Revenue, Royalties. Instead, Tabula extracted the two boxes above this and returned a three-column table combining Copies and Revenue. While the result isn't bad, it's not usable in the current form. We could use pandas at this point to clean up the data, but tabula-py provides us parameters to define the boundary of the entire table itself and the positioning of the columns.

The relative_area and relative_columns parameters are set to True which allows us to provide boundaries to both the area and columns parameters as a percentage of the page, as opposed to point coordinates. In my opinion, it is easier to define a boundary using relative page percentage.

The area column is passed a four-item list of top, left, bottom, and right. These represent the boundaries of the table relative to the page. The three-item list for columns represent the vertical lines separating the four columns. Let's re-read in the data and output the resulting DataFrame.

In [3]:

dfs = tabula.read_pdf('2018Q1.pdf', pages=1, 
                      relative_area=True,
                      relative_columns=True,
                      area=[42, 0, 60, 100], 
                      columns= [45, 65, 82])
df = dfs[0]
df

Out[3]:

	Format	Copies	Revenue	Royalties
0	Subscription 3rd Party	0	$ 0.00	$ 0.00
1	Video	0	$ 0.00	$ 0.00
2	Translation	0	$ 0.00	$ 0.00
3	Course	0	$ 0.00	$ 0.00
4	Subscription Mapt	203	$ 626.05	$ 100.16
5	Print Book	366	$ 12,927.57	$ 2,068.41
6	Ebook	643	$ 9,131.04	$ 1,460.97

Convert to numeric¶

This extraction looks much better and almost exactly what we want. Let's check the data types of each column to understand which ones need more processing.

In [4]:

df.dtypes

Out[4]:

Format       object
Copies        int64
Revenue      object
Royalties    object
dtype: object

Conveniently, the Copies column was successfully converted to an integer. The Revenue and Royalties columns were read in as strings. We'll now have to turn to pandas to convert these columns to floats. We do so by replacing any dollar signs, commas, and spaces with empty spaces using a regular expression.

In [5]:

df[['Revenue', 'Royalties']] = (df[['Revenue', 'Royalties']]
                                  .replace('[$, ]', '', regex=True)
                                  .astype('float64'))
df

Out[5]:

	Format	Copies	Revenue	Royalties
0	Subscription 3rd Party	0	0.00	0.00
1	Video	0	0.00	0.00
2	Translation	0	0.00	0.00
3	Course	0	0.00	0.00
4	Subscription Mapt	203	626.05	100.16
5	Print Book	366	12927.57	2068.41
6	Ebook	643	9131.04	1460.97

Summary¶

In this tutorial, we passed custom inputs into tabula-py's read_pdf function to extract a table trapped within a PDF as a pandas DataFrame. We then cleaned up the result with pandas ability to replace characters using a regular expression. Several more options are available to tabula-py to customize the extraction process.

Master Data Analysis with Python¶

If you enjoyed this lesson and are looking to become an expert with Pandas, then check out my book, Master Data Analysis with Python. It is the most comprehensive Pandas book available, and comes with video lessons, 500+ exercises with solutions, and certification exams.

Convert Trapped Tables within PDFs as Pandas DataFrames

Sample PDF - Book Sales¶

Automatic table detection¶

Convert to numeric¶

Summary¶

Master Data Analysis with Python¶

Master Data Analysis with Python

Start learning Data Science using Python with our free courses!

Register for a free account