Pandas is the most popular Python data analysis library available today and can read in data directly from a wide variety of sources, including CSVs, Excel Workbooks, JSON files, SQL databases, parquet files, and even from your clipboard. Currently, there is no direct method using pandas to read in data trapped within a PDF file. Thankfully, the tabula-py library (credit to Aki Ariga for developing it) is available to read in these tables within a PDF as pandas DataFrames.
The tabula-py library itself is a wrapper around tabula-java, a command line tool for extracting trapped data within a PDF. Get started by installing it with pip install tabula-py
.
For this tutorial, we will extract a single table within the royalty statement from Pandas Cookbook, a book I published in 2017. An image of the PDF is displayed below.
Tabula-py automatically detects all tables on the pages provided with its read_pdf
function. All we have to do is provide it the file path of the PDF document and the pages to extract. The returned result is a list of DataFrames, stored here as the variable dfs
.
import pandas as pd
import tabula
dfs = tabula.read_pdf('2018Q1.pdf', pages=1)
len(dfs)
1
Tabula has detected a single table. Let's output the DataFrame to see how well the extraction worked.
df = dfs[0]
df
Unnamed: 0 | AUTHOR ROYALTY STATEMENT | Unnamed: 1 | |
---|---|---|---|
0 | NaN | ROYALTY STATEMENT DATE: 01/01/18 - 31/03/18 | NaN |
1 | Pandas Cookbook | NaN | NaN |
2 | Format | Copies Revenue | Royalties |
3 | Subscription 3rd Party | 0 $ 0.00 | $ 0.00 |
4 | Video | 0 $ 0.00 | $ 0.00 |
5 | Translation | 0 $ 0.00 | $ 0.00 |
6 | Course | 0 $ 0.00 | $ 0.00 |
7 | Subscription Mapt | 203 $ 626.05 | $ 100.16 |
8 | Print Book | 366 $ 12,927.57 | $ 2,068.41 |
9 | Ebook | 643 $ 9,131.04 | $ 1,460.97 |
For this PDF, I wanted to extract the table beginning in the middle box with the four columns Format, Copies, Revenue, Royalties. Instead, Tabula extracted the two boxes above this and returned a three-column table combining Copies and Revenue. While the result isn't bad, it's not usable in the current form. We could use pandas at this point to clean up the data, but tabula-py provides us parameters to define the boundary of the entire table itself and the positioning of the columns.
The relative_area
and relative_columns
parameters are set to True
which allows us to provide boundaries to both the area
and columns
parameters as a percentage of the page, as opposed to point coordinates. In my opinion, it is easier to define a boundary using relative page percentage.
The area
column is passed a four-item list of top, left, bottom, and right. These represent the boundaries of the table relative to the page. The three-item list for columns
represent the vertical lines separating the four columns. Let's re-read in the data and output the resulting DataFrame.
dfs = tabula.read_pdf('2018Q1.pdf', pages=1,
relative_area=True,
relative_columns=True,
area=[42, 0, 60, 100],
columns= [45, 65, 82])
df = dfs[0]
df
Format | Copies | Revenue | Royalties | |
---|---|---|---|---|
0 | Subscription 3rd Party | 0 | $ 0.00 | $ 0.00 |
1 | Video | 0 | $ 0.00 | $ 0.00 |
2 | Translation | 0 | $ 0.00 | $ 0.00 |
3 | Course | 0 | $ 0.00 | $ 0.00 |
4 | Subscription Mapt | 203 | $ 626.05 | $ 100.16 |
5 | Print Book | 366 | $ 12,927.57 | $ 2,068.41 |
6 | Ebook | 643 | $ 9,131.04 | $ 1,460.97 |
This extraction looks much better and almost exactly what we want. Let's check the data types of each column to understand which ones need more processing.
df.dtypes
Format object Copies int64 Revenue object Royalties object dtype: object
Conveniently, the Copies column was successfully converted to an integer. The Revenue and Royalties columns were read in as strings. We'll now have to turn to pandas to convert these columns to floats. We do so by replacing any dollar signs, commas, and spaces with empty spaces using a regular expression.
df[['Revenue', 'Royalties']] = (df[['Revenue', 'Royalties']]
.replace('[$, ]', '', regex=True)
.astype('float64'))
df
Format | Copies | Revenue | Royalties | |
---|---|---|---|---|
0 | Subscription 3rd Party | 0 | 0.00 | 0.00 |
1 | Video | 0 | 0.00 | 0.00 |
2 | Translation | 0 | 0.00 | 0.00 |
3 | Course | 0 | 0.00 | 0.00 |
4 | Subscription Mapt | 203 | 626.05 | 100.16 |
5 | Print Book | 366 | 12927.57 | 2068.41 |
6 | Ebook | 643 | 9131.04 | 1460.97 |
In this tutorial, we passed custom inputs into tabula-py's read_pdf
function to extract a table trapped within a PDF as a pandas DataFrame. We then cleaned up the result with pandas ability to replace characters using a regular expression. Several more options are available to tabula-py to customize the extraction process.
If you enjoyed this lesson and are looking to become an expert with Pandas, then check out my book, Master Data Analysis with Python. It is the most comprehensive Pandas book available, and comes with video lessons, 500+ exercises with solutions, and certification exams.
Upon registration, you'll get access to the following free courses: