Pandas is the most popular Python data analysis library available today and can read in data directly from a wide variety of sources, including CSVs, Excel Workbooks, JSON files, SQL databases, parquet files, and even from your clipboard. Currently, there is no direct method using pandas to read in data trapped within a PDF file. Thankfully, the tabula-py library (credit to Aki Ariga for developing it) is available to read in these tables within a PDF as pandas DataFrames.
For this tutorial, we will extract a single table within the royalty statement from Pandas Cookbook, a book I published in 2017. An image of the PDF is displayed below.
Tabula-py automatically detects all tables on the pages provided with its
read_pdf function. All we have to do is provide it the file path of the PDF document and the pages to extract. The returned result is a list of DataFrames, stored here as the variable
import pandas as pd import tabula dfs = tabula.read_pdf('2018Q1.pdf', pages=1) len(dfs)
Tabula has detected a single table. Let's output the DataFrame to see how well the extraction worked.
df = dfs df
|Unnamed: 0||AUTHOR ROYALTY STATEMENT||Unnamed: 1|
|0||NaN||ROYALTY STATEMENT DATE: 01/01/18 - 31/03/18||NaN|
|3||Subscription 3rd Party||0 $ 0.00||$ 0.00|
|4||Video||0 $ 0.00||$ 0.00|
|5||Translation||0 $ 0.00||$ 0.00|
|6||Course||0 $ 0.00||$ 0.00|
|7||Subscription Mapt||203 $ 626.05||$ 100.16|
|8||Print Book||366 $ 12,927.57||$ 2,068.41|
|9||Ebook||643 $ 9,131.04||$ 1,460.97|
For this PDF, I wanted to extract the table beginning in the middle box with the four columns Format, Copies, Revenue, Royalties. Instead, Tabula extracted the two boxes above this and returned a three-column table combining Copies and Revenue. While the result isn't bad, it's not usable in the current form. We could use pandas at this point to clean up the data, but tabula-py provides us parameters to define the boundary of the entire table itself and the positioning of the columns.
relative_columns parameters are set to
True which allows us to provide boundaries to both the
columns parameters as a percentage of the page, as opposed to point coordinates. In my opinion, it is easier to define a boundary using relative page percentage.
area column is passed a four-item list of top, left, bottom, and right. These represent the boundaries of the table relative to the page. The three-item list for
columns represent the vertical lines separating the four columns. Let's re-read in the data and output the resulting DataFrame.
dfs = tabula.read_pdf('2018Q1.pdf', pages=1, relative_area=True, relative_columns=True, area=[42, 0, 60, 100], columns= [45, 65, 82]) df = dfs df
|0||Subscription 3rd Party||0||$ 0.00||$ 0.00|
|1||Video||0||$ 0.00||$ 0.00|
|2||Translation||0||$ 0.00||$ 0.00|
|3||Course||0||$ 0.00||$ 0.00|
|4||Subscription Mapt||203||$ 626.05||$ 100.16|
|5||Print Book||366||$ 12,927.57||$ 2,068.41|
|6||Ebook||643||$ 9,131.04||$ 1,460.97|
Format object Copies int64 Revenue object Royalties object dtype: object
Conveniently, the Copies column was successfully converted to an integer. The Revenue and Royalties columns were read in as strings. We'll now have to turn to pandas to convert these columns to floats. We do so by replacing any dollar signs, commas, and spaces with empty spaces using a regular expression.
df[['Revenue', 'Royalties']] = (df[['Revenue', 'Royalties']] .replace('[$, ]', '', regex=True) .astype('float64')) df
|0||Subscription 3rd Party||0||0.00||0.00|
In this tutorial, we passed custom inputs into tabula-py's
read_pdf function to extract a table trapped within a PDF as a pandas DataFrame. We then cleaned up the result with pandas ability to replace characters using a regular expression. Several more options are available to tabula-py to customize the extraction process.
If you enjoyed this lesson and are looking to become an expert with Pandas, then check out my book, Master Data Analysis with Python. It is the most comprehensive Pandas book available, and comes with video lessons, 500+ exercises with solutions, and certification exams.