Minimally Sufficient Pandas Cheat Sheet

pandas Jan 03, 2020

This article summarizes the very detailed guide presented in Minimally Sufficient Pandas.

Begin Mastering Data Science Now for Free!

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

What is Minimally Sufficient Pandas?

  • It is a small subset of the library that is sufficient to accomplish nearly everything that it has to offer.
  • It allows you to focus on doing data analysis and not the syntax

How will Minimally Sufficient Pandas benefit you?

  • All common data analysis tasks will use the same syntax
  • Fewer commands will be easier to commit to memory
  • Your code will be easier to understand by others and by you
  • It will be easier to put Pandas code in production
  • It reduces the chance of landing on a Pandas bug.

Specific Guidance

Selecting a Single Column of Data

Use the brackets and not dot notation to select a single column of data because the dot notation cannot column names with spaces, those that collide with DataFrame methods and when the column name is a variable.

The deprecated  indexer

The  indexer is ambiguous and confusing (and now deprecated) as it allows selection by both label and integer location. Every trace of  should be removed and replaced with the explicit or  indexers.

Selection with at and iat

The  and  indexers give a small increase in performance when selecting a single DataFrame cell. Use NumPy arrays if your application relies on performance for selecting a single cell of data and not  or .

 vs 

The only difference between these two functions is the default delimiter. Use  for all cases as  is deprecated.

 vs  and  vs 

 is an alias of  and  is an alias of . Use  and  as they end with ‘na’ like the other missing value methods  and .

Arithmetic and Comparison Operators vs Methods

Use the operators( , etc..) and not their corresponding methods ( , etc…) in all cases except when absolutely necessary such as when you need to change the direction of the alignment.

Builtin Python functions vs Pandas methods with the same name

Use the Pandas method over any built-in Python function with the same name.

Standardizing 

There are a few different syntaxes available to do a  aggregation. Use  as it can handle more complex cases.

Handling a MultiIndex

A DataFrame with a MultiIndex offers little benefit over one with a single-level index. I advise against using them. Instead, flatten them after a call to by renaming columns and resetting the index.

The equivalency of  aggregation and 

 aggregation and a  produce the same exact data with a different shape. Use  when you want to continue an analysis and  when you want to compare groups.

The equivalency of pivot_table and pd.crosstab

The  method and the  function are very similar. Only use  when finding the relative frequency.

pivot vs pivot_table

The  method pivots data without aggregating. It is possible to duplicate its functionality with  by selecting an aggregation function. Consider using only  and not .

The similarity between melt and stack

Both the  and  methods reshape the data in a very similar manner. Use  over  because it allows you to rename columns and it avoids a MultiIndex.

The similarity between  and unstack

Both  and  work reshape data similarly but from above,  can handle all cases that  can, so I suggest using it over both of the others.

Best of the DataFrame API

The above examples are the most common areas of Pandas where multiple options are available to its users. There are many other attributes and methods that are not discussed. Below, I provide a categorized list of the minimum amount of DataFrame attributes and methods that can accomplish nearly all of your data analysis tasks. It reduces the number from over 240 to less than 80.

Attributes

  • columns
  • dtypes
  • index
  • shape
  • T
  • values

Aggregation Methods

These result in a single value for each column

  • all
  • any
  • count
  • describe
  • idxmax
  • idxmin
  • max
  • mean
  • median
  • min
  • mode
  • nunique
  • sum
  • std
  • var

Non-Aggretaion Statistical Methods

  • abs
  • clip
  • corr
  • cov
  • cummax
  • cummin
  • cumprod
  • cumsum
  • diff
  • nlargest
  • nsmallest
  • pct_change
  • prod
  • quantile
  • rank
  • round

Subset Selection

  • head
  • iloc
  • loc
  • tail

Missing Value Handling

  • dropna
  • fillna
  • interpolate
  • isna
  • notna

Grouping

  • expanding
  • groupby
  • pivot_table
  • resample
  • rolling

Joining Data

  • append
  • merge

Other

  • asfreq
  • astype
  • copy
  • drop
  • drop_duplicates
  • equals
  • isin
  • melt
  • plot
  • rename
  • replace
  • reset_index
  • sample
  • select_dtypes
  • shift
  • sort_index
  • sort_values
  • to_csv
  • to_json
  • to_sql

Functions

  • pd.concat
  • pd.crosstab
  • pd.cut
  • pd.qcut
  • pd.read_csv
  • pd.read_json
  • pd.read_sql
  • pd.to_datetime
  • pd.to_timedelta
Close

Register for a free account

Upon registration, you'll get access to four free courses.