The Craziness of Subset Selection in Pandas

pandas Nov 21, 2019

Selecting subsets of data in pandas is not a trivial task as there are numerous ways to do the same thing. Different pandas users select data in different ways, so these options can be overwhelming. I wrote a long frou-part series on it to clarify how its done. For instance, take a look at the following options for selecting a single column of data (assuming it’s the first column):

  • df[‘colname’]
  • df[[‘colname’]]
  • df.colname
  • df.loc[:, ‘colname’]
  • df.iloc[:, 0]
  • df.get(‘colname’)

Begin Mastering Data Science Now for Free!

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

Summary of this post

In this post, I want to cover a single edge case of subset selection that I believe most pandas users will be unaware of what it does and how it works. Let’s say we have a DataFrame df and issue the following subset selection.

df[1, 2]

Deceptively simple

This appears to be quite a simple subset selection. There are so few characters on the screen. How difficult can this get? If you are a casual user of pandas, you might think that this must be something that you can figure out its meaning.

Even if you don’t know pandas well or at all, but had to guess what this selected, you might think something along the lines of ‘the value located at the first row and second column’.

Attempt to select on a ‘normal’ DataFrame

Let’s take a look at a ‘normal’ DataFrame with string names as columns and attempt to make the selection df[1, 2].

We are met with a KeyError which is what you get when you attempt to select a column not in the DataFrame. This is typically triggered when you misspell a column name like this:

Tuples as column names

Oddly enough, tuples are allowable as valid column names in a pandas DataFrame. The KeyError informs us that the tuple (1, 2) is not a column in your DataFrame. Yes, that is correct, it’s looking for the tuple (1, 2) as a column for the DataFrame.

Let’s create a DataFrame with a tuple as a column name:

 

The first column name is the tuple (1, 2). Any hahsable object is allowable as a column name.

Repeat selection

Let’s repeat our original selection, df2[1, 2].

This successfully selects the first column from our DataFrame as a Series.

The rules for just the brackets

I use the terminology just the brackets to describe subset selection when the brackets are appended directly to a DataFrame or Series variable name. This helps differentiates it from the loc and iloc indexers which also use the brackets.

pandas has specific rules that you must know to use just the brackets correctly. The behavior of just the brackets changes based on what you place inside of it. Here are the rules for different objects

Slice

Select rows based on integer location or label. df[2:5] selects rows with integer location 2 to 4.

df['Niko':'Penelope'] selects all rows beginning at label ‘Niko’ up to and including the row labeled by ‘Penelope’

List

Select each column in the list and return a DataFrame. df[['age', 'height']] selects the columns ‘age’ and ‘height’ as a DataFrame.

Boolean Series or List

If you pass in a Series or list of all boolean values, pandas uses those booleans to select only the rows where True is located.

Any other object

Supplying any other object will have pandas attempt to select that column as a Series. For instance, passing the string ‘height’ to the brackets selects the column ‘height’ as a Series.

Proving just the brackets with an object that is not a column name raises a KeyError. Trying to select the boolean value True (not to be confused with the string ‘True’) produces a KeyError.

What does df[1, 2] do?

Attempting the selection df[1, 2] falls into the ‘any other object’ category from above. It is not a slice, and it is not a list. The 1, 2 with just the brackets is received by pandas as tuple.

Why is it received as a tuple?

In order to understand why pandas receives this object as a tuple, you must understand how the __getitem__ special method works in Python. If you define this special method for your object, then the brackets work as if they were a method that accepts a single parameter. Whatever is inside the brackets is treated as a single parameter and is passed to the __getitem__ special method.

Let’s show how some of the subset examples using the brackets get translated into a call to the __getitem__ special method.

  • df[2:5] turns to df.__getitem__(slice(2, 5))
  • df[‘Niko’:’Penelope’] becomes df.__getitem(slice(‘Niko’, ‘Penelope’))
  • df[‘height’] becomes df.__getitem__(‘height’)

You might be asking, “isn’t df[1, 2] passing two separate arguments to __getitem__?” The answer is “no”. It treats 1, 2 as a tuple and passes that tuple as a single argument to the __getitem__ special method. Therefore, df[1, 2] becomes df.__getitem__((1, 2)). pandas receives the tuple. It is not a slice, and not a list, therefore it looks to see if this object is a column name. It is not a column name in the df DataFrame and raises a KeyError.

Our other DataFrame, df2, does have a column name equal to the tuple (1, 2) so it gets selected as a Series.

There’s more — MultiIndex Selection

The rules change again for just the brackets whenever you have a MultiIndex for the columns. If you pass in a tuple, it will use the first item in the tuple as the value for the columns in the top level. It takes the second item in the tuple as the value for the columns in the next level.

Take a look at the following DataFrame with a two-level MultiIndex. There are six total columns. The top level has two values — the integers 1 and 2.

We select all of the columns with top level value equal to 1 like this:

To select a single column in this MultiIndex DataFrame, use a tuple just like we did above.

Summary

Just the brackets makes subset selections on a pandas DataFrame and changes its behavior based on the object passed to it. Below are the objects it accepts and what it returns.

  • slice — rows
  • list — columns
  • boolean Series or list — rows
  • anything else — a single column (or multiple columns if it’s a tuple on a MultiIndex DataFrame)

Master Python, Data Science and Machine Learning

Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:

Get the All Access Pass now!

Close

Register for a free account

Upon registration, you'll get access to four free courses.