Selecting subsets of data in pandas is not a trivial task as there are numerous ways to do the same thing. Different pandas users select data in different ways, so these options can be overwhelming. I wrote a long frou-part series on it to clarify how its done. For instance, take a look at the following options for selecting a single column of data (assuming it’s the first column):
Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.
In this post, I want to cover a single edge case of subset selection that I believe most pandas users will be unaware of what it does and how it works. Let’s say we have a DataFrame
df and issue the following subset selection.
This appears to be quite a simple subset selection. There are so few characters on the screen. How difficult can this get? If you are a casual user of pandas, you might think that this must be something that you can figure out its meaning.
Even if you don’t know pandas well or at all, but had to guess what this selected, you might think something along the lines of ‘the value located at the first row and second column’.
Let’s take a look at a ‘normal’ DataFrame with string names as columns and attempt to make the selection
We are met with a
KeyError which is what you get when you attempt to select a column not in the DataFrame. This is typically triggered when you misspell a column name like this:
Oddly enough, tuples are allowable as valid column names in a pandas DataFrame. The
KeyError informs us that the tuple
(1, 2) is not a column in your DataFrame. Yes, that is correct, it’s looking for the tuple
(1, 2) as a column for the DataFrame.
Let’s create a DataFrame with a tuple as a column name:
The first column name is the tuple
(1, 2). Any hahsable object is allowable as a column name.
Let’s repeat our original selection,
This successfully selects the first column from our DataFrame as a Series.
I use the terminology just the brackets to describe subset selection when the brackets are appended directly to a DataFrame or Series variable name. This helps differentiates it from the
iloc indexers which also use the brackets.
pandas has specific rules that you must know to use just the brackets correctly. The behavior of just the brackets changes based on what you place inside of it. Here are the rules for different objects
Select rows based on integer location or label.
df[2:5] selects rows with integer location 2 to 4.
df['Niko':'Penelope'] selects all rows beginning at label ‘Niko’ up to and including the row labeled by ‘Penelope’
Select each column in the list and return a DataFrame.
df[['age', 'height']] selects the columns ‘age’ and ‘height’ as a DataFrame.
If you pass in a Series or list of all boolean values, pandas uses those booleans to select only the rows where True is located.
Supplying any other object will have pandas attempt to select that column as a Series. For instance, passing the string ‘height’ to the brackets selects the column ‘height’ as a Series.
Proving just the brackets with an object that is not a column name raises a
KeyError. Trying to select the boolean value
True (not to be confused with the string ‘True’) produces a
Attempting the selection
df[1, 2] falls into the ‘any other object’ category from above. It is not a slice, and it is not a list. The
1, 2 with just the brackets is received by pandas as tuple.
In order to understand why pandas receives this object as a tuple, you must understand how the
__getitem__ special method works in Python. If you define this special method for your object, then the brackets work as if they were a method that accepts a single parameter. Whatever is inside the brackets is treated as a single parameter and is passed to the
__getitem__ special method.
Let’s show how some of the subset examples using the brackets get translated into a call to the
__getitem__ special method.
You might be asking, “isn’t
df[1, 2] passing two separate arguments to
__getitem__?” The answer is “no”. It treats
1, 2 as a tuple and passes that tuple as a single argument to the
__getitem__ special method. Therefore,
df[1, 2] becomes
df.__getitem__((1, 2)). pandas receives the tuple. It is not a slice, and not a list, therefore it looks to see if this object is a column name. It is not a column name in the
df DataFrame and raises a
Our other DataFrame,
df2, does have a column name equal to the tuple
(1, 2) so it gets selected as a Series.
The rules change again for just the brackets whenever you have a MultiIndex for the columns. If you pass in a tuple, it will use the first item in the tuple as the value for the columns in the top level. It takes the second item in the tuple as the value for the columns in the next level.
Take a look at the following DataFrame with a two-level MultiIndex. There are six total columns. The top level has two values — the integers 1 and 2.
We select all of the columns with top level value equal to 1 like this:
To select a single column in this MultiIndex DataFrame, use a tuple just like we did above.
Just the brackets makes subset selections on a pandas DataFrame and changes its behavior based on the object passed to it. Below are the objects it accepts and what it returns.
Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains: