Sep 17, 2019

In this article, I will present an ‘optimal’ solution to Dunder Data Challenge #3. Please refer to that article for the problem setup. Work on this challenge directly in a Jupyter Notebook right now by clicking this link.

The naive solution was presented in detail in the previous article. The end result was a massive custom function containing many boolean filters used to find specific subsets of data to aggregate. For each group, a Series was returned with 11 values. Each of these values became a new column in the resulting DataFrame. Let’s take a look at the custom function:

Our performance using this naive solution takes nearly 4 seconds.

- My book Master Data Analysis with Python is the most comprehensive text on the market to learn data analysis using Python and comes with 300+ exercises and projects.
- Sign-up for the
**FREE**Intro to Pandas class - Follow me on Twitter @TedPetrou for my daily data science tricks

In order to greatly increase our performance, we need to take advantage of the **built-in** methods available to groupby objects. Above, we used a custom function to do many, many calculations. These calculations were performed on each group. Let’s get the total number of groups.

`>>> len(df.groupby(['country', 'region']))`

520

For each of these 520 groups, the massive `f_final`

function was called, recomputing each filter. Running these same calculations for each group is one of the main causes of poor performance with `apply`

.

Take a look at all the filters in `f_final`

. You'll notice that each of them are independent on the particular group. This means that we can calculate these filters **before** grouping and get the same result. As a concrete example, take a look at the following DataFrame.

`df_rev = pd.DataFrame({'state': ['TX', 'TX', 'TX', 'CA', 'CA', `

'CA'],

'category': ['tech', 'energy', 'energy',

'tech', 'energy', 'energy'],

'revenue' : [10, 5, 8, 20, 12, 2]})

Let’s calculate the revenue for each state, but just for the energy category. A naive solution involves writing a custom function, where each group will be filtered for just the energy category and then have the revenue summed.

**def** f(x):

is_energy = x['category'] == 'energy'

**return** x.loc[is_energy, 'revenue'].sum()

Using this custom function with apply returns the correct revenue for each state’s energy category.

`>>> df_rev.groupby('state').apply(f)`

state

CA 14

TX 13

Instead, we can create an entire new column for just the energy revenue. First, we create a boolean Series where `True`

corresponds to 'energy'. We multiply this Series by the original revenue column. Because `False`

evaluates to 0 and `True`

evaluates as 1, the new column will be just like the original, but have the 0 everywhere the category is not energy.

`filt = df_rev['category'] == 'energy'`

df_rev['energy_revenue'] = filt * df_rev['revenue']

df_rev

We can now use the built-in `sum`

method instead of our custom function during the grouping. There is no need for `apply`

here.

`>>> df_rev.groupby('state')['energy_revenue'].sum()`

state

CA 14

TX 13

`where`

methodInstead of replacing the filtered values with 0, as we did above, you might need to make them missing. This is crucial if you are calculating something like the mean or median, which will take into account the value of 0. The `where`

method will replace the `False`

values of the passed boolean Series with NaN.

`filt = df_rev['category'] == 'energy'`

df_rev['energy_revenue'] = df_rev['revenue'].where(filt)

df_rev

We can call the same groupby to get the same result.

`>>> df_rev.groupby('state')['energy_revenue'].sum()`

state

CA 14.0

TX 13.0

Let’s calculate `rev_2019`

which is defined as the revenue during the first half of 2019. Let's use the naive way of thinking first by defining a custom function.

>>> defget_rev_2019(x):

is_2019H1 = x['date'].between('2019-01-01', '2019-06-30')

returnx.loc[is_2019H1, 'revenue'].sum()

>>> df.groupby(['country', 'region']).apply(get_rev_2019).head()country region

Argentina A 150508

B 139048

C 118035

D 131728

E 146201

Now, let’s use our new method of applying the filter to the entire DataFrame first, creating a new column, and then using the built-in `sum`

method.

`>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')`

>>> df['rev_2019'] = df['revenue'].where(is_2019H1)

>>> df.groupby(['country', 'region'])['rev_2019'].sum().head()

country region

Argentina A 150508.0

B 139048.0

C 118035.0

D 131728.0

E 146201.0

Let’s compare the performance between the two methods.

`>>> %`**timeit** df.groupby(['country', 'region']).apply(get_rev_2019)

633 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

`>>> %%timeit`

>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')

>>> df['rev_2019'] = df['revenue'].where(is_2019H1)

>>> df.groupby(['country', 'region'])['rev_2019'].sum()

27.3 ms ± 530 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The custom function is 25x slower than the built-in method and this is just a simple calculation. The more complex the custom function, the larger the performance difference becomes.

If you are attempting to avoid using apply, then you have no choice but to use the built-in groupby methods. This limits the possibilities and forces you to approach the problem differently. The main ‘trick’ is to execute operations to the entire DataFrame before using the groupby method.

Not all operations will be able to be executed on the entire DataFrame, only those that are **independent** of the group. So, how do you know if an operation is independent of the group? The operation will not contain a calculation that is specific to the current group. For instance, we are grouping by country and region. If an operation is dependent on the particular country or region, then it would not be able to be executed on the entire DataFrame.

A concrete example can help here — If the definition of the first half of the year was January through July for Greece and January through June for all other countries, then the calculation of revenue for the first half of the year would depend on the group.

In this challenge, all the operations are independent of the group. There are no special cases based on the group. This means that we can execute all of our operations that we used within the custom function passed to `apply`

outside of it before we group.

The complete optimal solution will now be given. We will use the same definition for our filters as we did in the custom function, but instead calculate them on the entire DataFrame. We will then create new columns that have NaN where the filter is `False`

.

We can now use just the built-in groupby methods to aggregate the data.

This is not the final DataFrame, as some columns can only be calculated from the result of the aggregated values. We also need to drop some of these intermediate columns that are no longer desired in the result.

Let’s verify that the DataFrames are equivalent.

`>>> df1 = df.groupby(['country', 'region']).apply(f_final)`

>>> df1.equals(df_new1.astype('float'))

True

Let’s put all the steps of the optimal solution into a single function, which we can then use to measure performance.

**def** optimal():

... all steps

The optimal solution is about 20x as fast as the naive solution due to pre-calculating new columns on the entire DataFrame and only using built-in groupby methods.

>>> %**timeit** optimal()

201 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I suggest avoiding apply at all costs as the performance can be badly impacted by its use. The following steps summarize the procedure for avoiding apply:

- Compute any calculations that are independent of each group before grouping. These calculations are applied to the entire DataFrame as a whole
- Create new columns in your DataFrame that contain the result of these new calculations from step 1.
- When grouping, only use the built-in groupby aggregation methods. Do not use a custom function.
- You may have some aggregations that use the result of a groupby aggregation (taking the difference between mean and median for example). Calculate these columns after the groupby.

Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. **Purchase the All Access Pass to get lifetime access to all current and future courses****.** Some of the courses it contains:

- Master the Fundamentals of Python— A comprehensive introduction to Python (300+ pages, 150+ exercises)
**Master Data Analysis with Python**— The most comprehensive course available to learn pandas. (800+ pages and 500+ exercises)**Master Machine Learning with Python**— A deep dive into doing machine learning with scikit-learn constantly updated to showcase the latest and greatest tools. (300+ pages)

Upon registration, you'll get access to the following free courses:

- Python Installation
- Intro to Jupyter Notebooks
- Intro to Pandas
- Python and Pandas Challenges