Dunder Data Challenge #3 - Optimal Solution

dunder data challenges Sep 17, 2019

In this article, I will present an ‘optimal’ solution to Dunder Data Challenge #3. Please refer to that article for the problem setup. Work on this challenge directly in a Jupyter Notebook right now by clicking this link.

Naive Solution — Custom function with apply

The naive solution was presented in detail in the previous article. The end result was a massive custom function containing many boolean filters used to find specific subsets of data to aggregate. For each group, a Series was returned with 11 values. Each of these values became a new column in the resulting DataFrame. Let’s take a look at the custom function:


Our performance using this naive solution takes nearly 4 seconds.

Become an Expert

Optimal Solution

In order to greatly increase our performance, we need to take advantage of the built-in methods available to groupby objects. Above, we used a custom function to do many, many calculations. These calculations were performed on each group. Let’s get the total number of groups.

>>> len(df.groupby(['country', 'region']))

For each of these 520 groups, the massive f_final function was called, recomputing each filter. Running these same calculations for each group is one of the main causes of poor performance with apply.

Take a look at all the filters in f_final. You'll notice that each of them are independent on the particular group. This means that we can calculate these filters before grouping and get the same result. As a concrete example, take a look at the following DataFrame.

df_rev = pd.DataFrame({'state': ['TX', 'TX', 'TX', 'CA', 'CA', 
'category': ['tech', 'energy', 'energy',
'tech', 'energy', 'energy'],
'revenue' : [10, 5, 8, 20, 12, 2]})

Let’s calculate the revenue for each state, but just for the energy category. A naive solution involves writing a custom function, where each group will be filtered for just the energy category and then have the revenue summed.

def f(x):
is_energy = x['category'] == 'energy'
return x.loc[is_energy, 'revenue'].sum()

Using this custom function with apply returns the correct revenue for each state’s energy category.

>>> df_rev.groupby('state').apply(f)
CA 14
TX 13

Instead, we can create an entire new column for just the energy revenue. First, we create a boolean Series where True corresponds to 'energy'. We multiply this Series by the original revenue column. Because False evaluates to 0 and True evaluates as 1, the new column will be just like the original, but have the 0 everywhere the category is not energy.

filt = df_rev['category'] == 'energy'
df_rev['energy_revenue'] = filt * df_rev['revenue']

We can now use the built-in sum method instead of our custom function during the grouping. There is no need for apply here.

>>> df_rev.groupby('state')['energy_revenue'].sum()
CA 14
TX 13

Use the where method

Instead of replacing the filtered values with 0, as we did above, you might need to make them missing. This is crucial if you are calculating something like the mean or median, which will take into account the value of 0. The where method will replace the False values of the passed boolean Series with NaN.

filt = df_rev['category'] == 'energy'
df_rev['energy_revenue'] = df_rev['revenue'].where(filt)

We can call the same groupby to get the same result.

>>> df_rev.groupby('state')['energy_revenue'].sum()
CA 14.0
TX 13.0

Filter with our challenge data

Let’s calculate rev_2019 which is defined as the revenue during the first half of 2019. Let's use the naive way of thinking first by defining a custom function.

>>> def get_rev_2019(x):
is_2019H1 = x['date'].between('2019-01-01', '2019-06-30')
return x.loc[is_2019H1, 'revenue'].sum()

>>> df.groupby(['country', 'region']).apply(get_rev_2019).head()country region
Argentina A 150508
B 139048
C 118035
D 131728
E 146201

Now, let’s use our new method of applying the filter to the entire DataFrame first, creating a new column, and then using the built-in sum method.

>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')
>>> df['rev_2019'] = df['revenue'].where(is_2019H1)
>>> df.groupby(['country', 'region'])['rev_2019'].sum().head()
country region
Argentina A 150508.0
B 139048.0
C 118035.0
D 131728.0
E 146201.0

Custom function vs built-in method performance comparison

Let’s compare the performance between the two methods.

>>> %timeit df.groupby(['country', 'region']).apply(get_rev_2019)
633 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
>>> is_2019H1 = df['date'].between('2019-01-01', '2019-06-30')
>>> df['rev_2019'] = df['revenue'].where(is_2019H1)
>>> df.groupby(['country', 'region'])['rev_2019'].sum()
27.3 ms ± 530 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The custom function is 25x slower than the built-in method and this is just a simple calculation. The more complex the custom function, the larger the performance difference becomes.

The trick to avoiding apply

If you are attempting to avoid using apply, then you have no choice but to use the built-in groupby methods. This limits the possibilities and forces you to approach the problem differently. The main ‘trick’ is to execute operations to the entire DataFrame before using the groupby method.

Not all operations will be able to be executed on the entire DataFrame, only those that are independent of the group. So, how do you know if an operation is independent of the group? The operation will not contain a calculation that is specific to the current group. For instance, we are grouping by country and region. If an operation is dependent on the particular country or region, then it would not be able to be executed on the entire DataFrame.

A concrete example can help here — If the definition of the first half of the year was January through July for Greece and January through June for all other countries, then the calculation of revenue for the first half of the year would depend on the group.

In this challenge, all the operations are independent of the group. There are no special cases based on the group. This means that we can execute all of our operations that we used within the custom function passed to apply outside of it before we group.

Complete optimal solution

The complete optimal solution will now be given. We will use the same definition for our filters as we did in the custom function, but instead calculate them on the entire DataFrame. We will then create new columns that have NaN where the filter is False.

We can now use just the built-in groupby methods to aggregate the data.

This is not the final DataFrame, as some columns can only be calculated from the result of the aggregated values. We also need to drop some of these intermediate columns that are no longer desired in the result.

Let’s verify that the DataFrames are equivalent.

>>> df1 = df.groupby(['country', 'region']).apply(f_final)
>>> df1.equals(df_new1.astype('float'))

Let’s put all the steps of the optimal solution into a single function, which we can then use to measure performance.

def optimal():
... all steps

About 20x faster

The optimal solution is about 20x as fast as the naive solution due to pre-calculating new columns on the entire DataFrame and only using built-in groupby methods.

>>> %timeit optimal()
201 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary of how to unapply apply

I suggest avoiding apply at all costs as the performance can be badly impacted by its use. The following steps summarize the procedure for avoiding apply:

  1. Compute any calculations that are independent of each group before grouping. These calculations are applied to the entire DataFrame as a whole
  2. Create new columns in your DataFrame that contain the result of these new calculations from step 1.
  3. When grouping, only use the built-in groupby aggregation methods. Do not use a custom function.
  4. You may have some aggregations that use the result of a groupby aggregation (taking the difference between mean and median for example). Calculate these columns after the groupby.

Master Python, Data Science and Machine Learning

Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:

Get the All Access Pass now! 


Register for a free account

Upon registration, you'll get access to the following free courses:

  • Python Installation
  • Intro to Jupyter Notebooks
  • Intro to Pandas
  • Python  and Pandas Challenges