Sep 07, 2019

This is the first edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook thanks to Binder (mybinder.org).

In this challenge, your goal is to find the fastest solution while only using the Pandas library.

- My book Master Data Analysis with Python is the most comprehensive text on the market to learn data analysis using Python and comes with 300+ exercises and projects.
- Sign-up for the
**FREE**Intro to Pandas class - Follow me on Twitter @TedPetrou for my daily data science tricks

The `college_pop`

dataset contains the name, state, and population of all higher-ed institutions in the US and its territories. For each state, find the percentage of the total state population made up by the 5 largest colleges of that state. Below, you can inspect the first few rows of the data.

`>>> import pandas as pd`

>>> college = pd.read_csv('data/college_pop.csv')

>>> college.head()

Stop here if you want to attempt to answer this challenge on your own.

`grouping`

problemThis problem needs the use of the `groupby`

method to group all the colleges by state. With each group, we need to do these 4 steps:

- Select the top 5 largest schools
- Sum the population of these 5 largest schools
- Sum the entire state population
- Divide the results of step 2 and 3

Pandas `groupby`

objects have many methods such as `min`

, `max`

, `mean`

, `sum`

, etc… There is no direct method to accomplish our current task. We will need to do this problem in steps. There are multiple different approaches to solve this challenge which are outlined below.

Many Pandas users will see a problem like this and immediately think about creating a custom grouping function. Let’s start with this approach. Below, we create a custom function that accepts a single column and returns a single value. We will sort this column from greatest to least and then finish the problem.

`def find_top_5(s):`

s = s.sort_values(ascending=False)

top5 = s.iloc[:5]

top5_total = top5.sum()

total = s.sum()

return top5_total / total

Let’s use this custom function to get our result.

`>>> result = college.groupby('state').agg({'pop': find_top_5})`

>>> result.head()

It can be difficult to understand the operations taking place within custom grouping functions. One way to track and debug the code within custom grouping functions is to output the results of the variable you would like to inspect. You can use the `print`

function but I recommend using the `display`

function from the `IPython.display`

module. This will output DataFrames styled in the same manner as they would in the notebook. In this particular example, we will only output Series objects, but they are more useful for outputting DataFrames. Below, we define a new custom function that displays each line to the screen.

Note, that this function is actually called twice, which doesn’t seem to make any sense since an error is always raised before the function ends. Pandas always calls custom grouping functions twice for the first group regardless if it produces an error.

from IPython.display import displaydef find_top_5_display(s):

s = s.sort_values(ascending=False)

display('sorted schools', s)

top5 = s.iloc[:5]

display('top 5 schools', top5)

top5_total = top5.sum()

display('top 5 total', top5_total)

total = s.sum()

display('state total', total)

answer = top5_total / total

display('answer', answer)

raise

Let’s call this new function to see the results of each line of code.

>>> college.groupby('state').agg({'pop': find_top_5_display})

'sorted schools'

60 12865.0

62 5536.0

66 3256.0

63 1428.0

65 889.0

67 479.0

64 275.0

5171 109.0

5417 68.0

61 27.0

Name: pop, dtype: float64

'top 5 schools'

60 12865.0

62 5536.0

66 3256.0

63 1428.0

65 889.0

Name: pop, dtype: float64

'top 5 total'

23974.0

'state total'

24932.0

'answer'

0.9615754853200706

If you are enjoying this article, consider purchasing the All Access Pass! which includes all my current and future material for one low price.

On my machine, this solution completes in about 50ms.

>>> %timeit -n 5 college.groupby('state').agg({'pop': find_top_5})

53.1 ms ± 2.26 ms per loop

`apply`

instead of `agg`

You can use the `apply`

method as well which has slightly different syntax and returns a Series and not a DataFrame. Performance is similar.

>>> %timeit -n 5 college.groupby('state')['pop'].apply(find_top_5)

58.8 ms ± 3.7 ms per loop

Instead of sorting the data within the custom function, we can sort the entire DataFrame first. Pandas preserves the order of the rows within each group so we don’t need to worry about losing this sorted order during grouping. Below, we create a new custom function that assumes the data is already sorted. The result is the exact same as solution 1.

>>> cs = college.sort_values('pop', ascending=False)

>>> def find_top_5_sorted(s):

top5 = s.iloc[:5]

return top5.sum() / s.sum()

>>> cs.groupby('state').agg({'pop': find_top_5_sorted}).head()

On my machine, this performs twice as fast as solution 1.

>>> %timeit -n 5 cs.groupby('state').agg({'pop': find_top_5_sorted})

23.3 ms ± 1.2 ms per loop

One thing you must be aware of when using a custom grouping function is their potential for poor performance. Every line of code in the custom function must be re-run for each group. If you can apply a function to the entire DataFrame instead of within a custom function, you will almost always see a nice performance gain.

It’s possible to eliminate the custom function entirely. Pandas `groupby`

objects have a `head`

method that returns the first values of each group. This eliminates the need to call `s.iloc[:5]`

within the custom function. Let's see this portion now. Notice, we create a new variable `grouped`

to reference the `groupby`

object. We will use this again later.

`>>> cs = college.sort_values('pop', ascending=False)`

>>> grouped = cs.groupby('state')

>>> cs_top5 = grouped.head(5)

>>> cs_top5.head(10)

Only the first 5 rows for each state are returned. Let’s output the number of rows of this DataFrame below.

>>> cs_top5.shape(270, 5)

This dataset includes US territories which is why there are more than 250 rows (50 states * 5).

`groupby`

From here, we must perform another `groupby`

on this smaller dataset to get the total for these top 5 schools.

`>>> top5_total = cs_top5.groupby('state').agg({'pop': 'sum'})`

>>> top5_total.head()

Now we find the total for all the schools in each state.

`>>> total = grouped.agg({'pop': 'sum'})`

>>> total.head()

Now, we can divide the previous two to get the desired result, which is the same as the previous two solutions.

`>>> top5_total / total`

Eliminating the custom function altogether gives us the best performance, about 7x faster on my machine than Solution 1.

>>> %%timeit -n 5

>>> cs = college.sort_values('pop', ascending=False)

>>> grouped = cs.groupby('state')

>>> cs_top5 = grouped.head(5)

>>> top5_total = cs_top5.groupby('state').agg({'pop': 'sum'})

>>> total = grouped.agg({'pop': 'sum'})

>>> answer = top5_total / total7.43 ms ± 665 µs per loop

For complex grouping situations, you will be tempted to write your own custom function to do all of the work. This is dangerous and can lead to extremely inefficient code. Pandas cannot optimize custom functions. It has a limited number of builtin grouping methods. All of these are optimized and should yield better performance. The following are some guidelines when approaching a complex grouping situation.

If a builtin grouping method exists then you should use it over any custom function.

Solution 1 above was the slowest and performed all its calculations to each group within the custom function. This was the slowest solution. If you can perform an operation to the entire DataFrame outside of the custom grouping function, you will much better performance.

In general, use builtin Pandas methods whenever they exist and avoid custom functions if at all possible. Solution 3 uses no custom functions and performs the best.

Get all of my current and future material for one low price with the All Access Pass! The primary courses available are the following:

50% Complete

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.