Dunder Data Challenge #3 - Multiple Custom Grouping Aggregations

dunder data challenges Sep 09, 2019

Welcome to the third edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook courtesy of Binder (mybinder.org).

This challenge is going to be fairly difficult, but should answer a question that many pandas users face — What is the best way to perform a groupby that does many custom aggregations? In this context, a ‘custom aggregation’ is defined as one that is not directly available to use from pandas and one that you must write a custom function.

In Dunder Data Challenge #1, a single aggregation, which required a custom grouping function, was the desired result. In this challenge, you’ll need to return several aggregations when grouping. There are a few different solutions to this problem, but depending on how you arrive at your solution, there could arise enormous performance differences. I am looking for a compact, readable solution with very good performance.

Begin Mastering Data Science Now for Free!

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

Online Courses

Master the Fundamentals of Python— A comprehensive introduction to Python (300+ pages, 150+ exercises)
Master Data Analysis with Python — The most comprehensive course available to learn pandas. (800+ pages and 500+ exercises)
Master Machine Learning with Python — A deep dive into doing machine learning with scikit-learn constantly updated to showcase the latest and greatest tools. (300+ pages)

Get the All Access Pass now!

In-Person Courses

Take a hands-on, interactive, in-person class with me

Social Media

I frequently post my python data science thoughts on social media. Follow me!

Corporate Training

If you have a group at your company looking to learn directly from an expert who understands how to teach and motivate students, let me know by filling out the form on this page.

Sales Data

In this challenge, you will be working with some mock sales data found in the sales.csv file. It contains 200,000 rows and 9 columns. Here are the first five rows.

The Challenge

There are many aggregations that you will need to return and it will take some time to understand what they are and how to return them. The following definitions for two time periods will be used throughout the aggregations.

Period 2019H1 is defined as the time period beginning January 1, 2019 and ending June 30, 2019
Period 2018H1 is defined as the time period beginning January 1, 2018 and ending June 30, 2018.

Aggregations

I will now list all the aggregations that are expected to be returned. Each bullet point represents a single column. Use the first word after the bullet point as the new column name.

For every country and region, return the following:

recency: Number of days between today’s date (9/9/2019) and the maximum value of the ‘date’ column
fast_and_fastest: Number of unique customer_id in period 2019H1 with delivery_type either ‘fast’ or ‘fastest’
revenue_2019: Total revenue for the period 2019H1
revenue_2018: Total revenue for the period 2018H1
cost_2019: Total cost for period 2019H1
cost_2019_expert: Total cost for period 2019H1 with cost_type ‘expert’
other_cost: Difference between cost_2019 and cost_2019_expert
revenue_per_60: Total of revenue when duration equals 60 in period 2019H1 divided by number of unique customer_id when duration equals 60 in period 2019H1
profit_margin: Take the difference of revenue_2019 and cost_2019_expert then divide by revenue_2019
cost_expert_per_60: Total of cost when duration is 60 and cost_type is ‘expert’ in period 2019H1 divided by the number of unique customer_id when duration equals 60 and cost_type is ‘expert’ in period 2019H1
growth: Find the percentage growth from revenue in period 2019H1 compared to the revenue in period 2018H1