Dunder Data Challenge #3 - Multiple Custom Grouping Aggregations

dunder data challenges Sep 09, 2019

Welcome to the third edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook courtesy of Binder (mybinder.org).

This challenge is going to be fairly difficult, but should answer a question that many pandas users face — What is the best way to perform a groupby that does many custom aggregations? In this context, a ‘custom aggregation’ is defined as one that is not directly available to use from pandas and one that you must write a custom function.

In Dunder Data Challenge #1, a single aggregation, which required a custom grouping function, was the desired result. In this challenge, you’ll need to return several aggregations when grouping. There are a few different solutions to this problem, but depending on how you arrive at your solution, there could arise enormous performance differences. I am looking for a compact, readable solution with very good performance.

Begin Mastering Data Science Now for Free!

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

Online Courses

Get the All Access Pass now! 

In-Person Courses

Social Media

I frequently post my python data science thoughts on social media. Follow me!

Corporate Training

If you have a group at your company looking to learn directly from an expert who understands how to teach and motivate students, let me know by filling out the form on this page.

Sales Data

In this challenge, you will be working with some mock sales data found in the sales.csv file. It contains 200,000 rows and 9 columns. Here are the first five rows.

The Challenge

There are many aggregations that you will need to return and it will take some time to understand what they are and how to return them. The following definitions for two time periods will be used throughout the aggregations.

  • Period 2019H1 is defined as the time period beginning January 1, 2019 and ending June 30, 2019
  • Period 2018H1 is defined as the time period beginning January 1, 2018 and ending June 30, 2018.

Aggregations

I will now list all the aggregations that are expected to be returned. Each bullet point represents a single column. Use the first word after the bullet point as the new column name.

For every country and region, return the following:

  • recency: Number of days between today’s date (9/9/2019) and the maximum value of the ‘date’ column
  • fast_and_fastest: Number of unique customer_id in period 2019H1 with delivery_type either ‘fast’ or ‘fastest’
  • revenue_2019: Total revenue for the period 2019H1
  • revenue_2018: Total revenue for the period 2018H1
  • cost_2019: Total cost for period 2019H1
  • cost_2019_expert: Total cost for period 2019H1 with cost_type ‘expert’
  • other_cost: Difference between cost_2019 and cost_2019_expert
  • revenue_per_60: Total of revenue when duration equals 60 in period 2019H1 divided by number of unique customer_id when duration equals 60 in period 2019H1
  • profit_margin: Take the difference of revenue_2019 and cost_2019_expert then divide by revenue_2019
  • cost_expert_per_60: Total of cost when duration is 60 and cost_type is ‘expert’ in period 2019H1 divided by the number of unique customer_id when duration equals 60 and cost_type is ‘expert’ in period 2019H1
  • growth: Find the percentage growth from revenue in period 2019H1 compared to the revenue in period 2018H1
Close

Register for a free account

Upon registration, you'll get access to the following free courses:

  • Python Installation
  • Intro to Jupyter Notebooks
  • Intro to Pandas
  • Python  and Pandas Challenges