In this article, I will present an ‘optimal’ solution to Dunder Data Challenge #3. Please refer to that article for the problem setup. Work on this challenge directly in a Jupyter Notebook right now by clicking this link.
The naive solution was presented in detail in the previous article. The end result was a massive custom function containing many boolean filters used to find specific subsets of data to aggregate. For each group, a Series was returned with 11 values. Each of these values became a new column in the resulting DataFrame. Let’s take a look at the custom function:
Our performance using this naive solution takes nearly 4 seconds.
To view the problem setup, go to the Dunder Data Challenge #3 post. This post will contain the solution.
I will first present a naive solution that returns the correct results, but is extremely slow. It uses a large custom function with the groupby
apply method. Using the groupby
apply method has potential to capsize your program as performance can be awful.
One of my first attempts at using a groupby
apply to solve a complex grouping problem resulted in a computation that took about eight hours to finish. The dataset was fairly large, at around a million rows, but could still easily fit in memory. I eventually ended up solving the problem using SAS (and not pandas) and shrank the execution...
Welcome to the third edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook courtesy of Binder (mybinder.org).
This challenge is going to be fairly difficult, but should answer a question that many pandas users face — What is the best way to perform a groupby that does many custom aggregations? In this context, a ‘custom aggregation’ is defined as one that is not directly available to use from pandas and one that you must write a custom function.
In Dunder Data Challenge #1, a single aggregation, which required a custom grouping function, was the desired result. In this challenge, you’ll need to return several aggregations when grouping. There are a few different solutions to this problem, but depending on how you arrive at your solution, there could arise enormous performance differences. I am...
Welcome to the second edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook courtesy of Binder (mybinder.org).
In this challenge, your goal is to explain why taking the mean of the following DataFrame is more than 1,000x faster when setting the parameter
I have several online and in-person courses available on dunderdata.com to teach you Python, data science, and machine learning.
This is the first edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook thanks to Binder (mybinder.org).
In this challenge, your goal is to find the fastest solution while only using the Pandas library.
college_pop dataset contains the name, state, and population of all higher-ed institutions in the US and its territories. For each state, find the percentage of the total state population made up by the 5 largest colleges of that state. Below, you can inspect the first few rows of the...