Dunder Data Challenge #3 - Optimal Solution

dunder data challenges Sep 17, 2019

In this article, I will present an ‘optimal’ solution to Dunder Data Challenge #3. Please refer to that article for the problem setup. Work on this challenge directly in a Jupyter Notebook right now by clicking this link.

Naive Solution — Custom function with apply

The naive solution was presented in detail in the previous article. The end result was a massive custom function containing many boolean filters used to find specific subsets of data to aggregate. For each group, a Series was returned with 11 values. Each of these values became a new column in the resulting DataFrame. Let’s take a look at the custom function:


Our performance using this naive solution takes nearly 4 seconds.

Become an Expert

Continue Reading...

Use the brackets to select a single pandas DataFrame column and not dot notation

pandas Sep 13, 2019

pandas offers its users two choices to select a single column of data and that is with either brackets or dot notation. In this article, I suggest using the brackets and not dot notation for the following ten reasons.

  1. Select column names with spaces
  2. Select column names that have the same name as methods
  3. Select columns with variables
  4. Select non-string columns
  5. Set new columns
  6. Select multiple columns
  7. Dot notation is a strict subset of the brackets
  8. Use one way which works for all situations
  9. Auto-completion works in the brackets and following it
  10. Brackets are the canonical way to select subsets for all objects

Selecting a single column

Let’s begin by creating a small DataFrame with a few columns

import pandas as pd
df = pd.DataFrame({'name': ['Niko', 'Penelope', 'Aria'],
'average score': [10, 5, 3],
'max': [99, 100, 3]})

Let’s select the name column with dot notation. Many pandas users like dot notation.

>>> df.name
0 Niko
1 Penelope
2 Aria


Continue Reading...

Dunder Data Challenge #3 - Naive Solution

dunder data challenges Sep 12, 2019

To view the problem setup, go to the Dunder Data Challenge #3 post. This post will contain the solution.

Become an Expert

I will first present a naive solution that returns the correct results, but is extremely slow. It uses a large custom function with the groupby apply method. Using the groupby apply method has potential to capsize your program as performance can be awful.

One of my first attempts at using a groupby apply to solve a complex grouping problem resulted in a computation that took about eight hours to finish. The dataset was fairly large, at around a million rows, but could still easily fit in memory. I eventually ended up solving the problem using SAS (and not pandas) and shrank the execution...

Continue Reading...

Dunder Data Challenge #3 - Multiple Custom Grouping Aggregations

dunder data challenges Sep 09, 2019

Welcome to the third edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook courtesy of Binder (mybinder.org).

This challenge is going to be fairly difficult, but should answer a question that many pandas users face — What is the best way to perform a groupby that does many custom aggregations? In this context, a ‘custom aggregation’ is defined as one that is not directly available to use from pandas and one that you must write a custom function.

Become an Expert

In Dunder...

Continue Reading...

Dunder Data Challenge #2 - Explain the 1,000x Speed Difference when taking the Mean

dunder data challenges Sep 08, 2019

Welcome to the second edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook courtesy of Binder (mybinder.org).

In this challenge, your goal is to explain why taking the mean of the following DataFrame is more than 1,000x faster when setting the parameter   numeric_only to True

Become an Expert

Video Available!

A video tutorial of me completing this challenge is available on YouTube.

The Challenge

The bikes dataset below has about 50,000 rows. Calling the mean method on the entire DataFrame...

Continue Reading...

Dunder Data Challenge #1 - Optimize Custom Grouping Function

dunder data challenges Sep 07, 2019

This is the first edition of the Dunder Data Challenge series designed to help you learn python, data science, and machine learning. Begin working on any of the challenges directly in a Jupyter Notebook thanks to Binder (mybinder.org).

In this challenge, your goal is to find the fastest solution while only using the Pandas library.

Become an Expert

The Challenge

The college_pop dataset contains the name, state, and population of all higher-ed institutions in the US and its territories. For each state, find the percentage of the total state population made up by the 5 largest colleges of that state. Below, you can inspect the first few rows of the...

Continue Reading...

Pandas Cookbook — Develop Powerful Routines for Exploring Real-World Datasets

pandas Jul 18, 2019

In this article, I will discuss the overall approach I took to writing Pandas Cookbook along with highlights of each chapter.

New Book — Master Data Analysis with Python

I have a new book titled Master Data Analysis with Python that is far superior to Pandas Cookbook. It contains over 300 exercises and projects to reinforce all the material and will receive continuous updates through 2020. If you are interested in Pandas Cookbook, I would strongly suggest to purchase Master Data Analysis with Python instead.

All Access Pass!

If you want to learn python, data analysis, and machine learning, then the All Access Pass! will provide you access to all my current and future material for one low price.

Pandas Cookbook Guiding Principles

I had three main guiding principles when writing the book:

  • Use of real-world datasets
  • Focus on doing data analysis
  • Writing modern, idiomatic pandas

First, I wanted you, the reader, to explore real-world datasets and not randomly...

Continue Reading...

Python for Data Analysis — A Critical Line-by-Line Review

book review pandas python Jul 09, 2019

In this post, I will offer my review of the book, Python for Data Analysis (2nd edition) by Wes McKinney. My name is Ted Petrou and I am an expert at pandas and author of the recently released Pandas Cookbook. I thoroughly read through PDA and created a very long, review that is available on github. This post provides some of the highlights from that full review.

What is a critical line-by-line review?

I read this book as if I was the only technical reviewer and I was counted on to find all the possible errors. Every single line of code was scrutinized and explored to see if a better solution existed. Having spent nearly every day of the last 18 months writing and talking about pandas, I have formed strong opinions about how it should be used. This critical examination lead to me finding fault with quite a large percentage of the code.

Review Focuses on Pandas

The main focus of PDA is on the pandas library but it does have material on basic Python, IPython...

Continue Reading...

Anaconda is bloated — Set up a lean, robust data science environment with Miniconda and Conda-Forge

Uncategorized Jul 01, 2019

In this tutorial, I will describe a process for setting up a lean and robust Python data science environment on your system. By the end of the tutorial, your system will be set up such that:

  • Python is installed with only the most common and useful packages for data science
  • Conda is installed to manage packages and environments
  • You’ll have a single, robust environment which minimizes dependency issues by relying on the conda-forge channel

Become an Expert

I am extraordinarily dedicated to producing the absolute best content for doing data science using Python. For all my courses and live training visit Dunder Data.

Continue Reading...

Selecting Subsets of Data in Pandas: Part 4

pandas Jun 30, 2019

This article is available as a Jupyter Notebook complete with exercises at the bottom to practice and detailed solutions in another notebook. All material will be contained in my Learn-Pandas Github repository.

Become an Expert

This is the fourth and final part of the series “How to Select Subsets of Data in Pandas”. Pandas offers a wide variety of options for subset selection, which necessitates multiple articles. This series is broken down into the following topics.

  1. Selection with [].loc and .iloc
  2. Boolean indexing
  3. Assigning subsets of data
  4. How NOT to select subsets of data

Learning what not to do

In all programming...

Continue Reading...
1 2

50% Complete

Two Step

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.