Jan 03, 2020

This article summarizes the very detailed guide presented in Minimally Sufficient Pandas.

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

- It is a small subset of the library that is sufficient to accomplish nearly everything that it has to offer.
- It allows you to focus on doing data analysis and not the syntax

- All common data analysis tasks will use the same syntax
- Fewer commands will be easier to commit to memory
- Your code will be easier to understand by others and by you
- It will be easier to put Pandas code in production
- It reduces the chance of landing on a Pandas bug.

Use the **brackets** and not **dot notation **to select a single column of data because the dot notation cannot column names with spaces, those that collide with...

Jan 01, 2020

In this article, I will offer an opinionated perspective on how to best use the Pandas library for data analysis. My objective is to argue that only a small subset of the library is sufficient to complete nearly all of the data analysis tasks that one will encounter. This minimally sufficient subset of the library will benefit both beginners and professionals using Pandas. Not everyone will agree with the suggestions I lay forward, but they are how I teach and how I use the library myself. If you disagree or have any of your own suggestions, please leave them in the comments below.

By the end of this article you will:

- Know why limiting Pandas to a small subset will keep your focus on the actual data analysis and not on the syntax
- Have specific guidelines for taking a single approach to completing a variety of common data analysis tasks with Pandas

Dec 10, 2019

Click the video at the top of this post to view the animation and final solution.

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

A tutorial will now follow that describes the recreation. It will discuss the following:

- Figure and Axes setup
- Adding shapes
- Color gradients
- Animation

Understanding these topics should give you enough to start animating your own figures in matplotlib.

We first create a matplotlib Figure and Axes, remove the axis labels and tick marks, and set the x and y axis limits. The `fill_between`

method is used to set two different background colors.

`import numpy as np import matplotlib.pyplot as plt %matplotlib inline fig, ax = plt.subplots(figsize=(...`

Nov 26, 2019

In this challenge, you will recreate the Tesla Cybertruck unveiled last week using matplotlib. All challenges are available to be completed in your browser in a Jupyter Notebook now thanks to Binder (mybinder.org).

Use matplotlib to recreate the Tesla Cybertruck image above.

Add animation so that it drives off the screen.

I’m still working on this challenge myself. My current recreation is below:

If you are looking to completely master the pandas library and become a trusted expert for doing data science work, check out my book Master Data Analysis with Python. It comes with over 300 exercises with detailed solutions covering the pandas library in-depth.

Nov 25, 2019

This post presents a solution to Dunder Data Challenge #5 — Keeping Values Within the Interquartile Range.

All challenges may be worked in a Jupyter Notebook right now thanks to Binder (mybinder.org).

We begin by finding the first and third quartiles of each stock using the `quantile`

method. This is an **aggregation** which returns a single value for each column by default. Set the first parameter, `q`

to a float between 0 and 1 to represent the quantile. Below, we create two variables to hold the first and third quartiles (also known as the 25th and 75th percentiles) and output their results to the screen.

`import pandas as pd`

stocks = pd.read_csv('../data/stocks10.csv', index_col='date',

parse_dates=['date'])

stocks.head()

`>>> lower = stocks.quantile(.25)`

>>> upper = stocks.quantile(.75)

>>> lower

MSFT 19.1500

AAPL 3.9100

SLB 25.6200

AMZN 40.4600

TSLA 33.9375

XOM 32.6200

WMT 37.6200

T 14.5000

FB 62.3000

V...

Nov 21, 2019

Selecting subsets of data in pandas is not a trivial task as there are numerous ways to do the same thing. Different pandas users select data in different ways, so these options can be overwhelming. I wrote a long frou-part series on it to clarify how its done. For instance, take a look at the following options for selecting a single column of data (assuming it’s the first column):

`df[‘colname’]`

`df[[‘colname’]]`

`df.colname`

`df.loc[:, ‘colname’]`

`df.iloc[:, 0]`

`df.get(‘colname’)`

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

In this post, I want to cover a single edge case of subset selection that I believe most pandas users will be unaware of what it does and how it works. Let’s say we have a DataFrame `df`

and issue the following subset selection.

`df[1, 2]`

This appears to...

Nov 14, 2019

In this challenge, you are given a table of closing stock prices for 10 different stocks with data going back as far as 1999. For each stock, calculate the interquartile range (IQR). Return a DataFrame that satisfies the following conditions:

- Keep values as they are if they are within the IQR
- For values lower than the first quartile, make them equal equal to the exact value of the first quartile
- For values higher than the third quartile, make them equal equal to the exact value of the third quartile

Start this challenge in a Jupyter Notebook right now thanks to Binder (mybinder.org)

`import pandas as pd`

stocks = pd.read_csv('../data/stocks10.csv', index_col='date', parse_dates=['date'])

stocks.head()

There is a straightforward solution that completes this challenge in a single line of readable code. Can you find it?

If you are looking to completely master the pandas library and become a trusted expert for doing data science work,...

Nov 13, 2019

In this post, I detail the solution to Dunder Data Challenge #4 — Finding the Date of the Largest Percentage Stock Price Drop.

To begin, we need to find the percentage drop for each stock for each day. pandas has a built-in method for this called `pct_change`

. By default, it finds the percentage change between the current value and the one immediately above it. Like most DataFrame methods, it treats each column independently from the others.

If we call it on our current DataFrame, we’ll get an error as it will not work on our date column. Let’s re-read in the data, converting the date column to a datetime and place it in the index.

`stocks = pd.read_csv('../data/stocks10.csv', parse_dates=['date'],`

index_col='date')

stocks.head()

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

Placing the date column in the index is a key part of...

Nov 12, 2019

In this challenge, you are given a table of closing stock prices for 10 different stocks with data going back as far as 1999. For each stock, find the date where it had its largest one-day percentage loss.

Begin working this challenge now in a Jupyter Notebook thanks to Binder (mybinder.org). The data is found in the `stocks10.csv`

file with the ticker symbol as a column name. The Dunder Data Challenges Github repository also contains all of the challenges.

Can you return a Series that has the ticker symbols in the index and the date where the largest percentage price drop happened as the values? There is a nice, fast solution that uses just a minimal amount of code without any loops.

Can you return a DataFrame with the ticker symbol as the columns with a row for the date and another row for the percentage price drop?

- My book Master Data Analysis with...

Nov 01, 2019

In this tutorial, we will cover an efficient and straightforward method for finding the percentage of missing values in a Pandas DataFrame. This tutorial is available as a video on YouTube.

Take my free Intro to Pandas course to begin your journey mastering data analysis with Python.

The final solution to this problem is not quite intuitive for most people when they first encounter it. We will slowly build up to it and also provide some other methods that get us a result that is close but not exactly what we want.

We begin by reading in the flights dataset, which contains US domestic flight information during the year 2015. Pandas defaults the number of visible columns to 20. Since there are 31 columns in this DataFrame, we change this option below.

`>>> import pandas as pd`

>>> pd.options.display.max_columns = 100

>>> pd.read_csv('flights.csv')

>>>...

50% Complete

Upon registration, you'll get access to three free courses and the discount code to purchase the All Access Pass for 50% off through Cyber Monday (12/2)