Tuesday, February 11, 2020

Python - Slicing and Dicing Dataframes

Slicing_Dicing

Slicing and Dicing Dataframes

You have seen how to do indexing of dataframes using df.iloc and df.loc. Now, let's see how to subset dataframes based on certain conditions.

In [ ]:
# loading libraries and reading the data
import numpy as np
import pandas as pd

df = pd.read_csv("../global_sales_data/market_fact.csv")
df.head()

Subsetting Rows Based on Conditions

Often, you want to select rows which satisfy some given conditions. For e.g., select all the orders where the Sales > 3000, or all the orders where 2000 < Sales < 3000 and Profit < 100.

Arguably, the best way to do these operations is using df.loc[], since df.iloc[] would require you to remember the integer column indices, which is tedious.

Let's see some examples.

In [ ]:
# Select all rows where Sales > 3000
# First, we get a boolean array where True corresponds to rows having Sales > 3000
df.Sales > 3000
In [ ]:
# Then, we pass this boolean array inside df.loc
df.loc[df.Sales > 3000]
In [ ]:
# An alternative to df.Sales is df['Sales]
# You may want to put the : to indicate that you want all columns
# It is more explicit 
df.loc[df['Sales'] > 3000, :]
In [ ]:
# We combine multiple conditions using the & operator
# E.g. all orders having 2000 < Sales < 3000 and Profit > 100
df.loc[(df.Sales > 2000) & (df.Sales < 3000) & (df.Profit > 100), :]
In [ ]:
# The 'OR' operator is represented by a | (Note that 'or' doesn't work with pandas)
# E.g. all orders having 2000 < Sales  OR Profit > 100
df.loc[(df.Sales > 2000) | (df.Profit > 100), :]
In [ ]:
# E.g. all orders having 2000 < Sales < 3000 and Profit > 100
# Also, this time, you only need the Cust_id, Sales and Profit columns
df.loc[(df.Sales > 2000) & (df.Sales < 3000) & (df.Profit > 100), ['Cust_id', 'Sales', 'Profit']]
In [ ]:
# You can use the == and != operators 
df.loc[(df.Sales == 4233.15), :]
df.loc[(df.Sales != 1000), :]
In [ ]:
# You may want to select rows whose column value is in an iterable
# For instance, say a colleague gives you a list of customer_ids from a certain region

customers_in_bangalore = ['Cust_1798', 'Cust_1519', 'Cust_637', 'Cust_851']

# To get all the orders from these customers, use the isin() function
# It returns a boolean, which you can use to select rows
df.loc[df['Cust_id'].isin(customers_in_bangalore), :]

ML-Model DecisionTree Example-IncomePrediction

DecisionTree -- IncomePrediction Decision Tree: Income Prediction ¶ In this l...