A powerful data sorting method called logical indexing can be a great way to clean up huge datasets. Programming languages Octave/MATLAB, python, and R to name a few all are capable of logical indexing.

All the rules of booleans apply to logical indexing, such as stringing conditionals and, or, nand, nor, etc

To get an idea of what I'm talking about, let's do a quick example. First let's generate an array of random numbers, and then sort for the numbers less than 0.5 and greater than 0.1

In [1]:
# import python function random from the numpy library
from numpy import random
In [2]:
# generate 10 random numbers
randnums = random.random(10)
array([ 0.08547725,  0.27999385,  0.74051157,  0.77153143,  0.50921916,
        0.02230907,  0.63186024,  0.89142272,  0.0963861 ,  0.66311151])
In [3]:
# generate the boolnea logical index of the numbers that meet our criteria
ind = (randnums < 0.5) & (randnums > 0.1)
array([False,  True, False, False, False, False, False, False, False, False], dtype=bool)
In [4]:
# return an array of numbers with our new criteria
newnumbers = randnums[ind]
array([ 0.27999385])

Now, this seems somewhat simple to do with a modern programming language such as python, but to this this in a spreadsheet where many of my day-to-day tasks are done is a little trickier. Since Excel doesn't have an explicit logical index function, similar functionality can be worked out, at some cost of clarity that a code solution has.

Since everything in Excel (lets forget about VBA scripting for this tutorial) is represented in the spreadsheet and functions or values, creating arrays must be done as a new column. With python, we can either create a 2D array or have multiple variables. Creating new columns in Excel is fine, but sometimes can be difficult to understand the sorting logic. For simple analysis and data storage, spreadsheets are great, but as analysis methods become more complex (digital signal processing, mathematics, machine learning, etc) and datasets become large (gigabytes), spreadsheets just don't cut it.

This tutorial is a demo of logical indexing in python using the pandas module and comparing to the simpliest solution I could figure out in excel, which isn't really logical index, but more of a multi-criteria function search. This is similar to vlookup, which is a handy Excel function, but is limited to a single return value as opposed to a logical index array. The approach here in Excel allows for multiple criteria to be met, but alas, only one return value. This still offers substantially more functionality than vlookup.

To get python, I'd recommend using the Anaconda Python 3.5 distribution, although many others will work fine. Here is the silly example data that I will be manipulating.

Here is the spreadsheet that will be my example.

Since python is a general programming language, we need to import the libraries we need. xlwings is an awesome library!

In [5]:
import xlwings as xw
from pandas import DataFrame

Next, we will use the xlwings library to load the Excel file into a dataframe where we will employ our logical indexing methods of data sorting

In [6]:
# load excel file and turn into a pandas dataframe
data = xw.sheets('Sheet1').range("C6:E15").value
df = DataFrame(data[1:],columns=data[0])
Color Animal Awesomeness
0 green dog 75.0
1 blue dog 95.0
2 black dog 62.0
3 green dog 23.0
4 blue cat 90.0
5 black cat 39.0
6 green cat 22.0
7 blue cat 59.0
8 black cat 12.0

As a warm up, lets use a logical index to find all green items, and show only those values whose color is green

In [7]:
i = df['Color']=='green'
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
Name: Color, dtype: bool
In [8]:
# here are the filtered results from the dataframe
Color Animal Awesomeness
0 green dog 75.0
3 green dog 23.0
6 green cat 22.0

The advantage here is that using logical indexing filtering methods can easily be automated. For our next task, we will apply the logical values directly into the dataframe instead of making a index variable. This will index of all awesomeness levels greater than 40

In [9]:
df[df.Awesomeness > 40]
Color Animal Awesomeness
0 green dog 75.0
1 blue dog 95.0
2 black dog 62.0
4 blue cat 90.0
7 blue cat 59.0

Other handy features with logical indexing is we can sort by what the cells start with, in this case any color that starts with 'b'

In [10]:
Color Animal Awesomeness
1 blue dog 95.0
2 black dog 62.0
4 blue cat 90.0
5 black cat 39.0
7 blue cat 59.0
8 black cat 12.0

For our final task as shown in a similar fashion in excel, we need to find all colors with green and animals that are dogs. This is the syntax, which is not completley obvious, but once you have it, it is your's to use. Once we have our new dataset, we can choose the first row or index and return the same value that our Excel spreadsheet did

In [11]:
df1 = df[(df.Color == 'green') & (df.Animal == 'dog') ]
Color Animal Awesomeness
0 green dog 75.0
3 green dog 23.0
In [12]:
# and use the first index 0 to return the first value

This was a simple intro into python's logical indexing functionality which cannot be easily replicated in Excel, but this shows that the index and match function in Excel can get you some of the way there.

Excel is great for some things, python is great for others. It is handy to know what tools are out there and when to use which one. Some other great resources for using python and Excel are the book automatetheboringstuff and a greate blog Practical Business Python

Stay curious!


comments powered by Disqus