A powerful data sorting method called logical indexing can be a great way to clean up huge datasets. Programming languages Octave/MATLAB, python, and R to name a few all are capable of logical indexing.
All the rules of booleans apply to logical indexing, such as stringing conditionals and, or, nand, nor
, etc
To get an idea of what I'm talking about, let's do a quick example. First let's generate an array of random numbers, and then sort for the numbers less than 0.5 and greater than 0.1
# import python function random from the numpy library
from numpy import random
# generate 10 random numbers
randnums = random.random(10)
randnums
# generate the boolnea logical index of the numbers that meet our criteria
ind = (randnums < 0.5) & (randnums > 0.1)
ind
# return an array of numbers with our new criteria
newnumbers = randnums[ind]
newnumbers
Now, this seems somewhat simple to do with a modern programming language such as python, but to this this in a spreadsheet where many of my day-to-day tasks are done is a little trickier. Since Excel doesn't have an explicit logical index function, similar functionality can be worked out, at some cost of clarity that a code solution has.
Since everything in Excel (lets forget about VBA scripting for this tutorial) is represented in the spreadsheet and functions or values, creating arrays must be done as a new column. With python, we can either create a 2D array or have multiple variables. Creating new columns in Excel is fine, but sometimes can be difficult to understand the sorting logic. For simple analysis and data storage, spreadsheets are great, but as analysis methods become more complex (digital signal processing, mathematics, machine learning, etc) and datasets become large (gigabytes), spreadsheets just don't cut it.
This tutorial is a demo of logical indexing in python using the pandas module and comparing to the simpliest solution I could figure out in excel, which isn't really logical index, but more of a multi-criteria function search. This is similar to vlookup, which is a handy Excel function, but is limited to a single return value as opposed to a logical index array. The approach here in Excel allows for multiple criteria to be met, but alas, only one return value. This still offers substantially more functionality than vlookup.
To get python, I'd recommend using the Anaconda Python 3.5 distribution, although many others will work fine. Here is the silly example data that I will be manipulating.
Here is the spreadsheet that will be my example.
Since python is a general programming language, we need to import the libraries we need. xlwings is an awesome library!
import xlwings as xw
from pandas import DataFrame
Next, we will use the xlwings library to load the Excel file into a dataframe where we will employ our logical indexing methods of data sorting
# load excel file and turn into a pandas dataframe
xw.Book(r'excel_logical_indexing.xlsx')
data = xw.sheets('Sheet1').range("C6:E15").value
df = DataFrame(data[1:],columns=data[0])
df
As a warm up, lets use a logical index to find all green items, and show only those values whose color is green
i = df['Color']=='green'
i
# here are the filtered results from the dataframe
df[i]
The advantage here is that using logical indexing filtering methods can easily be automated. For our next task, we will apply the logical values directly into the dataframe instead of making a index variable. This will index of all awesomeness levels greater than 40
df[df.Awesomeness > 40]
Other handy features with logical indexing is we can sort by what the cells start with, in this case any color that starts with 'b'
df[df.Color.str.startswith('b')]
For our final task as shown in a similar fashion in excel, we need to find all colors with green and animals that are dogs. This is the syntax, which is not completley obvious, but once you have it, it is prednisone your's to use. Once we have our new dataset, we can choose the first row or index and return the same value that our Excel spreadsheet did
df1 = df[(df.Color == 'green') & (df.Animal == 'dog') ]
df1
# and use the first index 0 to return the first value
df1.Awesomeness.ix[0]
This was a simple intro into python's logical indexing functionality which cannot be easily replicated in Excel, but this shows that the index
and match
function in Excel can get you some of the way there.
Excel is great for some things, python is great for others. It is handy to know what tools are out there and when to use which one. Some other great resources for using python and Excel are the book automatetheboringstuff and a greate blog Practical Business Python
Stay curious!
Comments
comments powered by Disqus