One of the coolest tricks in Python when you want to search multiple conditions
is the clever use of in
operator
Let’s say you want to check if the user agrees with your proposal or not. May be you ask for the user’s opinion ( or may be read it from the frontend or a file ) and write some code like this
>>> answers = [ "Yes", "Yeah", "Nods"]
>>> opinion = input("Will you marry me? ")
Will you marry me? Nods
>>>
>>> print("Hurray !!!" if opinion in answers else "Try again later ...")
Hurray !!!
>>>
All that is good with simple strings or integers but how can we do something similar in a Pandas DataFrame?
Problem Statement
Consider this dataset that contains details about when a particular student submitted their exercises in my Course
The values in the table contain the date a particular student submitted any given exercise
>>> import pandas as pd
>>> df = pd.read_csv('exercises_table.csv')
>>> df.set_index('Name', inplace=True)
>>> df
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Name | ||||
Ajay | 12-Jan-2023 | 14-Jan-2023 | 18-Jan-2023 | 16-Jan-2023 |
Chetan | 17-Jan-2023 | 18-Jan-2023 | 19-Jan-2023 | NaN |
Girish | 17-Jan-2023 | NaN | 15-Jan-2023 | NaN |
Mohan | 19-Jan-2023 | NaN | NaN | NaN |
Vishal | 14-Jan-2023 | 15-Jan-2023 | 17-Jan-2023 | 18-Jan-2023 |
From the above information, how can we find out which students submitted which exercises on 18-Jan-2023 and 19-Jan-2023 ?
Fixing the Row index name
Before we jump to the solution let’s first fix the Row index name of the DataFrame. One neat trick ( more of a hack though ) to remove the name for the Row index called as “Name” is to set it’s name attribute to None
>>> df.index.name = None
>>> df
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Ajay | 12-Jan-2023 | 14-Jan-2023 | 18-Jan-2023 | 16-Jan-2023 |
Chetan | 17-Jan-2023 | 18-Jan-2023 | 19-Jan-2023 | NaN |
Girish | 17-Jan-2023 | NaN | 15-Jan-2023 | NaN |
Mohan | 19-Jan-2023 | NaN | NaN | NaN |
Vishal | 14-Jan-2023 | 15-Jan-2023 | 17-Jan-2023 | 18-Jan-2023 |
Now that the table is in a better shape, let’s move on to the solution
Solution
Let’s start by defining the required dates in a list imp_dates
, similar to
the answers
variable we used earlier
>>> imp_dates = ['18-Jan-2023', '19-Jan-2023']
Now, we need to check which of the values in df
contain any of the dates
present in imp_dates
. This can be achieved using isin
method of DataFrame.
This does an elementwise comparison as was done in the earlier Python code at
the beginning of this post. And all elementwise operations can be run
parallelly, so this should be a fast operation.
The isin
method returns a boolean DataFrame with cells that contain any of
our values of interest ( any date present in imp_dates
) having True
and
the rest as False
>>> df.isin(imp_dates)
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Ajay | False | False | True | False |
Chetan | False | True | True | False |
Girish | False | False | False | False |
Mohan | True | False | False | False |
Vishal | False | False | False | True |
Now, to filter the rows that contain our dates of interest, we use the values present in the Boolean DataFrame as a filter over the DataFrame.
But, the filter needs to be a Series object which contains the row labels containing our dates of interest.
In the above boolean DataFrame, the row lables that contain True would be “Ajay”, “Chetan”, “Mohan” and “Vishal”
To get the required Series, we can sum up the values along each row.
Any row that contains atleast one True value would sum up into a value greater than 0
Or, in other words, if a row has only False values, then its sum would be 0
We can sum up all the boolean values along each row using sum(axis=1)
Note: Using sum()
by default sums the boolean values along each column
>>> df.isin(imp_dates).sum(axis=1)
Ajay 1
Chetan 2
Girish 0
Mohan 1
Vishal 1
dtype: int64
>>>
We are interested in only those rows that contain atleast one True value. So, the sum value for such rows would be > 0
Let’s create a mask of the boolean Series for the above comparison
>>> mask_1819 = df.isin(imp_dates).sum(axis=1) > 0
>>> mask_1819
Ajay True
Chetan True
Girish False
Mohan True
Vishal True
dtype: bool
>>>
Now, let’s use the mask to filter the values in the DataFrame
>>> df[mask_1819]
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Ajay | 12-Jan-2023 | 14-Jan-2023 | 18-Jan-2023 | 16-Jan-2023 |
Chetan | 17-Jan-2023 | 18-Jan-2023 | 19-Jan-2023 | NaN |
Mohan | 19-Jan-2023 | NaN | NaN | NaN |
Vishal | 14-Jan-2023 | 15-Jan-2023 | 17-Jan-2023 | 18-Jan-2023 |
And that is our expected result.
Compare the result with the highlighted values shown below from the original DataFrame
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Ajay | 12-Jan-2023 | 14-Jan-2023 | 18-Jan-2023 | 16-Jan-2023 |
Chetan | 17-Jan-2023 | 18-Jan-2023 | 19-Jan-2023 | nan |
Girish | 17-Jan-2023 | nan | 15-Jan-2023 | nan |
Mohan | 19-Jan-2023 | nan | nan | nan |
Vishal | 14-Jan-2023 | 15-Jan-2023 | 17-Jan-2023 | 18-Jan-2023 |
Wait a minute. How did we get those nicely colored cells above?
Well, this is more of a digression from the topic we are discussing here. But, it’s a good one I guess :-)
Highlighting specific cells
For highlighting specific rows or columns or cells in a Pandas Dataframe, we
use it’s style
property
One handy method on the style
property is highlight_null
. As the name
suggests, it is used to highlight empty or NaN
entries in a DataFrame. This
comes in pretty handy when we want to visually spot the empty cells.
Following code explains how highlight_null
can be used to highlight empty
cells with a “lightsalmon” color
>>> df.style.highlight_null('lightsalmon')
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Ajay | 12-Jan-2023 | 14-Jan-2023 | 18-Jan-2023 | 16-Jan-2023 |
Chetan | 17-Jan-2023 | 18-Jan-2023 | 19-Jan-2023 | nan |
Girish | 17-Jan-2023 | nan | 15-Jan-2023 | nan |
Mohan | 19-Jan-2023 | nan | nan | nan |
Vishal | 14-Jan-2023 | 15-Jan-2023 | 17-Jan-2023 | 18-Jan-2023 |
Here, we have used the color name as “lightsalmon”. You can use any other color name as well. You can find the list of the colors that you can provide at this stackoverflow answer
Unfortunately, there is no direct method like highlight_null
that can be used
to highlight specific cells. To be able to highlight specific cells based on a
condition, we need to tweak the underlying html properties as mentioned in
this stackoverflow answer
We use applymap
method to element-wise apply these CSS style changes
We use a lambda
function to decide if the CSS style changes are needed, based
on the comparison of the cell value with the values present in imp_date
list
>>> (df
... .style
... .applymap(lambda x: 'background-color:bisque'
... if x in imp_dates else None)
... )
>>>
Ex01_Date | Ex02_Date | Ex03_Date | Ex04_Date | |
---|---|---|---|---|
Ajay | 12-Jan-2023 | 14-Jan-2023 | 18-Jan-2023 | 16-Jan-2023 |
Chetan | 17-Jan-2023 | 18-Jan-2023 | 19-Jan-2023 | nan |
Girish | 17-Jan-2023 | nan | 15-Jan-2023 | nan |
Mohan | 19-Jan-2023 | nan | nan | nan |
Vishal | 14-Jan-2023 | 15-Jan-2023 | 17-Jan-2023 | 18-Jan-2023 |
Discussion
Start a discussion on GitHub about this article and let me know your thoughts or suggestions.