Skip to the content.
Home | Blog | Book Reading |About | All Courses

Pandas: Filter Rows based on cell values


One of the coolest tricks in Python when you want to search multiple conditions is the clever use of in operator

Let’s say you want to check if the user agrees with your proposal or not. May be you ask for the user’s opinion ( or may be read it from the frontend or a file ) and write some code like this

>>> answers = [ "Yes", "Yeah", "Nods"]
>>> opinion = input("Will you marry me? ")
Will you marry me? Nods
>>>
>>> print("Hurray !!!" if opinion in answers else "Try again later ...")
Hurray !!!
>>>

All that is good with simple strings or integers but how can we do something similar in a Pandas DataFrame?

Problem Statement

Consider this dataset that contains details about when a particular student submitted their exercises in my Course

The values in the table contain the date a particular student submitted any given exercise

>>> import pandas as pd
>>> df = pd.read_csv('exercises_table.csv')
>>> df.set_index('Name', inplace=True)
>>> df
Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Name
Ajay 12-Jan-2023 14-Jan-2023 18-Jan-2023 16-Jan-2023
Chetan 17-Jan-2023 18-Jan-2023 19-Jan-2023 NaN
Girish 17-Jan-2023 NaN 15-Jan-2023 NaN
Mohan 19-Jan-2023 NaN NaN NaN
Vishal 14-Jan-2023 15-Jan-2023 17-Jan-2023 18-Jan-2023

From the above information, how can we find out which students submitted which exercises on 18-Jan-2023 and 19-Jan-2023 ?

Fixing the Row index name

Before we jump to the solution let’s first fix the Row index name of the DataFrame. One neat trick ( more of a hack though ) to remove the name for the Row index called as “Name” is to set it’s name attribute to None

>>> df.index.name = None
>>> df
Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Ajay 12-Jan-2023 14-Jan-2023 18-Jan-2023 16-Jan-2023
Chetan 17-Jan-2023 18-Jan-2023 19-Jan-2023 NaN
Girish 17-Jan-2023 NaN 15-Jan-2023 NaN
Mohan 19-Jan-2023 NaN NaN NaN
Vishal 14-Jan-2023 15-Jan-2023 17-Jan-2023 18-Jan-2023

Now that the table is in a better shape, let’s move on to the solution

Solution

Let’s start by defining the required dates in a list imp_dates, similar to the answers variable we used earlier

>>> imp_dates = ['18-Jan-2023', '19-Jan-2023']

Now, we need to check which of the values in df contain any of the dates present in imp_dates. This can be achieved using isin method of DataFrame. This does an elementwise comparison as was done in the earlier Python code at the beginning of this post. And all elementwise operations can be run parallelly, so this should be a fast operation.

The isin method returns a boolean DataFrame with cells that contain any of our values of interest ( any date present in imp_dates ) having True and the rest as False

>>> df.isin(imp_dates)
Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Ajay False False True False
Chetan False True True False
Girish False False False False
Mohan True False False False
Vishal False False False True

Now, to filter the rows that contain our dates of interest, we use the values present in the Boolean DataFrame as a filter over the DataFrame.

But, the filter needs to be a Series object which contains the row labels containing our dates of interest.

In the above boolean DataFrame, the row lables that contain True would be “Ajay”, “Chetan”, “Mohan” and “Vishal”

To get the required Series, we can sum up the values along each row.

Any row that contains atleast one True value would sum up into a value greater than 0

Or, in other words, if a row has only False values, then its sum would be 0

We can sum up all the boolean values along each row using sum(axis=1)

Note: Using sum() by default sums the boolean values along each column

>>> df.isin(imp_dates).sum(axis=1)

    Ajay      1
    Chetan    2
    Girish    0
    Mohan     1
    Vishal    1
    dtype: int64
>>> 

We are interested in only those rows that contain atleast one True value. So, the sum value for such rows would be > 0

Let’s create a mask of the boolean Series for the above comparison

>>> mask_1819 = df.isin(imp_dates).sum(axis=1) > 0
>>> mask_1819

    Ajay       True
    Chetan     True
    Girish    False
    Mohan      True
    Vishal     True
    dtype: bool
>>> 

Now, let’s use the mask to filter the values in the DataFrame

>>> df[mask_1819]
Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Ajay 12-Jan-2023 14-Jan-2023 18-Jan-2023 16-Jan-2023
Chetan 17-Jan-2023 18-Jan-2023 19-Jan-2023 NaN
Mohan 19-Jan-2023 NaN NaN NaN
Vishal 14-Jan-2023 15-Jan-2023 17-Jan-2023 18-Jan-2023

And that is our expected result.

Compare the result with the highlighted values shown below from the original DataFrame

  Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Ajay 12-Jan-2023 14-Jan-2023 18-Jan-2023 16-Jan-2023
Chetan 17-Jan-2023 18-Jan-2023 19-Jan-2023 nan
Girish 17-Jan-2023 nan 15-Jan-2023 nan
Mohan 19-Jan-2023 nan nan nan
Vishal 14-Jan-2023 15-Jan-2023 17-Jan-2023 18-Jan-2023

Wait a minute. How did we get those nicely colored cells above?

Well, this is more of a digression from the topic we are discussing here. But, it’s a good one I guess :-)

Highlighting specific cells

For highlighting specific rows or columns or cells in a Pandas Dataframe, we use it’s style property

One handy method on the style property is highlight_null. As the name suggests, it is used to highlight empty or NaN entries in a DataFrame. This comes in pretty handy when we want to visually spot the empty cells.

Following code explains how highlight_null can be used to highlight empty cells with a “lightsalmon” color

>>> df.style.highlight_null('lightsalmon')
  Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Ajay 12-Jan-2023 14-Jan-2023 18-Jan-2023 16-Jan-2023
Chetan 17-Jan-2023 18-Jan-2023 19-Jan-2023 nan
Girish 17-Jan-2023 nan 15-Jan-2023 nan
Mohan 19-Jan-2023 nan nan nan
Vishal 14-Jan-2023 15-Jan-2023 17-Jan-2023 18-Jan-2023

Here, we have used the color name as “lightsalmon”. You can use any other color name as well. You can find the list of the colors that you can provide at this stackoverflow answer

Unfortunately, there is no direct method like highlight_null that can be used to highlight specific cells. To be able to highlight specific cells based on a condition, we need to tweak the underlying html properties as mentioned in this stackoverflow answer

We use applymap method to element-wise apply these CSS style changes

We use a lambda function to decide if the CSS style changes are needed, based on the comparison of the cell value with the values present in imp_date list

>>> (df
...   .style
...   .applymap(lambda x: 'background-color:bisque' 
...             if x in imp_dates else None)
...   )
>>> 
  Ex01_Date Ex02_Date Ex03_Date Ex04_Date
Ajay 12-Jan-2023 14-Jan-2023 18-Jan-2023 16-Jan-2023
Chetan 17-Jan-2023 18-Jan-2023 19-Jan-2023 nan
Girish 17-Jan-2023 nan 15-Jan-2023 nan
Mohan 19-Jan-2023 nan nan nan
Vishal 14-Jan-2023 15-Jan-2023 17-Jan-2023 18-Jan-2023

Discussion

Start a discussion on GitHub about this article and let me know your thoughts or suggestions.