You have a couple of options.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
Now the data frame looks something like this:
0 1 2 3 4 5
0 0.520113 0.884000 1.260966 -0.236597 0.312972 -0.196281
1 -0.837552 NaN 0.143017 0.862355 0.346550 0.842952
2 -0.452595 NaN -0.420790 0.456215 1.203459 0.527425
3 0.317503 -0.917042 1.780938 -1.584102 0.432745 0.389797
4 -0.722852 1.704820 -0.113821 -1.466458 0.083002 0.011722
5 -0.622851 -0.251935 -1.498837 NaN 1.098323 0.273814
6 0.329585 0.075312 -0.690209 -3.807924 0.489317 -0.841368
7 -1.123433 -1.187496 1.868894 -2.046456 -0.949718 NaN
8 1.133880 -0.110447 0.050385 -1.158387 0.188222 NaN
9 -0.513741 1.196259 0.704537 0.982395 -0.585040 -1.693810
- Option 1:
df.isnull().any().any()
— This returns a boolean value
You know of the isnull()
which would return a dataframe like this:
0 1 2 3 4 5
0 False False False False False False
1 False True False False False False
2 False True False False False False
3 False False False False False False
4 False False False False False False
5 False False False True False False
6 False False False False False False
7 False False False False False True
8 False False False False False True
9 False False False False False False
If you make it df.isnull().any()
, you can find just the columns that have NaN
values:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
One more .any()
will tell you if any of the above are True
> df.isnull().any().any()
True
- Option 2:
df.isnull().sum().sum()
— This returns an integer of the total number ofNaN
values:
This operates the same way as the .any().any()
does, by first giving a summation of the number of NaN
values in a column, then the summation of those values:
df.isnull().sum()
0 0
1 2
2 0
3 1
4 0
5 2
dtype: int64
Finally, to get the total number of NaN values in the DataFrame:
df.isnull().sum().sum()
5
Improve Article
Save Article
Like Article
Improve Article
Save Article
Like Article
NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float.
NaN value is one of the major problems in Data Analysis. It is very essential to deal with NaN in order to get the desired results.
Check for NaN Value in Pandas DataFrame
The ways to check for NaN in Pandas DataFrame are as follows:
- Check for NaN with isnull().values.any() method
- Count the NaN Using isnull().sum() Method
- Check for NaN Using isnull().sum().any() Method
- Count the NaN Using isnull().sum().sum() Method
Method 1: Using isnull().values.any() method
Example:
Python3
import
pandas as pd
import
numpy as np
num
=
{
'Integers'
: [
10
,
15
,
30
,
40
,
55
, np.nan,
75
, np.nan,
90
,
150
, np.nan]}
df
=
pd.DataFrame(num, columns
=
[
'Integers'
])
check_nan
=
df[
'Integers'
].isnull().values.
any
()
print
(check_nan)
Output:
True
It is also possible to get the exact positions where NaN values are present. We can do so by removing .values.any() from isnull().values.any() .
Python3
Output:
0 False 1 False 2 False 3 False 4 False 5 True 6 False 7 True 8 False 9 False 10 True Name: Integers, dtype: bool
Method 2: Using isnull().sum() Method
Example:
Python3
import
pandas as pd
import
numpy as np
num
=
{
'Integers'
: [
10
,
15
,
30
,
40
,
55
, np.nan,
75
, np.nan,
90
,
150
, np.nan]}
df
=
pd.DataFrame(num, columns
=
[
'Integers'
])
count_nan
=
df[
'Integers'
].isnull().
sum
()
print
(
'Number of NaN values present: '
+
str
(count_nan))
Output:
Number of NaN values present: 3
Method 3: Using isnull().sum().any() Method
Example:
Python3
import
pandas as pd
import
numpy as np
nums
=
{
'Integers_1'
: [
10
,
15
,
30
,
40
,
55
, np.nan,
75
,
np.nan,
90
,
150
, np.nan],
'Integers_2'
: [np.nan,
21
,
22
,
23
, np.nan,
24
,
25
,
np.nan,
26
, np.nan, np.nan]}
df
=
pd.DataFrame(nums, columns
=
[
'Integers_1'
,
'Integers_2'
])
nan_in_df
=
df.isnull().
sum
().
any
()
print
(nan_in_df)
Output:
True
To get the exact positions where NaN values are present, we can do so by removing .sum().any() from isnull().sum().any() .
Method 4: Using isnull().sum().sum() Method
Example:
Python3
import
pandas as pd
import
numpy as np
nums
=
{
'Integers_1'
: [
10
,
15
,
30
,
40
,
55
, np.nan,
75
,
np.nan,
90
,
150
, np.nan],
'Integers_2'
: [np.nan,
21
,
22
,
23
, np.nan,
24
,
25
,
np.nan,
26
, np.nan, np.nan]}
df
=
pd.DataFrame(nums, columns
=
[
'Integers_1'
,
'Integers_2'
])
nan_in_df
=
df.isnull().
sum
().
sum
()
print
(
'Number of NaN values present: '
+
str
(nan_in_df))
Output:
Number of NaN values present: 8
Last Updated :
30 Jan, 2023
Like Article
Save Article
Here are 4 ways to check for NaN in Pandas DataFrame:
(1) Check for NaN under a single DataFrame column:
df['your column name'].isnull().values.any()
(2) Count the NaN under a single DataFrame column:
df['your column name'].isnull().sum()
(3) Check for NaN under an entire DataFrame:
df.isnull().values.any()
(4) Count the NaN under an entire DataFrame:
df.isnull().sum().sum()
(1) Check for NaN under a single DataFrame column
In the following example, we’ll create a DataFrame with a set of numbers and 3 NaN values:
import pandas as pd import numpy as np data = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(data) print (df)
You’ll now see the DataFrame with the 3 NaN values:
set_of_numbers
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 6.0
7 7.0
8 NaN
9 8.0
10 9.0
11 10.0
12 NaN
You can then use the following template in order to check for NaN under a single DataFrame column:
df['your column name'].isnull().values.any()
For our example, the DataFrame column is ‘set_of_numbers.’
And so, the code to check whether a NaN value exists under the ‘set_of_numbers’ column is as follows:
import pandas as pd import numpy as np data = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(data) check_for_nan = df['set_of_numbers'].isnull().values.any() print (check_for_nan)
Run the code, and you’ll get ‘True’ which confirms the existence of NaN values under the DataFrame column:
True
And if you want to get the actual breakdown of the instances where NaN values exist, then you may remove .values.any() from the code. So the complete syntax to get the breakdown would look as follows:
import pandas as pd import numpy as np data = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(data) check_for_nan = df['set_of_numbers'].isnull() print (check_for_nan)
You’ll now see the 3 instances of the NaN values:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 False
10 False
11 False
12 True
Here is another approach where you can get all the instances where a NaN value exists:
import pandas as pd import numpy as np data = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(data) df.loc[df['set_of_numbers'].isnull(),'value_is_NaN'] = 'Yes' df.loc[df['set_of_numbers'].notnull(), 'value_is_NaN'] = 'No' print (df)
You’ll now see a new column (called ‘value_is_NaN’), which indicates all the instances where a NaN value exists:
set_of_numbers value_is_NaN
0 1.0 No
1 2.0 No
2 3.0 No
3 4.0 No
4 5.0 No
5 NaN Yes
6 6.0 No
7 7.0 No
8 NaN Yes
9 8.0 No
10 9.0 No
11 10.0 No
12 NaN Yes
(2) Count the NaN under a single DataFrame column
You can apply this syntax in order to count the NaN values under a single DataFrame column:
df['your column name'].isnull().sum()
Here is the syntax for our example:
import pandas as pd import numpy as np data = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(data) count_nan = df['set_of_numbers'].isnull().sum() print ('Count of NaN: ' + str(count_nan))
You’ll then get the count of 3 NaN values:
Count of NaN: 3
And here is another approach to get the count:
import pandas as pd import numpy as np data = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(data) df.loc[df['set_of_numbers'].isnull(),'value_is_NaN'] = 'Yes' df.loc[df['set_of_numbers'].notnull(), 'value_is_NaN'] = 'No' count_nan = df.loc[df['value_is_NaN']=='Yes'].count() print (count_nan)
As before, you’ll get the count of 3 instances of NaN values:
value_is_NaN 3
(3) Check for NaN under an entire DataFrame
Now let’s add a second column into the original DataFrame. This column would include another set of numbers with NaN values:
import pandas as pd import numpy as np data = {'first_set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan], 'second_set_of_numbers': [11,12,np.nan,13,14,np.nan,15,16,np.nan,np.nan,17,np.nan,19]} df = pd.DataFrame(data) print (df)
Run the code, and you’ll get 8 instances of NaN values across the entire DataFrame:
first_set_of_numbers second_set_of_numbers
0 1.0 11.0
1 2.0 12.0
2 3.0 NaN
3 4.0 13.0
4 5.0 14.0
5 NaN NaN
6 6.0 15.0
7 7.0 16.0
8 NaN NaN
9 8.0 NaN
10 9.0 17.0
11 10.0 NaN
12 NaN 19.0
You can then apply this syntax in order to verify the existence of NaN values under the entire DataFrame:
df.isnull().values.any()
For our example:
import pandas as pd import numpy as np data = {'first_set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan], 'second_set_of_numbers': [11,12,np.nan,13,14,np.nan,15,16,np.nan,np.nan,17,np.nan,19]} df = pd.DataFrame(data) check_nan_in_df = df.isnull().values.any() print (check_nan_in_df)
Once you run the code, you’ll get ‘True’ which confirms the existence of NaN values in the DataFrame:
True
You can get a further breakdown by removing .values.any() from the code:
import pandas as pd import numpy as np data = {'first_set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan], 'second_set_of_numbers': [11,12,np.nan,13,14,np.nan,15,16,np.nan,np.nan,17,np.nan,19]} df = pd.DataFrame(data) check_nan_in_df = df.isnull() print (check_nan_in_df)
Here is the result of the breakdown:
first_set_of_numbers second_set_of_numbers
0 False False
1 False False
2 False True
3 False False
4 False False
5 True True
6 False False
7 False False
8 True True
9 False True
10 False False
11 False True
12 True False
(4) Count the NaN under an entire DataFrame
You may now use this template to count the NaN values under the entire DataFrame:
df.isnull().sum().sum()
Here is the code for our example:
import pandas as pd import numpy as np data = {'first_set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan], 'second_set_of_numbers': [11,12,np.nan,13,14,np.nan,15,16,np.nan,np.nan,17,np.nan,19]} df = pd.DataFrame(data) count_nan_in_df = df.isnull().sum().sum() print ('Count of NaN: ' + str(count_nan_in_df))
You’ll then get the total count of 8:
Count of NaN: 8
And if you want to get the count of NaN by column, then you may use the following code:
import pandas as pd import numpy as np data = {'first_set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan], 'second_set_of_numbers': [11,12,np.nan,13,14,np.nan,15,16,np.nan,np.nan,17,np.nan,19]} df = pd.DataFrame(data) count_nan_in_df = df.isnull().sum() print (count_nan_in_df)
And here is the result:
first_set_of_numbers 3
second_set_of_numbers 5
You just saw how to check for NaN in Pandas DataFrame. Alternatively you may:
- Drop Rows with NaN Values in Pandas DataFrame
- Replace NaN Values with Zeros
- Create NaN Values in Pandas DataFrame
By using isnull().values.any()
method you can check if a pandas DataFrame contains NaN
/None
values in any cell (all rows & columns ). This method returns True
if it finds NaN/None on any cell of a DataFrame, returns False
when not found. In this article, I will explain how to check if any value is NaN in a pandas DataFrame.
NaN
stands for Not A Number and is one of the common ways to represent the missing value in the data. One of the major problems in Data Analysis is the NaN value as having NaN the operations will have side effects hence it’s always a best practice to check if DataFrame has any missing data and replace them with values that make sense for example empty string or numeric zero.
1. Quick Examples of Check If any Value is NaN
If you are in a hurry, below are some quick examples of how to check if any value is nan in a pandas DataFrame.
# Below are a quick example
# Checking NaN on entire DataFrame
value = df.isnull().values.any()
# Checking on Single Column
value = df['Fee'].isnull().values.any()
# Checking on multiple columns
value = df[['Fee','Duration']].isnull().values.any()
# Counte NaN on entire DataFrame
result = df.isnull().sum()
# Counte NaN on single column of DataFrame
result = df['Fee'].isnull().sum()
# Counte NaN on selected columns of DataFrame
result = df[['Fee','Duration']].isnull().sum()
# Get Total Count of all Columns
count = df.isnull().sum().sum()
print('Number of NaN values present:' +str(count))
Now, let’s create a DataFrame with a few rows and columns and execute some examples and validate the output. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
with some NaN values.
# Create Sample DataFrame
import pandas as pd
import numpy as np
technologies = ({
'Courses':["Spark","Java","Hadoop","Python","pandas"],
'Fee' :[20000,np.nan,26000,np.nan,24000],
'Duration':['30days',np.nan,'35days','40days',np.nan],
'Discount':[1000,np.nan,2500,2100,np.nan]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Courses Fee Duration Discount
0 Spark 20000.0 30days 1000.0
1 Java NaN NaN NaN
2 Hadoop 26000.0 35days 2500.0
3 Python NaN 40days 2100.0
4 pandas 24000.0 NaN NaN
Use DataFrame.isnull().Values.any()
method to check if there are any missing data in pandas DataFrame, missing data is represented as NaN or None values in DataFrame. When your data contains NaN or None, using this method returns the boolean value True
otherwise returns False
. After identifying the columns with NaN, sometimes you may want to replace NaN with zero value or replace NaN with a blank or empty string.
# Check accross all cell for NaN values
value = df.isnull().values.any()
print(value)
# Outputs: True
The above example checks all columns and returns True when it finds at least a single NaN/None value.
3. Check for NaN Values on Selected Columns
If you wanted to check if NaN values exist on selected columns (single or multiple), First select the columns and run the same method.
# Checking on Single Column
value = df['Fee'].isnull().values.any()
print(value)
# Outputs: True
# Checking on Single Column
value = df['Courses'].isnull().values.any()
print(value)
# Outputs: False
# Checking on multiple columns
value = df[['Fee','Duration']].isnull().values.any()
print(value)
# Outputs: True
3. Using DataFrame.isnull() Method
DataFrame.isnull()
check if a value is present in a cell, if it finds NaN/None values it returns True otherwise it returns False for each cell.
# Using DataFrame.isnull() method
df2 = df['Fee'].isnull()
print(df2)
Yields below output.
0 False
1 True
2 False
3 True
4 False
Name: Fee, dtype: bool
4. Count the NaN Values on Single or Multiple DataFrame Columns
You can also count the NaN/None values present in the entire DataFrame, single or multiple columns.
# Counte NaN on entire DataFrame
result = df.isnull().sum()
print(result)
# Outputs
#Courses 0
#Fee 2
#Duration 2
#Discount 2
#dtype: int64
# Counte NaN on single column of DataFrame
result = df['Fee'].isnull().sum()
print(result)
# Outputs
#2
# Counte NaN on selected columns of DataFrame
result = df[['Fee','Duration']].isnull().sum()
print(result)
# Outputs
#Fee 2
#Duration 2
#dtype: int64
Note that when you use sum() on multiple columns or entire DataFrame it returns naN values count for each column.
5. Total Count NaN Values on Entire DataFrame
To get the combined total count of NaN values, use isnull().sum().sum()
on DataFrame. The below example returns the total count of NaN values from all columns.
# To get the Count
count = df.isnull().sum().sum()
print('Number of NaN values present:' +str(df2))
Yields below output.
Number of NaN values present:6
6. Complete Example For Check If any Value NaN
Below is the complete example of how to check if any value is NaN in pandas DataFrame.
import pandas as pd
import numpy as np
technologies = ({
'Courses':["Spark","Java","Hadoop","Python","pandas"],
'Fee' :[20000,np.nan,26000,np.nan,24000],
'Duration':['30days',np.nan,'35days','40days',np.nan],
'Discount':[1000,np.nan,2500,2100,np.nan]
})
df = pd.DataFrame(technologies)
print(df)
# Checking NaN on entire DataFrame
value = df.isnull().values.any()
print(value)
# Checking on Single Column
value = df['Fee'].isnull().values.any()
print(value)
# Checking on Single Column
value = df['Courses'].isnull().values.any()
print(value)
# Checking on multiple columns
value = df[['Fee','Duration']].isnull().values.any()
print(value)
# Using DataFrame.isnull() method
df2 = df['Fee'].isnull()
print(df2)
# Counte NaN on entire DataFrame
result = df.isnull().sum()
print(result)
# Counte NaN on single column of DataFrame
result = df['Fee'].isnull().sum()
print(result)
# Counte NaN on selected columns of DataFrame
result = df[['Fee','Duration']].isnull().sum()
print(result)
# To get the Count
count = df.isnull().sum().sum()
print('Number of NaN values present:' +str(count))
Conclusion
In this article, you have learned how to check if any value is NaN in the entire pandas DataFrame, on a single column or multiple columns using DataFrame.isnull().any()
, and DataFrame.isnull().sum()
method. Also, you have learned how to get the count of NaN values using DataFrame.isnull().sum().sum()
method.
Happy Learning !!
Related Articles
- How to Drop Rows with NaN Values in Pandas DataFrame
- How to Combine Two Series into pandas DataFrame
- Pandas Remap Values in Column with a Dictionary (Dict)
- Pandas Check Column Contains a Value in DataFrame
- Check Values of Pandas Series is Unique
- Pandas Check If DataFrame is Empty | Examples
- Pandas – Check If a Column Exists in DataFrame
- How to Check Pandas Version?
References
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html
In this article I would like to describe how to find NaN values in a pandas DataFrame. This kind of operation can be very useful given that is common to find datasets with missing or incorrect data values.
I will be using the numpy package to generate some data with NaN values.
Import necessary packages
import pandas as pd
import numpy as np
import platform
Enter fullscreen mode
Exit fullscreen mode
print(f'Python version: {platform.python_version()} ({platform.python_implementation()})')
print(f'Pandas version: {pd.__version__}')
print(f'Numpy version: {np.__version__}')
Enter fullscreen mode
Exit fullscreen mode
Python version: 3.6.4 (CPython)
Pandas version: 0.23.1
Numpy version: 1.14.5
Enter fullscreen mode
Exit fullscreen mode
Generate data with NaN values
num_nan = 25 # number of NaN values wanted in the generated data
np.random.seed(6765431) # set a seed for reproducibility
A = np.random.randn(10, 10)
print(A)
Enter fullscreen mode
Exit fullscreen mode
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
0.44499302 -0.61209568 0.31170935 1.4127548 ]
[ 0.85330488 0.68517546 -1.10140989 0.84918019 0.72802961 -0.35161197
0.73519152 1.13145412 0.53231247 0.78103143]
[-0.81614324 0.15906898 0.49940119 -0.09319255 -1.07837721 -0.76053341
0.73622083 -0.45518154 -0.69194032 1.02550409]
[-1.96339975 0.07593331 -0.16798377 -1.20398958 0.88333656 1.17908422
0.26324698 -2.65442248 -0.31583796 -0.16065732]
[-1.24321376 -0.89816898 0.02824671 0.15304093 0.56505667 -0.78115883
0.74504467 1.14025258 -0.04518221 -0.83908358]
[ 1.00967019 0.84240102 1.15043436 -0.40120489 0.00664105 -1.23247563
0.64738343 1.66096762 -0.92556683 0.47575796]
[ 0.96516278 1.11158059 -0.82155143 0.88900313 2.16943761 -2.05250161
2.40156233 0.92453867 -0.24437783 -2.91029265]
[-0.86492662 0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
0.07177233 1.02820181 -2.08947076 0.89859677]
[-0.07263982 -0.56840867 1.30910275 -0.52846822 0.06019191 -0.61000727
0.40782356 -0.36124333 -1.54522486 -0.07891861]
[-1.96361682 -1.06315325 -0.45582138 -0.74566868 1.27579529 -2.46306005
0.57022673 -0.02793746 0.78652775 1.27690195]]
Enter fullscreen mode
Exit fullscreen mode
# Set random values to nan
A.ravel()[np.random.choice(A.size, num_nan, replace=False)] = np.nan
print(A)
Enter fullscreen mode
Exit fullscreen mode
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
nan -0.61209568 0.31170935 1.4127548 ]
[ 0.85330488 0.68517546 nan 0.84918019 nan -0.35161197
0.73519152 nan 0.53231247 0.78103143]
[-0.81614324 0.15906898 0.49940119 nan -1.07837721 -0.76053341
0.73622083 nan -0.69194032 1.02550409]
[-1.96339975 0.07593331 nan -1.20398958 0.88333656 nan
0.26324698 nan -0.31583796 -0.16065732]
[-1.24321376 -0.89816898 0.02824671 0.15304093 0.56505667 -0.78115883
0.74504467 1.14025258 -0.04518221 -0.83908358]
[ 1.00967019 0.84240102 nan -0.40120489 0.00664105 nan
0.64738343 1.66096762 -0.92556683 0.47575796]
[ 0.96516278 nan -0.82155143 0.88900313 2.16943761 nan
2.40156233 nan -0.24437783 nan]
[-0.86492662 0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
0.07177233 1.02820181 -2.08947076 nan]
[-0.07263982 nan 1.30910275 -0.52846822 0.06019191 -0.61000727
0.40782356 -0.36124333 nan nan]
[ nan nan nan nan 1.27579529 -2.46306005
nan nan 0.78652775 1.27690195]]
Enter fullscreen mode
Exit fullscreen mode
# Create a DataFrame from the generated data
df = pd.DataFrame(A)
df
Enter fullscreen mode
Exit fullscreen mode
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.561323 | -0.169541 | -0.178454 | -1.336891 | -0.191851 | -1.186178 | NaN | -0.612096 | 0.311709 | 1.412755 |
1 | 0.853305 | 0.685175 | NaN | 0.849180 | NaN | -0.351612 | 0.735192 | NaN | 0.532312 | 0.781031 |
2 | -0.816143 | 0.159069 | 0.499401 | NaN | -1.078377 | -0.760533 | 0.736221 | NaN | -0.691940 | 1.025504 |
3 | -1.963400 | 0.075933 | NaN | -1.203990 | 0.883337 | NaN | 0.263247 | NaN | -0.315838 | -0.160657 |
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
5 | 1.009670 | 0.842401 | NaN | -0.401205 | 0.006641 | NaN | 0.647383 | 1.660968 | -0.925567 | 0.475758 |
6 | 0.965163 | NaN | -0.821551 | 0.889003 | 2.169438 | NaN | 2.401562 | NaN | -0.244378 | NaN |
7 | -0.864927 | 0.824432 | -0.482469 | -1.051831 | -1.152725 | -0.771707 | 0.071772 | 1.028202 | -2.089471 | NaN |
8 | -0.072640 | NaN | 1.309103 | -0.528468 | 0.060192 | -0.610007 | 0.407824 | -0.361243 | NaN | NaN |
9 | NaN | NaN | NaN | NaN | 1.275795 | -2.463060 | NaN | NaN | 0.786528 | 1.276902 |
Check for NaN values
Now that we have some data to operate on let’s see the different ways we can check for missing values.
There are two methods of the DataFrame object that can be used: DataFrame#isna()
and DataFrame#isnull()
. But if you check the source code it seems that isnull()
is only an alias for the isna()
method. To keep it simple I will only use the isna()
method as we would get the same result using isnull()
.
df.isna()
Enter fullscreen mode
Exit fullscreen mode
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | True | False | False | False |
1 | False | False | True | False | True | False | False | True | False | False |
2 | False | False | False | True | False | False | False | True | False | False |
3 | False | False | True | False | False | True | False | True | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
5 | False | False | True | False | False | True | False | False | False | False |
6 | False | True | False | False | False | True | False | True | False | True |
7 | False | False | False | False | False | False | False | False | False | True |
8 | False | True | False | False | False | False | False | False | True | True |
9 | True | True | True | True | False | False | True | True | False | False |
As it can be seen above when we use the isna()
method it returns a DataFrame with boolean values, where True
indicates NaN values and False
otherwise.
If we wanted to know how many missing values there are on each row or column we could use the DataFrame#sum()
method:
df.isna().sum(axis='rows') # 'rows' or 0
Enter fullscreen mode
Exit fullscreen mode
0 1
1 3
2 4
3 2
4 1
5 3
6 2
7 5
8 1
9 3
dtype: int64
Enter fullscreen mode
Exit fullscreen mode
df.isna().sum(axis='columns') # 'columns' or 1
Enter fullscreen mode
Exit fullscreen mode
0 1
1 3
2 2
3 3
4 0
5 2
6 4
7 1
8 3
9 6
dtype: int64
Enter fullscreen mode
Exit fullscreen mode
To simply know the total number of missing values we can call sum()
again:
df.isna().sum().sum()
Enter fullscreen mode
Exit fullscreen mode
25
Enter fullscreen mode
Exit fullscreen mode
If we simply wanna know if there is any missing value with no care for the quantity we can simply use the any()
method:
df.isna().any() # can also receive axis='rows' or 'columns'
Enter fullscreen mode
Exit fullscreen mode
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
Enter fullscreen mode
Exit fullscreen mode
Calling it again we have a single boolean output:
df.isna().any().any()
Enter fullscreen mode
Exit fullscreen mode
True
Enter fullscreen mode
Exit fullscreen mode
Besides the isna()
method we also have the notna()
method which is its boolean inverse. Applying it we can get the number of values that are not missing or simply if all values are not missing (but using the all()
method instead of any()
).
print(df.notna().sum().sum()) # not missing
print(df.notna().all().all())
Enter fullscreen mode
Exit fullscreen mode
75
False
Enter fullscreen mode
Exit fullscreen mode
Note 1: in the examples, it was used the DataFrame methods to check for missing values, but the pandas package has its own functions with the same purpose that can be applied to other objects. Example:
print(pd.isna([1, 2, np.nan]))
print(pd.notna([1, 2, np.nan]))
Enter fullscreen mode
Exit fullscreen mode
[False False True]
[ True True False]
Enter fullscreen mode
Exit fullscreen mode
Note 2: the methods applied here on DataFrame objects are also available for Series and Index objects.
Time comparison
Comparing the time taken by the two methods we can see that using any()
is faster but sum()
will give us the additional information about how many missing values there are.
%timeit df.isna().any().any()
Enter fullscreen mode
Exit fullscreen mode
333 µs ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode
Exit fullscreen mode
%timeit df.isna().sum().sum()
Enter fullscreen mode
Exit fullscreen mode
561 µs ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode
Exit fullscreen mode
Dealing with missing values
Two easy ways to deal with missing values are removing them or filling them with some value. These can be achieved with the dropna()
and fillna()
methods.
The dropna()
method will return a DataFrame without the rows and columns containing missing values.
df.dropna()
Enter fullscreen mode
Exit fullscreen mode
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
The fillna()
method will return a DataFrame with the missing values filled with a specified value.
df.fillna(value=5)
Enter fullscreen mode
Exit fullscreen mode
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.561323 | -0.169541 | -0.178454 | -1.336891 | -0.191851 | -1.186178 | 5.000000 | -0.612096 | 0.311709 | 1.412755 |
1 | 0.853305 | 0.685175 | 5.000000 | 0.849180 | 5.000000 | -0.351612 | 0.735192 | 5.000000 | 0.532312 | 0.781031 |
2 | -0.816143 | 0.159069 | 0.499401 | 5.000000 | -1.078377 | -0.760533 | 0.736221 | 5.000000 | -0.691940 | 1.025504 |
3 | -1.963400 | 0.075933 | 5.000000 | -1.203990 | 0.883337 | 5.000000 | 0.263247 | 5.000000 | -0.315838 | -0.160657 |
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
5 | 1.009670 | 0.842401 | 5.000000 | -0.401205 | 0.006641 | 5.000000 | 0.647383 | 1.660968 | -0.925567 | 0.475758 |
6 | 0.965163 | 5.000000 | -0.821551 | 0.889003 | 2.169438 | 5.000000 | 2.401562 | 5.000000 | -0.244378 | 5.000000 |
7 | -0.864927 | 0.824432 | -0.482469 | -1.051831 | -1.152725 | -0.771707 | 0.071772 | 1.028202 | -2.089471 | 5.000000 |
8 | -0.072640 | 5.000000 | 1.309103 | -0.528468 | 0.060192 | -0.610007 | 0.407824 | -0.361243 | 5.000000 | 5.000000 |
9 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 1.275795 | -2.463060 | 5.000000 | 5.000000 | 0.786528 | 1.276902 |
References:
- Create sample numpy array with randomly placed NaNs (StackOverflow)
- How to check if any value is NaN in a Pandas DataFrame (StackOverflow)
- pandas.isnull
- pandas.isna
- pandas.notna
- pandas.DataFrame.dropna
- pandas.DataFrame.fillna