Handling NaN values
Contents
Handling NaN values#
When analysing tabular data, sometimes table cells are present that does not contain data. In Python this typically means the value is Not a Number (NaN). We cannot assume these values are 0
or -1
or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.
To get a first view where NaNs play a role, we load again an example table and sort it.
import numpy as np
import pandas as pd
data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.sort_values(by = "Area", ascending=False)
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
190 | 2755.0 | 859.928 | 235.458 | 539.0 | 3880.0 | 108.710 | 302.158 | 110.999 | 300.247 | 144.475 | 24.280 | 39.318 | 100 | C |
81 | 2295.0 | 765.239 | 96.545 | 558.0 | 1431.0 | 375.003 | 134.888 | 374.982 | 135.359 | 65.769 | 44.429 | 127.247 | 100 | B |
209 | 1821.0 | 847.761 | 122.074 | 600.0 | 1510.0 | 287.795 | 321.115 | 288.074 | 321.824 | 55.879 | 41.492 | 112.124 | 100 | A |
252 | 1528.0 | 763.777 | 83.183 | 572.0 | 1172.0 | 191.969 | 385.944 | 192.487 | 385.697 | 63.150 | 30.808 | 34.424 | 100 | B |
265 | 1252.0 | 793.371 | 117.139 | 579.0 | 1668.0 | 262.071 | 394.497 | 262.268 | 394.326 | 60.154 | 26.500 | 50.147 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 1.0 | 587.000 | 0.000 | 587.0 | 587.0 | 399.500 | 117.500 | 399.500 | 117.500 | 1.128 | 1.128 | 0.000 | 100 | A |
310 | 1.0 | 866.000 | 0.000 | 866.0 | 866.0 | 343.500 | 408.500 | 343.500 | 408.500 | 1.128 | 1.128 | 0.000 | 100 | A |
219 | 1.0 | 763.000 | 0.000 | 763.0 | 763.0 | 411.500 | 296.500 | 411.500 | 296.500 | 1.128 | 1.128 | 0.000 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
391 rows × 14 columns
As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them.
A quick check if there are NaNs anywhere in a DataFrame is an important quality check and good scientific practice.
data.isnull().values.any()
True
We can also get some deeper insights in which columns these NaN values are located.
data.isnull().sum()
Area 2
Mean 5
StdDev 3
Min 3
Max 3
X 2
Y 3
XM 3
YM 5
Major 8
Minor 3
Angle 1
%Area 0
Type 0
dtype: int64
For getting a glimpse about if we can further process that tabel, we may want to know the percentage of NaNs for each column?
data.isnull().mean().sort_values(ascending=False) *100
Major 2.046036
Mean 1.278772
YM 1.278772
StdDev 0.767263
Min 0.767263
Max 0.767263
Y 0.767263
XM 0.767263
Minor 0.767263
Area 0.511509
X 0.511509
Angle 0.255754
%Area 0.000000
Type 0.000000
dtype: float64
Dropping rows that contain NaNs#
Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.
It depends on your project and what is important or not for the analysis. Its not an easy answer.
data_no_nan = data.dropna(how="any")
data_no_nan
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
6 | 669.0 | 697.164 | 72.863 | 539.0 | 957.0 | 471.696 | 26.253 | 471.694 | 26.197 | 36.656 | 23.237 | 124.340 | 100 | A |
7 | 5.0 | 658.600 | 49.161 | 607.0 | 710.0 | 28.300 | 8.100 | 28.284 | 8.103 | 3.144 | 2.025 | 161.565 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
383 | 94.0 | 746.617 | 85.198 | 550.0 | 1021.0 | 194.032 | 498.223 | 194.014 | 498.239 | 17.295 | 6.920 | 52.720 | 100 | B |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
374 rows × 14 columns
On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of the total number of datasets were analysed.
We can now also check again if NaNs are present.
data_no_nan.isnull().values.any()
False
Determining rows that contain NaNs#
In some use-cases it might be useful to have a list of row-indices where there are NaN values.
data2 = {
'A': [0, 1, 22, 21, 12, 23],
'B': [2, 3, np.nan, 2, 12, 22],
'C': [2, 3, 44, 2, np.nan, 52],
}
table = pd.DataFrame(data2)
table
A | B | C | |
---|---|---|---|
0 | 0 | 2.0 | 2.0 |
1 | 1 | 3.0 | 3.0 |
2 | 22 | NaN | 44.0 |
3 | 21 | 2.0 | 2.0 |
4 | 12 | 12.0 | NaN |
5 | 23 | 22.0 | 52.0 |
np.max(table.isnull().values, axis=1)
array([False, False, True, False, True, False])
Exercise#
Take the original data
table and select the columns Area
and Mean
. Remove all rows that contain NaNs and count the remaining rows.
Afterwards, take the original data
table again and select the columns Major
and Minor
. Remove NaNs and count the remaining rows again. What do you conclude?