{ "cells": [ { "cell_type": "markdown", "id": "78a139a0", "metadata": {}, "source": [ "# Handling NaN values\n", "When analysing tabular data, sometimes table cells are present that does not contain data. In Python this typically means the value is _Not a Number_ ([NaN](https://en.wikipedia.org/wiki/NaN)). We cannot assume these values are `0` or `-1` or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.\n", "\n", "To get a first view where NaNs play a role, we load again an example table and sort it." ] }, { "cell_type": "code", "execution_count": 1, "id": "189e76b0-0cc2-4baa-8290-e5a06ab2d70b", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd " ] }, { "cell_type": "code", "execution_count": 2, "id": "4e617db1-ac10-4f69-9ba9-97913d517a15", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AreaMeanStdDevMinMaxXYXMYMMajorMinorAngle%AreaType
1902755.0859.928235.458539.03880.0108.710302.158110.999300.247144.47524.28039.318100C
812295.0765.23996.545558.01431.0375.003134.888374.982135.35965.76944.429127.247100B
2091821.0847.761122.074600.01510.0287.795321.115288.074321.82455.87941.492112.124100A
2521528.0763.77783.183572.01172.0191.969385.944192.487385.69763.15030.80834.424100B
2651252.0793.371117.139579.01668.0262.071394.497262.268394.32660.15426.50050.147100A
.............................................
1131.0587.0000.000587.0587.0399.500117.500399.500117.5001.1281.1280.000100A
3101.0866.0000.000866.0866.0343.500408.500343.500408.5001.1281.1280.000100A
2191.0763.0000.000763.0763.0411.500296.500411.500296.5001.1281.1280.000100A
3NaNNaNNaN608.0964.0NaNNaNNaN7.6657.359NaN101.121100A
5NaNNaN69.438566.0792.0348.5007.500NaN7.508NaN3.088NaN100A
\n", "

391 rows × 14 columns

\n", "
" ], "text/plain": [ " Area Mean StdDev Min Max X Y XM \\\n", " \n", "190 2755.0 859.928 235.458 539.0 3880.0 108.710 302.158 110.999 \n", "81 2295.0 765.239 96.545 558.0 1431.0 375.003 134.888 374.982 \n", "209 1821.0 847.761 122.074 600.0 1510.0 287.795 321.115 288.074 \n", "252 1528.0 763.777 83.183 572.0 1172.0 191.969 385.944 192.487 \n", "265 1252.0 793.371 117.139 579.0 1668.0 262.071 394.497 262.268 \n", ".. ... ... ... ... ... ... ... ... \n", "113 1.0 587.000 0.000 587.0 587.0 399.500 117.500 399.500 \n", "310 1.0 866.000 0.000 866.0 866.0 343.500 408.500 343.500 \n", "219 1.0 763.000 0.000 763.0 763.0 411.500 296.500 411.500 \n", "3 NaN NaN NaN 608.0 964.0 NaN NaN NaN \n", "5 NaN NaN 69.438 566.0 792.0 348.500 7.500 NaN \n", "\n", " YM Major Minor Angle %Area Type \n", " \n", "190 300.247 144.475 24.280 39.318 100 C \n", "81 135.359 65.769 44.429 127.247 100 B \n", "209 321.824 55.879 41.492 112.124 100 A \n", "252 385.697 63.150 30.808 34.424 100 B \n", "265 394.326 60.154 26.500 50.147 100 A \n", ".. ... ... ... ... ... ... \n", "113 117.500 1.128 1.128 0.000 100 A \n", "310 408.500 1.128 1.128 0.000 100 A \n", "219 296.500 1.128 1.128 0.000 100 A \n", "3 7.665 7.359 NaN 101.121 100 A \n", "5 7.508 NaN 3.088 NaN 100 A \n", "\n", "[391 rows x 14 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')\n", "data.sort_values(by = \"Area\", ascending=False)" ] }, { "cell_type": "markdown", "id": "d4532812", "metadata": {}, "source": [ "As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them." ] }, { "cell_type": "markdown", "id": "2bc0c85e", "metadata": {}, "source": [ "A quick check if there are NaNs anywhere in a DataFrame is an important quality check and good scientific practice." ] }, { "cell_type": "code", "execution_count": 3, "id": "5c152771", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().values.any()" ] }, { "cell_type": "markdown", "id": "87f14e62-b3e5-45f6-9c02-9820b21bd929", "metadata": {}, "source": [ "We can also get some deeper insights in which columns these NaN values are located." ] }, { "cell_type": "code", "execution_count": 4, "id": "7f6b5eb4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Area 2\n", "Mean 5\n", "StdDev 3\n", "Min 3\n", "Max 3\n", "X 2\n", "Y 3\n", "XM 3\n", "YM 5\n", "Major 8\n", "Minor 3\n", "Angle 1\n", "%Area 0\n", "Type 0\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().sum()" ] }, { "cell_type": "markdown", "id": "b27eeccd", "metadata": {}, "source": [ "For getting a glimpse about if we can further process that tabel, we may want to know the percentage of NaNs for each column?" ] }, { "cell_type": "code", "execution_count": 5, "id": "a9297b56", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Major 2.046036\n", "Mean 1.278772\n", "YM 1.278772\n", "StdDev 0.767263\n", "Min 0.767263\n", "Max 0.767263\n", "Y 0.767263\n", "XM 0.767263\n", "Minor 0.767263\n", "Area 0.511509\n", "X 0.511509\n", "Angle 0.255754\n", "%Area 0.000000\n", "Type 0.000000\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().mean().sort_values(ascending=False) *100" ] }, { "cell_type": "markdown", "id": "f605facf", "metadata": {}, "source": [ "# Dropping rows that contain NaNs\n", "Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.\n", "\n", "It depends on your project and what is important or not for the analysis. Its not an easy answer." ] }, { "cell_type": "code", "execution_count": 6, "id": "25bac0a9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AreaMeanStdDevMinMaxXYXMYMMajorMinorAngle%AreaType
118.0730.389103.354592.0948.0435.0004.722434.9624.6975.9873.828168.425100A
2126.0718.33390.367556.01046.0388.0878.683388.1838.68716.5599.688175.471100A
468.0686.98561.169571.0880.0126.1478.809126.1928.81115.1365.720168.133100A
6669.0697.16472.863539.0957.0471.69626.253471.69426.19736.65623.237124.340100A
75.0658.60049.161607.0710.028.3008.10028.2848.1033.1442.025161.565100A
.............................................
38394.0746.61785.198550.01021.0194.032498.223194.014498.23917.2956.92052.720100B
387152.0801.599111.328582.01263.0348.487497.632348.451497.67517.77310.88911.829100A
38960.0758.03377.309601.0947.0259.000499.300258.990499.2899.4768.06290.000100A
39012.0714.83367.294551.0785.0240.167498.167240.179498.1484.6063.317168.690100A
39123.0695.04367.356611.0846.049.891503.02249.882502.9796.4544.53773.243100A
\n", "

374 rows × 14 columns

\n", "
" ], "text/plain": [ " Area Mean StdDev Min Max X Y XM \\\n", " \n", "1 18.0 730.389 103.354 592.0 948.0 435.000 4.722 434.962 \n", "2 126.0 718.333 90.367 556.0 1046.0 388.087 8.683 388.183 \n", "4 68.0 686.985 61.169 571.0 880.0 126.147 8.809 126.192 \n", "6 669.0 697.164 72.863 539.0 957.0 471.696 26.253 471.694 \n", "7 5.0 658.600 49.161 607.0 710.0 28.300 8.100 28.284 \n", ".. ... ... ... ... ... ... ... ... \n", "383 94.0 746.617 85.198 550.0 1021.0 194.032 498.223 194.014 \n", "387 152.0 801.599 111.328 582.0 1263.0 348.487 497.632 348.451 \n", "389 60.0 758.033 77.309 601.0 947.0 259.000 499.300 258.990 \n", "390 12.0 714.833 67.294 551.0 785.0 240.167 498.167 240.179 \n", "391 23.0 695.043 67.356 611.0 846.0 49.891 503.022 49.882 \n", "\n", " YM Major Minor Angle %Area Type \n", " \n", "1 4.697 5.987 3.828 168.425 100 A \n", "2 8.687 16.559 9.688 175.471 100 A \n", "4 8.811 15.136 5.720 168.133 100 A \n", "6 26.197 36.656 23.237 124.340 100 A \n", "7 8.103 3.144 2.025 161.565 100 A \n", ".. ... ... ... ... ... ... \n", "383 498.239 17.295 6.920 52.720 100 B \n", "387 497.675 17.773 10.889 11.829 100 A \n", "389 499.289 9.476 8.062 90.000 100 A \n", "390 498.148 4.606 3.317 168.690 100 A \n", "391 502.979 6.454 4.537 73.243 100 A \n", "\n", "[374 rows x 14 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_no_nan = data.dropna(how=\"any\")\n", "data_no_nan " ] }, { "cell_type": "markdown", "id": "0d857531-4c88-450b-ae62-9008388088ba", "metadata": {}, "source": [ "On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of the total number of datasets were analysed.\n", "\n", "We can now also check again if NaNs are present." ] }, { "cell_type": "code", "execution_count": 7, "id": "f09a2106", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_no_nan.isnull().values.any()" ] }, { "cell_type": "markdown", "id": "ab941145-75fe-4d5a-80fb-ab85672d0a86", "metadata": {}, "source": [ "## Determining rows that contain NaNs\n", "In some use-cases it might be useful to have a list of row-indices where there are NaN values." ] }, { "cell_type": "code", "execution_count": 8, "id": "84addf53-beb6-4955-bad9-fe52517f64d7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
002.02.0
113.03.0
222NaN44.0
3212.02.0
41212.0NaN
52322.052.0
\n", "
" ], "text/plain": [ " A B C\n", "0 0 2.0 2.0\n", "1 1 3.0 3.0\n", "2 22 NaN 44.0\n", "3 21 2.0 2.0\n", "4 12 12.0 NaN\n", "5 23 22.0 52.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {\n", " 'A': [0, 1, 22, 21, 12, 23],\n", " 'B': [2, 3, np.nan, 2, 12, 22],\n", " 'C': [2, 3, 44, 2, np.nan, 52],\n", "}\n", "\n", "table = pd.DataFrame(data)\n", "table" ] }, { "cell_type": "code", "execution_count": 9, "id": "7b3b3292-0864-484e-b2d4-dab5c1a0b6ed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, False, True, False])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.max(table.isnull().values, axis=1)" ] }, { "cell_type": "markdown", "id": "780acb1c-e8f0-4678-aa2c-ca497fe26872", "metadata": {}, "source": [ "## Exercise\n", "Take the original `data` table and select the columns `Area` and `Mean`. Remove all rows that contain NaNs and count the remaining rows.\n", "Afterwards, take the original `data` table again and select the columns `Major` and `Minor`. Remove NaNs and count the remaining rows again. What do you conclude?" ] }, { "cell_type": "code", "execution_count": null, "id": "12c2e25b-8d38-4c21-8390-ea94742a6070", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }