{ "cells": [ { "cell_type": "markdown", "id": "51e6f4d9-a000-4b05-9bcc-dc52db91658c", "metadata": {}, "source": [ "# Tidy-Data" ] }, { "cell_type": "markdown", "id": "248c6fda-e47c-46c2-957d-cf4bdf4d141d", "metadata": {}, "source": [ "Hadley Wickham wrote a [great article](https://www.jstatsoft.org/article/view/v059i10) in favor of “tidy data.” Tidy data frames follow the rules:\n", "\n", "- Each variable is a column.\n", "\n", "- Each observation is a row.\n", "\n", "- Each type of observation has its own separate data frame.\n", "\n", "This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats." ] }, { "cell_type": "markdown", "id": "774c1a89-2f91-404b-8f74-284acabd3bcb", "metadata": {}, "source": [ "Let's import a saved table with measurements. Is this table tidy?" ] }, { "cell_type": "code", "execution_count": 1, "id": "5c88af81-7a31-42bb-8f12-69a89f2f1e0a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "504fd34a-9454-4fb3-9ccf-4b9964feada9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BeforeAfter
channel_1channel_2channel_1channel_2
013.25000021.00000015.13798442.022776
144.95454524.31818243.32883648.661610
213.59090918.77272711.68599537.926184
385.03225819.74193586.03146140.396353
410.73170725.26829310.07542151.471865
594.62500036.45000095.18090073.347843
689.83673534.69387889.85786469.902829
7100.26190534.904762101.98985270.156432
829.61538552.11538531.516654104.525198
915.86842124.92105316.08693250.563301
1012.47500025.45000011.52992451.381594
1187.87500028.05000089.74552256.543107
1258.80000022.60000059.64622945.215405
1391.06122440.36734789.93589381.326111
1423.500000117.33333321.676993235.067654
1582.56666734.56666784.09773569.820702
1636.12000029.60000037.68867659.870177
1770.68750033.84375072.56911268.493363
18102.02127733.297872100.41974667.379506
1972.318182103.90909170.843134207.956510
2018.10000029.16666717.86520158.361239
215.21739136.3478266.96134673.286439
2219.92592672.81481518.607102145.900739
2326.67307757.40384627.611368115.347217
2413.34000030.40000014.16054361.225962
2515.02857138.40000014.52996377.490249
\n", "
" ], "text/plain": [ " Before After \n", " channel_1 channel_2 channel_1 channel_2\n", "0 13.250000 21.000000 15.137984 42.022776\n", "1 44.954545 24.318182 43.328836 48.661610\n", "2 13.590909 18.772727 11.685995 37.926184\n", "3 85.032258 19.741935 86.031461 40.396353\n", "4 10.731707 25.268293 10.075421 51.471865\n", "5 94.625000 36.450000 95.180900 73.347843\n", "6 89.836735 34.693878 89.857864 69.902829\n", "7 100.261905 34.904762 101.989852 70.156432\n", "8 29.615385 52.115385 31.516654 104.525198\n", "9 15.868421 24.921053 16.086932 50.563301\n", "10 12.475000 25.450000 11.529924 51.381594\n", "11 87.875000 28.050000 89.745522 56.543107\n", "12 58.800000 22.600000 59.646229 45.215405\n", "13 91.061224 40.367347 89.935893 81.326111\n", "14 23.500000 117.333333 21.676993 235.067654\n", "15 82.566667 34.566667 84.097735 69.820702\n", "16 36.120000 29.600000 37.688676 59.870177\n", "17 70.687500 33.843750 72.569112 68.493363\n", "18 102.021277 33.297872 100.419746 67.379506\n", "19 72.318182 103.909091 70.843134 207.956510\n", "20 18.100000 29.166667 17.865201 58.361239\n", "21 5.217391 36.347826 6.961346 73.286439\n", "22 19.925926 72.814815 18.607102 145.900739\n", "23 26.673077 57.403846 27.611368 115.347217\n", "24 13.340000 30.400000 14.160543 61.225962\n", "25 15.028571 38.400000 14.529963 77.490249" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')\n", "df" ] }, { "cell_type": "markdown", "id": "e497f862-4f03-4d58-9b92-0dea8d750f16", "metadata": {}, "source": [ "The most useful function for tidying data is [pd.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)." ] }, { "cell_type": "code", "execution_count": 3, "id": "0022b484-fd53-4c82-be11-9a3fe2261497", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variable_0variable_1value
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ], "text/plain": [ " variable_0 variable_1 value\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.melt()" ] }, { "cell_type": "markdown", "id": "dd24af67-9f6e-4f8c-a432-37eb146e2bc9", "metadata": {}, "source": [ "We can specify names for the value and for the variables. In this case, our measurements are of intensity and our variables are Intervention (before or after) and channel." ] }, { "cell_type": "code", "execution_count": 4, "id": "9ffaf527-78d8-4a9c-89e0-19858b877c2e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterventionChannelintensity
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ], "text/plain": [ " Intervention Channel intensity\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tidy = df.melt(value_name='intensity', var_name=['Intervention', 'Channel'])\n", "df_tidy" ] }, { "cell_type": "markdown", "id": "d95d7c92-95fe-48a5-8554-4b366830f351", "metadata": {}, "source": [ "It may not look better for interpreting, but it becomes easier to manipulate, because now we can more easily mask by columns. Here we select intensity measurement rows for \"channel_2\" and \"After\" intervention." ] }, { "cell_type": "code", "execution_count": 5, "id": "512f335d-61f7-4dc9-9ff1-239802b2f9da", "metadata": {}, "outputs": [], "source": [ "row_mask = (df_tidy['Channel'] == 'channel_2') & (df_tidy['Intervention'] == 'After')" ] }, { "cell_type": "code", "execution_count": 6, "id": "88c4b6bb-81b7-48e8-ada8-5f2acf2e4888", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "78 42.022776\n", "79 48.661610\n", "80 37.926184\n", "81 40.396353\n", "82 51.471865\n", "83 73.347843\n", "84 69.902829\n", "85 70.156432\n", "86 104.525198\n", "87 50.563301\n", "88 51.381594\n", "89 56.543107\n", "90 45.215405\n", "91 81.326111\n", "92 235.067654\n", "93 69.820702\n", "94 59.870177\n", "95 68.493363\n", "96 67.379506\n", "97 207.956510\n", "98 58.361239\n", "99 73.286439\n", "100 145.900739\n", "101 115.347217\n", "102 61.225962\n", "103 77.490249\n", "Name: intensity, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tidy.loc[row_mask, :]['intensity']" ] }, { "cell_type": "code", "execution_count": null, "id": "22c0f001-a3ae-4d4a-902a-8fe55cbd79b9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }