{ "cells": [ { "cell_type": "markdown", "id": "99c606f8-037f-4258-81e7-a9f4ac511242", "metadata": {}, "source": [ "# Introduction to working with DataFrames\n", "In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that, we introduce [pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), the primary tool in the Python ecosystem for handling data. Its primary object, the \"DataFrame\" is extremely useful in wrangling data. which are more handy in the next steps." ] }, { "cell_type": "code", "execution_count": 1, "id": "0cfceb6c-1acc-4632-b084-8b0871a7c50a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "8b77888b-c9a8-4a67-a4eb-f7df46eda970", "metadata": {}, "source": [ "## Creating DataFrames from a dictionary of lists\n", "Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:" ] }, { "cell_type": "code", "execution_count": 2, "id": "ff80484f-657b-4231-8d8f-cdc26577542b", "metadata": {}, "outputs": [], "source": [ "measurements = {\n", " \"labels\": [1, 2, 3],\n", " \"area\": [45, 23, 68],\n", " \"minor_axis\": [2, 4, 4],\n", " \"major_axis\": [3, 4, 5],\n", "}" ] }, { "cell_type": "markdown", "id": "b2afa6a9-e15c-4147-bdd4-ec4d4f87fb36", "metadata": {}, "source": [ "This data structure can be nicely visualized using a DataFrame:" ] }, { "cell_type": "code", "execution_count": 3, "id": "8bf4e4b5-ef72-4f63-84d2-48cc3a77c297", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axis
014523
122344
236845
\n", "
" ], "text/plain": [ " labels area minor_axis major_axis\n", "0 1 45 2 3\n", "1 2 23 4 4\n", "2 3 68 4 5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(measurements)\n", "df" ] }, { "cell_type": "markdown", "id": "930c082b-8f16-4711-b3e0-e56a7ec6d272", "metadata": {}, "source": [ "Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns. This is done elementwise." ] }, { "cell_type": "code", "execution_count": 4, "id": "a34866ff-a2cb-4a7c-a4e8-4544559b634c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axisaspect_ratio
0145231.50
1223441.00
2368451.25
\n", "
" ], "text/plain": [ " labels area minor_axis major_axis aspect_ratio\n", "0 1 45 2 3 1.50\n", "1 2 23 4 4 1.00\n", "2 3 68 4 5 1.25" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"aspect_ratio\"] = df[\"major_axis\"] / df[\"minor_axis\"]\n", "df" ] }, { "cell_type": "markdown", "id": "201a2142-22c7-4607-bc2d-f1dfce4c7e26", "metadata": {}, "source": [ "## Saving data frames\n", "We can also save this table for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages." ] }, { "cell_type": "code", "execution_count": 5, "id": "fb01d2d9-4d8b-4b6a-b158-9516a581e000", "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"../../data/short_table.csv\")" ] }, { "cell_type": "markdown", "id": "2677e5a9-8d1b-4454-b009-4ac26e549b2d", "metadata": {}, "source": [ "You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat files." ] }, { "cell_type": "markdown", "id": "0240857d-292f-4ac3-ba87-8878aa941cde", "metadata": {}, "source": [ "## Creating DataFrames from lists of lists\n", "Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists" ] }, { "cell_type": "code", "execution_count": 6, "id": "c72a82b1-4da6-468d-afa6-149cb00f7d37", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
labels123
area452368
minor_axis244
major_axis345
\n", "
" ], "text/plain": [ " 0 1 2\n", "labels 1 2 3\n", "area 45 23 68\n", "minor_axis 2 4 4\n", "major_axis 3 4 5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "row_header = ['labels', 'area', 'minor_axis', 'major_axis']\n", "\n", "data = [\n", " [1, 2, 3], # first row\n", " [45, 23, 68], # second row\n", " [2, 4, 4], # third row\n", " [3, 4, 5], # fourth row\n", "]\n", " \n", "# convert the data and header arrays in a pandas data frame\n", "data_frame = pd.DataFrame(data, row_header)\n", "\n", "# show it\n", "data_frame" ] }, { "cell_type": "markdown", "id": "9f64996f", "metadata": {}, "source": [ "## Or from a 2D numpy array" ] }, { "cell_type": "code", "execution_count": 7, "id": "320a8295", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaminor_axismajor_axis
00.8007970.4218620.956755
10.4703180.4311160.171390
20.2109300.0765680.199061
30.0506820.3469190.254751
\n", "
" ], "text/plain": [ " area minor_axis major_axis\n", "0 0.800797 0.421862 0.956755\n", "1 0.470318 0.431116 0.171390\n", "2 0.210930 0.076568 0.199061\n", "3 0.050682 0.346919 0.254751" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "data = np.random.random((4,3))\n", "column_header = ['area', 'minor_axis', 'major_axis']\n", "\n", "pd.DataFrame(data, columns=column_header)" ] }, { "cell_type": "markdown", "id": "a8b1b6b0-027c-4536-8710-e3f87aca1896", "metadata": {}, "source": [ "As you can see, this table is _rotated_. We can bring it in the usual form like this:" ] }, { "cell_type": "code", "execution_count": 8, "id": "40669e82-4264-4883-9c4e-8a366b061610", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelsareaminor_axismajor_axis
014523
122344
236845
\n", "
" ], "text/plain": [ " labels area minor_axis major_axis\n", "0 1 45 2 3\n", "1 2 23 4 4\n", "2 3 68 4 5" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rotate/flip it\n", "data_frame = data_frame.transpose()\n", "\n", "# show it\n", "data_frame" ] }, { "cell_type": "markdown", "id": "ccf08662-fccf-4dc1-91c2-3365fa85a96b", "metadata": {}, "source": [ "## Loading data frames\n", "Tables can be read from CSV files with [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv)." ] }, { "cell_type": "code", "execution_count": 9, "id": "aa7c74db-68ab-4004-aa5e-01ba1ad88c79", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0areamean_intensityminor_axis_lengthmajor_axis_lengtheccentricityextentferet_diameter_maxequivalent_diameter_areabbox-0bbox-1bbox-2bbox-3
00422192.37914716.48855034.5667890.8789000.58611135.22783023.1798850113035
11182180.13186811.73607420.8026970.8256650.78787921.37755815.2226670531174
22661205.21633928.40950230.2084330.3399340.87433932.75667929.01053809528122
33437216.58581223.14399624.6061300.3395760.82608726.92582423.588253014423167
44476212.30252119.85288231.0751060.7693170.86388431.38471024.618327023729256
..........................................
5656211185.06161114.52276218.4891380.6188930.78148118.97366616.3906542323925054
575778185.2307696.02863817.5797990.9393610.72222218.0277569.965575248170254188
585886183.7209305.42687121.2614270.9668760.78181822.00000010.464158249117254139
595951190.4313735.03241413.7420790.9305340.72857114.0356698.058239249228254242
606046175.3043483.80398215.9487140.9711390.76666715.0332967.6530402506725482
\n", "

61 rows × 13 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 area mean_intensity minor_axis_length major_axis_length \\\n", "0 0 422 192.379147 16.488550 34.566789 \n", "1 1 182 180.131868 11.736074 20.802697 \n", "2 2 661 205.216339 28.409502 30.208433 \n", "3 3 437 216.585812 23.143996 24.606130 \n", "4 4 476 212.302521 19.852882 31.075106 \n", ".. ... ... ... ... ... \n", "56 56 211 185.061611 14.522762 18.489138 \n", "57 57 78 185.230769 6.028638 17.579799 \n", "58 58 86 183.720930 5.426871 21.261427 \n", "59 59 51 190.431373 5.032414 13.742079 \n", "60 60 46 175.304348 3.803982 15.948714 \n", "\n", " eccentricity extent feret_diameter_max equivalent_diameter_area \\\n", "0 0.878900 0.586111 35.227830 23.179885 \n", "1 0.825665 0.787879 21.377558 15.222667 \n", "2 0.339934 0.874339 32.756679 29.010538 \n", "3 0.339576 0.826087 26.925824 23.588253 \n", "4 0.769317 0.863884 31.384710 24.618327 \n", ".. ... ... ... ... \n", "56 0.618893 0.781481 18.973666 16.390654 \n", "57 0.939361 0.722222 18.027756 9.965575 \n", "58 0.966876 0.781818 22.000000 10.464158 \n", "59 0.930534 0.728571 14.035669 8.058239 \n", "60 0.971139 0.766667 15.033296 7.653040 \n", "\n", " bbox-0 bbox-1 bbox-2 bbox-3 \n", "0 0 11 30 35 \n", "1 0 53 11 74 \n", "2 0 95 28 122 \n", "3 0 144 23 167 \n", "4 0 237 29 256 \n", ".. ... ... ... ... \n", "56 232 39 250 54 \n", "57 248 170 254 188 \n", "58 249 117 254 139 \n", "59 249 228 254 242 \n", "60 250 67 254 82 \n", "\n", "[61 rows x 13 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv = pd.read_csv('../../data/blobs_statistics.csv')\n", "df_csv" ] }, { "cell_type": "markdown", "id": "da74dce2-609a-4e52-bc16-77ddf45efc98", "metadata": {}, "source": [ "That's a bit too much information. We can use the `.head()` method of data frames to look at the first few rows (or the `.tail()` to check the last rows)." ] }, { "cell_type": "code", "execution_count": 10, "id": "08d22b50-f860-428f-9edc-41c749ba5ae7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0areamean_intensityminor_axis_lengthmajor_axis_lengtheccentricityextentferet_diameter_maxequivalent_diameter_areabbox-0bbox-1bbox-2bbox-3
00422192.37914716.48855034.5667890.8789000.58611135.22783023.1798850113035
11182180.13186811.73607420.8026970.8256650.78787921.37755815.2226670531174
22661205.21633928.40950230.2084330.3399340.87433932.75667929.01053809528122
\n", "
" ], "text/plain": [ " Unnamed: 0 area mean_intensity minor_axis_length major_axis_length \\\n", "0 0 422 192.379147 16.488550 34.566789 \n", "1 1 182 180.131868 11.736074 20.802697 \n", "2 2 661 205.216339 28.409502 30.208433 \n", "\n", " eccentricity extent feret_diameter_max equivalent_diameter_area \\\n", "0 0.878900 0.586111 35.227830 23.179885 \n", "1 0.825665 0.787879 21.377558 15.222667 \n", "2 0.339934 0.874339 32.756679 29.010538 \n", "\n", " bbox-0 bbox-1 bbox-2 bbox-3 \n", "0 0 11 30 35 \n", "1 0 53 11 74 \n", "2 0 95 28 122 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv.head(3)" ] }, { "cell_type": "code", "execution_count": 11, "id": "c3c35f55-146f-4cad-bda6-3d83b049a8f9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0areamean_intensityminor_axis_lengthmajor_axis_lengtheccentricityextentferet_diameter_maxequivalent_diameter_areabbox-0bbox-1bbox-2bbox-3
585886183.7209305.42687121.2614270.9668760.78181822.00000010.464158249117254139
595951190.4313735.03241413.7420790.9305340.72857114.0356698.058239249228254242
606046175.3043483.80398215.9487140.9711390.76666715.0332967.6530402506725482
\n", "
" ], "text/plain": [ " Unnamed: 0 area mean_intensity minor_axis_length major_axis_length \\\n", "58 58 86 183.720930 5.426871 21.261427 \n", "59 59 51 190.431373 5.032414 13.742079 \n", "60 60 46 175.304348 3.803982 15.948714 \n", "\n", " eccentricity extent feret_diameter_max equivalent_diameter_area \\\n", "58 0.966876 0.781818 22.000000 10.464158 \n", "59 0.930534 0.728571 14.035669 8.058239 \n", "60 0.971139 0.766667 15.033296 7.653040 \n", "\n", " bbox-0 bbox-1 bbox-2 bbox-3 \n", "58 249 117 254 139 \n", "59 249 228 254 242 \n", "60 250 67 254 82 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv.tail(3)" ] }, { "cell_type": "markdown", "id": "01732b57-35d9-4b25-9c1b-d322487d2757", "metadata": {}, "source": [ "We can also get column names with the DataFrame attribute `columns`." ] }, { "cell_type": "code", "execution_count": 12, "id": "cc7d6cbe-6487-49a6-84b2-e837f7070f25", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',\n", " 'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',\n", " 'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],\n", " dtype='object')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv.columns" ] }, { "cell_type": "markdown", "id": "ff187a52-9fc0-4f6f-b143-f872dfe620c2", "metadata": {}, "source": [ "## Selecting rows and columns" ] }, { "attachments": {}, "cell_type": "markdown", "id": "19fe76e8-5ea4-40c5-8bcd-1f9e12c9d85f", "metadata": {}, "source": [ "Ok, let's get the dataframe first column:" ] }, { "cell_type": "code", "execution_count": 13, "id": "bb65a716-5e56-466b-9a19-e388e6900d8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 422\n", "1 182\n", "2 661\n", "3 437\n", "4 476\n", " ... \n", "56 211\n", "57 78\n", "58 86\n", "59 51\n", "60 46\n", "Name: area, Length: 61, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv['area']" ] }, { "cell_type": "markdown", "id": "62ba3625-5206-44c6-b79f-0554fc55bdb7", "metadata": {}, "source": [ "Notice that when it was printed, the index of the rows came along with it. That's beacuse a Pandas DataFrame with one column is a [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html?highlight=series#pandas.Series)." ] }, { "cell_type": "markdown", "id": "3cb46414-d625-4f6a-85ba-8e05e521b8a5", "metadata": {}, "source": [ "We can get more columns by passing their names as a list. Furthermore, we can store this \"sub-dataframe\" in a new variable." ] }, { "cell_type": "code", "execution_count": 14, "id": "ffb0453b-a0d5-4ae9-88ae-da79e332bb0c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensity
0422192.379147
1182180.131868
2661205.216339
3437216.585812
4476212.302521
.........
56211185.061611
5778185.230769
5886183.720930
5951190.431373
6046175.304348
\n", "

61 rows × 2 columns

\n", "
" ], "text/plain": [ " area mean_intensity\n", "0 422 192.379147\n", "1 182 180.131868\n", "2 661 205.216339\n", "3 437 216.585812\n", "4 476 212.302521\n", ".. ... ...\n", "56 211 185.061611\n", "57 78 185.230769\n", "58 86 183.720930\n", "59 51 190.431373\n", "60 46 175.304348\n", "\n", "[61 rows x 2 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analysis = df_csv[ ['area', 'mean_intensity'] ].copy()\n", "df_analysis" ] }, { "cell_type": "markdown", "id": "361726eb-c73d-4c3a-a5ed-530e9517abb5", "metadata": {}, "source": [ "This gave us the area measurements we were after. \n", "\n", "If we want to get a single row, the proper way of doing that is to use the `.loc` method:" ] }, { "cell_type": "code", "execution_count": 15, "id": "6df270e8-5f3c-4f23-a934-3c554575dd09", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "area 661.000000\n", "mean_intensity 205.216339\n", "Name: 2, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv.loc[ 2, ['area', 'mean_intensity']]" ] }, { "cell_type": "markdown", "id": "95b4a73b-f03b-484a-8ed9-82a06ae38bc5", "metadata": {}, "source": [ "Note that following `.loc`, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers.\n", "\n", "In case you really want to access elements in a dataframe by integer indices, like in a numpy array, you can use the `.iloc` method. " ] }, { "cell_type": "code", "execution_count": 16, "id": "5a6cf822-dd51-4955-94ab-ece56bc7ac45", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensity
2661205.216339
3437216.585812
\n", "
" ], "text/plain": [ " area mean_intensity\n", "2 661 205.216339\n", "3 437 216.585812" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_csv.iloc[2:4, 1:3]" ] }, { "cell_type": "markdown", "id": "d576e7e3-db2b-4a73-9237-0deec1f4116d", "metadata": {}, "source": [ "The downside is that the code becomes less explicit when column names are absent.\n", "\n", "In practice you will almost never use row indices, but rather use Boolean indexing or Masking." ] }, { "attachments": {}, "cell_type": "markdown", "id": "6fe65e75-8003-4175-96c6-21cb26eb5d31", "metadata": {}, "source": [ "## Selecting parts of tables based on criteria\n", "In case we want to focus our further analysis on cells that have a certain minimum area, we can do this by passing boolean indices to the rows. This process is also sometimes call masking.\n", "\n", "Suppose we want the rows for which `df_analysis[\"area\"] > 50`. We can essentially plug this syntax directly when using `.loc`." ] }, { "cell_type": "code", "execution_count": 17, "id": "a4eadd9b-e287-4ca8-b1ff-d1278c24151c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensity
55280189.800000
56211185.061611
5778185.230769
5886183.720930
5951190.431373
\n", "
" ], "text/plain": [ " area mean_intensity\n", "55 280 189.800000\n", "56 211 185.061611\n", "57 78 185.230769\n", "58 86 183.720930\n", "59 51 190.431373" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analysis_filtered = df_analysis[ df_analysis[\"area\"] > 50]\n", "df_analysis_filtered.tail()" ] }, { "cell_type": "markdown", "id": "64eb1086-ebc8-4905-afc2-ed0dc01620b9", "metadata": {}, "source": [ "## Adding new columns\n", "In Pandas, it is very easy to generate new columns from existing columns. We just pass some operation between other columns to a new column name." ] }, { "cell_type": "code", "execution_count": 18, "id": "402892eb-b1ea-4f11-b272-9c44207f7991", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areamean_intensitytotal_intensity
0422192.37914781184.0
1182180.13186832784.0
2661205.216339135648.0
3437216.58581294648.0
4476212.302521101056.0
............
56211185.06161139048.0
5778185.23076914448.0
5886183.72093015800.0
5951190.4313739712.0
6046175.3043488064.0
\n", "

61 rows × 3 columns

\n", "
" ], "text/plain": [ " area mean_intensity total_intensity\n", "0 422 192.379147 81184.0\n", "1 182 180.131868 32784.0\n", "2 661 205.216339 135648.0\n", "3 437 216.585812 94648.0\n", "4 476 212.302521 101056.0\n", ".. ... ... ...\n", "56 211 185.061611 39048.0\n", "57 78 185.230769 14448.0\n", "58 86 183.720930 15800.0\n", "59 51 190.431373 9712.0\n", "60 46 175.304348 8064.0\n", "\n", "[61 rows x 3 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']\n", "df_analysis" ] }, { "cell_type": "markdown", "id": "273def9d-da84-49b7-83be-11695c64c3d9", "metadata": {}, "source": [ "# Appending tables\n", "When processing multiple images, potentially using multiple image processing libraries, a common task is to combine tables.\n", "\n", "We start with two small tables of measurements that could have been obtained from different functions or different libraries." ] }, { "cell_type": "code", "execution_count": 19, "id": "1477c227", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 20, "id": "156d6b7c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelcircularityelongation
010.32.3
120.53.4
230.71.2
\n", "
" ], "text/plain": [ " label circularity elongation\n", "0 1 0.3 2.3\n", "1 2 0.5 3.4\n", "2 3 0.7 1.2" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table1 = pd.DataFrame({\n", " \"label\": [1, 2, 3],\n", " \"circularity\": [0.3, 0.5, 0.7],\n", " \"elongation\": [2.3, 3.4, 1.2],\n", " })\n", "table1" ] }, { "cell_type": "code", "execution_count": 21, "id": "1877c769", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelareaskewness
03220.5
12320.6
21250.3
34180.3
\n", "
" ], "text/plain": [ " label area skewness\n", "0 3 22 0.5\n", "1 2 32 0.6\n", "2 1 25 0.3\n", "3 4 18 0.3" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table2 = pd.DataFrame({\n", " \"label\": [3, 2, 1, 4],\n", " \"area\": [22, 32, 25, 18],\n", " \"skewness\": [0.5, 0.6, 0.3, 0.3],\n", " })\n", "table2" ] }, { "cell_type": "markdown", "id": "9682caa7-ea47-452a-a773-952639b53de8", "metadata": { "tags": [] }, "source": [ "## Combining same measurement from different image files" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3e37e64c-0185-439f-9e52-f1e23e732865", "metadata": {}, "source": [ "When applying a workflow to many images, you would get tables with the same column names, but with a variable number of rows.\n", "To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.\n", "\n", "Let's open two tables generated from applying the same worflow to different files." ] }, { "cell_type": "code", "execution_count": 22, "id": "23b7b945-b874-49ba-b48f-d494adf7e8c3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
025693.25000019.99501717.0215591.174688
19082.48888915.9399697.5163262.120713
257790.63778235.32445821.7594341.623409
327095.64074120.22943117.6690521.144908
415384.90849715.68370312.4204751.262730
\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length aspect_ratio\n", "0 256 93.250000 19.995017 17.021559 1.174688\n", "1 90 82.488889 15.939969 7.516326 2.120713\n", "2 577 90.637782 35.324458 21.759434 1.623409\n", "3 270 95.640741 20.229431 17.669052 1.144908\n", "4 153 84.908497 15.683703 12.420475 1.262730" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../data/BBBC007_20P1_POS0007_D_1UL.csv')\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 23, "id": "f0bf9a4b-925d-4227-8aff-4348fa2afca3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
013996.54676317.50410410.2927701.700621
136086.61388935.74680814.9831242.385805
24391.48837212.9678844.3515732.980045
314073.74285718.94050810.3144041.836316
414489.37500013.63930813.4585321.013432
\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length aspect_ratio\n", "0 139 96.546763 17.504104 10.292770 1.700621\n", "1 360 86.613889 35.746808 14.983124 2.385805\n", "2 43 91.488372 12.967884 4.351573 2.980045\n", "3 140 73.742857 18.940508 10.314404 1.836316\n", "4 144 89.375000 13.639308 13.458532 1.013432" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.read_csv('../../data/BBBC007_20P1_POS0010_D_1UL.csv')\n", "df2.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6121c288-6584-4c30-a584-43cc58871599", "metadata": {}, "source": [ "Because we know we have the same columns, we can concatenate them into a single big table." ] }, { "cell_type": "code", "execution_count": 24, "id": "0cf88538-6bea-4a47-ba1e-09a26fba2602", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
025693.25000019.99501717.0215591.174688
19082.48888915.9399697.5163262.120713
257790.63778235.32445821.7594341.623409
327095.64074120.22943117.6690521.144908
415384.90849715.68370312.4204751.262730
..................
4231591.13333320.92709519.2092831.089426
4320694.26213623.38187911.6696682.003646
444568.3777789.4063716.2764451.498678
453376.72727310.7242754.1745682.568955
461676.7500007.7459672.7838822.782433
\n", "

111 rows × 5 columns

\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length aspect_ratio\n", "0 256 93.250000 19.995017 17.021559 1.174688\n", "1 90 82.488889 15.939969 7.516326 2.120713\n", "2 577 90.637782 35.324458 21.759434 1.623409\n", "3 270 95.640741 20.229431 17.669052 1.144908\n", "4 153 84.908497 15.683703 12.420475 1.262730\n", ".. ... ... ... ... ...\n", "42 315 91.133333 20.927095 19.209283 1.089426\n", "43 206 94.262136 23.381879 11.669668 2.003646\n", "44 45 68.377778 9.406371 6.276445 1.498678\n", "45 33 76.727273 10.724275 4.174568 2.568955\n", "46 16 76.750000 7.745967 2.783882 2.782433\n", "\n", "[111 rows x 5 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "big_df = pd.concat([df1, df2], axis=0)\n", "big_df" ] }, { "cell_type": "markdown", "id": "12290fe6-86aa-408e-b9f7-1c2b2900a64a", "metadata": {}, "source": [ "The problem is that we lose their source identity. An easy fix for that is to add a new column with the file name before concatenating them. This will facilitating splitting them back and plotting later on.\n", "\n", "When we give a single value to a new column, it is assigned to all rows." ] }, { "cell_type": "code", "execution_count": 25, "id": "a9a4745e-69a6-4bef-95e0-a449d8ec2dea", "metadata": {}, "outputs": [], "source": [ "df1['file_name'] = 'BBBC007_20P1_POS0007_D_1UL'\n", "\n", "df2['file_name'] = 'BBBC007_20P1_POS0010_D_1UL'" ] }, { "cell_type": "code", "execution_count": 26, "id": "4c87f225-eb3c-4107-b5d6-454c2598c084", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratiofile_name
025693.25000019.99501717.0215591.174688BBBC007_20P1_POS0007_D_1UL
19082.48888915.9399697.5163262.120713BBBC007_20P1_POS0007_D_1UL
257790.63778235.32445821.7594341.623409BBBC007_20P1_POS0007_D_1UL
327095.64074120.22943117.6690521.144908BBBC007_20P1_POS0007_D_1UL
415384.90849715.68370312.4204751.262730BBBC007_20P1_POS0007_D_1UL
.....................
4231591.13333320.92709519.2092831.089426BBBC007_20P1_POS0010_D_1UL
4320694.26213623.38187911.6696682.003646BBBC007_20P1_POS0010_D_1UL
444568.3777789.4063716.2764451.498678BBBC007_20P1_POS0010_D_1UL
453376.72727310.7242754.1745682.568955BBBC007_20P1_POS0010_D_1UL
461676.7500007.7459672.7838822.782433BBBC007_20P1_POS0010_D_1UL
\n", "

111 rows × 6 columns

\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length aspect_ratio \\\n", "0 256 93.250000 19.995017 17.021559 1.174688 \n", "1 90 82.488889 15.939969 7.516326 2.120713 \n", "2 577 90.637782 35.324458 21.759434 1.623409 \n", "3 270 95.640741 20.229431 17.669052 1.144908 \n", "4 153 84.908497 15.683703 12.420475 1.262730 \n", ".. ... ... ... ... ... \n", "42 315 91.133333 20.927095 19.209283 1.089426 \n", "43 206 94.262136 23.381879 11.669668 2.003646 \n", "44 45 68.377778 9.406371 6.276445 1.498678 \n", "45 33 76.727273 10.724275 4.174568 2.568955 \n", "46 16 76.750000 7.745967 2.783882 2.782433 \n", "\n", " file_name \n", "0 BBBC007_20P1_POS0007_D_1UL \n", "1 BBBC007_20P1_POS0007_D_1UL \n", "2 BBBC007_20P1_POS0007_D_1UL \n", "3 BBBC007_20P1_POS0007_D_1UL \n", "4 BBBC007_20P1_POS0007_D_1UL \n", ".. ... \n", "42 BBBC007_20P1_POS0010_D_1UL \n", "43 BBBC007_20P1_POS0010_D_1UL \n", "44 BBBC007_20P1_POS0010_D_1UL \n", "45 BBBC007_20P1_POS0010_D_1UL \n", "46 BBBC007_20P1_POS0010_D_1UL \n", "\n", "[111 rows x 6 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "big_df = pd.concat([df1, df2], axis=0)\n", "big_df" ] }, { "cell_type": "markdown", "id": "bf553c51-dd29-4d5b-8e72-dcf3a262c418", "metadata": {}, "source": [ "Now, when we can safely distinguish the source of each row." ] }, { "cell_type": "markdown", "id": "78a139a0", "metadata": {}, "source": [ "# Handling NaN values\n", "When analysing tabular data, sometimes table cells are present that does not contain data. In Python this typically means the value is _Not a Number_ ([NaN](https://en.wikipedia.org/wiki/NaN)). We cannot assume these values are `0` or `-1` or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.\n", "\n", "To get a first view where NaNs play a role, we load again an example table and sort it." ] }, { "cell_type": "code", "execution_count": 27, "id": "189e76b0-0cc2-4baa-8290-e5a06ab2d70b", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd " ] }, { "cell_type": "code", "execution_count": 28, "id": "4e617db1-ac10-4f69-9ba9-97913d517a15", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AreaMeanStdDevMinMaxXYXMYMMajorMinorAngle%AreaType
1902755.0859.928235.458539.03880.0108.710302.158110.999300.247144.47524.28039.318100C
812295.0765.23996.545558.01431.0375.003134.888374.982135.35965.76944.429127.247100B
2091821.0847.761122.074600.01510.0287.795321.115288.074321.82455.87941.492112.124100A
2521528.0763.77783.183572.01172.0191.969385.944192.487385.69763.15030.80834.424100B
2651252.0793.371117.139579.01668.0262.071394.497262.268394.32660.15426.50050.147100A
.............................................
1131.0587.0000.000587.0587.0399.500117.500399.500117.5001.1281.1280.000100A
3101.0866.0000.000866.0866.0343.500408.500343.500408.5001.1281.1280.000100A
2191.0763.0000.000763.0763.0411.500296.500411.500296.5001.1281.1280.000100A
3NaNNaNNaN608.0964.0NaNNaNNaN7.6657.359NaN101.121100A
5NaNNaN69.438566.0792.0348.5007.500NaN7.508NaN3.088NaN100A
\n", "

391 rows × 14 columns

\n", "
" ], "text/plain": [ " Area Mean StdDev Min Max X Y XM \\\n", " \n", "190 2755.0 859.928 235.458 539.0 3880.0 108.710 302.158 110.999 \n", "81 2295.0 765.239 96.545 558.0 1431.0 375.003 134.888 374.982 \n", "209 1821.0 847.761 122.074 600.0 1510.0 287.795 321.115 288.074 \n", "252 1528.0 763.777 83.183 572.0 1172.0 191.969 385.944 192.487 \n", "265 1252.0 793.371 117.139 579.0 1668.0 262.071 394.497 262.268 \n", ".. ... ... ... ... ... ... ... ... \n", "113 1.0 587.000 0.000 587.0 587.0 399.500 117.500 399.500 \n", "310 1.0 866.000 0.000 866.0 866.0 343.500 408.500 343.500 \n", "219 1.0 763.000 0.000 763.0 763.0 411.500 296.500 411.500 \n", "3 NaN NaN NaN 608.0 964.0 NaN NaN NaN \n", "5 NaN NaN 69.438 566.0 792.0 348.500 7.500 NaN \n", "\n", " YM Major Minor Angle %Area Type \n", " \n", "190 300.247 144.475 24.280 39.318 100 C \n", "81 135.359 65.769 44.429 127.247 100 B \n", "209 321.824 55.879 41.492 112.124 100 A \n", "252 385.697 63.150 30.808 34.424 100 B \n", "265 394.326 60.154 26.500 50.147 100 A \n", ".. ... ... ... ... ... ... \n", "113 117.500 1.128 1.128 0.000 100 A \n", "310 408.500 1.128 1.128 0.000 100 A \n", "219 296.500 1.128 1.128 0.000 100 A \n", "3 7.665 7.359 NaN 101.121 100 A \n", "5 7.508 NaN 3.088 NaN 100 A \n", "\n", "[391 rows x 14 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')\n", "data.sort_values(by = \"Area\", ascending=False)" ] }, { "cell_type": "markdown", "id": "d4532812", "metadata": {}, "source": [ "As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them." ] }, { "cell_type": "markdown", "id": "2bc0c85e", "metadata": {}, "source": [ "A quick check if there are NaNs anywhere in a DataFrame is an important quality check and good scientific practice." ] }, { "cell_type": "code", "execution_count": 29, "id": "5c152771", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().values.any()" ] }, { "cell_type": "markdown", "id": "87f14e62-b3e5-45f6-9c02-9820b21bd929", "metadata": {}, "source": [ "We can also get some deeper insights in which columns these NaN values are located." ] }, { "cell_type": "code", "execution_count": 30, "id": "7f6b5eb4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Area 2\n", "Mean 5\n", "StdDev 3\n", "Min 3\n", "Max 3\n", "X 2\n", "Y 3\n", "XM 3\n", "YM 5\n", "Major 8\n", "Minor 3\n", "Angle 1\n", "%Area 0\n", "Type 0\n", "dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().sum()" ] }, { "cell_type": "markdown", "id": "b27eeccd", "metadata": {}, "source": [ "For getting a glimpse about if we can further process that tabel, we may want to know the percentage of NaNs for each column?" ] }, { "cell_type": "code", "execution_count": 31, "id": "a9297b56", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Major 2.046036\n", "Mean 1.278772\n", "YM 1.278772\n", "StdDev 0.767263\n", "Min 0.767263\n", "Max 0.767263\n", "Y 0.767263\n", "XM 0.767263\n", "Minor 0.767263\n", "Area 0.511509\n", "X 0.511509\n", "Angle 0.255754\n", "%Area 0.000000\n", "Type 0.000000\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().mean().sort_values(ascending=False) *100" ] }, { "cell_type": "markdown", "id": "f605facf", "metadata": {}, "source": [ "# Dropping rows that contain NaNs\n", "Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.\n", "\n", "It depends on your project and what is important or not for the analysis. Its not an easy answer." ] }, { "cell_type": "code", "execution_count": 32, "id": "25bac0a9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AreaMeanStdDevMinMaxXYXMYMMajorMinorAngle%AreaType
118.0730.389103.354592.0948.0435.0004.722434.9624.6975.9873.828168.425100A
2126.0718.33390.367556.01046.0388.0878.683388.1838.68716.5599.688175.471100A
468.0686.98561.169571.0880.0126.1478.809126.1928.81115.1365.720168.133100A
6669.0697.16472.863539.0957.0471.69626.253471.69426.19736.65623.237124.340100A
75.0658.60049.161607.0710.028.3008.10028.2848.1033.1442.025161.565100A
.............................................
38394.0746.61785.198550.01021.0194.032498.223194.014498.23917.2956.92052.720100B
387152.0801.599111.328582.01263.0348.487497.632348.451497.67517.77310.88911.829100A
38960.0758.03377.309601.0947.0259.000499.300258.990499.2899.4768.06290.000100A
39012.0714.83367.294551.0785.0240.167498.167240.179498.1484.6063.317168.690100A
39123.0695.04367.356611.0846.049.891503.02249.882502.9796.4544.53773.243100A
\n", "

374 rows × 14 columns

\n", "
" ], "text/plain": [ " Area Mean StdDev Min Max X Y XM \\\n", " \n", "1 18.0 730.389 103.354 592.0 948.0 435.000 4.722 434.962 \n", "2 126.0 718.333 90.367 556.0 1046.0 388.087 8.683 388.183 \n", "4 68.0 686.985 61.169 571.0 880.0 126.147 8.809 126.192 \n", "6 669.0 697.164 72.863 539.0 957.0 471.696 26.253 471.694 \n", "7 5.0 658.600 49.161 607.0 710.0 28.300 8.100 28.284 \n", ".. ... ... ... ... ... ... ... ... \n", "383 94.0 746.617 85.198 550.0 1021.0 194.032 498.223 194.014 \n", "387 152.0 801.599 111.328 582.0 1263.0 348.487 497.632 348.451 \n", "389 60.0 758.033 77.309 601.0 947.0 259.000 499.300 258.990 \n", "390 12.0 714.833 67.294 551.0 785.0 240.167 498.167 240.179 \n", "391 23.0 695.043 67.356 611.0 846.0 49.891 503.022 49.882 \n", "\n", " YM Major Minor Angle %Area Type \n", " \n", "1 4.697 5.987 3.828 168.425 100 A \n", "2 8.687 16.559 9.688 175.471 100 A \n", "4 8.811 15.136 5.720 168.133 100 A \n", "6 26.197 36.656 23.237 124.340 100 A \n", "7 8.103 3.144 2.025 161.565 100 A \n", ".. ... ... ... ... ... ... \n", "383 498.239 17.295 6.920 52.720 100 B \n", "387 497.675 17.773 10.889 11.829 100 A \n", "389 499.289 9.476 8.062 90.000 100 A \n", "390 498.148 4.606 3.317 168.690 100 A \n", "391 502.979 6.454 4.537 73.243 100 A \n", "\n", "[374 rows x 14 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_no_nan = data.dropna(how=\"any\")\n", "data_no_nan " ] }, { "cell_type": "markdown", "id": "0d857531-4c88-450b-ae62-9008388088ba", "metadata": {}, "source": [ "On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of the total number of datasets were analysed.\n", "\n", "We can now also check again if NaNs are present." ] }, { "cell_type": "code", "execution_count": 33, "id": "f09a2106", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_no_nan.isnull().values.any()" ] }, { "cell_type": "markdown", "id": "ab941145-75fe-4d5a-80fb-ab85672d0a86", "metadata": {}, "source": [ "## Determining rows that contain NaNs\n", "In some use-cases it might be useful to have a list of row-indices where there are NaN values." ] }, { "cell_type": "code", "execution_count": 34, "id": "84addf53-beb6-4955-bad9-fe52517f64d7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
002.02.0
113.03.0
222NaN44.0
3212.02.0
41212.0NaN
52322.052.0
\n", "
" ], "text/plain": [ " A B C\n", "0 0 2.0 2.0\n", "1 1 3.0 3.0\n", "2 22 NaN 44.0\n", "3 21 2.0 2.0\n", "4 12 12.0 NaN\n", "5 23 22.0 52.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2 = {\n", " 'A': [0, 1, 22, 21, 12, 23],\n", " 'B': [2, 3, np.nan, 2, 12, 22],\n", " 'C': [2, 3, 44, 2, np.nan, 52],\n", "}\n", "\n", "table = pd.DataFrame(data2)\n", "table" ] }, { "cell_type": "code", "execution_count": 35, "id": "7b3b3292-0864-484e-b2d4-dab5c1a0b6ed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, False, True, False])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.max(table.isnull().values, axis=1)" ] }, { "cell_type": "markdown", "id": "51e6f4d9-a000-4b05-9bcc-dc52db91658c", "metadata": {}, "source": [ "# Tidy-Data" ] }, { "cell_type": "markdown", "id": "248c6fda-e47c-46c2-957d-cf4bdf4d141d", "metadata": {}, "source": [ "Hadley Wickham wrote a [great article](https://www.jstatsoft.org/article/view/v059i10) in favor of “tidy data.” Tidy data frames follow the rules:\n", "\n", "- Each variable is a column.\n", "\n", "- Each observation is a row.\n", "\n", "- Each type of observation has its own separate data frame.\n", "\n", "This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats." ] }, { "cell_type": "markdown", "id": "774c1a89-2f91-404b-8f74-284acabd3bcb", "metadata": {}, "source": [ "Let's import a saved table with measurements. Is this table tidy?" ] }, { "cell_type": "code", "execution_count": 36, "id": "5c88af81-7a31-42bb-8f12-69a89f2f1e0a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 37, "id": "504fd34a-9454-4fb3-9ccf-4b9964feada9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BeforeAfter
channel_1channel_2channel_1channel_2
013.25000021.00000015.13798442.022776
144.95454524.31818243.32883648.661610
213.59090918.77272711.68599537.926184
385.03225819.74193586.03146140.396353
410.73170725.26829310.07542151.471865
594.62500036.45000095.18090073.347843
689.83673534.69387889.85786469.902829
7100.26190534.904762101.98985270.156432
829.61538552.11538531.516654104.525198
915.86842124.92105316.08693250.563301
1012.47500025.45000011.52992451.381594
1187.87500028.05000089.74552256.543107
1258.80000022.60000059.64622945.215405
1391.06122440.36734789.93589381.326111
1423.500000117.33333321.676993235.067654
1582.56666734.56666784.09773569.820702
1636.12000029.60000037.68867659.870177
1770.68750033.84375072.56911268.493363
18102.02127733.297872100.41974667.379506
1972.318182103.90909170.843134207.956510
2018.10000029.16666717.86520158.361239
215.21739136.3478266.96134673.286439
2219.92592672.81481518.607102145.900739
2326.67307757.40384627.611368115.347217
2413.34000030.40000014.16054361.225962
2515.02857138.40000014.52996377.490249
\n", "
" ], "text/plain": [ " Before After \n", " channel_1 channel_2 channel_1 channel_2\n", "0 13.250000 21.000000 15.137984 42.022776\n", "1 44.954545 24.318182 43.328836 48.661610\n", "2 13.590909 18.772727 11.685995 37.926184\n", "3 85.032258 19.741935 86.031461 40.396353\n", "4 10.731707 25.268293 10.075421 51.471865\n", "5 94.625000 36.450000 95.180900 73.347843\n", "6 89.836735 34.693878 89.857864 69.902829\n", "7 100.261905 34.904762 101.989852 70.156432\n", "8 29.615385 52.115385 31.516654 104.525198\n", "9 15.868421 24.921053 16.086932 50.563301\n", "10 12.475000 25.450000 11.529924 51.381594\n", "11 87.875000 28.050000 89.745522 56.543107\n", "12 58.800000 22.600000 59.646229 45.215405\n", "13 91.061224 40.367347 89.935893 81.326111\n", "14 23.500000 117.333333 21.676993 235.067654\n", "15 82.566667 34.566667 84.097735 69.820702\n", "16 36.120000 29.600000 37.688676 59.870177\n", "17 70.687500 33.843750 72.569112 68.493363\n", "18 102.021277 33.297872 100.419746 67.379506\n", "19 72.318182 103.909091 70.843134 207.956510\n", "20 18.100000 29.166667 17.865201 58.361239\n", "21 5.217391 36.347826 6.961346 73.286439\n", "22 19.925926 72.814815 18.607102 145.900739\n", "23 26.673077 57.403846 27.611368 115.347217\n", "24 13.340000 30.400000 14.160543 61.225962\n", "25 15.028571 38.400000 14.529963 77.490249" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')\n", "df" ] }, { "cell_type": "markdown", "id": "e497f862-4f03-4d58-9b92-0dea8d750f16", "metadata": {}, "source": [ "The most useful function for tidying data is [pd.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)." ] }, { "cell_type": "code", "execution_count": 38, "id": "0022b484-fd53-4c82-be11-9a3fe2261497", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variable_0variable_1value
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ], "text/plain": [ " variable_0 variable_1 value\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.melt()" ] }, { "cell_type": "markdown", "id": "dd24af67-9f6e-4f8c-a432-37eb146e2bc9", "metadata": {}, "source": [ "We can specify names for the value and for the variables. In this case, our measurements are of intensity and our variables are Intervention (before or after) and channel." ] }, { "cell_type": "code", "execution_count": 39, "id": "9ffaf527-78d8-4a9c-89e0-19858b877c2e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TreatmentChannelintensity
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ], "text/plain": [ " Treatment Channel intensity\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tidy = df.melt(value_name='intensity', var_name=['Treatment', 'Channel'])\n", "df_tidy" ] }, { "cell_type": "markdown", "id": "d95d7c92-95fe-48a5-8554-4b366830f351", "metadata": {}, "source": [ "It may not look better for interpreting, but it becomes easier to manipulate, because now we can more easily mask by columns. Here we select intensity measurement rows for \"channel_2\" and \"After\" intervention." ] }, { "cell_type": "code", "execution_count": 40, "id": "512f335d-61f7-4dc9-9ff1-239802b2f9da", "metadata": {}, "outputs": [], "source": [ "row_mask = (df_tidy['Channel'] == 'channel_2') & (df_tidy['Treatment'] == 'After')" ] }, { "cell_type": "code", "execution_count": 41, "id": "88c4b6bb-81b7-48e8-ada8-5f2acf2e4888", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TreatmentChannelintensity
26Beforechannel_221.000000
27Beforechannel_224.318182
28Beforechannel_218.772727
29Beforechannel_219.741935
30Beforechannel_225.268293
31Beforechannel_236.450000
32Beforechannel_234.693878
33Beforechannel_234.904762
34Beforechannel_252.115385
35Beforechannel_224.921053
36Beforechannel_225.450000
37Beforechannel_228.050000
38Beforechannel_222.600000
39Beforechannel_240.367347
40Beforechannel_2117.333333
41Beforechannel_234.566667
42Beforechannel_229.600000
43Beforechannel_233.843750
44Beforechannel_233.297872
45Beforechannel_2103.909091
46Beforechannel_229.166667
47Beforechannel_236.347826
48Beforechannel_272.814815
49Beforechannel_257.403846
50Beforechannel_230.400000
51Beforechannel_238.400000
78Afterchannel_242.022776
79Afterchannel_248.661610
80Afterchannel_237.926184
81Afterchannel_240.396353
82Afterchannel_251.471865
83Afterchannel_273.347843
84Afterchannel_269.902829
85Afterchannel_270.156432
86Afterchannel_2104.525198
87Afterchannel_250.563301
88Afterchannel_251.381594
89Afterchannel_256.543107
90Afterchannel_245.215405
91Afterchannel_281.326111
92Afterchannel_2235.067654
93Afterchannel_269.820702
94Afterchannel_259.870177
95Afterchannel_268.493363
96Afterchannel_267.379506
97Afterchannel_2207.956510
98Afterchannel_258.361239
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "
" ], "text/plain": [ " Treatment Channel intensity\n", "26 Before channel_2 21.000000\n", "27 Before channel_2 24.318182\n", "28 Before channel_2 18.772727\n", "29 Before channel_2 19.741935\n", "30 Before channel_2 25.268293\n", "31 Before channel_2 36.450000\n", "32 Before channel_2 34.693878\n", "33 Before channel_2 34.904762\n", "34 Before channel_2 52.115385\n", "35 Before channel_2 24.921053\n", "36 Before channel_2 25.450000\n", "37 Before channel_2 28.050000\n", "38 Before channel_2 22.600000\n", "39 Before channel_2 40.367347\n", "40 Before channel_2 117.333333\n", "41 Before channel_2 34.566667\n", "42 Before channel_2 29.600000\n", "43 Before channel_2 33.843750\n", "44 Before channel_2 33.297872\n", "45 Before channel_2 103.909091\n", "46 Before channel_2 29.166667\n", "47 Before channel_2 36.347826\n", "48 Before channel_2 72.814815\n", "49 Before channel_2 57.403846\n", "50 Before channel_2 30.400000\n", "51 Before channel_2 38.400000\n", "78 After channel_2 42.022776\n", "79 After channel_2 48.661610\n", "80 After channel_2 37.926184\n", "81 After channel_2 40.396353\n", "82 After channel_2 51.471865\n", "83 After channel_2 73.347843\n", "84 After channel_2 69.902829\n", "85 After channel_2 70.156432\n", "86 After channel_2 104.525198\n", "87 After channel_2 50.563301\n", "88 After channel_2 51.381594\n", "89 After channel_2 56.543107\n", "90 After channel_2 45.215405\n", "91 After channel_2 81.326111\n", "92 After channel_2 235.067654\n", "93 After channel_2 69.820702\n", "94 After channel_2 59.870177\n", "95 After channel_2 68.493363\n", "96 After channel_2 67.379506\n", "97 After channel_2 207.956510\n", "98 After channel_2 58.361239\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tidy[df_tidy['Channel'] == 'channel_2']" ] }, { "cell_type": "markdown", "id": "731039f0-a993-430b-996a-cad1b809ac02", "metadata": {}, "source": [ "# Split-Apply-Combine" ] }, { "cell_type": "markdown", "id": "f34f881f-7f9b-4e51-9ee7-8c50b877390c", "metadata": {}, "source": [ "Assume the following dataframe. It contains measurements of objects from 2 different files." ] }, { "cell_type": "code", "execution_count": 42, "id": "8e6ddf36-5f87-415e-a983-364168dee4c3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratiofile_name
0FalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalse
.....................
106FalseFalseFalseFalseFalseFalse
107FalseFalseFalseFalseFalseFalse
108FalseFalseFalseFalseFalseFalse
109FalseFalseFalseFalseFalseFalse
110FalseFalseFalseFalseFalseFalse
\n", "

111 rows × 6 columns

\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length \\\n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", ".. ... ... ... ... \n", "106 False False False False \n", "107 False False False False \n", "108 False False False False \n", "109 False False False False \n", "110 False False False False \n", "\n", " aspect_ratio file_name \n", "0 False False \n", "1 False False \n", "2 False False \n", "3 False False \n", "4 False False \n", ".. ... ... \n", "106 False False \n", "107 False False \n", "108 False False \n", "109 False False \n", "110 False False \n", "\n", "[111 rows x 6 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('../../data/BBBC007_analysis.csv')\n", "df.isnull()" ] }, { "cell_type": "markdown", "id": "d50bf409-5dd7-41b4-85a6-ae8d3189abb7", "metadata": {}, "source": [ "Let’s say we want to compute the median \"intensity_mean\" of round objects and also discriminate these objects per file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?\n", "\n", "- Split the data set up according to a 'round' criterion field, i.e., split it up so we have a separate data set for the two classes, those round and those not round.\n", "\n", "- Apply a median function to the intensity in these split data sets.\n", "\n", "- Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (round and not round) and medians for each.\n", "\n", "We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in [this paper](https://www.jstatsoft.org/article/view/v040i01). It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.\n", "\n", "Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.\n", "\n", "Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it." ] }, { "cell_type": "markdown", "id": "ab72dfca-5ea2-4a8c-9b26-4e6afdef4c27", "metadata": {}, "source": [ "Before all that, we create a new column with our cirterion for roundness." ] }, { "cell_type": "code", "execution_count": 43, "id": "e6cd8e95-a0f7-47e5-adb7-2a57fd765f5f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratiofile_nameround
013996.54676317.50410410.2927701.70062120P1_POS0010_D_1ULFalse
136086.61388935.74680814.9831242.38580520P1_POS0010_D_1ULFalse
24391.48837212.9678844.3515732.98004520P1_POS0010_D_1ULFalse
314073.74285718.94050810.3144041.83631620P1_POS0010_D_1ULFalse
414489.37500013.63930813.4585321.01343220P1_POS0010_D_1ULTrue
........................
10630588.25245920.22653219.2442101.05104520P1_POS0007_D_1ULTrue
10759389.90556536.50837021.3653941.70876220P1_POS0007_D_1ULFalse
108289106.85121120.42780918.2214521.12108620P1_POS0007_D_1ULTrue
109277100.66426020.30796517.4329201.16492020P1_POS0007_D_1ULTrue
1104670.86956511.6488955.2980032.19873320P1_POS0007_D_1ULFalse
\n", "

111 rows × 7 columns

\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length aspect_ratio \\\n", "0 139 96.546763 17.504104 10.292770 1.700621 \n", "1 360 86.613889 35.746808 14.983124 2.385805 \n", "2 43 91.488372 12.967884 4.351573 2.980045 \n", "3 140 73.742857 18.940508 10.314404 1.836316 \n", "4 144 89.375000 13.639308 13.458532 1.013432 \n", ".. ... ... ... ... ... \n", "106 305 88.252459 20.226532 19.244210 1.051045 \n", "107 593 89.905565 36.508370 21.365394 1.708762 \n", "108 289 106.851211 20.427809 18.221452 1.121086 \n", "109 277 100.664260 20.307965 17.432920 1.164920 \n", "110 46 70.869565 11.648895 5.298003 2.198733 \n", "\n", " file_name round \n", "0 20P1_POS0010_D_1UL False \n", "1 20P1_POS0010_D_1UL False \n", "2 20P1_POS0010_D_1UL False \n", "3 20P1_POS0010_D_1UL False \n", "4 20P1_POS0010_D_1UL True \n", ".. ... ... \n", "106 20P1_POS0007_D_1UL True \n", "107 20P1_POS0007_D_1UL False \n", "108 20P1_POS0007_D_1UL True \n", "109 20P1_POS0007_D_1UL True \n", "110 20P1_POS0007_D_1UL False \n", "\n", "[111 rows x 7 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['round'] = df['aspect_ratio'] < 1.2\n", "df" ] }, { "cell_type": "markdown", "id": "464afbc4-287e-4ebc-9b29-dc80fe4f8805", "metadata": {}, "source": [ "## Aggregation: median intensity" ] }, { "cell_type": "markdown", "id": "ef04e5ff-02a2-4b72-aab6-e596dcb63eda", "metadata": {}, "source": [ "Let's start by grouping by 'round'." ] }, { "cell_type": "code", "execution_count": 44, "id": "37aaf674-fcc8-4ff1-96e2-216a0135fe10", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped = df.groupby('round')\n", "\n", "# Take a look\n", "grouped" ] }, { "cell_type": "markdown", "id": "00eac35a-50a6-4085-b94e-6c79592c5ef5", "metadata": {}, "source": [ "There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing `grouped.` and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want." ] }, { "cell_type": "code", "execution_count": 45, "id": "14983422-5c93-431e-a105-7eb90f739372", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
round
False270.092.78834521.45949515.8583241.412849
True291.0100.25600020.15554718.3522871.101700
\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length minor_axis_length \\\n", "round \n", "False 270.0 92.788345 21.459495 15.858324 \n", "True 291.0 100.256000 20.155547 18.352287 \n", "\n", " aspect_ratio \n", "round \n", "False 1.412849 \n", "True 1.101700 " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_median = grouped.median(numeric_only = True)\n", "\n", "# Take a look\n", "df_median" ] }, { "cell_type": "markdown", "id": "cfd9ac1c-b987-444b-96cf-aad7acc82cab", "metadata": {}, "source": [ "Here the numeric_only option is set to disconsider for now calculating a median for the categorical 'file_name' column." ] }, { "cell_type": "markdown", "id": "339a3a54-7d67-427b-9a74-c62239040959", "metadata": {}, "source": [ "The outputted data frame has the medians of all quantities, including the intensities that we wanted. Note that this data frame has 'round' as the name of the row index. If we want to instead keep 'round' (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method." ] }, { "cell_type": "code", "execution_count": 46, "id": "103b54c4-24a1-487a-a169-4f4fc535ddf7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
roundareaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
0False270.092.78834521.45949515.8583241.412849
1True291.0100.25600020.15554718.3522871.101700
\n", "
" ], "text/plain": [ " round area intensity_mean major_axis_length minor_axis_length \\\n", "0 False 270.0 92.788345 21.459495 15.858324 \n", "1 True 291.0 100.256000 20.155547 18.352287 \n", "\n", " aspect_ratio \n", "0 1.412849 \n", "1 1.101700 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_median = df_median.reset_index()\n", "df_median" ] }, { "cell_type": "markdown", "id": "588cf301-c855-4905-a565-caa307ec2dfe", "metadata": {}, "source": [ "We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, round from dirst file, round from second file, not round from first file, and not round from second file. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby, applying a median, and then resetting the index of the result, all in one line." ] }, { "cell_type": "code", "execution_count": 47, "id": "e5880741-5332-431d-89af-0e332f8b984d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
roundfile_nameareaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
0False20P1_POS0007_D_1UL323.091.79679123.75522717.0724771.467410
1False20P1_POS0010_D_1UL237.093.26911320.41073714.8320351.353858
2True20P1_POS0007_D_1UL293.098.22779920.30796518.5990431.101700
3True20P1_POS0010_D_1UL277.5103.29982519.66233017.6807411.103133
\n", "
" ], "text/plain": [ " round file_name area intensity_mean major_axis_length \\\n", "0 False 20P1_POS0007_D_1UL 323.0 91.796791 23.755227 \n", "1 False 20P1_POS0010_D_1UL 237.0 93.269113 20.410737 \n", "2 True 20P1_POS0007_D_1UL 293.0 98.227799 20.307965 \n", "3 True 20P1_POS0010_D_1UL 277.5 103.299825 19.662330 \n", "\n", " minor_axis_length aspect_ratio \n", "0 17.072477 1.467410 \n", "1 14.832035 1.353858 \n", "2 18.599043 1.101700 \n", "3 17.680741 1.103133 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['round', 'file_name']).median().reset_index()" ] }, { "cell_type": "markdown", "id": "5839d877-f17a-4c5c-8a3a-a3e0d5d537c4", "metadata": {}, "source": [ "This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median." ] }, { "cell_type": "markdown", "id": "b6130950-ff18-4487-ab18-9fb05588d407", "metadata": {}, "source": [ "## Aggregating with custom functions" ] }, { "cell_type": "markdown", "id": "f4a4eeab-fd4e-43fc-b2d5-26d0c21a04e7", "metadata": {}, "source": [ "If we want to apply a function that is not built-in, we can also do so. For example, let's apply the coefficient of variance. We can define a generic function that calculates it like shown below." ] }, { "cell_type": "code", "execution_count": 48, "id": "e86a620d-9b0c-4255-b66a-4c3dea178dcd", "metadata": {}, "outputs": [], "source": [ "def coefficient_of_variance(data):\n", " \"\"\"Compute coefficient of variation from an array of data.\"\"\"\n", " return np.std(data) / np.mean(data)" ] }, { "cell_type": "markdown", "id": "05d72d03-18dd-409f-8034-2fb2641b02b7", "metadata": {}, "source": [ "Now we group it and apply it as an aggregating function. If there are other categorical variables, they should be masked." ] }, { "cell_type": "code", "execution_count": 49, "id": "a0c01a3c-6064-4764-8772-78b52a334229", "metadata": {}, "outputs": [], "source": [ "grouped = df.groupby(['round', 'file_name'])" ] }, { "cell_type": "code", "execution_count": 50, "id": "cbc2b527-219c-4e94-a153-a7b785cdce48", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areaintensity_meanmajor_axis_lengthminor_axis_lengthaspect_ratio
roundfile_name
False20P1_POS0007_D_1UL0.5403990.1459560.3498570.2890630.243450
20P1_POS0010_D_1UL0.7651560.1435060.4476380.4027080.316206
True20P1_POS0007_D_1UL0.2487990.0996360.1452470.1476260.036950
20P1_POS0010_D_1UL0.2751200.1070080.1677220.1662140.043755
\n", "
" ], "text/plain": [ " area intensity_mean major_axis_length \\\n", "round file_name \n", "False 20P1_POS0007_D_1UL 0.540399 0.145956 0.349857 \n", " 20P1_POS0010_D_1UL 0.765156 0.143506 0.447638 \n", "True 20P1_POS0007_D_1UL 0.248799 0.099636 0.145247 \n", " 20P1_POS0010_D_1UL 0.275120 0.107008 0.167722 \n", "\n", " minor_axis_length aspect_ratio \n", "round file_name \n", "False 20P1_POS0007_D_1UL 0.289063 0.243450 \n", " 20P1_POS0010_D_1UL 0.402708 0.316206 \n", "True 20P1_POS0007_D_1UL 0.147626 0.036950 \n", " 20P1_POS0010_D_1UL 0.166214 0.043755 " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.agg(coefficient_of_variance)" ] }, { "cell_type": "markdown", "id": "e20ae5a5", "metadata": {}, "source": [ "# Exercises" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9db24255-2290-4e83-ac74-93d780378175", "metadata": {}, "source": [ "## Exercise 1\n" ] }, { "cell_type": "code", "execution_count": 51, "id": "87f226cd-721b-43e3-a31a-faed5e8a6733", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0areamean_intensityminor_axis_lengthmajor_axis_lengtheccentricityextentferet_diameter_maxequivalent_diameter_areabbox-0bbox-1bbox-2bbox-3
00422192.37914716.48855034.5667890.8789000.58611135.22783023.1798850113035
11182180.13186811.73607420.8026970.8256650.78787921.37755815.2226670531174
22661205.21633928.40950230.2084330.3399340.87433932.75667929.01053809528122
33437216.58581223.14399624.6061300.3395760.82608726.92582423.588253014423167
44476212.30252119.85288231.0751060.7693170.86388431.38471024.618327023729256
\n", "
" ], "text/plain": [ " Unnamed: 0 area mean_intensity minor_axis_length major_axis_length \\\n", "0 0 422 192.379147 16.488550 34.566789 \n", "1 1 182 180.131868 11.736074 20.802697 \n", "2 2 661 205.216339 28.409502 30.208433 \n", "3 3 437 216.585812 23.143996 24.606130 \n", "4 4 476 212.302521 19.852882 31.075106 \n", "\n", " eccentricity extent feret_diameter_max equivalent_diameter_area \\\n", "0 0.878900 0.586111 35.227830 23.179885 \n", "1 0.825665 0.787879 21.377558 15.222667 \n", "2 0.339934 0.874339 32.756679 29.010538 \n", "3 0.339576 0.826087 26.925824 23.588253 \n", "4 0.769317 0.863884 31.384710 24.618327 \n", "\n", " bbox-0 bbox-1 bbox-2 bbox-3 \n", "0 0 11 30 35 \n", "1 0 53 11 74 \n", "2 0 95 28 122 \n", "3 0 144 23 167 \n", "4 0 237 29 256 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_shape = pd.read_csv('../../data/blobs_statistics.csv')\n", "df_shape.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "75b9d1d9", "metadata": {}, "source": [ "Create a new data frame that only contains these columns:\n", "* `minor_axis_length`\n", "* `major_axis_length`\n", "* `aspect_ratio`" ] }, { "cell_type": "code", "execution_count": null, "id": "59916b36-cdd1-4af6-852e-25b0806ac11c", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "780acb1c-e8f0-4678-aa2c-ca497fe26872", "metadata": {}, "source": [ "## Exercise 2\n" ] }, { "cell_type": "code", "execution_count": 52, "id": "3022565a-6713-4fd0-ad78-1bb626e27077", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AreaMeanStdDevMinMaxXYXMYMMajorMinorAngle%AreaType
118.0730.389103.354592.0948.0435.0004.722434.9624.6975.9873.828168.425100A
2126.0718.33390.367556.01046.0388.0878.683388.1838.68716.5599.688175.471100A
3NaNNaNNaN608.0964.0NaNNaNNaN7.6657.359NaN101.121100A
468.0686.98561.169571.0880.0126.1478.809126.1928.81115.1365.720168.133100A
5NaNNaN69.438566.0792.0348.5007.500NaN7.508NaN3.088NaN100A
\n", "
" ], "text/plain": [ " Area Mean StdDev Min Max X Y XM YM \\\n", " \n", "1 18.0 730.389 103.354 592.0 948.0 435.000 4.722 434.962 4.697 \n", "2 126.0 718.333 90.367 556.0 1046.0 388.087 8.683 388.183 8.687 \n", "3 NaN NaN NaN 608.0 964.0 NaN NaN NaN 7.665 \n", "4 68.0 686.985 61.169 571.0 880.0 126.147 8.809 126.192 8.811 \n", "5 NaN NaN 69.438 566.0 792.0 348.500 7.500 NaN 7.508 \n", "\n", " Major Minor Angle %Area Type \n", " \n", "1 5.987 3.828 168.425 100 A \n", "2 16.559 9.688 175.471 100 A \n", "3 7.359 NaN 101.121 100 A \n", "4 15.136 5.720 168.133 100 A \n", "5 NaN 3.088 NaN 100 A " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')\n", "data.head()\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1667d7b9", "metadata": {}, "source": [ "Select the columns `Area` and `Mean`. Remove all rows that contain NaNs and count the remaining rows." ] }, { "cell_type": "code", "execution_count": null, "id": "6ef5bfa3", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "38e39be3", "metadata": {}, "source": [ "Now select the columns `Major` and `Minor`. Remove NaNs and count the remaining rows again. What do you conclude?" ] }, { "cell_type": "code", "execution_count": null, "id": "4e1fa0d0", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "88a608f2-0337-4ebe-bf1b-ff0b46def99e", "metadata": {}, "source": [ "## Exercise 3" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7e45e410-c69b-41b0-b11e-e939d35e80df", "metadata": {}, "source": [ "Group the tidy dataframe from above by 'Treatment' and 'Channel'" ] }, { "cell_type": "code", "execution_count": 53, "id": "1b8769e5-0287-4fcc-b0d1-8ab9884c2c9d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TreatmentChannelintensity
0Beforechannel_113.250000
1Beforechannel_144.954545
2Beforechannel_113.590909
3Beforechannel_185.032258
4Beforechannel_110.731707
............
99Afterchannel_273.286439
100Afterchannel_2145.900739
101Afterchannel_2115.347217
102Afterchannel_261.225962
103Afterchannel_277.490249
\n", "

104 rows × 3 columns

\n", "
" ], "text/plain": [ " Treatment Channel intensity\n", "0 Before channel_1 13.250000\n", "1 Before channel_1 44.954545\n", "2 Before channel_1 13.590909\n", "3 Before channel_1 85.032258\n", "4 Before channel_1 10.731707\n", ".. ... ... ...\n", "99 After channel_2 73.286439\n", "100 After channel_2 145.900739\n", "101 After channel_2 115.347217\n", "102 After channel_2 61.225962\n", "103 After channel_2 77.490249\n", "\n", "[104 rows x 3 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tidy" ] }, { "cell_type": "code", "execution_count": null, "id": "6eade53a", "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "2d09a3a3", "metadata": {}, "source": [ "Calculate the mean and standard deviation for the groups" ] }, { "cell_type": "code", "execution_count": null, "id": "7abd7722-9eb2-4acc-81b7-3f9dceeac73a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 ('dbnapari-arm64')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "f13a533d7f6b525f1b9e27d74dd8bd7ea06c4c95ae7aaf0187c7426193b5690e" } } }, "nbformat": 4, "nbformat_minor": 5 }