Introduction to working with DataFrames#

In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that, we introduce pandas DataFrames, the primary tool in the Python ecosystem for handling data. Its primary object, the “DataFrame” is extremely useful in wrangling data. which are more handy in the next steps.

import pandas as pd

Creating DataFrames from a dictionary of lists#

Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:

measurements = {
    "labels":      [1, 2, 3],
    "area":       [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be nicely visualized using a DataFrame:

df = pd.DataFrame(measurements)
df
labels area minor_axis major_axis
0 1 45 2 3
1 2 23 4 4
2 3 68 4 5

Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns. This is done elementwise.

df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels area minor_axis major_axis aspect_ratio
0 1 45 2 3 1.50
1 2 23 4 4 1.00
2 3 68 4 5 1.25

Saving data frames#

We can also save this table for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages.

df.to_csv("../../data/short_table.csv")

You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat files.

Creating DataFrames from lists of lists#

Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists

row_header = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],      # first row
    [45, 23, 68],   # second row
    [2, 4, 4],      # third row
    [3, 4, 5],      # fourth row
]
          
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, row_header)

# show it
data_frame
0 1 2
labels 1 2 3
area 45 23 68
minor_axis 2 4 4
major_axis 3 4 5

Or from a 2D numpy array#

import numpy as np

data = np.random.random((4,3))
column_header = ['area', 'minor_axis', 'major_axis']

pd.DataFrame(data, columns=column_header)
area minor_axis major_axis
0 0.800797 0.421862 0.956755
1 0.470318 0.431116 0.171390
2 0.210930 0.076568 0.199061
3 0.050682 0.346919 0.254751

As you can see, this table is rotated. We can bring it in the usual form like this:

# rotate/flip it
data_frame = data_frame.transpose()

# show it
data_frame
labels area minor_axis major_axis
0 1 45 2 3
1 2 23 4 4
2 3 68 4 5

Loading data frames#

Tables can be read from CSV files with pd.read_csv.

df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 56 211 185.061611 14.522762 18.489138 0.618893 0.781481 18.973666 16.390654 232 39 250 54
57 57 78 185.230769 6.028638 17.579799 0.939361 0.722222 18.027756 9.965575 248 170 254 188
58 58 86 183.720930 5.426871 21.261427 0.966876 0.781818 22.000000 10.464158 249 117 254 139
59 59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82

61 rows × 13 columns

That’s a bit too much information. We can use the .head() method of data frames to look at the first few rows (or the .tail() to check the last rows).

df_csv.head(3)
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
df_csv.tail(3)
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
58 58 86 183.720930 5.426871 21.261427 0.966876 0.781818 22.000000 10.464158 249 117 254 139
59 59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82

We can also get column names with the DataFrame attribute columns.

df_csv.columns
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
       'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

Selecting rows and columns#

Ok, let’s get the dataframe first column:

df_csv['area']
0     422
1     182
2     661
3     437
4     476
     ... 
56    211
57     78
58     86
59     51
60     46
Name: area, Length: 61, dtype: int64

Notice that when it was printed, the index of the rows came along with it. That’s beacuse a Pandas DataFrame with one column is a Pandas Series.

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_analysis = df_csv[ ['area', 'mean_intensity'] ].copy()
df_analysis
area mean_intensity
0 422 192.379147
1 182 180.131868
2 661 205.216339
3 437 216.585812
4 476 212.302521
... ... ...
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373
60 46 175.304348

61 rows × 2 columns

This gave us the area measurements we were after.

If we want to get a single row, the proper way of doing that is to use the .loc method:

df_csv.loc[ 2, ['area', 'mean_intensity']]
area              661.000000
mean_intensity    205.216339
Name: 2, dtype: float64

Note that following .loc, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers.

In case you really want to access elements in a dataframe by integer indices, like in a numpy array, you can use the .iloc method.

df_csv.iloc[2:4, 1:3]
area mean_intensity
2 661 205.216339
3 437 216.585812

The downside is that the code becomes less explicit when column names are absent.

In practice you will almost never use row indices, but rather use Boolean indexing or Masking.

Selecting parts of tables based on criteria#

In case we want to focus our further analysis on cells that have a certain minimum area, we can do this by passing boolean indices to the rows. This process is also sometimes call masking.

Suppose we want the rows for which df_analysis["area"] > 50. We can essentially plug this syntax directly when using .loc.

df_analysis_filtered  = df_analysis[ df_analysis["area"] > 50]
df_analysis_filtered.tail()
area mean_intensity
55 280 189.800000
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373

Adding new columns#

In Pandas, it is very easy to generate new columns from existing columns. We just pass some operation between other columns to a new column name.

df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
area mean_intensity total_intensity
0 422 192.379147 81184.0
1 182 180.131868 32784.0
2 661 205.216339 135648.0
3 437 216.585812 94648.0
4 476 212.302521 101056.0
... ... ... ...
56 211 185.061611 39048.0
57 78 185.230769 14448.0
58 86 183.720930 15800.0
59 51 190.431373 9712.0
60 46 175.304348 8064.0

61 rows × 3 columns

Appending tables#

When processing multiple images, potentially using multiple image processing libraries, a common task is to combine tables.

We start with two small tables of measurements that could have been obtained from different functions or different libraries.

import pandas as pd
table1 = pd.DataFrame({
    "label":       [1,   2,   3],
    "circularity": [0.3, 0.5, 0.7],
    "elongation":  [2.3, 3.4, 1.2],
    })
table1
label circularity elongation
0 1 0.3 2.3
1 2 0.5 3.4
2 3 0.7 1.2
table2 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    })
table2
label area skewness
0 3 22 0.5
1 2 32 0.6
2 1 25 0.3
3 4 18 0.3

Combining same measurement from different image files#

When applying a workflow to many images, you would get tables with the same column names, but with a variable number of rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.

Let’s open two tables generated from applying the same worflow to different files.

df1 = pd.read_csv('../../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 256 93.250000 19.995017 17.021559 1.174688
1 90 82.488889 15.939969 7.516326 2.120713
2 577 90.637782 35.324458 21.759434 1.623409
3 270 95.640741 20.229431 17.669052 1.144908
4 153 84.908497 15.683703 12.420475 1.262730
df2 = pd.read_csv('../../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 139 96.546763 17.504104 10.292770 1.700621
1 360 86.613889 35.746808 14.983124 2.385805
2 43 91.488372 12.967884 4.351573 2.980045
3 140 73.742857 18.940508 10.314404 1.836316
4 144 89.375000 13.639308 13.458532 1.013432

Because we know we have the same columns, we can concatenate them into a single big table.

big_df = pd.concat([df1, df2], axis=0)
big_df
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 256 93.250000 19.995017 17.021559 1.174688
1 90 82.488889 15.939969 7.516326 2.120713
2 577 90.637782 35.324458 21.759434 1.623409
3 270 95.640741 20.229431 17.669052 1.144908
4 153 84.908497 15.683703 12.420475 1.262730
... ... ... ... ... ...
42 315 91.133333 20.927095 19.209283 1.089426
43 206 94.262136 23.381879 11.669668 2.003646
44 45 68.377778 9.406371 6.276445 1.498678
45 33 76.727273 10.724275 4.174568 2.568955
46 16 76.750000 7.745967 2.783882 2.782433

111 rows × 5 columns

The problem is that we lose their source identity. An easy fix for that is to add a new column with the file name before concatenating them. This will facilitating splitting them back and plotting later on.

When we give a single value to a new column, it is assigned to all rows.

df1['file_name'] = 'BBBC007_20P1_POS0007_D_1UL'

df2['file_name'] = 'BBBC007_20P1_POS0010_D_1UL'
big_df = pd.concat([df1, df2], axis=0)
big_df
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 256 93.250000 19.995017 17.021559 1.174688 BBBC007_20P1_POS0007_D_1UL
1 90 82.488889 15.939969 7.516326 2.120713 BBBC007_20P1_POS0007_D_1UL
2 577 90.637782 35.324458 21.759434 1.623409 BBBC007_20P1_POS0007_D_1UL
3 270 95.640741 20.229431 17.669052 1.144908 BBBC007_20P1_POS0007_D_1UL
4 153 84.908497 15.683703 12.420475 1.262730 BBBC007_20P1_POS0007_D_1UL
... ... ... ... ... ... ...
42 315 91.133333 20.927095 19.209283 1.089426 BBBC007_20P1_POS0010_D_1UL
43 206 94.262136 23.381879 11.669668 2.003646 BBBC007_20P1_POS0010_D_1UL
44 45 68.377778 9.406371 6.276445 1.498678 BBBC007_20P1_POS0010_D_1UL
45 33 76.727273 10.724275 4.174568 2.568955 BBBC007_20P1_POS0010_D_1UL
46 16 76.750000 7.745967 2.783882 2.782433 BBBC007_20P1_POS0010_D_1UL

111 rows × 6 columns

Now, when we can safely distinguish the source of each row.

Handling NaN values#

When analysing tabular data, sometimes table cells are present that does not contain data. In Python this typically means the value is Not a Number (NaN). We cannot assume these values are 0 or -1 or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.

To get a first view where NaNs play a role, we load again an example table and sort it.

import numpy as np
import pandas as pd 
data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.sort_values(by = "Area", ascending=False)
Area Mean StdDev Min Max X Y XM YM Major Minor Angle %Area Type
190 2755.0 859.928 235.458 539.0 3880.0 108.710 302.158 110.999 300.247 144.475 24.280 39.318 100 C
81 2295.0 765.239 96.545 558.0 1431.0 375.003 134.888 374.982 135.359 65.769 44.429 127.247 100 B
209 1821.0 847.761 122.074 600.0 1510.0 287.795 321.115 288.074 321.824 55.879 41.492 112.124 100 A
252 1528.0 763.777 83.183 572.0 1172.0 191.969 385.944 192.487 385.697 63.150 30.808 34.424 100 B
265 1252.0 793.371 117.139 579.0 1668.0 262.071 394.497 262.268 394.326 60.154 26.500 50.147 100 A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 1.0 587.000 0.000 587.0 587.0 399.500 117.500 399.500 117.500 1.128 1.128 0.000 100 A
310 1.0 866.000 0.000 866.0 866.0 343.500 408.500 343.500 408.500 1.128 1.128 0.000 100 A
219 1.0 763.000 0.000 763.0 763.0 411.500 296.500 411.500 296.500 1.128 1.128 0.000 100 A
3 NaN NaN NaN 608.0 964.0 NaN NaN NaN 7.665 7.359 NaN 101.121 100 A
5 NaN NaN 69.438 566.0 792.0 348.500 7.500 NaN 7.508 NaN 3.088 NaN 100 A

391 rows × 14 columns

As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them.

A quick check if there are NaNs anywhere in a DataFrame is an important quality check and good scientific practice.

data.isnull().values.any()
True

We can also get some deeper insights in which columns these NaN values are located.

data.isnull().sum()
Area      2
Mean      5
StdDev    3
Min       3
Max       3
X         2
Y         3
XM        3
YM        5
Major     8
Minor     3
Angle     1
%Area     0
Type      0
dtype: int64

For getting a glimpse about if we can further process that tabel, we may want to know the percentage of NaNs for each column?

data.isnull().mean().sort_values(ascending=False) *100
Major     2.046036
Mean      1.278772
YM        1.278772
StdDev    0.767263
Min       0.767263
Max       0.767263
Y         0.767263
XM        0.767263
Minor     0.767263
Area      0.511509
X         0.511509
Angle     0.255754
%Area     0.000000
Type      0.000000
dtype: float64

Dropping rows that contain NaNs#

Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.

It depends on your project and what is important or not for the analysis. Its not an easy answer.

data_no_nan = data.dropna(how="any")
data_no_nan 
Area Mean StdDev Min Max X Y XM YM Major Minor Angle %Area Type
1 18.0 730.389 103.354 592.0 948.0 435.000 4.722 434.962 4.697 5.987 3.828 168.425 100 A
2 126.0 718.333 90.367 556.0 1046.0 388.087 8.683 388.183 8.687 16.559 9.688 175.471 100 A
4 68.0 686.985 61.169 571.0 880.0 126.147 8.809 126.192 8.811 15.136 5.720 168.133 100 A
6 669.0 697.164 72.863 539.0 957.0 471.696 26.253 471.694 26.197 36.656 23.237 124.340 100 A
7 5.0 658.600 49.161 607.0 710.0 28.300 8.100 28.284 8.103 3.144 2.025 161.565 100 A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
383 94.0 746.617 85.198 550.0 1021.0 194.032 498.223 194.014 498.239 17.295 6.920 52.720 100 B
387 152.0 801.599 111.328 582.0 1263.0 348.487 497.632 348.451 497.675 17.773 10.889 11.829 100 A
389 60.0 758.033 77.309 601.0 947.0 259.000 499.300 258.990 499.289 9.476 8.062 90.000 100 A
390 12.0 714.833 67.294 551.0 785.0 240.167 498.167 240.179 498.148 4.606 3.317 168.690 100 A
391 23.0 695.043 67.356 611.0 846.0 49.891 503.022 49.882 502.979 6.454 4.537 73.243 100 A

374 rows × 14 columns

On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of the total number of datasets were analysed.

We can now also check again if NaNs are present.

data_no_nan.isnull().values.any()
False

Determining rows that contain NaNs#

In some use-cases it might be useful to have a list of row-indices where there are NaN values.

data2 = {
    'A': [0, 1, 22, 21, 12, 23],
    'B': [2, 3, np.nan,  2,  12, 22],
    'C': [2, 3, 44,  2,  np.nan, 52],
}

table = pd.DataFrame(data2)
table
A B C
0 0 2.0 2.0
1 1 3.0 3.0
2 22 NaN 44.0
3 21 2.0 2.0
4 12 12.0 NaN
5 23 22.0 52.0
np.max(table.isnull().values, axis=1)
array([False, False,  True, False,  True, False])

Tidy-Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules:

  • Each variable is a column.

  • Each observation is a row.

  • Each type of observation has its own separate data frame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s import a saved table with measurements. Is this table tidy?

import pandas as pd
df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')
df
Before After
channel_1 channel_2 channel_1 channel_2
0 13.250000 21.000000 15.137984 42.022776
1 44.954545 24.318182 43.328836 48.661610
2 13.590909 18.772727 11.685995 37.926184
3 85.032258 19.741935 86.031461 40.396353
4 10.731707 25.268293 10.075421 51.471865
5 94.625000 36.450000 95.180900 73.347843
6 89.836735 34.693878 89.857864 69.902829
7 100.261905 34.904762 101.989852 70.156432
8 29.615385 52.115385 31.516654 104.525198
9 15.868421 24.921053 16.086932 50.563301
10 12.475000 25.450000 11.529924 51.381594
11 87.875000 28.050000 89.745522 56.543107
12 58.800000 22.600000 59.646229 45.215405
13 91.061224 40.367347 89.935893 81.326111
14 23.500000 117.333333 21.676993 235.067654
15 82.566667 34.566667 84.097735 69.820702
16 36.120000 29.600000 37.688676 59.870177
17 70.687500 33.843750 72.569112 68.493363
18 102.021277 33.297872 100.419746 67.379506
19 72.318182 103.909091 70.843134 207.956510
20 18.100000 29.166667 17.865201 58.361239
21 5.217391 36.347826 6.961346 73.286439
22 19.925926 72.814815 18.607102 145.900739
23 26.673077 57.403846 27.611368 115.347217
24 13.340000 30.400000 14.160543 61.225962
25 15.028571 38.400000 14.529963 77.490249

The most useful function for tidying data is pd.melt.

df.melt()
variable_0 variable_1 value
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

We can specify names for the value and for the variables. In this case, our measurements are of intensity and our variables are Intervention (before or after) and channel.

df_tidy = df.melt(value_name='intensity', var_name=['Treatment', 'Channel'])
df_tidy
Treatment Channel intensity
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

It may not look better for interpreting, but it becomes easier to manipulate, because now we can more easily mask by columns. Here we select intensity measurement rows for “channel_2” and “After” intervention.

row_mask = (df_tidy['Channel'] == 'channel_2') & (df_tidy['Treatment'] == 'After')
df_tidy[df_tidy['Channel'] == 'channel_2']
Treatment Channel intensity
26 Before channel_2 21.000000
27 Before channel_2 24.318182
28 Before channel_2 18.772727
29 Before channel_2 19.741935
30 Before channel_2 25.268293
31 Before channel_2 36.450000
32 Before channel_2 34.693878
33 Before channel_2 34.904762
34 Before channel_2 52.115385
35 Before channel_2 24.921053
36 Before channel_2 25.450000
37 Before channel_2 28.050000
38 Before channel_2 22.600000
39 Before channel_2 40.367347
40 Before channel_2 117.333333
41 Before channel_2 34.566667
42 Before channel_2 29.600000
43 Before channel_2 33.843750
44 Before channel_2 33.297872
45 Before channel_2 103.909091
46 Before channel_2 29.166667
47 Before channel_2 36.347826
48 Before channel_2 72.814815
49 Before channel_2 57.403846
50 Before channel_2 30.400000
51 Before channel_2 38.400000
78 After channel_2 42.022776
79 After channel_2 48.661610
80 After channel_2 37.926184
81 After channel_2 40.396353
82 After channel_2 51.471865
83 After channel_2 73.347843
84 After channel_2 69.902829
85 After channel_2 70.156432
86 After channel_2 104.525198
87 After channel_2 50.563301
88 After channel_2 51.381594
89 After channel_2 56.543107
90 After channel_2 45.215405
91 After channel_2 81.326111
92 After channel_2 235.067654
93 After channel_2 69.820702
94 After channel_2 59.870177
95 After channel_2 68.493363
96 After channel_2 67.379506
97 After channel_2 207.956510
98 After channel_2 58.361239
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

Split-Apply-Combine#

Assume the following dataframe. It contains measurements of objects from 2 different files.

df = pd.read_csv('../../data/BBBC007_analysis.csv')
df.isnull()
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
... ... ... ... ... ... ...
106 False False False False False False
107 False False False False False False
108 False False False False False False
109 False False False False False False
110 False False False False False False

111 rows × 6 columns

Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

  • Split the data set up according to a ‘round’ criterion field, i.e., split it up so we have a separate data set for the two classes, those round and those not round.

  • Apply a median function to the intensity in these split data sets.

  • Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (round and not round) and medians for each.

We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

Before all that, we create a new column with our cirterion for roundness.

df['round'] = df['aspect_ratio'] < 1.2
df
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name round
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL False
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL False
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL False
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL False
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL True
... ... ... ... ... ... ... ...
106 305 88.252459 20.226532 19.244210 1.051045 20P1_POS0007_D_1UL True
107 593 89.905565 36.508370 21.365394 1.708762 20P1_POS0007_D_1UL False
108 289 106.851211 20.427809 18.221452 1.121086 20P1_POS0007_D_1UL True
109 277 100.664260 20.307965 17.432920 1.164920 20P1_POS0007_D_1UL True
110 46 70.869565 11.648895 5.298003 2.198733 20P1_POS0007_D_1UL False

111 rows × 7 columns

Aggregation: median intensity#

Let’s start by grouping by ‘round’.

grouped = df.groupby('round')

# Take a look
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11bb28e20>

There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped. and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want.

df_median = grouped.median(numeric_only = True)

# Take a look
df_median
area intensity_mean major_axis_length minor_axis_length aspect_ratio
round
False 270.0 92.788345 21.459495 15.858324 1.412849
True 291.0 100.256000 20.155547 18.352287 1.101700

Here the numeric_only option is set to disconsider for now calculating a median for the categorical ‘file_name’ column.

The outputted data frame has the medians of all quantities, including the intensities that we wanted. Note that this data frame has ‘round’ as the name of the row index. If we want to instead keep ‘round’ (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.

df_median = df_median.reset_index()
df_median
round area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False 270.0 92.788345 21.459495 15.858324 1.412849
1 True 291.0 100.256000 20.155547 18.352287 1.101700

We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, round from dirst file, round from second file, not round from first file, and not round from second file. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby, applying a median, and then resetting the index of the result, all in one line.

df.groupby(['round', 'file_name']).median().reset_index()
round file_name area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False 20P1_POS0007_D_1UL 323.0 91.796791 23.755227 17.072477 1.467410
1 False 20P1_POS0010_D_1UL 237.0 93.269113 20.410737 14.832035 1.353858
2 True 20P1_POS0007_D_1UL 293.0 98.227799 20.307965 18.599043 1.101700
3 True 20P1_POS0010_D_1UL 277.5 103.299825 19.662330 17.680741 1.103133

This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.

Aggregating with custom functions#

If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.

def coefficient_of_variance(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Now we group it and apply it as an aggregating function. If there are other categorical variables, they should be masked.

grouped = df.groupby(['round', 'file_name'])
grouped.agg(coefficient_of_variance)
area intensity_mean major_axis_length minor_axis_length aspect_ratio
round file_name
False 20P1_POS0007_D_1UL 0.540399 0.145956 0.349857 0.289063 0.243450
20P1_POS0010_D_1UL 0.765156 0.143506 0.447638 0.402708 0.316206
True 20P1_POS0007_D_1UL 0.248799 0.099636 0.145247 0.147626 0.036950
20P1_POS0010_D_1UL 0.275120 0.107008 0.167722 0.166214 0.043755

Exercises#

Exercise 1#

df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape.head()
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256

Create a new data frame that only contains these columns:

  • minor_axis_length

  • major_axis_length

  • aspect_ratio

Exercise 2#

data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.head()
Area Mean StdDev Min Max X Y XM YM Major Minor Angle %Area Type
1 18.0 730.389 103.354 592.0 948.0 435.000 4.722 434.962 4.697 5.987 3.828 168.425 100 A
2 126.0 718.333 90.367 556.0 1046.0 388.087 8.683 388.183 8.687 16.559 9.688 175.471 100 A
3 NaN NaN NaN 608.0 964.0 NaN NaN NaN 7.665 7.359 NaN 101.121 100 A
4 68.0 686.985 61.169 571.0 880.0 126.147 8.809 126.192 8.811 15.136 5.720 168.133 100 A
5 NaN NaN 69.438 566.0 792.0 348.500 7.500 NaN 7.508 NaN 3.088 NaN 100 A

Select the columns Area and Mean. Remove all rows that contain NaNs and count the remaining rows.

Now select the columns Major and Minor. Remove NaNs and count the remaining rows again. What do you conclude?

Exercise 3#

Group the tidy dataframe from above by ‘Treatment’ and ‘Channel’

df_tidy
Treatment Channel intensity
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

Calculate the mean and standard deviation for the groups