Split-Apply-Combine#

Assume the following dataframe. It contains measurements of objects from 2 different files.

import pandas as pd
import numpy as np
df = pd.read_csv('../../data/BBBC007_analysis.csv')
df
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL
... ... ... ... ... ... ...
106 305 88.252459 20.226532 19.244210 1.051045 20P1_POS0007_D_1UL
107 593 89.905565 36.508370 21.365394 1.708762 20P1_POS0007_D_1UL
108 289 106.851211 20.427809 18.221452 1.121086 20P1_POS0007_D_1UL
109 277 100.664260 20.307965 17.432920 1.164920 20P1_POS0007_D_1UL
110 46 70.869565 11.648895 5.298003 2.198733 20P1_POS0007_D_1UL

111 rows × 6 columns

Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

  • Split the data set up according to a ‘round’ criterion field, i.e., split it up so we have a separate data set for the two classes, those round and those not round.

  • Apply a median function to the intensity in these split data sets.

  • Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (round and not round) and medians for each.

We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

Before all that, we create a new column with our cirterion for roundness.

df['round'] = df['aspect_ratio'] < 1.2
df
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name round
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL False
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL False
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL False
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL False
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL True
... ... ... ... ... ... ... ...
106 305 88.252459 20.226532 19.244210 1.051045 20P1_POS0007_D_1UL True
107 593 89.905565 36.508370 21.365394 1.708762 20P1_POS0007_D_1UL False
108 289 106.851211 20.427809 18.221452 1.121086 20P1_POS0007_D_1UL True
109 277 100.664260 20.307965 17.432920 1.164920 20P1_POS0007_D_1UL True
110 46 70.869565 11.648895 5.298003 2.198733 20P1_POS0007_D_1UL False

111 rows × 7 columns

Aggregation: median intensity#

Let’s start by grouping by ‘round’.

grouped = df.groupby('round')

# Take a look
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001788C205760>

There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped. and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want.

df_median = grouped.median(numeric_only = True)

# Take a look
df_median
area intensity_mean major_axis_length minor_axis_length aspect_ratio
round
False 270.0 92.788345 21.459495 15.858324 1.412849
True 291.0 100.256000 20.155547 18.352287 1.101700

Here the numeric_only option is set to disconsider for now calculating a median for the categorical ‘file_name’ column.

The outputted data frame has the medians of all quantities, including the intensities that we wanted. Note that this data frame has ‘round’ as the name of the row index. If we want to instead keep ‘round’ (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.

df_median.reset_index()
round area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False 270.0 92.788345 21.459495 15.858324 1.412849
1 True 291.0 100.256000 20.155547 18.352287 1.101700

Note, though, that this was not done in-place. df_median still has an index labeled ‘round’. If you want to update your data frame, you have to explicitly do so with an assignment operator.

df_median = df_median.reset_index()

We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, round from dirst file, round from second file, not round from first file, and not round from second file. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby, applying a median, and then resetting the index of the result, all in one line.

df.groupby(['round', 'file_name']).median().reset_index()
round file_name area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 False 20P1_POS0007_D_1UL 323.0 91.796791 23.755227 17.072477 1.467410
1 False 20P1_POS0010_D_1UL 237.0 93.269113 20.410737 14.832035 1.353858
2 True 20P1_POS0007_D_1UL 293.0 98.227799 20.307965 18.599043 1.101700
3 True 20P1_POS0010_D_1UL 277.5 103.299825 19.662330 17.680741 1.103133

This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.

Aggregating with custom functions#

If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.

def coefficient_of_variance(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Now we group it and apply it as an aggregating function. If there are other categorical variables, they should be masked.

grouped = df.groupby(['round', 'file_name'])
grouped.agg(coefficient_of_variance)
area intensity_mean major_axis_length minor_axis_length aspect_ratio
round file_name
False 20P1_POS0007_D_1UL 0.540399 0.145956 0.349857 0.289063 0.243450
20P1_POS0010_D_1UL 0.765156 0.143506 0.447638 0.402708 0.316206
True 20P1_POS0007_D_1UL 0.248799 0.099636 0.145247 0.147626 0.036950
20P1_POS0010_D_1UL 0.275120 0.107008 0.167722 0.166214 0.043755

Exercise 1#

From the tidy dataframe from the previous notebook, group them by ‘Intervention’ and ‘Channel’ and display summary statistics for intensity.

df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')
df_tidy = df.melt(value_name='intensity', var_name=['Intervention', 'Channel'])
df_tidy
Intervention Channel intensity
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

Exercise 2#

Calculate the skewness of the intensities grouped by channel and intervention.

Hint: use the function skew from scipy.stats and use the .agg method.