Split-Apply-Combine#

Assume the following dataframe. It contains measurements of objects from 2 different files.

import pandas as pd
import numpy as np

df = pd.read_csv('../../data/BBBC007_analysis.csv')
df

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	file_name
0	139	96.546763	17.504104	10.292770	1.700621	20P1_POS0010_D_1UL
1	360	86.613889	35.746808	14.983124	2.385805	20P1_POS0010_D_1UL
2	43	91.488372	12.967884	4.351573	2.980045	20P1_POS0010_D_1UL
3	140	73.742857	18.940508	10.314404	1.836316	20P1_POS0010_D_1UL
4	144	89.375000	13.639308	13.458532	1.013432	20P1_POS0010_D_1UL
...	...	...	...	...	...	...
106	305	88.252459	20.226532	19.244210	1.051045	20P1_POS0007_D_1UL
107	593	89.905565	36.508370	21.365394	1.708762	20P1_POS0007_D_1UL
108	289	106.851211	20.427809	18.221452	1.121086	20P1_POS0007_D_1UL
109	277	100.664260	20.307965	17.432920	1.164920	20P1_POS0007_D_1UL
110	46	70.869565	11.648895	5.298003	2.198733	20P1_POS0007_D_1UL

111 rows × 6 columns

Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?

Split the data set up according to a ‘round’ criterion field, i.e., split it up so we have a separate data set for the two classes, those round and those not round.
Apply a median function to the intensity in these split data sets.
Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (round and not round) and medians for each.

We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.

Before all that, we create a new column with our cirterion for roundness.

df['round'] = df['aspect_ratio'] < 1.2
df

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	file_name	round
0	139	96.546763	17.504104	10.292770	1.700621	20P1_POS0010_D_1UL	False
1	360	86.613889	35.746808	14.983124	2.385805	20P1_POS0010_D_1UL	False
2	43	91.488372	12.967884	4.351573	2.980045	20P1_POS0010_D_1UL	False
3	140	73.742857	18.940508	10.314404	1.836316	20P1_POS0010_D_1UL	False
4	144	89.375000	13.639308	13.458532	1.013432	20P1_POS0010_D_1UL	True
...	...	...	...	...	...	...	...
106	305	88.252459	20.226532	19.244210	1.051045	20P1_POS0007_D_1UL	True
107	593	89.905565	36.508370	21.365394	1.708762	20P1_POS0007_D_1UL	False
108	289	106.851211	20.427809	18.221452	1.121086	20P1_POS0007_D_1UL	True
109	277	100.664260	20.307965	17.432920	1.164920	20P1_POS0007_D_1UL	True
110	46	70.869565	11.648895	5.298003	2.198733	20P1_POS0007_D_1UL	False

111 rows × 7 columns

Aggregation: median intensity#

Let’s start by grouping by ‘round’.

grouped = df.groupby('round')

# Take a look
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001788C205760>

There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped. and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want.

df_median = grouped.median(numeric_only = True)

# Take a look
df_median

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
round
False	270.0	92.788345	21.459495	15.858324	1.412849
True	291.0	100.256000	20.155547	18.352287	1.101700

Here the numeric_only option is set to disconsider for now calculating a median for the categorical ‘file_name’ column.

The outputted data frame has the medians of all quantities, including the intensities that we wanted. Note that this data frame has ‘round’ as the name of the row index. If we want to instead keep ‘round’ (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.

df_median.reset_index()

	round	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	False	270.0	92.788345	21.459495	15.858324	1.412849
1	True	291.0	100.256000	20.155547	18.352287	1.101700

Note, though, that this was not done in-place. df_median still has an index labeled ‘round’. If you want to update your data frame, you have to explicitly do so with an assignment operator.

df_median = df_median.reset_index()

We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, round from dirst file, round from second file, not round from first file, and not round from second file. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby, applying a median, and then resetting the index of the result, all in one line.

df.groupby(['round', 'file_name']).median().reset_index()

	round	file_name	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	False	20P1_POS0007_D_1UL	323.0	91.796791	23.755227	17.072477	1.467410
1	False	20P1_POS0010_D_1UL	237.0	93.269113	20.410737	14.832035	1.353858
2	True	20P1_POS0007_D_1UL	293.0	98.227799	20.307965	18.599043	1.101700
3	True	20P1_POS0010_D_1UL	277.5	103.299825	19.662330	17.680741	1.103133

This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.

Aggregating with custom functions#

If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.

def coefficient_of_variance(data):
    """Compute coefficient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Now we group it and apply it as an aggregating function. If there are other categorical variables, they should be masked.

grouped = df.groupby(['round', 'file_name'])

grouped.agg(coefficient_of_variance)

		area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
round	file_name
False	20P1_POS0007_D_1UL	0.540399	0.145956	0.349857	0.289063	0.243450
False	20P1_POS0010_D_1UL	0.765156	0.143506	0.447638	0.402708	0.316206
True	20P1_POS0007_D_1UL	0.248799	0.099636	0.145247	0.147626	0.036950
True	20P1_POS0010_D_1UL	0.275120	0.107008	0.167722	0.166214	0.043755

Exercise 1#

From the tidy dataframe from the previous notebook, group them by ‘Intervention’ and ‘Channel’ and display summary statistics for intensity.

df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')
df_tidy = df.melt(value_name='intensity', var_name=['Intervention', 'Channel'])
df_tidy

	Intervention	Channel	intensity
0	Before	channel_1	13.250000
1	Before	channel_1	44.954545
2	Before	channel_1	13.590909
3	Before	channel_1	85.032258
4	Before	channel_1	10.731707
...	...	...	...
99	After	channel_2	73.286439
100	After	channel_2	145.900739
101	After	channel_2	115.347217
102	After	channel_2	61.225962
103	After	channel_2	77.490249

104 rows × 3 columns

Exercise 2#

Calculate the skewness of the intensities grouped by channel and intervention.

Hint: use the function skew from scipy.stats and use the .agg method.

Quantitative Bio-image Analysis with Python

Split-Apply-Combine

Contents

Split-Apply-Combine#

Aggregation: median intensity#

Aggregating with custom functions#

Exercise 1#

Exercise 2#