Split-Apply-Combine
Contents
Split-Apply-Combine#
Assume the following dataframe. It contains measurements of objects from 2 different files.
import pandas as pd
import numpy as np
df = pd.read_csv('../../data/BBBC007_analysis.csv')
df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | |
---|---|---|---|---|---|---|
0 | 139 | 96.546763 | 17.504104 | 10.292770 | 1.700621 | 20P1_POS0010_D_1UL |
1 | 360 | 86.613889 | 35.746808 | 14.983124 | 2.385805 | 20P1_POS0010_D_1UL |
2 | 43 | 91.488372 | 12.967884 | 4.351573 | 2.980045 | 20P1_POS0010_D_1UL |
3 | 140 | 73.742857 | 18.940508 | 10.314404 | 1.836316 | 20P1_POS0010_D_1UL |
4 | 144 | 89.375000 | 13.639308 | 13.458532 | 1.013432 | 20P1_POS0010_D_1UL |
... | ... | ... | ... | ... | ... | ... |
106 | 305 | 88.252459 | 20.226532 | 19.244210 | 1.051045 | 20P1_POS0007_D_1UL |
107 | 593 | 89.905565 | 36.508370 | 21.365394 | 1.708762 | 20P1_POS0007_D_1UL |
108 | 289 | 106.851211 | 20.427809 | 18.221452 | 1.121086 | 20P1_POS0007_D_1UL |
109 | 277 | 100.664260 | 20.307965 | 17.432920 | 1.164920 | 20P1_POS0007_D_1UL |
110 | 46 | 70.869565 | 11.648895 | 5.298003 | 2.198733 | 20P1_POS0007_D_1UL |
111 rows × 6 columns
Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?
Split the data set up according to a ‘round’ criterion field, i.e., split it up so we have a separate data set for the two classes, those round and those not round.
Apply a median function to the intensity in these split data sets.
Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (round and not round) and medians for each.
We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.
Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.
Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.
Before all that, we create a new column with our cirterion for roundness.
df['round'] = df['aspect_ratio'] < 1.2
df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | round | |
---|---|---|---|---|---|---|---|
0 | 139 | 96.546763 | 17.504104 | 10.292770 | 1.700621 | 20P1_POS0010_D_1UL | False |
1 | 360 | 86.613889 | 35.746808 | 14.983124 | 2.385805 | 20P1_POS0010_D_1UL | False |
2 | 43 | 91.488372 | 12.967884 | 4.351573 | 2.980045 | 20P1_POS0010_D_1UL | False |
3 | 140 | 73.742857 | 18.940508 | 10.314404 | 1.836316 | 20P1_POS0010_D_1UL | False |
4 | 144 | 89.375000 | 13.639308 | 13.458532 | 1.013432 | 20P1_POS0010_D_1UL | True |
... | ... | ... | ... | ... | ... | ... | ... |
106 | 305 | 88.252459 | 20.226532 | 19.244210 | 1.051045 | 20P1_POS0007_D_1UL | True |
107 | 593 | 89.905565 | 36.508370 | 21.365394 | 1.708762 | 20P1_POS0007_D_1UL | False |
108 | 289 | 106.851211 | 20.427809 | 18.221452 | 1.121086 | 20P1_POS0007_D_1UL | True |
109 | 277 | 100.664260 | 20.307965 | 17.432920 | 1.164920 | 20P1_POS0007_D_1UL | True |
110 | 46 | 70.869565 | 11.648895 | 5.298003 | 2.198733 | 20P1_POS0007_D_1UL | False |
111 rows × 7 columns
Aggregation: median intensity#
Let’s start by grouping by ‘round’.
grouped = df.groupby('round')
# Take a look
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001788C205760>
There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped.
and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want.
df_median = grouped.median(numeric_only = True)
# Take a look
df_median
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
round | |||||
False | 270.0 | 92.788345 | 21.459495 | 15.858324 | 1.412849 |
True | 291.0 | 100.256000 | 20.155547 | 18.352287 | 1.101700 |
Here the numeric_only option is set to disconsider for now calculating a median for the categorical ‘file_name’ column.
The outputted data frame has the medians of all quantities, including the intensities that we wanted. Note that this data frame has ‘round’ as the name of the row index. If we want to instead keep ‘round’ (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.
df_median.reset_index()
round | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|
0 | False | 270.0 | 92.788345 | 21.459495 | 15.858324 | 1.412849 |
1 | True | 291.0 | 100.256000 | 20.155547 | 18.352287 | 1.101700 |
Note, though, that this was not done in-place. df_median still has an index labeled ‘round’. If you want to update your data frame, you have to explicitly do so with an assignment operator.
df_median = df_median.reset_index()
We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, round from dirst file, round from second file, not round from first file, and not round from second file. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby, applying a median, and then resetting the index of the result, all in one line.
df.groupby(['round', 'file_name']).median().reset_index()
round | file_name | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|---|
0 | False | 20P1_POS0007_D_1UL | 323.0 | 91.796791 | 23.755227 | 17.072477 | 1.467410 |
1 | False | 20P1_POS0010_D_1UL | 237.0 | 93.269113 | 20.410737 | 14.832035 | 1.353858 |
2 | True | 20P1_POS0007_D_1UL | 293.0 | 98.227799 | 20.307965 | 18.599043 | 1.101700 |
3 | True | 20P1_POS0010_D_1UL | 277.5 | 103.299825 | 19.662330 | 17.680741 | 1.103133 |
This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.
Aggregating with custom functions#
If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.
def coefficient_of_variance(data):
"""Compute coefficient of variation from an array of data."""
return np.std(data) / np.mean(data)
Now we group it and apply it as an aggregating function. If there are other categorical variables, they should be masked.
grouped = df.groupby(['round', 'file_name'])
grouped.agg(coefficient_of_variance)
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | ||
---|---|---|---|---|---|---|
round | file_name | |||||
False | 20P1_POS0007_D_1UL | 0.540399 | 0.145956 | 0.349857 | 0.289063 | 0.243450 |
20P1_POS0010_D_1UL | 0.765156 | 0.143506 | 0.447638 | 0.402708 | 0.316206 | |
True | 20P1_POS0007_D_1UL | 0.248799 | 0.099636 | 0.145247 | 0.147626 | 0.036950 |
20P1_POS0010_D_1UL | 0.275120 | 0.107008 | 0.167722 | 0.166214 | 0.043755 |
Exercise 1#
From the tidy dataframe from the previous notebook, group them by ‘Intervention’ and ‘Channel’ and display summary statistics for intensity.
df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')
df_tidy = df.melt(value_name='intensity', var_name=['Intervention', 'Channel'])
df_tidy
Intervention | Channel | intensity | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
Exercise 2#
Calculate the skewness of the intensities grouped by channel and intervention.
Hint: use the function skew from scipy.stats and use the .agg
method.