Introduction to working with DataFrames
Contents
Introduction to working with DataFrames#
In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that, we introduce pandas DataFrames, the primary tool in the Python ecosystem for handling data. Its primary object, the “DataFrame” is extremely useful in wrangling data. which are more handy in the next steps.
import pandas as pd
Creating DataFrames from a dictionary of lists#
Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:
measurements = {
"labels": [1, 2, 3],
"area": [45, 23, 68],
"minor_axis": [2, 4, 4],
"major_axis": [3, 4, 5],
}
This data structure can be nicely visualized using a DataFrame:
df = pd.DataFrame(measurements)
df
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns. This is done elementwise.
df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels | area | minor_axis | major_axis | aspect_ratio | |
---|---|---|---|---|---|
0 | 1 | 45 | 2 | 3 | 1.50 |
1 | 2 | 23 | 4 | 4 | 1.00 |
2 | 3 | 68 | 4 | 5 | 1.25 |
Saving data frames#
We can also save this table for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages.
df.to_csv("../../data/short_table.csv")
You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat files.
Creating DataFrames from lists of lists#
Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists
row_header = ['labels', 'area', 'minor_axis', 'major_axis']
data = [
[1, 2, 3], # first row
[45, 23, 68], # second row
[2, 4, 4], # third row
[3, 4, 5], # fourth row
]
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, row_header)
# show it
data_frame
0 | 1 | 2 | |
---|---|---|---|
labels | 1 | 2 | 3 |
area | 45 | 23 | 68 |
minor_axis | 2 | 4 | 4 |
major_axis | 3 | 4 | 5 |
Or from a 2D numpy array#
import numpy as np
data = np.random.random((4,3))
column_header = ['area', 'minor_axis', 'major_axis']
pd.DataFrame(data, columns=column_header)
area | minor_axis | major_axis | |
---|---|---|---|
0 | 0.800797 | 0.421862 | 0.956755 |
1 | 0.470318 | 0.431116 | 0.171390 |
2 | 0.210930 | 0.076568 | 0.199061 |
3 | 0.050682 | 0.346919 | 0.254751 |
As you can see, this table is rotated. We can bring it in the usual form like this:
# rotate/flip it
data_frame = data_frame.transpose()
# show it
data_frame
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Loading data frames#
Tables can be read from CSV files with pd.read_csv.
df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
61 rows × 13 columns
That’s a bit too much information. We can use the .head()
method of data frames to look at the first few rows (or the .tail()
to check the last rows).
df_csv.head(3)
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
df_csv.tail(3)
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
We can also get column names with the DataFrame attribute columns
.
df_csv.columns
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
dtype='object')
Selecting rows and columns#
Ok, let’s get the dataframe first column:
df_csv['area']
0 422
1 182
2 661
3 437
4 476
...
56 211
57 78
58 86
59 51
60 46
Name: area, Length: 61, dtype: int64
Notice that when it was printed, the index of the rows came along with it. That’s beacuse a Pandas DataFrame with one column is a Pandas Series.
We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.
df_analysis = df_csv[ ['area', 'mean_intensity'] ].copy()
df_analysis
area | mean_intensity | |
---|---|---|
0 | 422 | 192.379147 |
1 | 182 | 180.131868 |
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
4 | 476 | 212.302521 |
... | ... | ... |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
60 | 46 | 175.304348 |
61 rows × 2 columns
This gave us the area measurements we were after.
If we want to get a single row, the proper way of doing that is to use the .loc
method:
df_csv.loc[ 2, ['area', 'mean_intensity']]
area 661.000000
mean_intensity 205.216339
Name: 2, dtype: float64
Note that following .loc
, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers.
In case you really want to access elements in a dataframe by integer indices, like in a numpy array, you can use the .iloc
method.
df_csv.iloc[2:4, 1:3]
area | mean_intensity | |
---|---|---|
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
The downside is that the code becomes less explicit when column names are absent.
In practice you will almost never use row indices, but rather use Boolean indexing or Masking.
Selecting parts of tables based on criteria#
In case we want to focus our further analysis on cells that have a certain minimum area, we can do this by passing boolean indices to the rows. This process is also sometimes call masking.
Suppose we want the rows for which df_analysis["area"] > 50
. We can essentially plug this syntax directly when using .loc
.
df_analysis_filtered = df_analysis[ df_analysis["area"] > 50]
df_analysis_filtered.tail()
area | mean_intensity | |
---|---|---|
55 | 280 | 189.800000 |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
Adding new columns#
In Pandas, it is very easy to generate new columns from existing columns. We just pass some operation between other columns to a new column name.
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
area | mean_intensity | total_intensity | |
---|---|---|---|
0 | 422 | 192.379147 | 81184.0 |
1 | 182 | 180.131868 | 32784.0 |
2 | 661 | 205.216339 | 135648.0 |
3 | 437 | 216.585812 | 94648.0 |
4 | 476 | 212.302521 | 101056.0 |
... | ... | ... | ... |
56 | 211 | 185.061611 | 39048.0 |
57 | 78 | 185.230769 | 14448.0 |
58 | 86 | 183.720930 | 15800.0 |
59 | 51 | 190.431373 | 9712.0 |
60 | 46 | 175.304348 | 8064.0 |
61 rows × 3 columns
Appending tables#
When processing multiple images, potentially using multiple image processing libraries, a common task is to combine tables.
We start with two small tables of measurements that could have been obtained from different functions or different libraries.
import pandas as pd
table1 = pd.DataFrame({
"label": [1, 2, 3],
"circularity": [0.3, 0.5, 0.7],
"elongation": [2.3, 3.4, 1.2],
})
table1
label | circularity | elongation | |
---|---|---|---|
0 | 1 | 0.3 | 2.3 |
1 | 2 | 0.5 | 3.4 |
2 | 3 | 0.7 | 1.2 |
table2 = pd.DataFrame({
"label": [3, 2, 1, 4],
"area": [22, 32, 25, 18],
"skewness": [0.5, 0.6, 0.3, 0.3],
})
table2
label | area | skewness | |
---|---|---|---|
0 | 3 | 22 | 0.5 |
1 | 2 | 32 | 0.6 |
2 | 1 | 25 | 0.3 |
3 | 4 | 18 | 0.3 |
Combining same measurement from different image files#
When applying a workflow to many images, you would get tables with the same column names, but with a variable number of rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.
Let’s open two tables generated from applying the same worflow to different files.
df1 = pd.read_csv('../../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 |
df2 = pd.read_csv('../../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 139 | 96.546763 | 17.504104 | 10.292770 | 1.700621 |
1 | 360 | 86.613889 | 35.746808 | 14.983124 | 2.385805 |
2 | 43 | 91.488372 | 12.967884 | 4.351573 | 2.980045 |
3 | 140 | 73.742857 | 18.940508 | 10.314404 | 1.836316 |
4 | 144 | 89.375000 | 13.639308 | 13.458532 | 1.013432 |
Because we know we have the same columns, we can concatenate them into a single big table.
big_df = pd.concat([df1, df2], axis=0)
big_df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 |
... | ... | ... | ... | ... | ... |
42 | 315 | 91.133333 | 20.927095 | 19.209283 | 1.089426 |
43 | 206 | 94.262136 | 23.381879 | 11.669668 | 2.003646 |
44 | 45 | 68.377778 | 9.406371 | 6.276445 | 1.498678 |
45 | 33 | 76.727273 | 10.724275 | 4.174568 | 2.568955 |
46 | 16 | 76.750000 | 7.745967 | 2.783882 | 2.782433 |
111 rows × 5 columns
The problem is that we lose their source identity. An easy fix for that is to add a new column with the file name before concatenating them. This will facilitating splitting them back and plotting later on.
When we give a single value to a new column, it is assigned to all rows.
df1['file_name'] = 'BBBC007_20P1_POS0007_D_1UL'
df2['file_name'] = 'BBBC007_20P1_POS0010_D_1UL'
big_df = pd.concat([df1, df2], axis=0)
big_df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | |
---|---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 | BBBC007_20P1_POS0007_D_1UL |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 | BBBC007_20P1_POS0007_D_1UL |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 | BBBC007_20P1_POS0007_D_1UL |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 | BBBC007_20P1_POS0007_D_1UL |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 | BBBC007_20P1_POS0007_D_1UL |
... | ... | ... | ... | ... | ... | ... |
42 | 315 | 91.133333 | 20.927095 | 19.209283 | 1.089426 | BBBC007_20P1_POS0010_D_1UL |
43 | 206 | 94.262136 | 23.381879 | 11.669668 | 2.003646 | BBBC007_20P1_POS0010_D_1UL |
44 | 45 | 68.377778 | 9.406371 | 6.276445 | 1.498678 | BBBC007_20P1_POS0010_D_1UL |
45 | 33 | 76.727273 | 10.724275 | 4.174568 | 2.568955 | BBBC007_20P1_POS0010_D_1UL |
46 | 16 | 76.750000 | 7.745967 | 2.783882 | 2.782433 | BBBC007_20P1_POS0010_D_1UL |
111 rows × 6 columns
Now, when we can safely distinguish the source of each row.
Handling NaN values#
When analysing tabular data, sometimes table cells are present that does not contain data. In Python this typically means the value is Not a Number (NaN). We cannot assume these values are 0
or -1
or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.
To get a first view where NaNs play a role, we load again an example table and sort it.
import numpy as np
import pandas as pd
data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.sort_values(by = "Area", ascending=False)
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
190 | 2755.0 | 859.928 | 235.458 | 539.0 | 3880.0 | 108.710 | 302.158 | 110.999 | 300.247 | 144.475 | 24.280 | 39.318 | 100 | C |
81 | 2295.0 | 765.239 | 96.545 | 558.0 | 1431.0 | 375.003 | 134.888 | 374.982 | 135.359 | 65.769 | 44.429 | 127.247 | 100 | B |
209 | 1821.0 | 847.761 | 122.074 | 600.0 | 1510.0 | 287.795 | 321.115 | 288.074 | 321.824 | 55.879 | 41.492 | 112.124 | 100 | A |
252 | 1528.0 | 763.777 | 83.183 | 572.0 | 1172.0 | 191.969 | 385.944 | 192.487 | 385.697 | 63.150 | 30.808 | 34.424 | 100 | B |
265 | 1252.0 | 793.371 | 117.139 | 579.0 | 1668.0 | 262.071 | 394.497 | 262.268 | 394.326 | 60.154 | 26.500 | 50.147 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 1.0 | 587.000 | 0.000 | 587.0 | 587.0 | 399.500 | 117.500 | 399.500 | 117.500 | 1.128 | 1.128 | 0.000 | 100 | A |
310 | 1.0 | 866.000 | 0.000 | 866.0 | 866.0 | 343.500 | 408.500 | 343.500 | 408.500 | 1.128 | 1.128 | 0.000 | 100 | A |
219 | 1.0 | 763.000 | 0.000 | 763.0 | 763.0 | 411.500 | 296.500 | 411.500 | 296.500 | 1.128 | 1.128 | 0.000 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
391 rows × 14 columns
As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them.
A quick check if there are NaNs anywhere in a DataFrame is an important quality check and good scientific practice.
data.isnull().values.any()
True
We can also get some deeper insights in which columns these NaN values are located.
data.isnull().sum()
Area 2
Mean 5
StdDev 3
Min 3
Max 3
X 2
Y 3
XM 3
YM 5
Major 8
Minor 3
Angle 1
%Area 0
Type 0
dtype: int64
For getting a glimpse about if we can further process that tabel, we may want to know the percentage of NaNs for each column?
data.isnull().mean().sort_values(ascending=False) *100
Major 2.046036
Mean 1.278772
YM 1.278772
StdDev 0.767263
Min 0.767263
Max 0.767263
Y 0.767263
XM 0.767263
Minor 0.767263
Area 0.511509
X 0.511509
Angle 0.255754
%Area 0.000000
Type 0.000000
dtype: float64
Dropping rows that contain NaNs#
Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.
It depends on your project and what is important or not for the analysis. Its not an easy answer.
data_no_nan = data.dropna(how="any")
data_no_nan
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
6 | 669.0 | 697.164 | 72.863 | 539.0 | 957.0 | 471.696 | 26.253 | 471.694 | 26.197 | 36.656 | 23.237 | 124.340 | 100 | A |
7 | 5.0 | 658.600 | 49.161 | 607.0 | 710.0 | 28.300 | 8.100 | 28.284 | 8.103 | 3.144 | 2.025 | 161.565 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
383 | 94.0 | 746.617 | 85.198 | 550.0 | 1021.0 | 194.032 | 498.223 | 194.014 | 498.239 | 17.295 | 6.920 | 52.720 | 100 | B |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
374 rows × 14 columns
On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of the total number of datasets were analysed.
We can now also check again if NaNs are present.
data_no_nan.isnull().values.any()
False
Determining rows that contain NaNs#
In some use-cases it might be useful to have a list of row-indices where there are NaN values.
data2 = {
'A': [0, 1, 22, 21, 12, 23],
'B': [2, 3, np.nan, 2, 12, 22],
'C': [2, 3, 44, 2, np.nan, 52],
}
table = pd.DataFrame(data2)
table
A | B | C | |
---|---|---|---|
0 | 0 | 2.0 | 2.0 |
1 | 1 | 3.0 | 3.0 |
2 | 22 | NaN | 44.0 |
3 | 21 | 2.0 | 2.0 |
4 | 12 | 12.0 | NaN |
5 | 23 | 22.0 | 52.0 |
np.max(table.isnull().values, axis=1)
array([False, False, True, False, True, False])
Tidy-Data#
Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules:
Each variable is a column.
Each observation is a row.
Each type of observation has its own separate data frame.
This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.
Let’s import a saved table with measurements. Is this table tidy?
import pandas as pd
df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')
df
Before | After | |||
---|---|---|---|---|
channel_1 | channel_2 | channel_1 | channel_2 | |
0 | 13.250000 | 21.000000 | 15.137984 | 42.022776 |
1 | 44.954545 | 24.318182 | 43.328836 | 48.661610 |
2 | 13.590909 | 18.772727 | 11.685995 | 37.926184 |
3 | 85.032258 | 19.741935 | 86.031461 | 40.396353 |
4 | 10.731707 | 25.268293 | 10.075421 | 51.471865 |
5 | 94.625000 | 36.450000 | 95.180900 | 73.347843 |
6 | 89.836735 | 34.693878 | 89.857864 | 69.902829 |
7 | 100.261905 | 34.904762 | 101.989852 | 70.156432 |
8 | 29.615385 | 52.115385 | 31.516654 | 104.525198 |
9 | 15.868421 | 24.921053 | 16.086932 | 50.563301 |
10 | 12.475000 | 25.450000 | 11.529924 | 51.381594 |
11 | 87.875000 | 28.050000 | 89.745522 | 56.543107 |
12 | 58.800000 | 22.600000 | 59.646229 | 45.215405 |
13 | 91.061224 | 40.367347 | 89.935893 | 81.326111 |
14 | 23.500000 | 117.333333 | 21.676993 | 235.067654 |
15 | 82.566667 | 34.566667 | 84.097735 | 69.820702 |
16 | 36.120000 | 29.600000 | 37.688676 | 59.870177 |
17 | 70.687500 | 33.843750 | 72.569112 | 68.493363 |
18 | 102.021277 | 33.297872 | 100.419746 | 67.379506 |
19 | 72.318182 | 103.909091 | 70.843134 | 207.956510 |
20 | 18.100000 | 29.166667 | 17.865201 | 58.361239 |
21 | 5.217391 | 36.347826 | 6.961346 | 73.286439 |
22 | 19.925926 | 72.814815 | 18.607102 | 145.900739 |
23 | 26.673077 | 57.403846 | 27.611368 | 115.347217 |
24 | 13.340000 | 30.400000 | 14.160543 | 61.225962 |
25 | 15.028571 | 38.400000 | 14.529963 | 77.490249 |
The most useful function for tidying data is pd.melt.
df.melt()
variable_0 | variable_1 | value | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
We can specify names for the value and for the variables. In this case, our measurements are of intensity and our variables are Intervention (before or after) and channel.
df_tidy = df.melt(value_name='intensity', var_name=['Treatment', 'Channel'])
df_tidy
Treatment | Channel | intensity | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
It may not look better for interpreting, but it becomes easier to manipulate, because now we can more easily mask by columns. Here we select intensity measurement rows for “channel_2” and “After” intervention.
row_mask = (df_tidy['Channel'] == 'channel_2') & (df_tidy['Treatment'] == 'After')
df_tidy[df_tidy['Channel'] == 'channel_2']
Treatment | Channel | intensity | |
---|---|---|---|
26 | Before | channel_2 | 21.000000 |
27 | Before | channel_2 | 24.318182 |
28 | Before | channel_2 | 18.772727 |
29 | Before | channel_2 | 19.741935 |
30 | Before | channel_2 | 25.268293 |
31 | Before | channel_2 | 36.450000 |
32 | Before | channel_2 | 34.693878 |
33 | Before | channel_2 | 34.904762 |
34 | Before | channel_2 | 52.115385 |
35 | Before | channel_2 | 24.921053 |
36 | Before | channel_2 | 25.450000 |
37 | Before | channel_2 | 28.050000 |
38 | Before | channel_2 | 22.600000 |
39 | Before | channel_2 | 40.367347 |
40 | Before | channel_2 | 117.333333 |
41 | Before | channel_2 | 34.566667 |
42 | Before | channel_2 | 29.600000 |
43 | Before | channel_2 | 33.843750 |
44 | Before | channel_2 | 33.297872 |
45 | Before | channel_2 | 103.909091 |
46 | Before | channel_2 | 29.166667 |
47 | Before | channel_2 | 36.347826 |
48 | Before | channel_2 | 72.814815 |
49 | Before | channel_2 | 57.403846 |
50 | Before | channel_2 | 30.400000 |
51 | Before | channel_2 | 38.400000 |
78 | After | channel_2 | 42.022776 |
79 | After | channel_2 | 48.661610 |
80 | After | channel_2 | 37.926184 |
81 | After | channel_2 | 40.396353 |
82 | After | channel_2 | 51.471865 |
83 | After | channel_2 | 73.347843 |
84 | After | channel_2 | 69.902829 |
85 | After | channel_2 | 70.156432 |
86 | After | channel_2 | 104.525198 |
87 | After | channel_2 | 50.563301 |
88 | After | channel_2 | 51.381594 |
89 | After | channel_2 | 56.543107 |
90 | After | channel_2 | 45.215405 |
91 | After | channel_2 | 81.326111 |
92 | After | channel_2 | 235.067654 |
93 | After | channel_2 | 69.820702 |
94 | After | channel_2 | 59.870177 |
95 | After | channel_2 | 68.493363 |
96 | After | channel_2 | 67.379506 |
97 | After | channel_2 | 207.956510 |
98 | After | channel_2 | 58.361239 |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
Split-Apply-Combine#
Assume the following dataframe. It contains measurements of objects from 2 different files.
df = pd.read_csv('../../data/BBBC007_analysis.csv')
df.isnull()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | False | False | False | False |
3 | False | False | False | False | False | False |
4 | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... |
106 | False | False | False | False | False | False |
107 | False | False | False | False | False | False |
108 | False | False | False | False | False | False |
109 | False | False | False | False | False | False |
110 | False | False | False | False | False | False |
111 rows × 6 columns
Let’s say we want to compute the median “intensity_mean” of round objects and also discriminate these objects per file. Ignoring for the second the mechanics of how we would do this with Python, let’s think about it in English. What do we need to do?
Split the data set up according to a ‘round’ criterion field, i.e., split it up so we have a separate data set for the two classes, those round and those not round.
Apply a median function to the intensity in these split data sets.
Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (round and not round) and medians for each.
We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.
Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.
Pandas’s split-apply-combine operations are achieved using the groupby() method. You can think of groupby() as the splitting part. You can then apply functions to the resulting DataFrameGroupBy object. The Pandas documentation on split-apply-combine is excellent and worth reading through. It is extensive though, so don’t let yourself get intimidated by it.
Before all that, we create a new column with our cirterion for roundness.
df['round'] = df['aspect_ratio'] < 1.2
df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | round | |
---|---|---|---|---|---|---|---|
0 | 139 | 96.546763 | 17.504104 | 10.292770 | 1.700621 | 20P1_POS0010_D_1UL | False |
1 | 360 | 86.613889 | 35.746808 | 14.983124 | 2.385805 | 20P1_POS0010_D_1UL | False |
2 | 43 | 91.488372 | 12.967884 | 4.351573 | 2.980045 | 20P1_POS0010_D_1UL | False |
3 | 140 | 73.742857 | 18.940508 | 10.314404 | 1.836316 | 20P1_POS0010_D_1UL | False |
4 | 144 | 89.375000 | 13.639308 | 13.458532 | 1.013432 | 20P1_POS0010_D_1UL | True |
... | ... | ... | ... | ... | ... | ... | ... |
106 | 305 | 88.252459 | 20.226532 | 19.244210 | 1.051045 | 20P1_POS0007_D_1UL | True |
107 | 593 | 89.905565 | 36.508370 | 21.365394 | 1.708762 | 20P1_POS0007_D_1UL | False |
108 | 289 | 106.851211 | 20.427809 | 18.221452 | 1.121086 | 20P1_POS0007_D_1UL | True |
109 | 277 | 100.664260 | 20.307965 | 17.432920 | 1.164920 | 20P1_POS0007_D_1UL | True |
110 | 46 | 70.869565 | 11.648895 | 5.298003 | 2.198733 | 20P1_POS0007_D_1UL | False |
111 rows × 7 columns
Aggregation: median intensity#
Let’s start by grouping by ‘round’.
grouped = df.groupby('round')
# Take a look
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11bb28e20>
There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Typing grouped.
and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The grouped.median() method is exactly what we want.
df_median = grouped.median(numeric_only = True)
# Take a look
df_median
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
round | |||||
False | 270.0 | 92.788345 | 21.459495 | 15.858324 | 1.412849 |
True | 291.0 | 100.256000 | 20.155547 | 18.352287 | 1.101700 |
Here the numeric_only option is set to disconsider for now calculating a median for the categorical ‘file_name’ column.
The outputted data frame has the medians of all quantities, including the intensities that we wanted. Note that this data frame has ‘round’ as the name of the row index. If we want to instead keep ‘round’ (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the reset_index() method.
df_median = df_median.reset_index()
df_median
round | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|
0 | False | 270.0 | 92.788345 | 21.459495 | 15.858324 | 1.412849 |
1 | True | 291.0 | 100.256000 | 20.155547 | 18.352287 | 1.101700 |
We can also use multiple columns in our groupby() operation. For example, we may wish to look at four groups, round from dirst file, round from second file, not round from first file, and not round from second file. To do this, we simply pass in a list of columns into df.groupby(). We will chain the methods, performing a groupby, applying a median, and then resetting the index of the result, all in one line.
df.groupby(['round', 'file_name']).median().reset_index()
round | file_name | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|---|
0 | False | 20P1_POS0007_D_1UL | 323.0 | 91.796791 | 23.755227 | 17.072477 | 1.467410 |
1 | False | 20P1_POS0010_D_1UL | 237.0 | 93.269113 | 20.410737 | 14.832035 | 1.353858 |
2 | True | 20P1_POS0007_D_1UL | 293.0 | 98.227799 | 20.307965 | 18.599043 | 1.101700 |
3 | True | 20P1_POS0010_D_1UL | 277.5 | 103.299825 | 19.662330 | 17.680741 | 1.103133 |
This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median.
Aggregating with custom functions#
If we want to apply a function that is not built-in, we can also do so. For example, let’s apply the coefficient of variance. We can define a generic function that calculates it like shown below.
def coefficient_of_variance(data):
"""Compute coefficient of variation from an array of data."""
return np.std(data) / np.mean(data)
Now we group it and apply it as an aggregating function. If there are other categorical variables, they should be masked.
grouped = df.groupby(['round', 'file_name'])
grouped.agg(coefficient_of_variance)
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | ||
---|---|---|---|---|---|---|
round | file_name | |||||
False | 20P1_POS0007_D_1UL | 0.540399 | 0.145956 | 0.349857 | 0.289063 | 0.243450 |
20P1_POS0010_D_1UL | 0.765156 | 0.143506 | 0.447638 | 0.402708 | 0.316206 | |
True | 20P1_POS0007_D_1UL | 0.248799 | 0.099636 | 0.145247 | 0.147626 | 0.036950 |
20P1_POS0010_D_1UL | 0.275120 | 0.107008 | 0.167722 | 0.166214 | 0.043755 |
Exercises#
Exercise 1#
df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape.head()
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
Create a new data frame that only contains these columns:
minor_axis_length
major_axis_length
aspect_ratio
Exercise 2#
data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.head()
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
Select the columns Area
and Mean
. Remove all rows that contain NaNs and count the remaining rows.
Now select the columns Major
and Minor
. Remove NaNs and count the remaining rows again. What do you conclude?
Exercise 3#
Group the tidy dataframe from above by ‘Treatment’ and ‘Channel’
df_tidy
Treatment | Channel | intensity | |
---|---|---|---|
0 | Before | channel_1 | 13.250000 |
1 | Before | channel_1 | 44.954545 |
2 | Before | channel_1 | 13.590909 |
3 | Before | channel_1 | 85.032258 |
4 | Before | channel_1 | 10.731707 |
... | ... | ... | ... |
99 | After | channel_2 | 73.286439 |
100 | After | channel_2 | 145.900739 |
101 | After | channel_2 | 115.347217 |
102 | After | channel_2 | 61.225962 |
103 | After | channel_2 | 77.490249 |
104 rows × 3 columns
Calculate the mean and standard deviation for the groups