Appending tables#

When processing multiple images, potentially using multiple image processing libraries, a common task is to combine tables.

We start with two small tables of measurements that could have been obtained from different functions or different libraries.

import pandas as pd

table1 = pd.DataFrame({
    "label":       [1,   2,   3],
    "circularity": [0.3, 0.5, 0.7],
    "elongation":  [2.3, 3.4, 1.2],
    })
table1

	label	circularity	elongation
0	1	0.3	2.3
1	2	0.5	3.4
2	3	0.7	1.2

table2 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    })
table2

	label	area	skewness
0	3	22	0.5
1	2	32	0.6
2	1	25	0.3
3	4	18	0.3

Combining columns of tables#

According to the pandas documentation there are multiple ways for combining tables. We first use a wrong example to highlight pitfalls when combining tables.

In the following example, measurements of label 1 and 3 are mixed. Furthermore, one of our tables did not contain measurements for label 4.

wrongly_combined_tables = pd.concat([table1, table2], axis=1)
wrongly_combined_tables

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

A better way for combining tables is the merge command. It allows to explicitly specify on which column the tables should be combined. Data scientists speak of the ‘index’ or ‘identifier’ of rows in the tables.

correctly_combined_tables1 = pd.merge(table1, table2, how='inner', on='label')
correctly_combined_tables1

	label	circularity	elongation	area	skewness
0	1	0.3	2.3	25	0.3
1	2	0.5	3.4	32	0.6
2	3	0.7	1.2	22	0.5

You may note that in the above example, label 4 is missing. We can also get it by out table by performing an outer join.

correctly_combined_tables2 = pd.merge(table1, table2, how='outer', on='label')
correctly_combined_tables2

	label	circularity	elongation	area	skewness
0	1	0.3	2.3	25	0.3
1	2	0.5	3.4	32	0.6
2	3	0.7	1.2	22	0.5
3	4	NaN	NaN	18	0.3

correctly_combined_tables2 = pd.merge(table1, table2, how='right', on='label')
correctly_combined_tables2

	label	circularity	elongation	area	skewness
0	3	0.7	1.2	22	0.5
1	2	0.5	3.4	32	0.6
2	1	0.3	2.3	25	0.3
3	4	NaN	NaN	18	0.3

Suppose there is a common measurement name from different tables. For example, table3 below also contain “elongation”.

table3 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    "elongation":  [2.3, 3.4, 1.2, 1.1]
    })
table3

	label	area	skewness	elongation
0	3	22	0.5	2.3
1	2	32	0.6	3.4
2	1	25	0.3	1.2
3	4	18	0.3	1.1

Applying merge still preserves both measurements in different columns.

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label')
correctly_combined_tables3 

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

We can change ‘x’ and ‘y’ by passing other suffixes.

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label', suffixes=('_method1', '_method2'))
correctly_combined_tables3

	label	circularity	elongation_method1	area	skewness	elongation_method2
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

Combining same measurement from different image files#

When applying a workflow to many images, you would get tables with the same column names, but with a variable number of rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.

Let’s open two tables generated from applying the same worflow to different files.

df1 = pd.read_csv('../../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	256	93.250000	19.995017	17.021559	1.174688
1	90	82.488889	15.939969	7.516326	2.120713
2	577	90.637782	35.324458	21.759434	1.623409
3	270	95.640741	20.229431	17.669052	1.144908
4	153	84.908497	15.683703	12.420475	1.262730

df2 = pd.read_csv('../../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.head()

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	139	96.546763	17.504104	10.292770	1.700621
1	360	86.613889	35.746808	14.983124	2.385805
2	43	91.488372	12.967884	4.351573	2.980045
3	140	73.742857	18.940508	10.314404	1.836316
4	144	89.375000	13.639308	13.458532	1.013432

In this particular case where we know we have the same columns, we could concatenate them into a single big table.

big_df = pd.concat([df1, df2], axis=0)
big_df

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio
0	256	93.250000	19.995017	17.021559	1.174688
1	90	82.488889	15.939969	7.516326	2.120713
2	577	90.637782	35.324458	21.759434	1.623409
3	270	95.640741	20.229431	17.669052	1.144908
4	153	84.908497	15.683703	12.420475	1.262730
...	...	...	...	...	...
42	315	91.133333	20.927095	19.209283	1.089426
43	206	94.262136	23.381879	11.669668	2.003646
44	45	68.377778	9.406371	6.276445	1.498678
45	33	76.727273	10.724275	4.174568	2.568955
46	16	76.750000	7.745967	2.783882	2.782433

111 rows × 5 columns

The problem is that we lose their source identity. An easy fix for that is to add a new column with the file name before concatenating them. This will facilitating splitting them back and plotting later on.

When we give a single value to a new column, it is assigned to all rows.

df1['file_name'] = 'BBBC007_20P1_POS0007_D_1UL'

df2['file_name'] = 'BBBC007_20P1_POS0010_D_1UL'

big_df = pd.concat([df1, df2], axis=0)
big_df

	area	intensity_mean	major_axis_length	minor_axis_length	aspect_ratio	file_name
0	256	93.250000	19.995017	17.021559	1.174688	BBBC007_20P1_POS0007_D_1UL
1	90	82.488889	15.939969	7.516326	2.120713	BBBC007_20P1_POS0007_D_1UL
2	577	90.637782	35.324458	21.759434	1.623409	BBBC007_20P1_POS0007_D_1UL
3	270	95.640741	20.229431	17.669052	1.144908	BBBC007_20P1_POS0007_D_1UL
4	153	84.908497	15.683703	12.420475	1.262730	BBBC007_20P1_POS0007_D_1UL
...	...	...	...	...	...	...
42	315	91.133333	20.927095	19.209283	1.089426	BBBC007_20P1_POS0010_D_1UL
43	206	94.262136	23.381879	11.669668	2.003646	BBBC007_20P1_POS0010_D_1UL
44	45	68.377778	9.406371	6.276445	1.498678	BBBC007_20P1_POS0010_D_1UL
45	33	76.727273	10.724275	4.174568	2.568955	BBBC007_20P1_POS0010_D_1UL
46	16	76.750000	7.745967	2.783882	2.782433	BBBC007_20P1_POS0010_D_1UL

111 rows × 6 columns

Now, when we can safely distinguish the source of each row.

Quantitative Bio-image Analysis with Python

Appending tables

Contents

Appending tables#

Combining columns of tables#

Combining same measurement from different image files#

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation_method1	area	skewness	elongation_method2
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation_method1	area	skewness	elongation_method2
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation	label	area	skewness
0	1.0	0.3	2.3	3	22	0.5
1	2.0	0.5	3.4	2	32	0.6
2	3.0	0.7	1.2	1	25	0.3
3	NaN	NaN	NaN	4	18	0.3

	label	circularity	elongation_x	area	skewness	elongation_y
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1

	label	circularity	elongation_method1	area	skewness	elongation_method2
0	1	0.3	2.3	25	0.3	1.2
1	2	0.5	3.4	32	0.6	3.4
2	3	0.7	1.2	22	0.5	2.3
3	4	NaN	NaN	18	0.3	1.1