Introduction to working with DataFrames#

In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce panda DataFrames which are more handy in the next steps. In Python, scientists often call tables “DataFrames”.

import pandas as pd

Creating DataFrames from a dictionary of lists#

Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:

measurements = {
    "labels":      [1, 2, 3],
    "area":       [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be nicely visualized using a DataFrame:

df = pd.DataFrame(measurements)
df

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:

df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df

	labels	area	minor_axis	major_axis	aspect_ratio
0	1	45	2	3	1.50
1	2	23	4	4	1.00
2	3	68	4	5	1.25

Saving data frames#

We can also save this table for continuing to work with it.

df.to_csv("../../data/short_table.csv")

Creating DataFrames from lists of lists#

Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists

header = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],
    [45, 23, 68],
    [2, 4, 4],
    [3, 4, 5],
]
          
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)

# show it
data_frame

	0	1	2
labels	1	2	3
area	45	23	68
minor_axis	2	4	4
major_axis	3	4	5

As you can see, this tabls is rotated. We can bring it in the usual form like this:

# rotate/flip it
data_frame = data_frame.transpose()

# show it
data_frame

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Loading data frames#

Tables can also be read from CSV files.

df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv

	Unnamed: 0	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
0	0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	0	11	30	35
1	1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	0	53	11	74
2	2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
3	3	437	216.585812	23.143996	24.606130	0.339576	0.826087	26.925824	23.588253	0	144	23	167
4	4	476	212.302521	19.852882	31.075106	0.769317	0.863884	31.384710	24.618327	0	237	29	256
...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	56	211	185.061611	14.522762	18.489138	0.618893	0.781481	18.973666	16.390654	232	39	250	54
57	57	78	185.230769	6.028638	17.579799	0.939361	0.722222	18.027756	9.965575	248	170	254	188
58	58	86	183.720930	5.426871	21.261427	0.966876	0.781818	22.000000	10.464158	249	117	254	139
59	59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82

61 rows × 13 columns

Typically, we don’t need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first.

df_csv.keys()

Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
       'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

Selecting columns#

We can then copy&paste the colum names we’re interested in and create a new data frame. This is recommended especially when tables are overwhelmingly large.

df_analysis = df_csv[['area', 'mean_intensity']]
df_analysis

	area	mean_intensity
0	422	192.379147
1	182	180.131868
2	661	205.216339
3	437	216.585812
4	476	212.302521
...	...	...
56	211	185.061611
57	78	185.230769
58	86	183.720930
59	51	190.431373
60	46	175.304348

61 rows × 2 columns

Selecting rows#

In case we want to focus our further analysis on cells that have a certain minimum area. We can do this by selecting rows. The process is also sometimes call masking.

selected_data  = df_analysis[ df_analysis["area"] > 50]
selected_data

	area	mean_intensity	total_intensity
0	422	192.379147	81184.0
1	182	180.131868	32784.0
2	661	205.216339	135648.0
3	437	216.585812	94648.0
4	476	212.302521	101056.0
5	277	206.469314	57192.0
6	259	178.007722	46104.0
7	219	191.598174	41960.0
8	67	167.522388	11224.0
10	486	190.946502	92800.0
11	630	173.600000	109368.0
12	221	197.936652	43744.0
13	78	173.128205	13504.0
14	449	208.766147	93736.0
15	516	194.403101	100312.0
16	390	180.779487	70504.0
17	419	196.849642	82480.0
18	267	200.958801	53656.0
19	353	189.779037	66992.0
20	151	186.225166	28120.0
21	400	187.960000	75184.0
22	426	201.577465	85872.0
23	246	182.113821	44800.0
24	503	198.648111	99920.0
25	278	190.187050	52872.0
26	681	198.308370	135048.0
27	176	195.272727	34368.0
28	358	197.787709	70808.0
29	544	198.455882	107960.0
30	597	190.954774	114000.0
31	181	184.883978	33464.0
32	629	193.666137	121816.0
33	596	210.067114	125200.0
35	263	190.022814	49976.0
36	899	198.291435	178264.0
37	476	204.924370	97544.0
38	233	193.167382	45008.0
39	164	184.634146	30280.0
40	394	181.401015	71472.0
41	411	200.253041	82304.0
42	235	189.140426	44448.0
43	375	195.498667	73312.0
44	654	199.706422	130608.0
45	376	208.638298	78448.0
46	579	200.649396	116176.0
47	64	190.250000	12176.0
48	161	183.950311	29616.0
49	457	168.210066	76872.0
50	625	217.894400	136184.0
51	535	189.936449	101616.0
52	205	199.180488	40832.0
53	562	215.928826	121352.0
54	845	198.295858	167560.0
55	280	189.800000	53144.0
56	211	185.061611	39048.0
57	78	185.230769	14448.0
58	86	183.720930	15800.0
59	51	190.431373	9712.0

Adding new columns#

You can then access columns and add new columns.

df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis

C:\Users\rober\AppData\Local\Temp\ipykernel_25216\206920941.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']

	area	mean_intensity	total_intensity
0	422	192.379147	81184.0
1	182	180.131868	32784.0
2	661	205.216339	135648.0
3	437	216.585812	94648.0
4	476	212.302521	101056.0
...	...	...	...
56	211	185.061611	39048.0
57	78	185.230769	14448.0
58	86	183.720930	15800.0
59	51	190.431373	9712.0
60	46	175.304348	8064.0

61 rows × 3 columns

Exercise#

From the loaded CSV file, create a table that only contains these columns:

minor_axis_length
major_axis_length
aspect_ratio

df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape

Image Analyis with Napari and Python (Helmholtz Imaging Summer Academy 2022)

Introduction to working with DataFrames

Contents