Introduction to working with DataFrames#

In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce panda DataFrames which are more handy in the next steps. In Python, scientists often call tables “DataFrames”.

import pandas as pd

Creating DataFrames from a dictionary of lists#

Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:

measurements = {
    "labels":      [1, 2, 3],
    "area":       [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be nicely visualized using a DataFrame:

df = pd.DataFrame(measurements)
df
labels area minor_axis major_axis
0 1 45 2 3
1 2 23 4 4
2 3 68 4 5

Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:

df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels area minor_axis major_axis aspect_ratio
0 1 45 2 3 1.50
1 2 23 4 4 1.00
2 3 68 4 5 1.25

Saving data frames#

We can also save this table for continuing to work with it.

df.to_csv("../../data/short_table.csv")

Creating DataFrames from lists of lists#

Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists

header = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],
    [45, 23, 68],
    [2, 4, 4],
    [3, 4, 5],
]
          
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)

# show it
data_frame
0 1 2
labels 1 2 3
area 45 23 68
minor_axis 2 4 4
major_axis 3 4 5

As you can see, this tabls is rotated. We can bring it in the usual form like this:

# rotate/flip it
data_frame = data_frame.transpose()

# show it
data_frame
labels area minor_axis major_axis
0 1 45 2 3
1 2 23 4 4
2 3 68 4 5

Loading data frames#

Tables can also be read from CSV files.

df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 56 211 185.061611 14.522762 18.489138 0.618893 0.781481 18.973666 16.390654 232 39 250 54
57 57 78 185.230769 6.028638 17.579799 0.939361 0.722222 18.027756 9.965575 248 170 254 188
58 58 86 183.720930 5.426871 21.261427 0.966876 0.781818 22.000000 10.464158 249 117 254 139
59 59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82

61 rows × 13 columns

Typically, we don’t need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first.

df_csv.keys()
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
       'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

Selecting columns#

We can then copy&paste the colum names we’re interested in and create a new data frame. This is recommended especially when tables are overwhelmingly large.

df_analysis = df_csv[['area', 'mean_intensity']]
df_analysis
area mean_intensity
0 422 192.379147
1 182 180.131868
2 661 205.216339
3 437 216.585812
4 476 212.302521
... ... ...
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373
60 46 175.304348

61 rows × 2 columns

Selecting rows#

In case we want to focus our further analysis on cells that have a certain minimum area. We can do this by selecting rows. The process is also sometimes call masking.

selected_data  = df_analysis[ df_analysis["area"] > 50]
selected_data
area mean_intensity total_intensity
0 422 192.379147 81184.0
1 182 180.131868 32784.0
2 661 205.216339 135648.0
3 437 216.585812 94648.0
4 476 212.302521 101056.0
5 277 206.469314 57192.0
6 259 178.007722 46104.0
7 219 191.598174 41960.0
8 67 167.522388 11224.0
10 486 190.946502 92800.0
11 630 173.600000 109368.0
12 221 197.936652 43744.0
13 78 173.128205 13504.0
14 449 208.766147 93736.0
15 516 194.403101 100312.0
16 390 180.779487 70504.0
17 419 196.849642 82480.0
18 267 200.958801 53656.0
19 353 189.779037 66992.0
20 151 186.225166 28120.0
21 400 187.960000 75184.0
22 426 201.577465 85872.0
23 246 182.113821 44800.0
24 503 198.648111 99920.0
25 278 190.187050 52872.0
26 681 198.308370 135048.0
27 176 195.272727 34368.0
28 358 197.787709 70808.0
29 544 198.455882 107960.0
30 597 190.954774 114000.0
31 181 184.883978 33464.0
32 629 193.666137 121816.0
33 596 210.067114 125200.0
35 263 190.022814 49976.0
36 899 198.291435 178264.0
37 476 204.924370 97544.0
38 233 193.167382 45008.0
39 164 184.634146 30280.0
40 394 181.401015 71472.0
41 411 200.253041 82304.0
42 235 189.140426 44448.0
43 375 195.498667 73312.0
44 654 199.706422 130608.0
45 376 208.638298 78448.0
46 579 200.649396 116176.0
47 64 190.250000 12176.0
48 161 183.950311 29616.0
49 457 168.210066 76872.0
50 625 217.894400 136184.0
51 535 189.936449 101616.0
52 205 199.180488 40832.0
53 562 215.928826 121352.0
54 845 198.295858 167560.0
55 280 189.800000 53144.0
56 211 185.061611 39048.0
57 78 185.230769 14448.0
58 86 183.720930 15800.0
59 51 190.431373 9712.0

Adding new columns#

You can then access columns and add new columns.

df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
C:\Users\rober\AppData\Local\Temp\ipykernel_25216\206920941.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
area mean_intensity total_intensity
0 422 192.379147 81184.0
1 182 180.131868 32784.0
2 661 205.216339 135648.0
3 437 216.585812 94648.0
4 476 212.302521 101056.0
... ... ... ...
56 211 185.061611 39048.0
57 78 185.230769 14448.0
58 86 183.720930 15800.0
59 51 190.431373 9712.0
60 46 175.304348 8064.0

61 rows × 3 columns

Exercise#

From the loaded CSV file, create a table that only contains these columns:

  • minor_axis_length

  • major_axis_length

  • aspect_ratio

df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape