Introduction to working with DataFrames
Contents
Introduction to working with DataFrames#
In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce panda DataFrames which are more handy in the next steps. In Python, scientists often call tables “DataFrames”.
import pandas as pd
Creating DataFrames from a dictionary of lists#
Assume we did some image processing and have some results in available in a dictionary that contains lists of numbers:
measurements = {
"labels": [1, 2, 3],
"area": [45, 23, 68],
"minor_axis": [2, 4, 4],
"major_axis": [3, 4, 5],
}
This data structure can be nicely visualized using a DataFrame:
df = pd.DataFrame(measurements)
df
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:
df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels | area | minor_axis | major_axis | aspect_ratio | |
---|---|---|---|---|---|
0 | 1 | 45 | 2 | 3 | 1.50 |
1 | 2 | 23 | 4 | 4 | 1.00 |
2 | 3 | 68 | 4 | 5 | 1.25 |
Saving data frames#
We can also save this table for continuing to work with it.
df.to_csv("../../data/short_table.csv")
Creating DataFrames from lists of lists#
Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists
header = ['labels', 'area', 'minor_axis', 'major_axis']
data = [
[1, 2, 3],
[45, 23, 68],
[2, 4, 4],
[3, 4, 5],
]
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)
# show it
data_frame
0 | 1 | 2 | |
---|---|---|---|
labels | 1 | 2 | 3 |
area | 45 | 23 | 68 |
minor_axis | 2 | 4 | 4 |
major_axis | 3 | 4 | 5 |
As you can see, this tabls is rotated. We can bring it in the usual form like this:
# rotate/flip it
data_frame = data_frame.transpose()
# show it
data_frame
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Loading data frames#
Tables can also be read from CSV files.
df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
61 rows × 13 columns
Typically, we don’t need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first.
df_csv.keys()
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
dtype='object')
Selecting columns#
We can then copy&paste the colum names we’re interested in and create a new data frame. This is recommended especially when tables are overwhelmingly large.
df_analysis = df_csv[['area', 'mean_intensity']]
df_analysis
area | mean_intensity | |
---|---|---|
0 | 422 | 192.379147 |
1 | 182 | 180.131868 |
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
4 | 476 | 212.302521 |
... | ... | ... |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
60 | 46 | 175.304348 |
61 rows × 2 columns
Selecting rows#
In case we want to focus our further analysis on cells that have a certain minimum area. We can do this by selecting rows. The process is also sometimes call masking.
selected_data = df_analysis[ df_analysis["area"] > 50]
selected_data
area | mean_intensity | total_intensity | |
---|---|---|---|
0 | 422 | 192.379147 | 81184.0 |
1 | 182 | 180.131868 | 32784.0 |
2 | 661 | 205.216339 | 135648.0 |
3 | 437 | 216.585812 | 94648.0 |
4 | 476 | 212.302521 | 101056.0 |
5 | 277 | 206.469314 | 57192.0 |
6 | 259 | 178.007722 | 46104.0 |
7 | 219 | 191.598174 | 41960.0 |
8 | 67 | 167.522388 | 11224.0 |
10 | 486 | 190.946502 | 92800.0 |
11 | 630 | 173.600000 | 109368.0 |
12 | 221 | 197.936652 | 43744.0 |
13 | 78 | 173.128205 | 13504.0 |
14 | 449 | 208.766147 | 93736.0 |
15 | 516 | 194.403101 | 100312.0 |
16 | 390 | 180.779487 | 70504.0 |
17 | 419 | 196.849642 | 82480.0 |
18 | 267 | 200.958801 | 53656.0 |
19 | 353 | 189.779037 | 66992.0 |
20 | 151 | 186.225166 | 28120.0 |
21 | 400 | 187.960000 | 75184.0 |
22 | 426 | 201.577465 | 85872.0 |
23 | 246 | 182.113821 | 44800.0 |
24 | 503 | 198.648111 | 99920.0 |
25 | 278 | 190.187050 | 52872.0 |
26 | 681 | 198.308370 | 135048.0 |
27 | 176 | 195.272727 | 34368.0 |
28 | 358 | 197.787709 | 70808.0 |
29 | 544 | 198.455882 | 107960.0 |
30 | 597 | 190.954774 | 114000.0 |
31 | 181 | 184.883978 | 33464.0 |
32 | 629 | 193.666137 | 121816.0 |
33 | 596 | 210.067114 | 125200.0 |
35 | 263 | 190.022814 | 49976.0 |
36 | 899 | 198.291435 | 178264.0 |
37 | 476 | 204.924370 | 97544.0 |
38 | 233 | 193.167382 | 45008.0 |
39 | 164 | 184.634146 | 30280.0 |
40 | 394 | 181.401015 | 71472.0 |
41 | 411 | 200.253041 | 82304.0 |
42 | 235 | 189.140426 | 44448.0 |
43 | 375 | 195.498667 | 73312.0 |
44 | 654 | 199.706422 | 130608.0 |
45 | 376 | 208.638298 | 78448.0 |
46 | 579 | 200.649396 | 116176.0 |
47 | 64 | 190.250000 | 12176.0 |
48 | 161 | 183.950311 | 29616.0 |
49 | 457 | 168.210066 | 76872.0 |
50 | 625 | 217.894400 | 136184.0 |
51 | 535 | 189.936449 | 101616.0 |
52 | 205 | 199.180488 | 40832.0 |
53 | 562 | 215.928826 | 121352.0 |
54 | 845 | 198.295858 | 167560.0 |
55 | 280 | 189.800000 | 53144.0 |
56 | 211 | 185.061611 | 39048.0 |
57 | 78 | 185.230769 | 14448.0 |
58 | 86 | 183.720930 | 15800.0 |
59 | 51 | 190.431373 | 9712.0 |
Adding new columns#
You can then access columns and add new columns.
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
C:\Users\rober\AppData\Local\Temp\ipykernel_25216\206920941.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
area | mean_intensity | total_intensity | |
---|---|---|---|
0 | 422 | 192.379147 | 81184.0 |
1 | 182 | 180.131868 | 32784.0 |
2 | 661 | 205.216339 | 135648.0 |
3 | 437 | 216.585812 | 94648.0 |
4 | 476 | 212.302521 | 101056.0 |
... | ... | ... | ... |
56 | 211 | 185.061611 | 39048.0 |
57 | 78 | 185.230769 | 14448.0 |
58 | 86 | 183.720930 | 15800.0 |
59 | 51 | 190.431373 | 9712.0 |
60 | 46 | 175.304348 | 8064.0 |
61 rows × 3 columns
Exercise#
From the loaded CSV file, create a table that only contains these columns:
minor_axis_length
major_axis_length
aspect_ratio
df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape