Introduction to working with DataFrames
Contents
Introduction to working with DataFrames#
In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that, we introduce panda DataFrames, the primary tool in the Python ecosystem for handling data. Its primary object, the “DataFrame” is extremely useful in wrangling data. which are more handy in the next steps.
import pandas as pd
Creating DataFrames from a dictionary of lists#
Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:
measurements = {
"labels": [1, 2, 3],
"area": [45, 23, 68],
"minor_axis": [2, 4, 4],
"major_axis": [3, 4, 5],
}
This data structure can be nicely visualized using a DataFrame:
df = pd.DataFrame(measurements)
df
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns. This is done elementwise.
df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels | area | minor_axis | major_axis | aspect_ratio | |
---|---|---|---|---|---|
0 | 1 | 45 | 2 | 3 | 1.50 |
1 | 2 | 23 | 4 | 4 | 1.00 |
2 | 3 | 68 | 4 | 5 | 1.25 |
Saving data frames#
We can also save this table for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages.
df.to_csv("../../data/short_table.csv")
You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat files.
Creating DataFrames from lists of lists#
Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists
header = ['labels', 'area', 'minor_axis', 'major_axis']
data = [
[1, 2, 3],
[45, 23, 68],
[2, 4, 4],
[3, 4, 5],
]
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)
# show it
data_frame
0 | 1 | 2 | |
---|---|---|---|
labels | 1 | 2 | 3 |
area | 45 | 23 | 68 |
minor_axis | 2 | 4 | 4 |
major_axis | 3 | 4 | 5 |
As you can see, this tabls is rotated. We can bring it in the usual form like this:
# rotate/flip it
data_frame = data_frame.transpose()
# show it
data_frame
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Loading data frames#
Tables can be read from CSV files with pd.read_csv.
df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
61 rows × 13 columns
That’s a bit too much information. We can use the .head()
method of data frames to look at the first few rows (or the .tail()
to check the last rows).
df_csv.head()
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
df_csv.tail()
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
We can also get column names with the DataFrame attribute columns
.
df_csv.columns
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
dtype='object')
Selecting rows and columns#
Ok, let’s get the dataframe first row:
df_csv[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\indexes\base.py:3800, in Index.get_loc(self, key, method, tolerance)
3799 try:
-> 3800 return self._engine.get_loc(casted_key)
3801 except KeyError as err:
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas\_libs\hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\_libs\hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In [12], line 1
----> 1 df_csv[0]
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\frame.py:3805, in DataFrame.__getitem__(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance)
3800 return self._engine.get_loc(casted_key)
3801 except KeyError as err:
-> 3802 raise KeyError(key) from err
3803 except TypeError:
3804 # If we have a listlike key, _check_indexing_error will raise
3805 # InvalidIndexError. Otherwise we fall through and re-raise
3806 # the TypeError.
3807 self._check_indexing_error(key)
KeyError: 0
Ooops… We got a big error. That’s because we index dataframes by columns.
We can then copy&paste the colum names we’re interested in and create a new data frame. This is recommended especially when tables are overwhelmingly large.
df_csv['area']
0 422
1 182
2 661
3 437
4 476
...
56 211
57 78
58 86
59 51
60 46
Name: area, Length: 61, dtype: int64
Notice that when it was printed, the index of the rows came along with it. That’s beacuse a Pandas DataFrame with one column is a Pandas Series.
We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.
df_analysis = df_csv[ ['area', 'mean_intensity'] ]
df_analysis
area | mean_intensity | |
---|---|---|
0 | 422 | 192.379147 |
1 | 182 | 180.131868 |
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
4 | 476 | 212.302521 |
... | ... | ... |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
60 | 46 | 175.304348 |
61 rows × 2 columns
df_analysis
area | mean_intensity | |
---|---|---|
0 | 422 | 192.379147 |
1 | 182 | 180.131868 |
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
4 | 476 | 212.302521 |
... | ... | ... |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
60 | 46 | 175.304348 |
61 rows × 2 columns
This gave us the area measurements we were after.
If we want to get a single row, the proper way of doing that is to use the .loc
method:
df_csv.loc[ 2, ['area', 'mean_intensity']]
area 661.000000
mean_intensity 205.216339
Name: 2, dtype: float64
Note that following .loc
, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers.
In case you really want to access elements in a dataframe by integer indices, like in a numpy array, you can use the .iloc
method.
df_csv.iloc[2:4, 1:3]
area | mean_intensity | |
---|---|---|
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
The downside is that the code becomes less explicit when column names are absent.
In practice you will almost never use row indices, but rather use Boolean indexing or Masking.
Boolean indexing of data frames#
In case we want to focus our further analysis on cells that have a certain minimum area, we can do this by passing boolean indices to the rows. This process is also sometimes call masking.
Suppose we want the rows for which df_analysis["area"] > 50
. We can essentially plug this syntax directly when using .loc
.
df_analysis_filtered = df_analysis[ df_analysis["area"] > 50]
df_analysis_filtered.tail()
area | mean_intensity | |
---|---|---|
55 | 280 | 189.800000 |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
Adding new columns#
In Pandas, it is very easy to generate new columns from existing columns. We just pass some operation between other columns to a new column name.
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
C:\Users\Marcelo_Researcher\AppData\Local\Temp\ipykernel_19228\206920941.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
area | mean_intensity | total_intensity | |
---|---|---|---|
0 | 422 | 192.379147 | 81184.0 |
1 | 182 | 180.131868 | 32784.0 |
2 | 661 | 205.216339 | 135648.0 |
3 | 437 | 216.585812 | 94648.0 |
4 | 476 | 212.302521 | 101056.0 |
... | ... | ... | ... |
56 | 211 | 185.061611 | 39048.0 |
57 | 78 | 185.230769 | 14448.0 |
58 | 86 | 183.720930 | 15800.0 |
59 | 51 | 190.431373 | 9712.0 |
60 | 46 | 175.304348 | 8064.0 |
61 rows × 3 columns
Exercise#
From the loaded CSV file, create a table that only contains these columns:
minor_axis_length
major_axis_length
aspect_ratio
df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
61 rows × 13 columns