Introduction to working with DataFrames#

In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that, we introduce panda DataFrames, the primary tool in the Python ecosystem for handling data. Its primary object, the “DataFrame” is extremely useful in wrangling data. which are more handy in the next steps.

import pandas as pd

Creating DataFrames from a dictionary of lists#

Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:

measurements = {
    "labels":      [1, 2, 3],
    "area":       [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be nicely visualized using a DataFrame:

df = pd.DataFrame(measurements)
df
labels area minor_axis major_axis
0 1 45 2 3
1 2 23 4 4
2 3 68 4 5

Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns. This is done elementwise.

df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels area minor_axis major_axis aspect_ratio
0 1 45 2 3 1.50
1 2 23 4 4 1.00
2 3 68 4 5 1.25

Saving data frames#

We can also save this table for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages.

df.to_csv("../../data/short_table.csv")

You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat files.

Creating DataFrames from lists of lists#

Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists

header = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],
    [45, 23, 68],
    [2, 4, 4],
    [3, 4, 5],
]
          
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)

# show it
data_frame
0 1 2
labels 1 2 3
area 45 23 68
minor_axis 2 4 4
major_axis 3 4 5

As you can see, this tabls is rotated. We can bring it in the usual form like this:

# rotate/flip it
data_frame = data_frame.transpose()

# show it
data_frame
labels area minor_axis major_axis
0 1 45 2 3
1 2 23 4 4
2 3 68 4 5

Loading data frames#

Tables can be read from CSV files with pd.read_csv.

df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 56 211 185.061611 14.522762 18.489138 0.618893 0.781481 18.973666 16.390654 232 39 250 54
57 57 78 185.230769 6.028638 17.579799 0.939361 0.722222 18.027756 9.965575 248 170 254 188
58 58 86 183.720930 5.426871 21.261427 0.966876 0.781818 22.000000 10.464158 249 117 254 139
59 59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82

61 rows × 13 columns

That’s a bit too much information. We can use the .head() method of data frames to look at the first few rows (or the .tail() to check the last rows).

df_csv.head()
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256
df_csv.tail()
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
56 56 211 185.061611 14.522762 18.489138 0.618893 0.781481 18.973666 16.390654 232 39 250 54
57 57 78 185.230769 6.028638 17.579799 0.939361 0.722222 18.027756 9.965575 248 170 254 188
58 58 86 183.720930 5.426871 21.261427 0.966876 0.781818 22.000000 10.464158 249 117 254 139
59 59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82

We can also get column names with the DataFrame attribute columns.

df_csv.columns
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
       'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

Selecting rows and columns#

Ok, let’s get the dataframe first row:

df_csv[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\indexes\base.py:3800, in Index.get_loc(self, key, method, tolerance)
   3799 try:
-> 3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [12], line 1
----> 1 df_csv[0]

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 0

Ooops… We got a big error. That’s because we index dataframes by columns.

We can then copy&paste the colum names we’re interested in and create a new data frame. This is recommended especially when tables are overwhelmingly large.

df_csv['area']
0     422
1     182
2     661
3     437
4     476
     ... 
56    211
57     78
58     86
59     51
60     46
Name: area, Length: 61, dtype: int64

Notice that when it was printed, the index of the rows came along with it. That’s beacuse a Pandas DataFrame with one column is a Pandas Series.

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_analysis = df_csv[ ['area', 'mean_intensity'] ]
df_analysis
area mean_intensity
0 422 192.379147
1 182 180.131868
2 661 205.216339
3 437 216.585812
4 476 212.302521
... ... ...
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373
60 46 175.304348

61 rows × 2 columns

df_analysis
area mean_intensity
0 422 192.379147
1 182 180.131868
2 661 205.216339
3 437 216.585812
4 476 212.302521
... ... ...
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373
60 46 175.304348

61 rows × 2 columns

This gave us the area measurements we were after.

If we want to get a single row, the proper way of doing that is to use the .loc method:

df_csv.loc[ 2, ['area', 'mean_intensity']]
area              661.000000
mean_intensity    205.216339
Name: 2, dtype: float64

Note that following .loc, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers.

In case you really want to access elements in a dataframe by integer indices, like in a numpy array, you can use the .iloc method.

df_csv.iloc[2:4, 1:3]
area mean_intensity
2 661 205.216339
3 437 216.585812

The downside is that the code becomes less explicit when column names are absent.

In practice you will almost never use row indices, but rather use Boolean indexing or Masking.

Boolean indexing of data frames#

In case we want to focus our further analysis on cells that have a certain minimum area, we can do this by passing boolean indices to the rows. This process is also sometimes call masking.

Suppose we want the rows for which df_analysis["area"] > 50. We can essentially plug this syntax directly when using .loc.

df_analysis_filtered  = df_analysis[ df_analysis["area"] > 50]
df_analysis_filtered.tail()
area mean_intensity
55 280 189.800000
56 211 185.061611
57 78 185.230769
58 86 183.720930
59 51 190.431373

Adding new columns#

In Pandas, it is very easy to generate new columns from existing columns. We just pass some operation between other columns to a new column name.

df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
C:\Users\Marcelo_Researcher\AppData\Local\Temp\ipykernel_19228\206920941.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
area mean_intensity total_intensity
0 422 192.379147 81184.0
1 182 180.131868 32784.0
2 661 205.216339 135648.0
3 437 216.585812 94648.0
4 476 212.302521 101056.0
... ... ... ...
56 211 185.061611 39048.0
57 78 185.230769 14448.0
58 86 183.720930 15800.0
59 51 190.431373 9712.0
60 46 175.304348 8064.0

61 rows × 3 columns

Exercise#

From the loaded CSV file, create a table that only contains these columns:

  • minor_axis_length

  • major_axis_length

  • aspect_ratio

df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape
Unnamed: 0 area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 56 211 185.061611 14.522762 18.489138 0.618893 0.781481 18.973666 16.390654 232 39 250 54
57 57 78 185.230769 6.028638 17.579799 0.939361 0.722222 18.027756 9.965575 248 170 254 188
58 58 86 183.720930 5.426871 21.261427 0.966876 0.781818 22.000000 10.464158 249 117 254 139
59 59 51 190.431373 5.032414 13.742079 0.930534 0.728571 14.035669 8.058239 249 228 254 242
60 60 46 175.304348 3.803982 15.948714 0.971139 0.766667 15.033296 7.653040 250 67 254 82

61 rows × 13 columns