Introduction to working with DataFrames#

In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that, we introduce panda DataFrames, the primary tool in the Python ecosystem for handling data. Its primary object, the “DataFrame” is extremely useful in wrangling data. which are more handy in the next steps.

import pandas as pd

Creating DataFrames from a dictionary of lists#

Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:

measurements = {
    "labels":      [1, 2, 3],
    "area":       [45, 23, 68],
    "minor_axis": [2, 4, 4],
    "major_axis": [3, 4, 5],
}

This data structure can be nicely visualized using a DataFrame:

df = pd.DataFrame(measurements)
df

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns. This is done elementwise.

df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df

	labels	area	minor_axis	major_axis	aspect_ratio
0	1	45	2	3	1.50
1	2	23	4	4	1.00
2	3	68	4	5	1.25

Saving data frames#

We can also save this table for continuing to work with it. We chose to save it as a CSV file, where CSV stands for comma-separated value. This is a text file that is easily read into data structures in many programming languages.

df.to_csv("../../data/short_table.csv")

You should generally always store your data in such a format, not necessarily CSV, but a format that is open, has a well-defined specification, and is readable in many contexts. Excel files do not meet these criteria. Neither do .mat files.

Creating DataFrames from lists of lists#

Sometimes, we are confronted to data in form of lists of lists. To make pandas understand that form of data correctly, we also need to provide the headers in the same order as the lists

header = ['labels', 'area', 'minor_axis', 'major_axis']

data = [
    [1, 2, 3],
    [45, 23, 68],
    [2, 4, 4],
    [3, 4, 5],
]
          
# convert the data and header arrays in a pandas data frame
data_frame = pd.DataFrame(data, header)

# show it
data_frame

	0	1	2
labels	1	2	3
area	45	23	68
minor_axis	2	4	4
major_axis	3	4	5

As you can see, this tabls is rotated. We can bring it in the usual form like this:

# rotate/flip it
data_frame = data_frame.transpose()

# show it
data_frame

	labels	area	minor_axis	major_axis
0	1	45	2	3
1	2	23	4	4
2	3	68	4	5

Loading data frames#

Tables can be read from CSV files with pd.read_csv.

df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv

	Unnamed: 0	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
0	0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	0	11	30	35
1	1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	0	53	11	74
2	2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
3	3	437	216.585812	23.143996	24.606130	0.339576	0.826087	26.925824	23.588253	0	144	23	167
4	4	476	212.302521	19.852882	31.075106	0.769317	0.863884	31.384710	24.618327	0	237	29	256
...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	56	211	185.061611	14.522762	18.489138	0.618893	0.781481	18.973666	16.390654	232	39	250	54
57	57	78	185.230769	6.028638	17.579799	0.939361	0.722222	18.027756	9.965575	248	170	254	188
58	58	86	183.720930	5.426871	21.261427	0.966876	0.781818	22.000000	10.464158	249	117	254	139
59	59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82

61 rows × 13 columns

That’s a bit too much information. We can use the .head() method of data frames to look at the first few rows (or the .tail() to check the last rows).

df_csv.head()

	Unnamed: 0	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-1	bbox-2	bbox-3
0	0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	11	30	35
1	1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	53	11	74
2	2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	95	28	122
3	3	437	216.585812	23.143996	24.606130	0.339576	0.826087	26.925824	23.588253	144	23	167
4	4	476	212.302521	19.852882	31.075106	0.769317	0.863884	31.384710	24.618327	237	29	256

df_csv.tail()

	Unnamed: 0	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
56	56	211	185.061611	14.522762	18.489138	0.618893	0.781481	18.973666	16.390654	232	39	250	54
57	57	78	185.230769	6.028638	17.579799	0.939361	0.722222	18.027756	9.965575	248	170	254	188
58	58	86	183.720930	5.426871	21.261427	0.966876	0.781818	22.000000	10.464158	249	117	254	139
59	59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82

We can also get column names with the DataFrame attribute columns.

df_csv.columns

Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
       'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
       'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
      dtype='object')

Selecting rows and columns#

Ok, let’s get the dataframe first row:

df_csv[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\indexes\base.py:3800, in Index.get_loc(self, key, method, tolerance)
   3799 try:
-> 3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [12], line 1
----> 1 df_csv[0]

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~\anaconda3\envs\devbio-napari-env\lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 0

Ooops… We got a big error. That’s because we index dataframes by columns.

We can then copy&paste the colum names we’re interested in and create a new data frame. This is recommended especially when tables are overwhelmingly large.

df_csv['area']

   422
   182
   661
   437
   476
     ... 
  211
   78
   86
   51
   46
Name: area, Length: 61, dtype: int64

Notice that when it was printed, the index of the rows came along with it. That’s beacuse a Pandas DataFrame with one column is a Pandas Series.

We can get more columns by passing their names as a list. Furthermore, we can store this “sub-dataframe” in a new variable.

df_analysis = df_csv[ ['area', 'mean_intensity'] ]
df_analysis

	area	mean_intensity
0	422	192.379147
1	182	180.131868
2	661	205.216339
3	437	216.585812
4	476	212.302521
...	...	...
56	211	185.061611
57	78	185.230769
58	86	183.720930
59	51	190.431373
60	46	175.304348

61 rows × 2 columns

df_analysis

	area	mean_intensity
0	422	192.379147
1	182	180.131868
2	661	205.216339
3	437	216.585812
4	476	212.302521
...	...	...
56	211	185.061611
57	78	185.230769
58	86	183.720930
59	51	190.431373
60	46	175.304348

61 rows × 2 columns

This gave us the area measurements we were after.

If we want to get a single row, the proper way of doing that is to use the .loc method:

df_csv.loc[ 2, ['area', 'mean_intensity']]

area              661.000000
mean_intensity    205.216339
Name: 2, dtype: float64

Note that following .loc, we have the index by row then column, separated by a comma, in brackets. It is also important to note that row indices need not be integers. And you should not count on them being integers.

In case you really want to access elements in a dataframe by integer indices, like in a numpy array, you can use the .iloc method.

df_csv.iloc[2:4, 1:3]

	area	mean_intensity
2	661	205.216339
3	437	216.585812

The downside is that the code becomes less explicit when column names are absent.

In practice you will almost never use row indices, but rather use Boolean indexing or Masking.

Boolean indexing of data frames#

In case we want to focus our further analysis on cells that have a certain minimum area, we can do this by passing boolean indices to the rows. This process is also sometimes call masking.

Suppose we want the rows for which df_analysis["area"] > 50. We can essentially plug this syntax directly when using .loc.

df_analysis_filtered  = df_analysis[ df_analysis["area"] > 50]
df_analysis_filtered.tail()

	area	mean_intensity
55	280	189.800000
56	211	185.061611
57	78	185.230769
58	86	183.720930
59	51	190.431373

Adding new columns#

In Pandas, it is very easy to generate new columns from existing columns. We just pass some operation between other columns to a new column name.

df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis

C:\Users\Marcelo_Researcher\AppData\Local\Temp\ipykernel_19228\206920941.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']

	area	mean_intensity	total_intensity
0	422	192.379147	81184.0
1	182	180.131868	32784.0
2	661	205.216339	135648.0
3	437	216.585812	94648.0
4	476	212.302521	101056.0
...	...	...	...
56	211	185.061611	39048.0
57	78	185.230769	14448.0
58	86	183.720930	15800.0
59	51	190.431373	9712.0
60	46	175.304348	8064.0

61 rows × 3 columns

Exercise#

From the loaded CSV file, create a table that only contains these columns:

minor_axis_length
major_axis_length
aspect_ratio

df_shape = pd.read_csv('../../data/blobs_statistics.csv')
df_shape

	Unnamed: 0	area	mean_intensity	minor_axis_length	major_axis_length	eccentricity	extent	feret_diameter_max	equivalent_diameter_area	bbox-0	bbox-1	bbox-2	bbox-3
0	0	422	192.379147	16.488550	34.566789	0.878900	0.586111	35.227830	23.179885	0	11	30	35
1	1	182	180.131868	11.736074	20.802697	0.825665	0.787879	21.377558	15.222667	0	53	11	74
2	2	661	205.216339	28.409502	30.208433	0.339934	0.874339	32.756679	29.010538	0	95	28	122
3	3	437	216.585812	23.143996	24.606130	0.339576	0.826087	26.925824	23.588253	0	144	23	167
4	4	476	212.302521	19.852882	31.075106	0.769317	0.863884	31.384710	24.618327	0	237	29	256
...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	56	211	185.061611	14.522762	18.489138	0.618893	0.781481	18.973666	16.390654	232	39	250	54
57	57	78	185.230769	6.028638	17.579799	0.939361	0.722222	18.027756	9.965575	248	170	254	188
58	58	86	183.720930	5.426871	21.261427	0.966876	0.781818	22.000000	10.464158	249	117	254	139
59	59	51	190.431373	5.032414	13.742079	0.930534	0.728571	14.035669	8.058239	249	228	254	242
60	60	46	175.304348	3.803982	15.948714	0.971139	0.766667	15.033296	7.653040	250	67	254	82