Basic Operations#

Inspiration and some of the parts came from: Python Data Science GitHub repository, MIT License and Introduction to Pandas by Google, Apache 2.0

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

#!pip install seaborn
#!pip install watermark
import pandas as pd
import numpy as np
import seaborn as sns

Learning Objectives:#

  • operations

  • selection

  • filtering

  • concat

  • NaNs

For this notebook, we will continue using the cities and california housing dataframes.

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })
cities_dataframe
City name Population
0 San Francisco 852469
1 San Jose 1015785
2 Sacramento 485199
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

Manipulating Data#

Applying functions#

You may apply Python’s basic arithmetic operations to Series. For example:

population / 1000
0     852.469
1    1015.785
2     485.199
dtype: float64

pandas Series can be used as arguments to most NumPy functions:

np.log(population)
0    13.655892
1    13.831172
2    13.092314
dtype: float64

For more complex single-column transformations, you can use Series.apply. Like the Python map function, Series.apply accepts as an argument a lambda function, which is applied to each value.

The example below creates a new Series that indicates whether population is over one million:

big_city = population.apply(lambda val: val > 1000000)
big_city
0    False
1     True
2    False
dtype: bool

Filtering#

One can use this result as a binary mask to make a sub-dataframe.

cities_dataframe[big_city]
City name Population
1 San Jose 1015785

Here is another way of generating a binary mask without explicitly using a lamba function.

big_city = cities_dataframe['Population'] > 1000000
big_city
0    False
1     True
2    False
Name: Population, dtype: bool

Adding new columns#

Modifying DataFrames is also straightforward. For example, the following code adds two Series to an existing DataFrame. One of them is the result of a computation of 2 existing columns.

cities_dataframe['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities_dataframe['Population density'] = cities_dataframe['Population'] / cities_dataframe['Area square miles']
cities_dataframe
City name Population Area square miles Population density
0 San Francisco 852469 46.87 18187.945381
1 San Jose 1015785 176.53 5754.177760
2 Sacramento 485199 97.92 4955.055147

Concatenating DataFrames#

Let’s imagine we collect another similar data sample, like the one below.

city_names = pd.Series(['Sao Paulo', 'Sao Luis', 'Salvador'])
population = pd.Series([12400232, 1108975, 2886698])

another_cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })

another_cities_dataframe['Area square miles'] = pd.Series([587.34, 319.36, 268])
another_cities_dataframe['Population density'] = another_cities_dataframe['Population'] / another_cities_dataframe['Area square miles']
another_cities_dataframe
City name Population Area square miles Population density
0 Sao Paulo 12400232 587.34 21112.527667
1 Sao Luis 1108975 319.36 3472.491859
2 Salvador 2886698 268.00 10771.261194

Before concatenating, it is probably a good idea to insert an identifier column so that we keep track where data came from.

We can easily do that by creating a new column in each dataframe before concatenating.

cities_dataframe['Country'] = 'USA'
another_cities_dataframe['Country'] = 'Brazil'

We can now concatenate similar dataframes with the pandas.concat functions.

result = pd.concat([cities_dataframe, another_cities_dataframe])
result
City name Population Area square miles Population density Country
0 San Francisco 852469 46.87 18187.945381 USA
1 San Jose 1015785 176.53 5754.177760 USA
2 Sacramento 485199 97.92 4955.055147 USA
0 Sao Paulo 12400232 587.34 21112.527667 Brazil
1 Sao Luis 1108975 319.36 3472.491859 Brazil
2 Salvador 2886698 268.00 10771.261194 Brazil

We now have a longer dataframe with some repeated indices. To have unique indices, we can use .reset_index.

result = result.reset_index(drop=True)
result
City name Population Area square miles Population density Country
0 San Francisco 852469 46.87 18187.945381 USA
1 San Jose 1015785 176.53 5754.177760 USA
2 Sacramento 485199 97.92 4955.055147 USA
3 Sao Paulo 12400232 587.34 21112.527667 Brazil
4 Sao Luis 1108975 319.36 3472.491859 Brazil
5 Salvador 2886698 268.00 10771.261194 Brazil

NaNs#

DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don’t match in length, missing values are filled with special NA/NaN values. We cannot assume what these values are, because that would distort th results. So we need to deal with these NaNs values.

We can test the missing values using isnull() function.

We can work with one of the seaborn training datasets Penguins

penguins = sns.load_dataset("penguins")
penguins.isnull()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False True True True True True
4 False False False False False False False
... ... ... ... ... ... ... ...
339 False False True True True True True
340 False False False False False False False
341 False False False False False False False
342 False False False False False False False
343 False False False False False False False

344 rows × 7 columns

But it is more practical to test if there are any NaNs, than looking for it. We can use .isnull().values.any() approach.

penguins.isnull().values.any()
True

Or we can explore each column using .isnull().sum().

penguins.isnull().sum()
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

We will want to drop all rows with unknown entries with .dropna() function.

penguins_cleaned = penguins.dropna()
penguins_cleaned.isnull().sum()
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Exercise#

The table below contains shape and intensity measurements from a biological sample. Make a subset with the columns Area and Mean. Remove all rows that contain NaNs from it and count the remaining rows.

Afterwards, take the initial table again and make a new subset with the columns Major and Minor. Remove NaNs and count the remaining rows again.

What do you conclude?

dataframe = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
dataframe
Area Mean StdDev Min Max X Y XM YM Major Minor Angle %Area Type
1 18.0 730.389 103.354 592.0 948.0 435.000 4.722 434.962 4.697 5.987 3.828 168.425 100 A
2 126.0 718.333 90.367 556.0 1046.0 388.087 8.683 388.183 8.687 16.559 9.688 175.471 100 A
3 NaN NaN NaN 608.0 964.0 NaN NaN NaN 7.665 7.359 NaN 101.121 100 A
4 68.0 686.985 61.169 571.0 880.0 126.147 8.809 126.192 8.811 15.136 5.720 168.133 100 A
5 NaN NaN 69.438 566.0 792.0 348.500 7.500 NaN 7.508 NaN 3.088 NaN 100 A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
387 152.0 801.599 111.328 582.0 1263.0 348.487 497.632 348.451 497.675 17.773 10.889 11.829 100 A
388 17.0 742.706 69.624 620.0 884.0 420.500 496.382 420.513 NaN NaN 3.663 49.457 100 A
389 60.0 758.033 77.309 601.0 947.0 259.000 499.300 258.990 499.289 9.476 8.062 90.000 100 A
390 12.0 714.833 67.294 551.0 785.0 240.167 498.167 240.179 498.148 4.606 3.317 168.690 100 A
391 23.0 695.043 67.356 611.0 846.0 49.891 503.022 49.882 502.979 6.454 4.537 73.243 100 A

391 rows × 14 columns

Watermark

from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,pivottablejs"))
Last updated: 2023-08-24T14:25:40.097708+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark   : 2.4.3
numpy       : 1.23.5
pandas      : 2.0.3
seaborn     : 0.12.2
pivottablejs: 0.9.0