Basic Operations

Basic Operations#

Inspiration and some of the parts came from: Python Data Science GitHub repository, MIT License and Introduction to Pandas by Google, Apache 2.0

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

#!pip install seaborn
#!pip install watermark

import pandas as pd
import numpy as np
import seaborn as sns

Learning Objectives:#

operations
selection
filtering
concat
NaNs

For this notebook, we will continue using the cities and california housing dataframes.

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })
cities_dataframe

	City name	Population
0	San Francisco	852469
1	San Jose	1015785
2	Sacramento	485199

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

Manipulating Data#

Applying functions#

You may apply Python’s basic arithmetic operations to Series. For example:

population / 1000

   852.469
  1015.785
   485.199
dtype: float64

pandas Series can be used as arguments to most NumPy functions:

np.log(population)

  13.655892
  13.831172
  13.092314
dtype: float64

For more complex single-column transformations, you can use Series.apply. Like the Python map function, Series.apply accepts as an argument a lambda function, which is applied to each value.

The example below creates a new Series that indicates whether population is over one million:

big_city = population.apply(lambda val: val > 1000000)
big_city

  False
   True
  False
dtype: bool

Filtering#

One can use this result as a binary mask to make a sub-dataframe.

cities_dataframe[big_city]

	City name	Population
1	San Jose	1015785

Here is another way of generating a binary mask without explicitly using a lamba function.

big_city = cities_dataframe['Population'] > 1000000
big_city

  False
   True
  False
Name: Population, dtype: bool

Adding new columns#

Modifying DataFrames is also straightforward. For example, the following code adds two Series to an existing DataFrame. One of them is the result of a computation of 2 existing columns.

cities_dataframe['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities_dataframe['Population density'] = cities_dataframe['Population'] / cities_dataframe['Area square miles']
cities_dataframe

	City name	Population	Area square miles	Population density
0	San Francisco	852469	46.87	18187.945381
1	San Jose	1015785	176.53	5754.177760
2	Sacramento	485199	97.92	4955.055147

Concatenating DataFrames#

Let’s imagine we collect another similar data sample, like the one below.

city_names = pd.Series(['Sao Paulo', 'Sao Luis', 'Salvador'])
population = pd.Series([12400232, 1108975, 2886698])

another_cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })

another_cities_dataframe['Area square miles'] = pd.Series([587.34, 319.36, 268])
another_cities_dataframe['Population density'] = another_cities_dataframe['Population'] / another_cities_dataframe['Area square miles']
another_cities_dataframe

	City name	Population	Area square miles	Population density
0	Sao Paulo	12400232	587.34	21112.527667
1	Sao Luis	1108975	319.36	3472.491859
2	Salvador	2886698	268.00	10771.261194

Before concatenating, it is probably a good idea to insert an identifier column so that we keep track where data came from.

We can easily do that by creating a new column in each dataframe before concatenating.

cities_dataframe['Country'] = 'USA'
another_cities_dataframe['Country'] = 'Brazil'

We can now concatenate similar dataframes with the pandas.concat functions.

result = pd.concat([cities_dataframe, another_cities_dataframe])
result

	City name	Population	Area square miles	Population density	Country
0	San Francisco	852469	46.87	18187.945381	USA
1	San Jose	1015785	176.53	5754.177760	USA
2	Sacramento	485199	97.92	4955.055147	USA
0	Sao Paulo	12400232	587.34	21112.527667	Brazil
1	Sao Luis	1108975	319.36	3472.491859	Brazil
2	Salvador	2886698	268.00	10771.261194	Brazil

We now have a longer dataframe with some repeated indices. To have unique indices, we can use .reset_index.

result = result.reset_index(drop=True)
result

	City name	Population	Area square miles	Population density	Country
0	San Francisco	852469	46.87	18187.945381	USA
1	San Jose	1015785	176.53	5754.177760	USA
2	Sacramento	485199	97.92	4955.055147	USA
3	Sao Paulo	12400232	587.34	21112.527667	Brazil
4	Sao Luis	1108975	319.36	3472.491859	Brazil
5	Salvador	2886698	268.00	10771.261194	Brazil

NaNs#

DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don’t match in length, missing values are filled with special NA/NaN values. We cannot assume what these values are, because that would distort th results. So we need to deal with these NaNs values.

We can test the missing values using isnull() function.

We can work with one of the seaborn training datasets Penguins

penguins = sns.load_dataset("penguins")

penguins.isnull()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False
3	False	False	True	True	True	True	True
4	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...
339	False	False	True	True	True	True	True
340	False	False	False	False	False	False	False
341	False	False	False	False	False	False	False
342	False	False	False	False	False	False	False
343	False	False	False	False	False	False	False

344 rows × 7 columns

But it is more practical to test if there are any NaNs, than looking for it. We can use .isnull().values.any() approach.

penguins.isnull().values.any()

True

Or we can explore each column using .isnull().sum().

penguins.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

We will want to drop all rows with unknown entries with .dropna() function.

penguins_cleaned = penguins.dropna()
penguins_cleaned.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Exercise#

The table below contains shape and intensity measurements from a biological sample. Make a subset with the columns Area and Mean. Remove all rows that contain NaNs from it and count the remaining rows.

Afterwards, take the initial table again and make a new subset with the columns Major and Minor. Remove NaNs and count the remaining rows again.

What do you conclude?

dataframe = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
dataframe

	Area	Mean	StdDev	Min	Max	X	Y	XM	YM	Major	Minor	Angle	%Area	Type

1	18.0	730.389	103.354	592.0	948.0	435.000	4.722	434.962	4.697	5.987	3.828	168.425	100	A
2	126.0	718.333	90.367	556.0	1046.0	388.087	8.683	388.183	8.687	16.559	9.688	175.471	100	A
3	NaN	NaN	NaN	608.0	964.0	NaN	NaN	NaN	7.665	7.359	NaN	101.121	100	A
4	68.0	686.985	61.169	571.0	880.0	126.147	8.809	126.192	8.811	15.136	5.720	168.133	100	A
5	NaN	NaN	69.438	566.0	792.0	348.500	7.500	NaN	7.508	NaN	3.088	NaN	100	A
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
387	152.0	801.599	111.328	582.0	1263.0	348.487	497.632	348.451	497.675	17.773	10.889	11.829	100	A
388	17.0	742.706	69.624	620.0	884.0	420.500	496.382	420.513	NaN	NaN	3.663	49.457	100	A
389	60.0	758.033	77.309	601.0	947.0	259.000	499.300	258.990	499.289	9.476	8.062	90.000	100	A
390	12.0	714.833	67.294	551.0	785.0	240.167	498.167	240.179	498.148	4.606	3.317	168.690	100	A
391	23.0	695.043	67.356	611.0	846.0	49.891	503.022	49.882	502.979	6.454	4.537	73.243	100	A

391 rows × 14 columns

Watermark

from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,pivottablejs"))

Last updated: 2023-08-24T14:25:40.097708+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark   : 2.4.3
numpy       : 1.23.5
pandas      : 2.0.3
seaborn     : 0.12.2
pivottablejs: 0.9.0