Basic Operations#
Inspiration and some of the parts came from: Python Data Science GitHub repository, MIT License and Introduction to Pandas by Google, Apache 2.0
If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.
#!pip install seaborn
#!pip install watermark
import pandas as pd
import numpy as np
import seaborn as sns
Learning Objectives:#
operations
selection
filtering
concat
NaNs
For this notebook, we will continue using the cities and california housing dataframes.
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })
cities_dataframe
City name | Population | |
---|---|---|
0 | San Francisco | 852469 |
1 | San Jose | 1015785 |
2 | Sacramento | 485199 |
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
Manipulating Data#
Applying functions#
You may apply Python’s basic arithmetic operations to Series
. For example:
population / 1000
0 852.469
1 1015.785
2 485.199
dtype: float64
pandas Series
can be used as arguments to most NumPy functions:
np.log(population)
0 13.655892
1 13.831172
2 13.092314
dtype: float64
For more complex single-column transformations, you can use Series.apply
. Like the Python map function,
Series.apply
accepts as an argument a lambda function, which is applied to each value.
The example below creates a new Series
that indicates whether population
is over one million:
big_city = population.apply(lambda val: val > 1000000)
big_city
0 False
1 True
2 False
dtype: bool
Filtering#
One can use this result as a binary mask to make a sub-dataframe.
cities_dataframe[big_city]
City name | Population | |
---|---|---|
1 | San Jose | 1015785 |
Here is another way of generating a binary mask without explicitly using a lamba
function.
big_city = cities_dataframe['Population'] > 1000000
big_city
0 False
1 True
2 False
Name: Population, dtype: bool
Adding new columns#
Modifying DataFrames
is also straightforward. For example, the following code adds two Series
to an existing DataFrame
. One of them is the result of a computation of 2 existing columns.
cities_dataframe['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities_dataframe['Population density'] = cities_dataframe['Population'] / cities_dataframe['Area square miles']
cities_dataframe
City name | Population | Area square miles | Population density | |
---|---|---|---|---|
0 | San Francisco | 852469 | 46.87 | 18187.945381 |
1 | San Jose | 1015785 | 176.53 | 5754.177760 |
2 | Sacramento | 485199 | 97.92 | 4955.055147 |
Concatenating DataFrames#
Let’s imagine we collect another similar data sample, like the one below.
city_names = pd.Series(['Sao Paulo', 'Sao Luis', 'Salvador'])
population = pd.Series([12400232, 1108975, 2886698])
another_cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })
another_cities_dataframe['Area square miles'] = pd.Series([587.34, 319.36, 268])
another_cities_dataframe['Population density'] = another_cities_dataframe['Population'] / another_cities_dataframe['Area square miles']
another_cities_dataframe
City name | Population | Area square miles | Population density | |
---|---|---|---|---|
0 | Sao Paulo | 12400232 | 587.34 | 21112.527667 |
1 | Sao Luis | 1108975 | 319.36 | 3472.491859 |
2 | Salvador | 2886698 | 268.00 | 10771.261194 |
Before concatenating, it is probably a good idea to insert an identifier column so that we keep track where data came from.
We can easily do that by creating a new column in each dataframe before concatenating.
cities_dataframe['Country'] = 'USA'
another_cities_dataframe['Country'] = 'Brazil'
We can now concatenate similar dataframes with the pandas.concat
functions.
result = pd.concat([cities_dataframe, another_cities_dataframe])
result
City name | Population | Area square miles | Population density | Country | |
---|---|---|---|---|---|
0 | San Francisco | 852469 | 46.87 | 18187.945381 | USA |
1 | San Jose | 1015785 | 176.53 | 5754.177760 | USA |
2 | Sacramento | 485199 | 97.92 | 4955.055147 | USA |
0 | Sao Paulo | 12400232 | 587.34 | 21112.527667 | Brazil |
1 | Sao Luis | 1108975 | 319.36 | 3472.491859 | Brazil |
2 | Salvador | 2886698 | 268.00 | 10771.261194 | Brazil |
We now have a longer dataframe with some repeated indices. To have unique indices, we can use .reset_index
.
result = result.reset_index(drop=True)
result
City name | Population | Area square miles | Population density | Country | |
---|---|---|---|---|---|
0 | San Francisco | 852469 | 46.87 | 18187.945381 | USA |
1 | San Jose | 1015785 | 176.53 | 5754.177760 | USA |
2 | Sacramento | 485199 | 97.92 | 4955.055147 | USA |
3 | Sao Paulo | 12400232 | 587.34 | 21112.527667 | Brazil |
4 | Sao Luis | 1108975 | 319.36 | 3472.491859 | Brazil |
5 | Salvador | 2886698 | 268.00 | 10771.261194 | Brazil |
NaNs#
DataFrame
objects can be created by passing a dict
mapping string
column names to their respective Series
. If the Series
don’t match in length, missing values are filled with special NA/NaN values. We cannot assume what these values are, because that would distort th results. So we need to deal with these NaNs values.
We can test the missing values using isnull()
function.
We can work with one of the seaborn
training datasets Penguins
penguins = sns.load_dataset("penguins")
penguins.isnull()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | True | True | True | True | True |
4 | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... |
339 | False | False | True | True | True | True | True |
340 | False | False | False | False | False | False | False |
341 | False | False | False | False | False | False | False |
342 | False | False | False | False | False | False | False |
343 | False | False | False | False | False | False | False |
344 rows × 7 columns
But it is more practical to test if there are any NaNs, than looking for it. We can use .isnull().values.any()
approach.
penguins.isnull().values.any()
True
Or we can explore each column using .isnull().sum()
.
penguins.isnull().sum()
species 0
island 0
bill_length_mm 2
bill_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 11
dtype: int64
We will want to drop all rows with unknown entries with .dropna()
function.
penguins_cleaned = penguins.dropna()
penguins_cleaned.isnull().sum()
species 0
island 0
bill_length_mm 0
bill_depth_mm 0
flipper_length_mm 0
body_mass_g 0
sex 0
dtype: int64
Exercise#
The table below contains shape and intensity measurements from a biological sample. Make a subset with the columns Area
and Mean
. Remove all rows that contain NaNs from it and count the remaining rows.
Afterwards, take the initial table again and make a new subset with the columns Major
and Minor
. Remove NaNs and count the remaining rows again.
What do you conclude?
dataframe = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
dataframe
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
388 | 17.0 | 742.706 | 69.624 | 620.0 | 884.0 | 420.500 | 496.382 | 420.513 | NaN | NaN | 3.663 | 49.457 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
391 rows × 14 columns
Watermark
from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,pivottablejs"))
Last updated: 2023-08-24T14:25:40.097708+02:00
Python implementation: CPython
Python version : 3.9.17
IPython version : 8.14.0
Compiler : MSC v.1929 64 bit (AMD64)
OS : Windows
Release : 10
Machine : AMD64
Processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores : 16
Architecture: 64bit
watermark : 2.4.3
numpy : 1.23.5
pandas : 2.0.3
seaborn : 0.12.2
pivottablejs: 0.9.0