Exploratory Data Analysis#

Inspiration and some of the parts came from: Python Data Science GitHub repository, MIT License and Introduction to Pandas by Google, Apache 2.0

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

#!pip install seaborn
#!pip install watermark
import pandas as pd
import seaborn as sns
from scipy import stats

Learning Objectives:#

  • descriptive statistics/EDA

  • corr matrix

For this notebook, we will use the california housing dataframes.

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
... ... ... ... ... ... ... ... ... ...
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

17000 rows × 9 columns

Exploring Data#

As shown above, after loading a large DataFrame, it may be a bit hard to have a good overview of what is inside it just by looking at a few rows. Thus, the DataFrame.describe method is useful to show interesting statistics about a DataFrame.

california_housing_dataframe.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000
mean -119.562108 35.625225 28.589353 2643.664412 539.410824 1429.573941 501.221941 3.883578 207300.912353
std 2.005166 2.137340 12.586937 2179.947071 421.499452 1147.852959 384.520841 1.908157 115983.764387
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.790000 33.930000 18.000000 1462.000000 297.000000 790.000000 282.000000 2.566375 119400.000000
50% -118.490000 34.250000 29.000000 2127.000000 434.000000 1167.000000 409.000000 3.544600 180400.000000
75% -118.000000 37.720000 37.000000 3151.250000 648.250000 1721.000000 605.250000 4.767000 265000.000000
max -114.310000 41.950000 52.000000 37937.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

Another useful function is DataFrame.head, which displays the first few records of a DataFrame. You can give it a number of rows to display.

california_housing_dataframe.head(10)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
5 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74000.0
6 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82400.0
7 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48500.0
8 -114.59 33.61 34.0 4789.0 1175.0 3134.0 1056.0 2.1782 58400.0
9 -114.60 34.83 46.0 1497.0 309.0 787.0 271.0 2.1908 48100.0

Or DataFrame.tail, which displays the last few records of a DataFrame:

california_housing_dataframe.tail()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

Correletaion Matrix#

Consider the table of measurements below.

blobs_statistics = pd.read_csv('../../data/blobs_statistics.csv', index_col=0)
blobs_statistics.head()
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
0 422 192.379147 16.488550 34.566789 0.878900 0.586111 35.227830 23.179885 0 11 30 35
1 182 180.131868 11.736074 20.802697 0.825665 0.787879 21.377558 15.222667 0 53 11 74
2 661 205.216339 28.409502 30.208433 0.339934 0.874339 32.756679 29.010538 0 95 28 122
3 437 216.585812 23.143996 24.606130 0.339576 0.826087 26.925824 23.588253 0 144 23 167
4 476 212.302521 19.852882 31.075106 0.769317 0.863884 31.384710 24.618327 0 237 29 256

After measuring many features / properties, it is often common that some of them are strongly correlated and may not bring much new information. In pandas, we can calculate correlation among columns like this.

blobs_statistics.corr()
area mean_intensity minor_axis_length major_axis_length eccentricity extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 bbox-2 bbox-3
area 1.000000 0.548612 0.890649 0.895282 -0.192147 -0.267454 0.916652 0.975964 -0.066508 -0.081937 0.034083 -0.003961
mean_intensity 0.548612 1.000000 0.657131 0.440678 -0.362592 -0.011555 0.487183 0.611103 0.015188 0.217484 0.069184 0.266504
minor_axis_length 0.890649 0.657131 1.000000 0.664507 -0.566486 -0.037872 0.716706 0.937795 -0.163017 -0.056785 -0.077817 0.015790
major_axis_length 0.895282 0.440678 0.664507 1.000000 0.168454 -0.551362 0.995196 0.880909 -0.010743 -0.128821 0.093556 -0.057776
eccentricity -0.192147 -0.362592 -0.566486 0.168454 1.000000 -0.432629 0.103529 -0.272402 0.257938 -0.060467 0.253671 -0.076793
extent -0.267454 -0.011555 -0.037872 -0.551362 -0.432629 1.000000 -0.517428 -0.278453 -0.076688 0.048511 -0.128149 0.019310
feret_diameter_max 0.916652 0.487183 0.716706 0.995196 0.103529 -0.517428 1.000000 0.911211 -0.025173 -0.122607 0.080054 -0.049283
equivalent_diameter_area 0.975964 0.611103 0.937795 0.880909 -0.272402 -0.278453 0.911211 1.000000 -0.107059 -0.096706 -0.004660 -0.018489
bbox-0 -0.066508 0.015188 -0.163017 -0.010743 0.257938 -0.076688 -0.025173 -0.107059 1.000000 0.050957 0.993418 0.053563
bbox-1 -0.081937 0.217484 -0.056785 -0.128821 -0.060467 0.048511 -0.122607 -0.096706 0.050957 1.000000 0.032728 0.996062
bbox-2 0.034083 0.069184 -0.077817 0.093556 0.253671 -0.128149 0.080054 -0.004660 0.993418 0.032728 1.000000 0.041855
bbox-3 -0.003961 0.266504 0.015790 -0.057776 -0.076793 0.019310 -0.049283 -0.018489 0.053563 0.996062 0.041855 1.000000

It can be hard to read in numeric format. I wonder if there is beter way how to look at the data?

Below we take a quick shortcut to seaborn to show how the correlation can be displayed as a heatmap.

# calculate the correlation matrix on the numeric columns
corr = blobs_statistics.select_dtypes('number').corr()

# plot the heatmap
sns.heatmap(corr, cmap="Blues", annot=False)
<Axes: >
../_images/be2b71893f0da876c58a077aa79cb2f204e5676806f133262610a785af555615.png

Watermark

from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,pivottablejs"))
Last updated: 2023-08-24T14:26:10.347260+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark   : 2.4.3
numpy       : 1.23.5
pandas      : 2.0.3
seaborn     : 0.12.2
pivottablejs: 0.9.0