Basic Plots, Groupping and Multi-Level Tables

Basic Plots, Groupping and Multi-Level Tables#

Inspiration and some of the parts came from: Python Data Science GitHub repository, MIT License and Introduction to Pandas by Google, Apache 2.0

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

#!pip install seaborn
#!pip install watermark
#!pip install pivottablejs

import pandas as pd
import seaborn as sns

# For generating a pivot table widget
from pivottablejs import pivot_ui
from IPython.display import HTML
from IPython.display import IFrame
import json, io

Learning Objectives:#

Simple plotting from DataFrame
split-apply-combine on tidy data
Pivot tables

For this notebook, we will use the california housing dataframes.

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0

Simple Plotting#

The example below is using DataFrame.describe to show interesting statistics about a DataFrame.

california_housing_dataframe.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000
mean	-119.562108	35.625225	28.589353	2643.664412	539.410824	1429.573941	501.221941	3.883578	207300.912353
std	2.005166	2.137340	12.586937	2179.947071	421.499452	1147.852959	384.520841	1.908157	115983.764387
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.790000	33.930000	18.000000	1462.000000	297.000000	790.000000	282.000000	2.566375	119400.000000
50%	-118.490000	34.250000	29.000000	2127.000000	434.000000	1167.000000	409.000000	3.544600	180400.000000
75%	-118.000000	37.720000	37.000000	3151.250000	648.250000	1721.000000	605.250000	4.767000	265000.000000
max	-114.310000	41.950000	52.000000	37937.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

Another powerful feature of pandas is graphing. For example, DataFrame.hist lets you quickly study the distribution of values in a column:

california_housing_dataframe.hist('housing_median_age')

array([[<Axes: title={'center': 'housing_median_age'}>]], dtype=object)

../_images/8b73b297e42d106c8b0c963d3e4e42254980c1ed092dc37a2af02e0120215f3e.png

Another example, DataFrame.plot.scatter lets you quickly study the possition of houses:

california_housing_dataframe.plot.scatter(x='longitude',
                      y='latitude',
                      c='DarkBlue')

<Axes: xlabel='longitude', ylabel='latitude'>

../_images/fe0149caaf0e88cb89518eec84deab3230473e1744efb27ddcd57a69a6549136.png

Boxplots can also be displayed with DataFrame.boxplot.

california_housing_dataframe.boxplot(column=["median_income"])

<Axes: >

../_images/ed6c6cc296e66712801b8518fe1545be9306ed93b22a26270560c796c7227243.png

You can learn other plot options in the plot section of the pandas API here.

Split-apply-combine on Tidy Data#

Tidy Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules:

Each variable is a column.
Each observation is a row.
Each type of observation has its own separate data frame.

This is less pretty to visualize as a table (you may notice too many rows with repeated values in certain columns), but the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s look at the titanic dataset below.

# load dataset titanic from seaborn package
titanic = sns.load_dataset('titanic')
titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

According to the statements above, this dataframe is tidy, so it is easier to process.

Split-apply-combine#

We might be interested in computing a statistic, let’s say the survival mean, not for the entire data, but according to subgroups. Basically, we want to:

Split the data according to the ‘sex’ criterion field, i.e., split it up so we have a separate data set for the two classes, you for ‘male’ and one for ‘female’.
Apply a function (mean) to the ‘survived’ field in these split data sets.
Combine the results of these averages on the split data set into a new, summary data set that contains the two classes (‘male’ and ‘female’) and mean survival rate for each.

The first step is to apply a groupby operation.

grouped = titanic.groupby('sex')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D743336850>

There is not much to see in the DataFrameGroupBy object that resulted. But there is a lot we can do with this object. Now, we apply the mean function and check the combined result of this operation.

mean_measurements_per_sex = grouped.mean(numeric_only = True)
mean_measurements_per_sex

	survived	pclass	age	sibsp	parch	fare	adult_male	alone
sex
female	0.742038	2.159236	27.915709	0.694268	0.649682	44.479818	0.000000	0.401274
male	0.188908	2.389948	30.726645	0.429809	0.235702	25.523893	0.930676	0.712305

Here the numeric_only option is set to disconsider for calculating mean over columns that contain strings like ‘embark_town’ for example.

The ‘sex’ field is now the index of our dataframe. We can put it back as a column with the .reset_index method.

mean_measurements_per_sex = mean_measurements_per_sex.reset_index()
mean_measurements_per_sex

	sex	survived	pclass	age	sibsp	parch	fare	adult_male	alone
0	female	0.742038	2.159236	27.915709	0.694268	0.649682	44.479818	0.000000	0.401274
1	male	0.188908	2.389948	30.726645	0.429809	0.235702	25.523893	0.930676	0.712305

We may choose to display only the ‘sex’ and ‘survived’ columns.

mean_measurements_per_sex[['sex', 'survived']]

	sex	survived
0	female	0.742038
1	male	0.188908

Now that you know all the individual steps, here is all the above steps in one shot.

mean_survival_per_sex = titanic.groupby('sex').mean(numeric_only = True).reset_index()[['sex', 'survived']]
mean_survival_per_sex

	sex	survived
0	female	0.742038
1	male	0.188908

Pivot Tables#

This is useful, but we might like to go one step deeper and look at survival rates by both sex and, say, class. Using the vocabulary of groupby, we might proceed using a process like this: we first group by ‘class’ and ‘sex’, then select survival, apply a mean aggregate, combine the resulting groups, and finally unstack the hierarchical index to reveal the hidden multidimensionality. In code:

titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

This gives us a better idea of how both sex and class affected survival, but the code is starting to look a bit garbled. While each step of this pipeline makes sense in light of the tools we’ve previously discussed, the long string of code is not particularly easy to read or use. This two-dimensional groupby is common enough that Pandas includes a convenience routine, pivot_table, which succinctly handles this type of multidimensional aggregation.

Pivot Table Syntax#

Here is the equivalent to the preceding operation using the DataFrame.pivot_table method:

titanic.pivot_table('survived', index='sex', columns='class', aggfunc='mean')

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

This is eminently more readable than the manual groupby approach, and produces the same result. As you might expect of an early 20th-century transatlantic cruise, the survival gradient favors both higher classes and people recorded as females in the data. First-class females survived with near certainty (hi, Rose!), while only one in eight or so third-class males survived (sorry, Jack!).

Pivot Table Widgets#

Some of the widgets needs to be specificaly tweaked for Google Colab or Jupyter Notebooks

from pivottablejs import pivot_ui
from IPython.display import HTML
from IPython.display import IFrame
import json, io

# Google colab alternative template

TEMPLATE = u"""
<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
        <title>PivotTable.js</title>

        <!-- external libs from cdnjs -->
        <link rel="stylesheet" type="text/css" href="https://cdnjs.cloudflare.com/ajax/libs/c3/0.4.11/c3.min.css">
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/c3/0.4.11/c3.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.2/jquery.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jquery-csv/0.71/jquery.csv-0.71.min.js"></script>


        <link rel="stylesheet" type="text/css" href="https://cdnjs.cloudflare.com/ajax/libs/pivottable/2.19.0/pivot.min.css">
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/pivottable/2.19.0/pivot.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/pivottable/2.19.0/d3_renderers.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/pivottable/2.19.0/c3_renderers.min.js"></script>
        <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/pivottable/2.19.0/export_renderers.min.js"></script>

        <style>
            body {font-family: Verdana;}
            .node {
              border: solid 1px white;
              font: 10px sans-serif;
              line-height: 12px;
              overflow: hidden;
              position: absolute;
              text-indent: 2px;
            }
            .c3-line, .c3-focused {stroke-width: 3px !important;}
            .c3-bar {stroke: white !important; stroke-width: 1;}
            .c3 text { font-size: 12px; color: grey;}
            .tick line {stroke: white;}
            .c3-axis path {stroke: grey;}
            .c3-circle { opacity: 1 !important; }
            .c3-xgrid-focus {visibility: hidden !important;}
        </style>
    </head>
    <body>
        <script type="text/javascript">
            $(function(){

                $("#output").pivotUI(
                    $.csv.toArrays($("#output").text())
                    , $.extend({
                        renderers: $.extend(
                            $.pivotUtilities.renderers,
                            $.pivotUtilities.c3_renderers,
                            $.pivotUtilities.d3_renderers,
                            $.pivotUtilities.export_renderers
                            ),
                        hiddenAttributes: [""]
                    }
                    , {
                        onRefresh: function(config) {
                            var config_copy = JSON.parse(JSON.stringify(config));
                            //delete some values which are functions
                            delete config_copy["aggregators"];
                            delete config_copy["renderers"];
                            //delete some bulky default values
                            delete config_copy["rendererOptions"];
                            delete config_copy["localeStrings"];
                            $("#output2").text(JSON.stringify(config_copy, undefined, 2));
                        }
                    }
                    , %(kwargs)s
                    , %(json_kwargs)s)
                ).show();
             });
        </script>
        <div id="output" style="display: none;">%(csv)s</div>

        <textarea id="output2"
        style="float: left; width: 0px; height: 0px; margin: 0px; opacity:0;" readonly>
        </textarea>

        <button onclick="copyTextFunction()">Copy settings</button>
        <script>
        function copyTextFunction() {
                    var copyText = document.getElementById("output2");
                    copyText.select();
                    document.execCommand("copy");
                    }
        </script>

    </body>
</html>
"""


def pivot_cht_html(df, outfile_path = "pivottablejs.html", url="",
    width="100%", height="500",json_kwargs='',  **kwargs):
  with io.open(outfile_path, 'wt', encoding='utf8') as outfile:
      csv = df.to_csv(encoding='utf8')
      if hasattr(csv, 'decode'):
          csv = csv.decode('utf8')
      outfile.write(TEMPLATE %
            dict(csv=csv, kwargs=json.dumps(kwargs),json_kwargs=json_kwargs))

  IFrame(src=url or outfile_path, width=width, height=height)
  return HTML(outfile_path)

Calling the function ‘pivot_ui’ with Pandas DataFrame as input will allow you interactively explore and plot its values

pivot_ui(titanic)
HTML('pivottablejs.html')

PivotTable.js

Watermark

from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,pivottablejs"))

Last updated: 2023-08-25T10:51:16.460182+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit

watermark   : 2.4.3
numpy       : 1.23.5
pandas      : 2.0.3
seaborn     : 0.12.2
pivottablejs: 0.9.0