# Introduction to Pandas

[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.
Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.

Inspiration and some of the parts came from: Python Data Science [GitHub repository](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master), [MIT License](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/LICENSE-CODE) and [Introduction to Pandas](https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb) by Google, [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it.

In [1]:
#!pip install watermark

## Learning Objectives:

 * Gain an introduction to the *DataFrame* and *Series* data structures of the pandas library

 * Import CSV data into a pandas *DataFrame*

 * Access and manipulate data within a *DataFrame* and *Series*

 * Export *DataFrame* to CSV

## Basic Concepts

The following line imports the *pandas* API and prints the API version:

In [2]:
import pandas as pd
pd.__version__

'2.0.3'

The primary data structures in *pandas* are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.
  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html).

### pandas.Series

One way to create a `Series` is to construct a `Series` object. For example:

In [3]:
pd.Series(['San Francisco', 'San Jose', 'Sacramento'])

0    San Francisco
1         San Jose
2       Sacramento
dtype: object

### pandas.DataFrame

`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:

In [4]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

cities_dataframe = pd.DataFrame({ 'City name': city_names, 'Population': population })
cities_dataframe

Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785
2,Sacramento,485199


#### Reading a DataFrame from a file

But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:

In [5]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.head()
#california_housing_dataframe.tail()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


If you need to take a peak to documentation, there is always fast way to use **?** after function.

In [6]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m=

### Selecting Columns, Rows and Creating Subsets

We index DataFrames by columns, like this:

In [7]:
california_housing_dataframe['population']

0        1015.0
1        1129.0
2         333.0
3         515.0
4         624.0
          ...  
16995     907.0
16996    1194.0
16997    1244.0
16998    1298.0
16999     806.0
Name: population, Length: 17000, dtype: float64

We can get more columns by passing their names as a list. Furthermore, we can store this "sub-dataframe" in a new variable.

In [8]:
sub_dataframe = california_housing_dataframe[ ['population', 'households'] ]
sub_dataframe

Unnamed: 0,population,households
0,1015.0,472.0
1,1129.0,463.0
2,333.0,117.0
3,515.0,226.0
4,624.0,262.0
...,...,...
16995,907.0,369.0
16996,1194.0,465.0
16997,1244.0,456.0
16998,1298.0,478.0


If we want to get a single row, the proper way of doing that is to use the `.loc` method:

In [9]:
row_with_index_2 = california_housing_dataframe.loc[2,  ['population', 'households'] ]
row_with_index_2

population    333.0
households    117.0
Name: 2, dtype: float64

In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here.

## Saving data

A `DataFrame` can be saved as a `.csv` file with the `.to_csv` method.

In [16]:
cities_dataframe.to_csv('cities_out.csv', index=False, sep=";")

## Exercise

From the following loaded CSV file, create a table that only contains these columns:

- minor_axis_length
- major_axis_length
- aspect_ratio

In [16]:
blobs_df = pd.read_csv('../../data/blobs_statistics.csv')
blobs_df

Unnamed: 0.1,Unnamed: 0,area,mean_intensity,minor_axis_length,major_axis_length,eccentricity,extent,feret_diameter_max,equivalent_diameter_area,bbox-0,bbox-1,bbox-2,bbox-3
0,0,422,192.379147,16.488550,34.566789,0.878900,0.586111,35.227830,23.179885,0,11,30,35
1,1,182,180.131868,11.736074,20.802697,0.825665,0.787879,21.377558,15.222667,0,53,11,74
2,2,661,205.216339,28.409502,30.208433,0.339934,0.874339,32.756679,29.010538,0,95,28,122
3,3,437,216.585812,23.143996,24.606130,0.339576,0.826087,26.925824,23.588253,0,144,23,167
4,4,476,212.302521,19.852882,31.075106,0.769317,0.863884,31.384710,24.618327,0,237,29,256
...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,56,211,185.061611,14.522762,18.489138,0.618893,0.781481,18.973666,16.390654,232,39,250,54
57,57,78,185.230769,6.028638,17.579799,0.939361,0.722222,18.027756,9.965575,248,170,254,188
58,58,86,183.720930,5.426871,21.261427,0.966876,0.781818,22.000000,10.464158,249,117,254,139
59,59,51,190.431373,5.032414,13.742079,0.930534,0.728571,14.035669,8.058239,249,228,254,242


**Watermark**

In [19]:
from watermark import watermark
watermark(iversions=True, globals_=globals())
print(watermark())
print(watermark(packages="watermark,numpy,pandas,seaborn,pivottablejs"))

Last updated: 2023-08-24T14:24:06.278180+02:00

Python implementation: CPython
Python version       : 3.9.17
IPython version      : 8.14.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
CPU cores   : 16
Architecture: 64bit



watermark   : 2.4.3
numpy       : 1.23.5
pandas      : 2.0.3
seaborn     : 0.12.2
pivottablejs: 0.9.0

