Essential Tools: Pandas

Introduction

In this lecture we discuss one of most useful Python packages for data science – Pandas.

We’ll touch on some highlights here, but to learn more, start with the Pandas Getting started tutorials

Pandas

  • Pandas is a Python library for data manipulation and analysis with an emphasis on tabular data.
  • It can be used to produce high quality plots and integrates nicely with other libraries that expect NumPy arrays.
  • Knowledge and use of Pandas is essential as a data scientist.

The most important data structure provided by Pandas is the DataFrame implemented in the DataFrame class.

Unlike a numpy array, a DataFrame can have columns of different types.

Make it a habit that when you’re given a tabular dataset, load it into a DataFrame.

Fetching, storing and retrieving your data

For demonstration purposes, we’ll use the yfinance package to fetch financial data via the Yahoo! API and store it in a dataframe.

import pandas as pd
import yfinance as yf

# download nvidia stock prices from 2023
nvidia_stocks = pd.DataFrame(yf.download('NVDA', start='2023-01-01', end='2023-12-31', progress=False))

It’s important to inspect the data you are working with and Pandas provides a variety of methods to do so such as .head(), .tail(), .info(), .describe(), etc.

nvidia_stocks.head()
Open High Low Close Adj Close Volume
Date
2023-01-03 14.851 14.996 14.096 14.315 14.306805 401277000
2023-01-04 14.567 14.853 14.241 14.749 14.740557 431324000
2023-01-05 14.491 14.564 14.148 14.265 14.256832 389168000
2023-01-06 14.474 15.010 14.034 14.859 14.850493 405044000
2023-01-09 15.284 16.056 15.141 15.628 15.619054 504231000

Notice how each row has a label and each column has a label.


A DataFrame is a python object that has many associated methods to explore and manipulate the data.

The method .info() gives you a description of the dataframe.

nvidia_stocks.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 250 entries, 2023-01-03 to 2023-12-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       250 non-null    float64
 1   High       250 non-null    float64
 2   Low        250 non-null    float64
 3   Close      250 non-null    float64
 4   Adj Close  250 non-null    float64
 5   Volume     250 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 13.7 KB

The method .describe() gives you summary statistics of the dataframe.

nvidia_stocks.describe()
Open High Low Close Adj Close Volume
count 250.000000 250.000000 250.000000 250.000000 250.000000 2.500000e+02
mean 36.515168 37.123132 35.926540 36.569260 36.558143 4.735575e+08
std 10.849127 10.916913 10.657784 10.756455 10.756568 1.614028e+08
min 14.474000 14.564000 14.034000 14.265000 14.256832 1.982090e+08
25% 26.831000 27.171000 26.428500 26.899750 26.888915 3.790238e+08
50% 41.635500 42.503000 40.962500 41.893499 41.882687 4.341845e+08
75% 45.770749 46.223248 44.988249 45.646250 45.633528 5.172652e+08
max 50.216000 50.548000 49.411999 50.409000 50.398064 1.543911e+09

Reading to/from a .csv file

Pandas can read and write dataframes with many file formats such as .csv, .json, .parquet, .xlsx, .html, SQL, etc.

Here we write the dataframe to a .csv file.

nvidia_stocks.to_csv('nvidia_data.csv')

We can escape a shell command using the ! operator to see the top of the file.

!head nvidia_data.csv
Date,Open,High,Low,Close,Adj Close,Volume
2023-01-03,14.85099983215332,14.996000289916992,14.095999717712402,14.3149995803833,14.306804656982422,401277000
2023-01-04,14.567000389099121,14.852999687194824,14.241000175476074,14.74899959564209,14.740556716918945,431324000
2023-01-05,14.491000175476074,14.564000129699707,14.14799976348877,14.265000343322754,14.256832122802734,389168000
2023-01-06,14.473999977111816,15.010000228881836,14.034000396728516,14.859000205993652,14.850493431091309,405044000
2023-01-09,15.284000396728516,16.055999755859375,15.140999794006348,15.628000259399414,15.619053840637207,504231000
2023-01-10,15.506999969482422,15.961999893188477,15.472000122070312,15.909000396728516,15.899893760681152,384101000
2023-01-11,15.84000015258789,16.027999877929688,15.562999725341797,16.000999450683594,15.991839408874512,353285000
2023-01-12,16.100000381469727,16.636999130249023,15.491999626159668,16.51099967956543,16.501544952392578,551409000
2023-01-13,16.277999877929688,16.922000885009766,16.165000915527344,16.89900016784668,16.889324188232422,447287000

And of course we can likewise read a .csv file into a dataframe. This is probably the most common way you will get data into Pandas.

df = pd.read_csv('nvidia_data.csv')
df.head()
Date Open High Low Close Adj Close Volume
0 2023-01-03 14.851 14.996 14.096 14.315 14.306805 401277000
1 2023-01-04 14.567 14.853 14.241 14.749 14.740557 431324000
2 2023-01-05 14.491 14.564 14.148 14.265 14.256832 389168000
3 2023-01-06 14.474 15.010 14.034 14.859 14.850493 405044000
4 2023-01-09 15.284 16.056 15.141 15.628 15.619054 504231000
Caution

But be careful, the index column is not automatically set.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       250 non-null    object 
 1   Open       250 non-null    float64
 2   High       250 non-null    float64
 3   Low        250 non-null    float64
 4   Close      250 non-null    float64
 5   Adj Close  250 non-null    float64
 6   Volume     250 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.8+ KB

Note the index description.


To set the index column, we can use the index_col parameter.

df = pd.read_csv('nvidia_data.csv', index_col=0)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, 2023-01-03 to 2023-12-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       250 non-null    float64
 1   High       250 non-null    float64
 2   Low        250 non-null    float64
 3   Close      250 non-null    float64
 4   Adj Close  250 non-null    float64
 5   Volume     250 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 13.7+ KB

Working with data columns

In general, we’ll typically describe the rows in the dataframe as items (or observations or data samples) and the columns as features.

df.columns
Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

Pandas allows you to reference a column similar to a python dictionary key, using column names in square brackets.

df['Open']
Date
2023-01-03    14.851000
2023-01-04    14.567000
2023-01-05    14.491000
2023-01-06    14.474000
2023-01-09    15.284000
                ...    
2023-12-22    49.195000
2023-12-26    48.967999
2023-12-27    49.511002
2023-12-28    49.643002
2023-12-29    49.813000
Name: Open, Length: 250, dtype: float64

Note that this returns a Series object, the other fundamental data structure in Pandas.

type(df['Open'])
pandas.core.series.Series

Also note that Series is indexed in this case by dates rather than simple integers.


Pandas also allows you to refer to columns using an object attribute syntax.

Note that the column name cannot include a space in this case.

df.Open
Date
2023-01-03    14.851000
2023-01-04    14.567000
2023-01-05    14.491000
2023-01-06    14.474000
2023-01-09    15.284000
                ...    
2023-12-22    49.195000
2023-12-26    48.967999
2023-12-27    49.511002
2023-12-28    49.643002
2023-12-29    49.813000
Name: Open, Length: 250, dtype: float64

You can select a list of columns:

df[['Open', 'Close']].head()
Open Close
Date
2023-01-03 14.851 14.315
2023-01-04 14.567 14.749
2023-01-05 14.491 14.265
2023-01-06 14.474 14.859
2023-01-09 15.284 15.628

Which is just another dataframe, which is why we can chain the .head() method.

type(df[['Open', 'Close']])
pandas.core.frame.DataFrame

Changing column names is as simple as assigning to the .columns property.

Let’s adjust the column names to remove spaces.

new_column_names = [x.lower().replace(' ', '_') for x in df.columns]
df.columns = new_column_names
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, 2023-01-03 to 2023-12-29
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       250 non-null    float64
 1   high       250 non-null    float64
 2   low        250 non-null    float64
 3   close      250 non-null    float64
 4   adj_close  250 non-null    float64
 5   volume     250 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 21.8+ KB

Observe that we first created a list of column names without spaces using list comprehension. This is the pythonic way to generate a new list.


Now all columns can be accessed using the dot notation:

df.adj_close.head()
Date
2023-01-03    14.306805
2023-01-04    14.740557
2023-01-05    14.256832
2023-01-06    14.850493
2023-01-09    15.619054
Name: adj_close, dtype: float64

A sampling of DataFrame methods.

There are many useful methods in the DataFrame object. It is important to familiarize yourself with these methods.

The following methods calculate the mean, standard deviation, and median of the specified numeric columns.

df.mean()
open         3.651517e+01
high         3.712313e+01
low          3.592654e+01
close        3.656926e+01
adj_close    3.655814e+01
volume       4.735575e+08
dtype: float64

or we can give a list of columns to the Dataframe object:

df[['open', 'close', 'volume', 'adj_close']].mean()
open         3.651517e+01
close        3.656926e+01
volume       4.735575e+08
adj_close    3.655814e+01
dtype: float64
df.std()
open         1.084913e+01
high         1.091691e+01
low          1.065778e+01
close        1.075645e+01
adj_close    1.075657e+01
volume       1.614028e+08
dtype: float64
df.median()
open         4.163550e+01
high         4.250300e+01
low          4.096250e+01
close        4.189350e+01
adj_close    4.188269e+01
volume       4.341845e+08
dtype: float64

Or apply the method to a single column:

df.open.mean()
np.float64(36.515168033599856)
df.high.mean()
np.float64(37.12313207244873)

Plotting methods

Pandas also wraps matplotlib and provides a variety of easy-to-use plotting functions directly from the dataframe object.

These are your “first look” functions and useful in exploratory data analysis.

Later, we will use more specialized graphics packages to create more sophisticated visualizations.

import matplotlib.pyplot as plt

df.high.plot(label='High')
df.low.plot(label='Low')
plt.title('NVIDIA Stock Price')
plt.ylabel('Dollars')
plt.legend(loc='best')
plt.show()


Or a histogram on the adjusted closing price.

df.adj_close.hist()
plt.xlabel('Adjusted Closing Price')
plt.ylabel('Dollars')
plt.title('NVIDIA Stock Price')
plt.show()

Accessing rows of the DataFrame

So far we’ve seen how to access a column of the DataFrame. To access a row we use different syntax.

To access a row by its index label, use the .loc() method (‘location’).

df.loc['2023-01-23']
open         1.806400e+01
high         1.924500e+01
low          1.781800e+01
close        1.919300e+01
adj_close    1.918202e+01
volume       6.551630e+08
Name: 2023-01-23, dtype: float64

As a tangent, we can use the .apply() method to format the output.

df.loc['2023-01-23'].apply(lambda x: '{:,.2f}'.format(x) if isinstance(x, (int, float)) else x)
open                  18.06
high                  19.25
low                   17.82
close                 19.19
adj_close             19.18
volume       655,163,000.00
Name: 2023-01-23, dtype: object

To access a row by its index number (i.e., like an array index), use .iloc() (‘integer location’)

df.iloc[0, :]
open         1.485100e+01
high         1.499600e+01
low          1.409600e+01
close        1.431500e+01
adj_close    1.430680e+01
volume       4.012770e+08
Name: 2023-01-03, dtype: float64

and similarly formatted:

df.iloc[0, :].apply(lambda x: '{:,.2f}'.format(x) if isinstance(x, (int, float)) else x)
open                  14.85
high                  15.00
low                   14.10
close                 14.31
adj_close             14.31
volume       401,277,000.00
Name: 2023-01-03, dtype: object

To iterate over the rows you can use .iterrows().

num_positive_days = 0
for idx, row in df.iterrows():
    if row.close > row.open:
        num_positive_days += 1

print(f"The total number of positive-gain days is {num_positive_days} out of {len(df)} days or as percentage {num_positive_days/len(df):.2%}")
The total number of positive-gain days is 137 out of 250 days or as percentage 54.80%
Note

This is only capturing the intraday gain/loss, not the cumulative inter-day gain/loss.

Filtering

It is easy to select rows from the data.

All the operations below return a new Series or DataFrame, which itself can be treated the same way as all Series and DataFrames we have seen so far.

tmp_high = df.high > 45
tmp_high.tail()
Date
2023-12-22    True
2023-12-26    True
2023-12-27    True
2023-12-28    True
2023-12-29    True
Name: high, dtype: bool

Summing a Boolean array is the same as counting the number of True values.

sum(tmp_high)
87

Now, let’s select only the rows of df that correspond to tmp_high.

Note

We can pass a series to the dataframe to select rows.

df[tmp_high]
open high low close adj_close volume
Date
2023-07-13 44.518002 46.154999 44.492001 45.977001 45.963242 478204000
2023-07-14 46.583000 48.088001 45.060001 45.469002 45.455391 772075000
2023-07-17 46.289001 46.495998 45.262001 46.460999 46.447094 510488000
2023-07-18 46.701000 47.896000 45.734001 47.493999 47.479786 569164000
2023-07-19 47.464001 47.818001 46.742001 47.077000 47.062908 427502000
... ... ... ... ... ... ...
2023-12-22 49.195000 49.382999 48.466999 48.830002 48.823704 252507000
2023-12-26 48.967999 49.599998 48.959999 49.278999 49.272640 244200000
2023-12-27 49.511002 49.680000 49.084999 49.417000 49.410622 233648000
2023-12-28 49.643002 49.883999 49.411999 49.521999 49.515610 246587000
2023-12-29 49.813000 49.997002 48.750999 49.521999 49.515610 389293000

87 rows × 6 columns


Putting it all together, we can count the number of positive days without iterating over the rows.

positive_days = df[df.close > df.open]
print(f"Total number of positive-gain days is {len(positive_days)}")
positive_days.head()
Total number of positive-gain days is 137
open high low close adj_close volume
Date
2023-01-04 14.567 14.853 14.241 14.749000 14.740557 431324000
2023-01-06 14.474 15.010 14.034 14.859000 14.850493 405044000
2023-01-09 15.284 16.056 15.141 15.628000 15.619054 504231000
2023-01-10 15.507 15.962 15.472 15.909000 15.899894 384101000
2023-01-11 15.840 16.028 15.563 16.000999 15.991839 353285000

Or count the number of days with a gain of more than $2.

very_positive_days = df[(df.close - df.open) > 2]
print(f"Total number of days with gain > $2 is {len(very_positive_days)}")
very_positive_days.head()
Total number of days with gain > $2 is 4
open high low close adj_close volume
Date
2023-06-14 40.824001 43.000000 40.551998 42.997002 42.984131 740465000
2023-08-14 40.486000 43.799999 40.311001 43.752998 43.739902 690286000
2023-08-21 44.493999 47.064999 44.222000 46.966999 46.952946 692573000
2023-08-29 46.666000 49.081001 46.390999 48.784000 48.769402 701397000

Note that this doesn’t the explain the total gain for the year. Why?

Creating new columns

To create a new column, simply assign values to it. The column name is similar to a key in a dictionary.

Let’s look at the daily change in closing price.

# Calculate the daily change in closing price
df['daily_change'] = df['close'].diff()

# Create the cumulative profit column
df['cum_profit'] = df['daily_change'].cumsum()

# Display the first few rows to verify the new column
print(df[['close', 'daily_change', 'cum_profit']].head())
             close  daily_change  cum_profit
Date                                        
2023-01-03  14.315           NaN         NaN
2023-01-04  14.749      0.434000    0.434000
2023-01-05  14.265     -0.483999   -0.049999
2023-01-06  14.859      0.594000    0.544001
2023-01-09  15.628      0.769000    1.313001

It is convenient that .diff() by default is the difference between the current and previous row.


Let’s look at the histogram of the daily change in stock price.

# Plot histogram of daily_change
plt.figure(figsize=(10, 6))
df['daily_change'].hist(bins=50, edgecolor='black')
plt.title('Histogram of Daily Change in Stock Price')
plt.xlabel('Daily Change')
plt.ylabel('Frequency')
plt.show()


Let’s give each row a gain value as a categorical variable.

for idx, row in df.iterrows():
    if row.daily_change < 0:
        df.loc[idx,'cat_gain']='negative'
    elif row.daily_change < 1:
        df.loc[idx,'cat_gain']='small_gain'
    elif row.daily_change < 2:
        df.loc[idx,'cat_gain']='medium_gain'
    elif row.daily_change >= 2:
        df.loc[idx,'cat_gain']='large_gain'
df.head()
open high low close adj_close volume daily_change cum_profit cat_gain
Date
2023-01-03 14.851 14.996 14.096 14.315 14.306805 401277000 NaN NaN NaN
2023-01-04 14.567 14.853 14.241 14.749 14.740557 431324000 0.434000 0.434000 small_gain
2023-01-05 14.491 14.564 14.148 14.265 14.256832 389168000 -0.483999 -0.049999 negative
2023-01-06 14.474 15.010 14.034 14.859 14.850493 405044000 0.594000 0.544001 small_gain
2023-01-09 15.284 16.056 15.141 15.628 15.619054 504231000 0.769000 1.313001 small_gain

Here is another, more “functional”, way to accomplish the same thing.

First, let’s drop the gain column so we can start fresh.

df.drop('cat_gain', axis=1, inplace=True)
df.head()
open high low close adj_close volume daily_change cum_profit
Date
2023-01-03 14.851 14.996 14.096 14.315 14.306805 401277000 NaN NaN
2023-01-04 14.567 14.853 14.241 14.749 14.740557 431324000 0.434000 0.434000
2023-01-05 14.491 14.564 14.148 14.265 14.256832 389168000 -0.483999 -0.049999
2023-01-06 14.474 15.010 14.034 14.859 14.850493 405044000 0.594000 0.544001
2023-01-09 15.284 16.056 15.141 15.628 15.619054 504231000 0.769000 1.313001

Define a function that classifies rows, and apply it to each row.

def namerow(row):
    if row.daily_change < 0:
        return 'negative'
    elif row.daily_change < 1:
        return 'small_gain'
    elif row.daily_change < 2:
        return 'medium_gain'
    elif row.daily_change >= 2:
        return 'large_gain'

df['cat_gain'] = df.apply(namerow, axis=1)
df.head()
open high low close adj_close volume daily_change cum_profit cat_gain
Date
2023-01-03 14.851 14.996 14.096 14.315 14.306805 401277000 NaN NaN None
2023-01-04 14.567 14.853 14.241 14.749 14.740557 431324000 0.434000 0.434000 small_gain
2023-01-05 14.491 14.564 14.148 14.265 14.256832 389168000 -0.483999 -0.049999 negative
2023-01-06 14.474 15.010 14.034 14.859 14.850493 405044000 0.594000 0.544001 small_gain
2023-01-09 15.284 16.056 15.141 15.628 15.619054 504231000 0.769000 1.313001 small_gain

Grouping

A powerful DataFrame method is groupby().

This is analagous to GROUP BY in SQL.

It will group the rows of a DataFrame by the values in one (or more) columns and let you iterate through each group.

Here we will look at the average gain among the categories of gains (negative, small, medium, and large) we defined above and stored in the column gain.

gain_groups = df.groupby(by='cat_gain')

Essentially, gain_groups behaves like a dictionary:

  • the keys are the unique values found in the gain column, and
  • the values are DataFrames that contain only the rows having the corresponding unique values.
for gain, gain_data in gain_groups:
    print(gain)
    print(gain_data[['close', 'daily_change']].head())
    print('=============================')
large_gain
                close  daily_change
Date                               
2023-02-23  23.664000      2.910000
2023-05-25  37.980000      7.441999
2023-07-13  45.977001      2.075001
2023-08-14  43.752998      2.897999
2023-08-21  46.966999      3.667999
=============================
medium_gain
                close  daily_change
Date                               
2023-01-20  17.839001      1.074001
2023-01-23  19.193001      1.354000
2023-02-01  20.943001      1.406000
2023-02-07  22.173000      1.084000
2023-02-14  22.971001      1.183001
=============================
negative
                close  daily_change
Date                               
2023-01-05  14.265000     -0.483999
2023-01-18  17.377001     -0.324999
2023-01-19  16.764999     -0.612001
2023-01-30  19.162001     -1.202999
2023-02-03  21.100000     -0.608999
=============================
small_gain
                close  daily_change
Date                               
2023-01-04  14.749000      0.434000
2023-01-06  14.859000      0.594000
2023-01-09  15.628000      0.769000
2023-01-10  15.909000      0.281000
2023-01-11  16.000999      0.091999
=============================
for gain, gain_data in df.groupby("cat_gain"):
    print('The average daily change for the {} group is {}'.format(gain,
                                                           gain_data.daily_change.mean()))
The average daily change for the large_gain group is 3.798599624633789
The average daily change for the medium_gain group is 1.29327072968354
The average daily change for the negative group is -0.6942264628860185
The average daily change for the small_gain group is 0.41535633389312465

Other Pandas Classes

A DataFrame is essentially an annotated 2-D array.

Pandas also has annotated versions of 1-D and 3-D arrays.

A 1-D array in Pandas is called a Series. You can think of DataFrames as a dictionary of Series.

A 3-D array in Pandas is created using a MultiIndex.

For more information read the documentation.

Comparing multiple stocks

As a last task, we will use the experience we obtained so far – and learn some new things – in order to compare the performance of different stocks.

stocks = ['NVDA', 'META', 'MSFT', 'TSLA', 'IBM', 'INTC']
stock_df = pd.DataFrame()
for s in stocks:
    stock_df[s] = pd.DataFrame(yf.download(s, 
                                           start='2023-09-01', 
                                           end='2024-08-30', 
                                           progress=False))['Close']
stock_df.head()
NVDA META MSFT TSLA IBM INTC
Date
2023-09-01 48.508999 296.380005 328.660004 245.009995 147.940002 36.610001
2023-09-05 48.548000 300.149994 333.549988 256.489990 148.130005 36.709999
2023-09-06 47.061001 299.170013 332.880005 251.919998 148.059998 36.980000
2023-09-07 46.241001 298.670013 329.910004 251.490005 147.520004 38.180000
2023-09-08 45.571999 297.890015 334.269989 248.500000 147.679993 38.009998

Let’s look at the closing prices of the stocks.

import matplotlib.pyplot as plt

stock_df.plot()
plt.title('Stock Closing Prices')
plt.ylabel('Dollars')
plt.legend(loc='best')
plt.show()

But that is not as interesting as the returns.


So next, we calculate the returns over a period of length \(T\). The returns are defined as

\[ r(t) = \frac{f(t)-f(t-T)}{f(t-T)}. \]

The returns can be computed with a simple DataFrame method pct_change(). Note that for the first \(T\) timesteps, this value is not defined.

rets = stock_df.pct_change(30)
rets.iloc[25:35]
NVDA META MSFT TSLA IBM INTC
Date
2023-10-09 NaN NaN NaN NaN NaN NaN
2023-10-10 NaN NaN NaN NaN NaN NaN
2023-10-11 NaN NaN NaN NaN NaN NaN
2023-10-12 NaN NaN NaN NaN NaN NaN
2023-10-13 NaN NaN NaN NaN NaN NaN
2023-10-16 -0.049764 0.083575 0.012110 0.036366 -0.059010 -0.001366
2023-10-17 -0.094958 0.079460 -0.004467 -0.006394 -0.052724 -0.017706
2023-10-18 -0.103377 0.059498 -0.008321 -0.036678 -0.054640 -0.036236
2023-10-19 -0.089531 0.047343 0.004274 -0.124776 -0.064466 -0.065741
2023-10-20 -0.091833 0.036121 -0.022736 -0.146922 -0.071235 -0.081294

Now we’ll plot the timeseries of the rolling 30-day returns of the different stocks.

Notice that the NaN values are dropped by the plotting function.

rets[['NVDA', 'META']].plot()
plt.ylabel('Returns (%)')
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.2f}%'.format(y * 100)))
plt.show()


Let’s do a scatter plot of the returns of NVDA versus META.

plt.scatter(rets.NVDA, rets.META)
plt.xlabel('NVDA 30-day returns')
plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.2f}%'.format(y * 100)))
plt.ylabel('META 30-day returns')
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.2f}%'.format(y * 100)))
plt.tight_layout()  # Adjust layout to prevent label cutoff
plt.show()

There appears to be some (fairly strong) correlation between the movement of NVDA and META stocks. Let’s measure this.


The correlation coefficient between variables \(X\) and \(Y\) is defined as follows

\[ \text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}. \]

Pandas provides a DataFrame method called corr() that computes the correlation coefficient of all pairs of columns.

rets.corr()
NVDA META MSFT TSLA IBM INTC
NVDA 1.000000 0.710654 0.534221 -0.120444 0.212919 0.189073
META 0.710654 1.000000 0.456792 -0.374675 0.574674 0.218855
MSFT 0.534221 0.456792 1.000000 -0.012553 0.331533 0.710729
TSLA -0.120444 -0.374675 -0.012553 1.000000 -0.016866 0.305960
IBM 0.212919 0.574674 0.331533 -0.016866 1.000000 0.402570
INTC 0.189073 0.218855 0.710729 0.305960 0.402570 1.000000

It takes a bit of time to examine that table and draw conclusions.


To speed that process up let’s visualize the table.

import seaborn as sns

sns.heatmap(rets.corr(), annot=True)
plt.show()

It does seem like there is a strong correlation between NVDA and META.

What about TSLA and META?


plt.scatter(rets.TSLA, rets.META)
plt.xlabel('TESLA 30-day returns')
plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.2f}%'.format(y * 100)))
plt.ylabel('META 30-day returns')
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: '{:.2f}%'.format(y * 100)))
plt.tight_layout()  # Adjust layout to prevent label cutoff
plt.show()

What can we say about the 30-day returns of TSLA and META?


Pandas plotting

As mentioned, the plotting performed by Pandas is just a layer on top of matplotlib (i.e., the plt package).

So Panda’s plots can (and often should) be replaced or improved by using additional functions from matplotlib.

For example, suppose we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk).


Here is a visualization of the result of such an analysis. We construct the plot using only functions from matplotlib.

plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Standard Deviation (Risk)')
plt.xlim([-.05, .1])
plt.ylim([0, .3])
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (30, -30),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
plt.show()

To understand what these functions are doing, (especially the annotate function), you will need to consult the online documentation for matplotlib.

Recap

In this section we got a first glimpse of the Pandas library.

We learned how to:

  • load data from a CSV file
  • inspect the data
  • manipulate the data
  • plot the data
  • access rows and columns of the dataframe
  • filter the data
  • create new columns
  • group the data
  • compute the correlation between variables
Back to top