Exploratory Data Analysis in Python

The objective of exploratory data analysis (EDA) is to load data into structures most suitable for further analysis to identify and rectify any wrong/bad data and get basic insight into the data — the types of fields there are; whether they are categorical or not; how many missing values there are; how the fields are related; and so on.

EDA is the process of procuring, understanding, and deriving meaningful statistical insights from structured/unstructured data of interest.

Steps in EDA

Loading the necessary libraries and setting them up
Data collection
Data wrangling/munging
Data cleaning
Obtaining descriptive statistics
Visual inspection of the data
Data cleaning
Advanced visualisation techniques

Loading the necessary libraries and setting them up

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sn
import matplotlib.pyplot as plt

Data collection

Data source: comma-separated values (CSV) file, Excel files, web scrapes, binary files, and so on.

Data wrangling/munging

pandas.DataFrame.join(…)

lsuffix=

rsuffix=

For example:

merged_df = A.join(B, how='outer', lsuffix='_A', sort=True).join(C, how='outer', lsuffix='_B', rsuffix='_C', sort=True)

Data cleaning

Data cleaning refers to the process of addressing data errors coming from missing data, incorrect data values, and outliers.

Check if there are any rows where all values are missing (NaN),

merged_df[marged_df.isnull().all(axis=1)]

Find out how many rows exist that have at least one field that is missing/NaN,

merged_df[['Close_A', 'Close_B', 'Close_c']].isnull().any(axis=1).sum()

Need to have valid Close prices, and drop all rows where the Close price for any of the three instruments is missing,

valid_close_df = marged_df.dropna(subset=["Close_A', 'Close_B', 'Close_C'], how='any')

After dropping the missing Close prices, check no more missing Close price fields,

valid_close_df[['Close_A', 'Close_B', 'Close_C']].isnull().any(axis=1).sum()

Deal with rows that have NaN values for any of the other fields, starting with getting a sense of how many such rows exist,

valid_close_df.isnull().any(axis=1).sum()

Inspect a few of the rows with at least some fields with a missing value,

valid_close_df[valid_close_df.isnull().any(axis=1)]

Use the pandas.DataFrame.fillna(…) method with a method called backfill — this uses the next valid value after the missing value to fill in the missing value,

valid_close_complete = valide_close_df.fillna(method='backfill')

Check the result,

valid_close_complete.isnull().any(axis=1).sum()

Obtaining descriptive statistics

Generate the key basic statistics on data to build familiarity with each, with the DataFrame.describe(…) method,

pd.set_option('display.max_rows', None)
valid_close_complete.describe()

Visual inspection of the data

Plot the prices of A

valid_close_complete['Open_A'].plot(figsize=(12, 6), linestyle='--', color='black', legend='Open_A')
valid_close_complete['Close_A'].plot(figsize=(12, 6), linestyle='-', color='grey', legend='Close_A')
valid_close_complete['Low_A'].plot(figsize=(12, 6), linestyle=':', color='black', legend='Low_A')
valid_close_complete['High_A'].plot(figsize=(12, 6), linestyle='-.', color='grey', legend='High_A')

Plot the prices of C

valid_close_complete['Open_C'].plot(figsize=(12, 6), linestyle='--', color='black', legend='Open_C')
valid_close_complete['Close_C'].plot(figsize=(12, 6), linestyle='-', color='grey', legend='Close_C')
valid_close_complete['Low_C'].plot(figsize=(12, 6), linestyle=':', color='black', legend='Low_C')
valid_close_complete['High_C'].plot(figsize=(12, 6), linestyle='-.', color='grey', legend='High_C')

Further data cleaning to eliminate outlier values, interquartile range (IQR) and the Z-score.

The IQR method uses a percentile/quantile range of values over the entire dataset to identify and remove outliers.

The Z-score (or standard score) is obtained by subtracting the mean of the dataset from each data point and normalising the result by dividing by the standard deviation of the dataset.

Use scipy.stats,zscore(…) to compute Z-scores of each column in the price_only DataFrame, and then we use numpy.abs(…) to get the magnitude of the Z-scores. Finally, we select rows where all fields have Z-scores lower than 6, and save that in a no_outlier_prices DataFrame.

no_outlier_prices = prices_only[(np.abs(stats.zscore(prices_only))<6).all(axis=1)]

Plotting the prices of C

no_outlier_prices['Open_C'].plot(figsize=(12,6), linestyle='--', colour='black', legend='Open_C')
no_outlier_prices['Close_C'].plot(figsize=(12,6), linestyle='-', colour='grey', legend='Close_C')
no_outlier_prices['Low_C'].plot(figsize=(12,6), linestyle=':', colour='black', legend='Low_C')
no_outlier_prices['High_C'].plot(figsize=(12,6), linestyle='-.', colour='grey', legend='High_C')

Check the impact of our outlier removal work by re-inspecting the descriptive statistics

pd.set_option('display.max_rows', None)
no_outlier_prices[['Open_C', 'Close_C', 'Low_C', 'High_C']].describe()

Reset the number of rows to display for a pandas DataFrame

pd.set_option('display.max_rows', 5)

Advanced visualisation techniques

Explore univariate and multivariate statistics visualisation techniques

Collect the close prices for the three instruments

close_prices = no_outlier_prices[['Close_A', 'Close_B', 'Close_C']]

Compute the daily close price changes to evaluate if there is a relationship between daily price changes between the three instruments

Use the pandas.DataFrame..shift(…) method to shift the original DataFrame one period forward so that we can compute the price change

delta_close_price = (close_prices.shift(-1) - close_prices).fillna(0)
delta_close_prices.columns = ['Delta_Close_A', 'Delta_Close_B', 'Delta_Close_C']
delta_close_prices

Inspect the summary statistics for this new DataFrame to get a sense of how the delta price values are distributed

pd.set_option('Display.max_rows', None)
delta_close_prices.describe()

Histogram plot

Observe the distribution of Delta_Close_A

delta_close_pirces['Delta_Close_A'].plot(kind='hist', bins=100, figsize=(12,6), color='black', grid=True)

Box plot

Draw a box plot, which also helps in assessing the values’ distribution

delta_close_price['Delta_Close_B'].plot(kind='box', figsize=(12,6), color='black', grid=True)

Correlation charts

The first step in multivariate data statistics is to assess the correlations between Delta_Close_A, Delta_Close_B, and Delta_Close_C.

Demonstrate the option of using kernel density estimation (KDE)

pd.plotting.scatter_matrix(delta_close_prices, figsize=(10,10), color='black', alpha=0.75, diagonal='kde', grid=True)

Look at some statistics that provide the relationship between the variables

delta_close_prices.corr()

Pairwise correlation heatmap

plt.figure(figsize=(6,6))
sn.heatmap(delta_close_price.corr(), annot=True, square=True, linewidths=2)

Revelation of the identity of A, B, and C and EDA’s conclusions

The A instrument is the Dow Jones Industrial Average (DJIA), a large cap equity index exchange traded fund (ETF). The B instrument is the S&P 500 (SPY), another large cap equity index ETF. The C instrument is the Chicago Board Options Exchange (CBOE) Volatility Index (VIX), which basically tracks how volatile markets are at any given time (basically, a function of equity index price swings).

From our EDA on the mystery instruments, we drew the following conclusions:

C (VIX) cannot have negative prices or prices above 90, which has historically been true.
A (DJIA) and B (SPY) had huge drops in 2008 and 2020, corresponding to the stock market crash and the COVID-19 pandemic, respectively. Also, the price of C (VIX) spiked at the same time, indicating heightened market turmoil.
A (DJIA) has largest daily price swings, followed by B (SPY), and finally C (VIX), with very low daily price swings. These are also correct observations considering the underlying instruments that they were hiding.

A (DJIA) and B (SPY) have very strong positive correlations, which makes sense since both are large cap equity indices. C (VIX) has strong negative correlations with both A (DJIA) and B (SPY), which also makes sense since during periods of prosperity, volatility remains low and markets rise, and during periods of crisis, volatility spikes and markets drop.

Special Python libraries for EDA

import dtale
dtale.show(valid_close_df)

Author: Zhe

Leave a Reply Cancel reply

Exploratory Data Analysis in Python

Author: Zhe

Related posts

Leave a Reply Cancel reply