The objective of exploratory data analysis (EDA) is to load data into structures most suitable for further analysis to identify and rectify any wrong/bad data and get basic insight into the data — the types of fields there are; whether they are categorical or not; how many missing values there are; how the fields are related; and so on.
EDA is the process of procuring, understanding, and deriving meaningful statistical insights from structured/unstructured data of interest.
Steps in EDA
- Loading the necessary libraries and setting them up
- Data collection
- Data wrangling/munging
- Data cleaning
- Obtaining descriptive statistics
- Visual inspection of the data
- Data cleaning
- Advanced visualisation techniques
Loading the necessary libraries and setting them up
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sn
import matplotlib.pyplot as plt
Data collection
Data source: comma-separated values (CSV) file, Excel files, web scrapes, binary files, and so on.
Data wrangling/munging
pandas.DataFrame.join(…)
lsuffix=
rsuffix=
For example:
merged_df = A.join(B, how='outer', lsuffix='_A', sort=True).join(C, how='outer', lsuffix='_B', rsuffix='_C', sort=True)
Data cleaning
Data cleaning refers to the process of addressing data errors coming from missing data, incorrect data values, and outliers.
Check if there are any rows where all values are missing (NaN),
merged_df[marged_df.isnull().all(axis=1)]
Find out how many rows exist that have at least one field that is missing/NaN,
merged_df[['Close_A', 'Close_B', 'Close_c']].isnull().any(axis=1).sum()
Need to have valid Close prices, and drop all rows where the Close price for any of the three instruments is missing,
valid_close_df = marged_df.dropna(subset=["Close_A', 'Close_B', 'Close_C'], how='any')
After dropping the missing Close prices, check no more missing Close price fields,
valid_close_df[['Close_A', 'Close_B', 'Close_C']].isnull().any(axis=1).sum()
Deal with rows that have NaN values for any of the other fields, starting with getting a sense of how many such rows exist,
valid_close_df.isnull().any(axis=1).sum()
Inspect a few of the rows with at least some fields with a missing value,
valid_close_df[valid_close_df.isnull().any(axis=1)]
Use the pandas.DataFrame.fillna(…) method with a method called backfill — this uses the next valid value after the missing value to fill in the missing value,
valid_close_complete = valide_close_df.fillna(method='backfill')
Check the result,
valid_close_complete.isnull().any(axis=1).sum()
Obtaining descriptive statistics
Generate the key basic statistics on data to build familiarity with each, with the DataFrame.describe(…) method,
pd.set_option('display.max_rows', None)
valid_close_complete.describe()
Visual inspection of the data
Plot the prices of A
valid_close_complete['Open_A'].plot(figsize=(12, 6), linestyle='--', color='black', legend='Open_A')
valid_close_complete['Close_A'].plot(figsize=(12, 6), linestyle='-', color='grey', legend='Close_A')
valid_close_complete['Low_A'].plot(figsize=(12, 6), linestyle=':', color='black', legend='Low_A')
valid_close_complete['High_A'].plot(figsize=(12, 6), linestyle='-.', color='grey', legend='High_A')
Plot the prices of C
valid_close_complete['Open_C'].plot(figsize=(12, 6), linestyle='--', color='black', legend='Open_C')
valid_close_complete['Close_C'].plot(figsize=(12, 6), linestyle='-', color='grey', legend='Close_C')
valid_close_complete['Low_C'].plot(figsize=(12, 6), linestyle=':', color='black', legend='Low_C')
valid_close_complete['High_C'].plot(figsize=(12, 6), linestyle='-.', color='grey', legend='High_C')
Further data cleaning to eliminate outlier values, interquartile range (IQR) and the Z-score.
The IQR method uses a percentile/quantile range of values over the entire dataset to identify and remove outliers.
The Z-score (or standard score) is obtained by subtracting the mean of the dataset from each data point and normalising the result by dividing by the standard deviation of the dataset.
Use scipy.stats,zscore(…) to compute Z-scores of each column in the price_only DataFrame, and then we use numpy.abs(…) to get the magnitude of the Z-scores. Finally, we select rows where all fields have Z-scores lower than 6, and save that in a no_outlier_prices DataFrame.
no_outlier_prices = prices_only[(np.abs(stats.zscore(prices_only))<6).all(axis=1)]
Plotting the prices of C
no_outlier_prices['Open_C'].plot(figsize=(12,6), linestyle='--', colour='black', legend='Open_C')
no_outlier_prices['Close_C'].plot(figsize=(12,6), linestyle='-', colour='grey', legend='Close_C')
no_outlier_prices['Low_C'].plot(figsize=(12,6), linestyle=':', colour='black', legend='Low_C')
no_outlier_prices['High_C'].plot(figsize=(12,6), linestyle='-.', colour='grey', legend='High_C')
Check the impact of our outlier removal work by re-inspecting the descriptive statistics
pd.set_option('display.max_rows', None)
no_outlier_prices[['Open_C', 'Close_C', 'Low_C', 'High_C']].describe()
Reset the number of rows to display for a pandas DataFrame
pd.set_option('display.max_rows', 5)
Advanced visualisation techniques
Explore univariate and multivariate statistics visualisation techniques
Collect the close prices for the three instruments
close_prices = no_outlier_prices[['Close_A', 'Close_B', 'Close_C']]
Compute the daily close price changes to evaluate if there is a relationship between daily price changes between the three instruments
Use the pandas.DataFrame..shift(…) method to shift the original DataFrame one period forward so that we can compute the price change
delta_close_price = (close_prices.shift(-1) - close_prices).fillna(0)
delta_close_prices.columns = ['Delta_Close_A', 'Delta_Close_B', 'Delta_Close_C']
delta_close_prices
Inspect the summary statistics for this new DataFrame to get a sense of how the delta price values are distributed
pd.set_option('Display.max_rows', None)
delta_close_prices.describe()
Histogram plot
Observe the distribution of Delta_Close_A
delta_close_pirces['Delta_Close_A'].plot(kind='hist', bins=100, figsize=(12,6), color='black', grid=True)
Box plot
Draw a box plot, which also helps in assessing the values’ distribution
delta_close_price['Delta_Close_B'].plot(kind='box', figsize=(12,6), color='black', grid=True)
Correlation charts
The first step in multivariate data statistics is to assess the correlations between Delta_Close_A, Delta_Close_B, and Delta_Close_C.
Demonstrate the option of using kernel density estimation (KDE)
pd.plotting.scatter_matrix(delta_close_prices, figsize=(10,10), color='black', alpha=0.75, diagonal='kde', grid=True)
Look at some statistics that provide the relationship between the variables
delta_close_prices.corr()
Pairwise correlation heatmap
plt.figure(figsize=(6,6))
sn.heatmap(delta_close_price.corr(), annot=True, square=True, linewidths=2)
Revelation of the identity of A, B, and C and EDA’s conclusions
The A instrument is the Dow Jones Industrial Average (DJIA), a large cap equity index exchange traded fund (ETF). The B instrument is the S&P 500 (SPY), another large cap equity index ETF. The C instrument is the Chicago Board Options Exchange (CBOE) Volatility Index (VIX), which basically tracks how volatile markets are at any given time (basically, a function of equity index price swings).
From our EDA on the mystery instruments, we drew the following conclusions:
- C (VIX) cannot have negative prices or prices above 90, which has historically been true.
- A (DJIA) and B (SPY) had huge drops in 2008 and 2020, corresponding to the stock market crash and the COVID-19 pandemic, respectively. Also, the price of C (VIX) spiked at the same time, indicating heightened market turmoil.
- A (DJIA) has largest daily price swings, followed by B (SPY), and finally C (VIX), with very low daily price swings. These are also correct observations considering the underlying instruments that they were hiding.
A (DJIA) and B (SPY) have very strong positive correlations, which makes sense since both are large cap equity indices. C (VIX) has strong negative correlations with both A (DJIA) and B (SPY), which also makes sense since during periods of prosperity, volatility remains low and markets rise, and during periods of crisis, volatility spikes and markets drop.
Special Python libraries for EDA
import dtale
dtale.show(valid_close_df)









