The objective of exploratory data analysis (EDA) is to load data into structures most suitable for further analysis to identify and rectify any wrong/bad data and get basic insight into the data — the types of fields there are; whether they are categorical or not; how many missing values there are; how the fields are related; and so on.
EDA is the process of procuring, understanding, and deriving meaningful statistical insights from structured/unstructured data of interest.
Steps in EDA
- Loading the necessary libraries and setting them up
- Data collection
- Data wrangling/munging
- Data cleaning
- Obtaining descriptive statistics
- Visual inspection of the data
- Data cleaning
- Advanced visualisation techniques
Loading the necessary libraries and setting them up
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sn
import matplotlib.pyplot as plt
Data collection
Data source: comma-separated values (CSV) file, Excel files, web scrapes, binary files, and so on.
Data wrangling/munging
pandas.DataFrame.join(…)
lsuffix=
rsuffix=
For example:
merged_df = A.join(B, how='outer', lsuffix='_A', sort=True).join(C, how='outer', lsuffix='_B', rsuffix='_C', sort=True)
Data cleaning
Data cleaning refers to the process of addressing data errors coming from missing data, incorrect data values, and outliers.
Check if there are any rows where all values are missing (NaN),
merged_df[marged_df.isnull().all(axis=1)]
Find out how many rows exist that have at least one field that is missing/NaN,
merged_df[['Close_A', 'Close_B', 'Close_c']].isnull().any(axis=1).sum()
Need to have valid Close prices, and drop all rows where the Close price for any of the three instruments is missing,
valid_close_df = marged_df.dropna(subset=["Close_A', 'Close_B', 'Close_C'], how='any')
After dropping the missing Close prices, check no more missing Close price fields,
valid_close_df[['Close_A', 'Close_B', 'Close_C']].isnull().any(axis=1).sum()
Deal with rows that have NaN values for any of the other fields, starting with getting a sense of how many such rows exist,
valid_close_df.isnull().any(axis=1).sum()
Inspect a few of the rows with at least some fields with a missing value,
valid_close_df[valid_close_df.isnull().any(axis=1)]
Use the pandas.DataFrame.fillna(…) method with a method called backfill — this uses the next valid value after the missing value to fill in the missing value,
valid_close_complete = valide_close_df.fillna(method='backfill')
Check the result,
valid_close_complete.isnull().any(axis=1).sum()
Obtaining descriptive statistics
Generate the key basic statistics on data to build familiarity with each, with the DataFrame.describe(…) method,
pd.set_option('display.max_rows', None)
valid_close_complete.describe()
Visual inspection of the data
Plot the prices of A
valid_close_complete['Open_A'].plot(figsize=(12, 6), linestyle='--', color='black', legend='Open_A')
valid_close_complete['Close_A'].plot(figsize=(12, 6), linestyle='-', color='grey', legend='Close_A')
valid_close_complete['Low_A'].plot(figsize=(12, 6), linestyle=':', color='black', legend='Low_A')
valid_close_complete['High_A'].plot(figsize=(12, 6), linestyle='-.', color='grey', legend='High_A')
Plot the prices of C
valid_close_complete['Open_C'].plot(figsize=(12, 6), linestyle='--', color='black', legend='Open_C')
valid_close_complete['Close_C'].plot(figsize=(12, 6), linestyle='-', color='grey', legend='Close_C')
valid_close_complete['Low_C'].plot(figsize=(12, 6), linestyle=':', color='black', legend='Low_C')
valid_close_complete['High_C'].plot(figsize=(12, 6), linestyle='-.', color='grey', legend='High_C')
Further data cleaning to eliminate outlier values, interquartile range (IQR) and the Z-score.
The IQR method uses a percentile/quantile range of values over the entire dataset to identify and remove outliers.
The Z-score (or standard score) is obtained by subtracting the mean of the dataset from each data point and normalising the result by dividing by the standard deviation of the dataset.
Use scipy.stats,zscore(…) to compute Z-scores of each column in the price_only DataFrame, and then we use numpy.abs(…) to get the magnitude of the Z-scores.









