Four key statistical libraries in Python
- statsmodels
- pmdarima
- fbprophet
- scikitlearn
statsmodels is a Python library to explore data, perform statistical tests, and estimate statistical models
Normal distribution test with Q-Q plots
While there are many robust statistical tests for normal distributions, an intuitive visual method is known as a quantile-quantile plot (Q-Q plot). If a sample is normally distributed, its Q-Q plot is a straight line.
In the following code block, the statsmodels.graphics.api.qqplot(…) method is used to check if a numpy.random.uniform(…) distribution is normally distributed:
from statsmodels.graphics.api import qqplot
import numpy as np
fig = qqplot(np.random.uniform(size=10000), line='s')
fig.set_size_inches(12, 6)
fig = qqplot(np.random.exponential(size=10000), line='s')
fig.set_size_inches(12, 6)
fig = qqplot(np.random.normal(size=10000), line='s')
fig.set_size_inches(12, 6)
Time series modelling with statsmodels
A time series is a sequence of numerical dta points in time order.
import stats models.api as sm
sm.tsa.datetools.dates_from_range('2020', length=12)
The frequency of dates in the dates_from_range(…) method can be specified by the start date and a special format, where the m1 suffix means first month and monthly frequency, and q1 means first quarter and quarterly frequency, as illustrated in the following code snippet:
sm.tsa.datetools.dates_from_range('2010m1', length=120)
Error, Trend, Seasonality (ETS) analysis of a time series
The ETS analysis of a time series breaks down the data into three different components, as follows:
- The trend component captures the overall trend of the time series.
- The seasonality component captures cyclical/seasonal changes.
- The error component captures noise in the data that could not be captured with the other two components.
Let’s generate 20 years of monthly dates as an index to the Pandas DataFrame dataset using the datetools.dates_from_range(…) method, as follows:
import pandas as pd
n_obs = 12 * 20
linear_trend = np.linspace(100, 200, num=n_obs)
cycle = np.sin(linear_trend) * 10
error_noise = np.random.randn(n_obs)
dataset = pd.DataFrame(linear_trend + cycle + error_noise, index=sm.tsa.datetools.dates_from_range('2000m1', length=n_obs), columns=['Price'])
import matplotlib.pypllot as plt
dataset.plot(figsize=(12, 6), color='black')
The Hodrick-Prescott filter
The Hodrick-Prescott (HP) filter is used to separate the trend and cyclical components from time series data by removing short-term fluctuations from the longer-term trend.
In statsmodels, this is implemented as statsmodels.api.tsa.filters.hpfilter(…).
hy_cycle, hp_trend = sm.tsa.filters.hpfilter(dataset['Price'], lamb=129600)
decomp = dataset[['Price']]
decomp['HP_Cycle'] = hp_cycle
decomp['HP_Trend'] = hp_trend
UnobservedComponents model
uc = sm.tsa.UnobservedComponents(dataset['Price'], level='lltrend', cycle=True, stochastic_cycle=True)
res_uc = us.fit(method='powell', disp=True)
res_uc.summary()
Access the ETS/cyclical components using the resid, cycle.smoothed, andlevel.smoothed attributes and add them to the decomp DataFrame
decomp['UC_Cycle'] = res_uc.cycle.smoothed
decomp['UC_Trend'] = res_uc.level.smoothed
decomp['UC_Error'] = res_uc.resid
statsmodels.tsa.seasonal.seasonal_decompose(…) method
The following code block uses an additive model by specifying a model=’additive’ parameter and adds SDC_Cycle, DC_Trend, and SDC_Error columns to the decomp DataFrame by accessing the season, trend, and resid attributes in the DecomposeResult object:
from statsmodels.tsa.seasonal import seasonal_decompose
s_dc = seasona;_decompose(detaset['Price'], model='additive')
decomp['SDC_Cycle'] = s_dc.seasonal
decomp['SDC_Trend'] = s_dc.trend
decomp['SDC_Error'] = s_dc.resid
Plotting of the results of HP filter, the UnobservedComponents model, and the seasonal_decompose method
plt.title('Trend components')
decomp['Price'].plot(figsize=(12, 6), color='black', linestyle='-', legend='Price')
decomp['HP_Trend'].plot(figsize=(12, 6), color='darkgray', linestyle="--", lw=2, legend='HP_Trend')
decomp['UC_Trend'].plot(figsize=(12, 6), color='black', linestyle=':', lw=2, legend='UC_Trend')
decomp=['SDC_Trend'].plot(figsize=(12, 6), color='black', linestyle=':', lw=2, legend='SDC_Trend')
The following code block plots the cycle/seasonal components obtained from the three models:
plt.title('Cycle/Seasona; components')
decomp['HP_Cycle'].plot(figsize=(12, 6), color='darkgray', linestyle='--', lw=2, legend='HP_Cycle')
decomp['UC_Cycle'].plot(figsize=(12, 6), color='black', linestyle=':', lw=2, legend='UC_Cycle')
decomp['SDC_Cycle'].plot(figsize=(12, 6), color='black', linestyle='-.', lw=2, legend='SDC_Cycle')
Visualise the error terms in the UnobservedComponents and seasonal_decompose methods
plt.title('Error components')
plt.ylim((-20, 20))
decomp['UC_Error'].plot(figsize=(12, 60), color='black', linestyle=':', lw=2, legend='UC_Error')
decomp['SDC_Error'].plot(figsize=(12, 60), color='black', linestyle='-.', lw=2, legend='SDC_Error')
Augmented Dickey-Fuller test for stationarity of a time series
Stationary time series are time series whose statistical properties such as mean, variance, and autocorrelation are constant over time. Many statistical forecasting models assume that time series datasets can be transformed into stationary datasets by some mathematical operations, such as differencing.
An Augmented Dickey-Fuller (ADF) test is used to check if a dataset is stationary or not—it computes the likelihood that a dataset is not stationary, and when that probability (p-value) is very low, we can conclude that the dataset is stationary.
Step 1 — ADF test on the prices
from statsmodels.tsa.stattools import adfuller
result = adfuller(dataset['Price'])
print('Test Stat: {}\np value: {}\nLags: {}\nNum observations: {}'.format(result[0], result[1], result[2], result[3])
Step 2 — First differencing on prices
price_diff = (dataset['Price'].shift(-1) - dataset['Price']).fillna(0)
Step 3 — ADF test on the differenced prices
result = adfuller(price_diff)
print('Test Stat: {}\np value: {}\nLags: {}\nNum observations: {}'.format(result[0], result[1], result[2], result[3])
Autocorrelation and partial autocorrelation of a time series
Autocorrelation or serial correlation is the correlation of an observation—a delayed copy of itself—as a function of delay. It measures if the currently observed value has any relationship to the value in the future/past.
from statsmodels.graphic.tsaplots import plot_acf, plot_pacf
fig = plot_acf(dataset['Price'], lags=100)
fig.set_size_inches(12, 6)
The statsmodels.graphics.tsaplots.plot_pacf(…) method lets us plot the partial autocorrelation values against different lag values. The difference between autocorrelation and partial autocorrelation is that with partial autocorrelation, only the correlation between that observation and the previous observation that lag periods is used, and correlation effects from lower lag-value terms are removed.
fig = plot_pacf(dataset['Price'], lags=100)
fig.set_size_inches(12, 6)
ARIMA time series model
The Auto-Regressive Integrated Moving Average (ARIMA) model is one of the most well-known time series modelling and forecasting models available. It is used to predict time series data for time series with correlated data points.
The ARIMA model is composed of three components, outlined as follows:
- Auto-regression (AR)
- Integrated (I)
- Moving Average (MA)
from statsmodels.tsa.arima.model import ARIMA
arima = ARIMA(dataset['Price'], order=(36, 1, 2)
res_ar = arima.fit()
res_ar.summary()
Using the statsmodels.tsa.arima.ARIMAResults.predict(…) method, we can use the fitted model to predict values over the specified start and end datetime indices (in this case, the entire dataset).
dataset['PredPrice'] = res_ar.predict(dataset.index[0], dataset.index[-1])
Plot the original Price and the PredPrice fields in the following code block to visually compare the two
plt.ylim(70, 250)
dataset['Price'].plot(figsize=(12, 6), color='darkgray', linestyle='-', lw=4, legend='Price')
dataset['PredPrice'].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')
The predicted prices are quite accurate, and that is because the specified parameters (p, d, q) were precise.
Use this fitted model to forecast values for dates out in the future. First, we build an extended_dataset DataFrame with another 4 years’ worth of datetime indices and no data (which will be filled in with NaN values) using the datetools.dates_from_range(…) method and the pandas.DataFrame.append(…) method,
extended_dataset = pd.DataFrame(index=sm.datetools.dates_from_range('2020m1', length=48))
extended_dataset = dataset.append(extended_dataset)
Call the ARIMAResults.predict(…) method again to generate predicted prices for the entire time series and thus forecast onto the new dates,
extended_dataset['PredPrice'] = res_ar.predict(extended_dataset.index[0], extended_dataset.index[-1])
Plots the last 100 observations from the extended_dataset DataFrame
extended_dataset['Price'].iloc[-100:].plot(figsize=(12, 6), color='darkgray', linestyle='-', lw=4, legend='Price')
extended_dataset['Price'].iloc[-100:].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')
Using a SARIMAX time series model with pmdarima
SARIMA is an extension of the ARIMA model for univariate time series with a seasonal component.
SARIMAX is, then, the name of the model, which also supports exogenous variables.
These are the three ARIMA parameters:
- p = trend auto-regressive order
- d = trend difference order
- q = trend MA order
In addition to the preceding parameters, SARIMA introduces four more,
- P = seasonal auto-regressive order
- D = seasonal difference order
- Q = seasonal MA order
- m = the length of a single seasonal period in the number of time steps
To find these parameters manually can be time-consuming, and it may be advantageous to use an auto-ARIMA model.
The best model is determined by minimising the value of the information criterion (Akaike information criterion (AIC), Corrected AIC, Bayesian information criterion (BIC), Hannan-Quinn information criterion (HQC), or out-of-bag (OOB)—for validation scoring—respectively).
import pmdarima as pn
model = pm.auto_arima(dataset['Price'], seasonal=True, stepwise=True, m=12)
print(model.summary())
extended_dataset = pd.DataFrame(index=sm.tsa.datetools.dates_from_range('2020m1', length=48))
extended_dataset['PredPrice'], conf_int = model.predict(48, return_conf_int=True, alpha=0.05)
plt.plot(dataset['Price'], c='blue')
plt.plot(extended_dataset['PredPrice'], c='green')
plt.show()
print(extended_dataset)
print(conf_int)
Time series forecasting with Facebook’s Prophet library
Facebook Prophet is a Python library used for forecasting univariate time series with strong support for seasonality and holiday effects. It is especially suitable for time series with frequent changes of trends and is robust enough to handle outliers.
More specifically, the Prophet model is an additive regression model with the following attributes:
- Piecewise linear or logistic growth trend
- Yearly seasonal component modelled with a Fourier series
- Weekly seasonal component modelled with dummy variables
- A user-provided list of holidays
from fbprophet import Prophet
prophet_dataset = dataset.rename(columns={'Price': 'y'}).rename_axis('ds').drop('PredPrice', 1).reset_index()
print(prophet_dataset)
model = Prophet()
model.fit(prophet_dataset)
df_forecast = model.make_future_datafame(periods=48, freq='M')
df_forecast = model.predict(df_forecast)
print(df_dorecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())
model.plot(df_forecast, xlabel='Date', ylable='Value')
model.plot_components(df_forecast)
Introduction to scikit-learn regression and classification
scikit-learn is a Python supervised and unsupervised machine learning library built on top of the numpy and scipy libraries.
Generating the dataset
A Pandas DataFrame containing daily data for 20 years with BookPressure, TradePressure, RelativeValue, and Microstructure fields to represent some synthetic trading signals built on this dataset (also known as features or predictors). The PriceChange field represents the daily change in prices that we are trying to predict (also known as response or target variable).
import numpy as np
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2000', '2020'))
df['BookPressure'] = np.random.randn(len(df)) * 2
df['TradePressure'] = np.random.randn(len(df)) * 100
df['RelativeValue'] = np.random.randn(len(df)) * 50
df['Microstructure'] = np.random.randn(lef(df)) * 10
true_coefficients = np.random.randint(low=-100, high=101, size=4) / 10
df['PriceChange'] = ((df['BookPressure'] * true_coefficients[0]) + (df['TradePressure'] * true_coefficients[1]) + (df['RelativeValue'] * true_coefficients[2]) + (df['Microstructure'] * true_coefficients[3]) + (np.random.randn(len(df) * 200))
df['Price'] = df['PriceChange'].cumsum(0) + 100000
Visually inspect the Price field
df['Price'].plot(figsize=(12, 6), color='black', legend='Price')
Display a scatter matrix of all columns but the Price column
pd.plotting.scatter_matrix(df.drop('Price', axis=1), color='black', alpha=0.2, grid=True, diagonal='kde', figsize=(10, 10))
Running RidgeCV regression on the dataset
Use a scikit-learn regression method to fit a linear regression model to our dataset.
First, collect the features and target into a DataFrame and a Series,
features = df[['BookPressure', 'TradePresser', 'RelativeValue', 'Microstructure']]
target = df['PriceChange']
Use sklearn.linear_model.RidgeCV, a linear regression model with L2 regularisation (an L2 norm penalty factor to avoid overfitting) that uses cross-validation to learn the optimal coefficients,
from sklearn.linear_model import RidgeCV
ridge = RidgeCV()
ridge.fit(features, target)
The result is a RidgeCV object,
RidgeCV(alphas=arrary([0.1, 1., 10.]), cv=None, fit_intercept=True, gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
Access the weights/coefficients learned by the Ridge model using the RidgeCV. coef_ attribute and compare it with the actual coefficients,
true_coefficients, ridge.coef_
The RidgeCV.score(…) method returns the R2 score, representing the accuracy of a fitted model,
ridge.score(features, target)
The RidgeCV.predict(…) method outputs the predicted price change values, which we combine with the pandas.Series.cumsum(…) method to generate the predicted price series, and then save it in the PredPrice field,
df['PredPrice'] = ridge.predict(features).cumsum(0) + 100000;
The true Price field is plotted alongside the predicted PredPrice field,
df['Price'].plot(figsize=(12, 6), color='gray', linestyle='--', legend='Price')
df['PredPrice'].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')
Zoom in to the first quarter of 2010 to inspect the prediction errors,
df['Price'].loc['2010-01-01':'2010-03-31'].plot(figsize=(12, 6), color='darkgray', linestyle='-', legend='Price')
df['PredPrice'].loc['2010-01-01':'2010-03-31'].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredPrice')
Compute the prediction errors and plot them using a density plot,
df['Errors'] = df['Price'] - df['PredPrice']
df['Errors'].plot(figsize=(12, 6), kind='kde', color='black', legend='Errors')
Running a classification method on the dataset
First, create discrete categorical target labels for the classification model to predict.
We assign -2, -1, 0, 1, and 2 numeric labels to these conditions respectively and save the discrete target labels in the target_discrete pandas.Series object,
target_discrete = pd.cut(target, bins=5, labels=[-2, -1, 0, 1, 2].astype(int)
Visualise the distribution of the five labels by using the following code,
target_discrete.plot(figsize=(12, 6), kind='hist', color='black')
For the classification, use an ensemble of decision tree classifiers provided by sklearn.ensemble.RandomForestClassifier.
Random forest is a classifier that uses the bagging ensemble method and builds a forest of decision trees by training each tree on datasets generated by random sampling with replacements from the original dataset.
Using a max_depth=5 parameter, limit the height of each tree to reduce overfitting and then call the RandomForestClassifier.fit(…) method to fit the model,
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=5)
rf.fit(features, target_discrete)
Builds the following RandomForestClassifier fitted model
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, max_samples=None, mini_impruity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
The RandomForestClassifier.score(…) method returns the mean accuracy of the predictions compared to the True labels,
rf.score(features, target_discrete)
Add the DiscretePriceChange and PredDiscretePriceChange fields to the DataFrame to hold the true labels and the predicted labels using the RandomForestClassifier.predict(…) method,
df['DiscretePriceChange'] = target_discrete
df['PredDiscretePriceChange'] = rf.predict(features)
Plot two fields for the first quarter of 2010,
df['DiscretePriceChange'].loc('2010-01-01':'2010-03-31'].plot(figsize=(12, 6), color='darkgray', linestyle='-', legend='DiscretePriceChange')
df['PredDiscretePriceChange'].loc('2010-01-01':'2010-03-31'].plot(figsize=(12, 6), color='black', linestyle='-.', legend='PredDiscretePriceChange')
Compute and plot the distribution of the ClassificationErrors DataFrame with the following code,
df['ClassificationErrors'] = df['DiscretePriceChange'] - df['PredDiscretePriceChange']
df'[ClassificationErrors'].plot(figsize=(12, 6), kind='kde', color='black', legend='ClassificationErrors')









