The Data Set
Suppose that we have file, AAPL.csv in a data folder.
fn = '../data/AAPL.csv'
with open(fn. 'r') as f:
for _ in range(5):
print(f.readline(), end='')
Reading from a CSV File with Python
import csv
csv_reader = csv.reader(open(fn, 'r'))
data = list(csv_reader)
Using a csv.DictReader iterator object instead of the standard csv.reader object.
csv_reader = cvs.DicReader(open(fn, 'r'))
data = list(csv_reader)
Calculate the mean of the Apple closing stock
sum([float(l['CLOSE']) for l in data]) / len(data)
Reading from a CSV File with pandas
Use pandas to work with the Apple stock price data set.
import pandas as pd
data = pd.read_csv(fn, index_col=0, parse_dates=True)
data.info()
data.tail()
Calculate the mean of the Apple stock closing values
data['CLOSE'].mean()
Exploring to Excel and JSON
pandas allows one to do the export in the form of Excel spreadsheet files as well as JSON files
data.to_excel('data/aapl.xls', 'AAPL')
data.to_json('data/aapl.json')
ls -n data/
Reading from Excel and JSON
data_copy_1 = pd.read_excel('data/aapl.xls', 'AAPL', index_col=0)
data_copy_1.head()
data_copy_2 = pd.read_json('data/aapl.json')
data_copy_2.head()
!rm data/*
Working with Open Data Sources
Obtain a key by signing up for a free Quandl account.
Data requests requiring an API key expect the key to be provided as the parameter api_key.
import configparser
config = configparser.ConfigParser()
config.read('../pyalgo.cfg')
import quandl as q
data = q.get('BCHAIN/MKPRU', api_key=config['quandl']['api_key'])
data.info()
data['Value'].resample('A').last()
Quandl also provides diverse data sets for single stocks, like end-of-day stock prices, stock fundamentals, or data sets related to options traded on a certain stock.
data = q.get('FSE/SAP_X', start_date='2018-1-1', end_date='2020-05-01', api_key=config['quandl']['api_key'])
data.info()
The API key can also be configured permanently with the Python wrapper.
q.ApiConfig.api_key = 'YOUR_API_KEY'
The Quandl platform also offers premium data sets for which a subscription or fee is required.
q.ApiConfig.api_key = config['quandl']['api_key']
vol = q.get(VOL/MSFT')
vol.iloc[:, :10].info()
vol[['IvMean30', 'IvMean60', 'IvMean90']].tail()
Eikon Data API
Refinitiv is one of the biggest financial data and news providers in the world. Its current desktop flagship product is Eikon, which is the equivalent to the Terminal by Bloomberg, the major competitor in the data services field.
Refinitiv have streamlined their API landscape and have released a Python wrapper package, called eikon, for the Eikon data API, which is installed via
pip install eikon
Get an Eikon app_key via the App Key Generator (APPKEY) application in either Eikon or Workspace
import eikon as ek
ek.set_app_key(config['eikon']['app_key'])
help(ek)
Retrieving Historical Structured Data
symbols= ['AAPL.0', 'MSFT.0', 'GOOG.0']
data = ek.get_timeseries(symbols, start_date='2020-01-01', end_date='2020-05-01', interval='daily', field=['*'])
data.keys()
type(data['AAPL.0'])
data['AAPL.0'].info()
data['AAPL.0'].tail()
%%time
data = ek.get_timeseries(symbols, start_date='2020-08-14', end_date='2020-08-15', interval='minute', field='*')
print(data['GOOG.0].loc['2020-08-14 16:00:00':'2020-08-14 16:00:00']
for sym in symbols:
print('\n' + sym + '\n', data[sym].iloc[-300:-295])
Compare the following code
%%time
data = ek.get_timeseries(symbols[0], start_date='2020-08-14 15:00:00', end_date='2020-08-14 15:30:00', interval='tick', field='*')
data.info()
data.head()
resampled = data.resample('30s', label='right').agg({'VALUE': 'last', 'VOLUME': 'sum'})
resmaple.tail()
Retrieving Historical Unstructured Data
The code that follows retrieves news headlines for a fixed time interval that includes Apple Inc. as a company and “Macbook” as a word.
headlines = ek.get_news_headlines(query='R:AAPL.0 macbook', count=5, date_from='2020-4-1', date_to='2020-5-1')
story = headlines.iloc[0]
news_text = ek.get_news_story['storyId'])
from IPython.display import HTML
HTML(news_text)
Storing Financial Data Efficiently
Generates a sample financial data set in tabular form of arbitrary size
from sample_data import generate_sample_data
print(generate_sample_data(rows=5, cols=4))
Storing DataFrame Objects
The storage of a pandas DataFrame object as a whole is made simple by the pandas HDFStore wrapper functionality for the HDF5 binary storage standard. It allows one to dump complete DataFrame objects in a single step to a file-based database object.
%time data = generate_sample_data(rows=5e6, cols=10).round(4)
data.info()
Open a HDFStore object on disk and to write the DataFrame object to it.
h5 = pd.HDFStore('data/data.h5', 'w')
%time h5['data'] = data
ls -n data/data.*
h5.close()
Read the data from the file-based HDFStore object.
h5 = pd.HDFStore('data/data.h5', 'r')
%time data_copy = h5['data']
data_copy.info()
h5.close()
rm data/data.h5
Another more flexible way of writing the data from a DataFrame object to an HDFStore object.
%time data.to_hdf('data/data.h5', 'data', format='table')
ls -n data/data.*
%time data_copy = pd.read_hdf('data/data.h5', 'data')
data_copy.info()
The advantage of this approach is that one can work with the table_frame object on disk like with any other table object of the PyTables package that is used pandas.
import tables as tb
h5 = tb.open_file('data/data.h5', 'r')
h5.root.data.table[:3]
h5.close()
rm data/data.h5
Using TsTable
%%time
data = generate_sample_data(rows=2.5e6, cols=5, freq='1s').round(4)
data.info()
Some more imports and the creation fo the TsTable table object.
import tstables
import tables as tb
class desc(tb.IsDescription):
'''Description of TsTables table structure.
'''
timestamp = tb.Int64Col(pos=0)
No0 = tb.Float64Col(pos=1)
No1 = tb.Float64Col(pos=2)
No2 = tb.Float64Col(pos=3)
No3 = tb.Float64Col(pos=4)
h5 = tb.open_file('data/data.h5ts', 'w')
ts = h5.create_ts('/', 'data', desc)
Write the sample data stored in a DataFrame object to the table object on disk
%time ts.append(data)
Read sub-sets of the data from a TsTables table object. TsTables returns the data already as a DataFrame object.
import datetime
start = datetime.datetime(2021, 1, 2)
end = datatime.datetime(2021, 1, 3)
%time subset = ts.read_range(start, end)
start = datetime.datetime(2021, 1, 2, 12, 30, 0)
end = datetime.datetime(20212, 1, 5, 17, 15, 30)
%time subset = ts.read_range(start, end)
subset.info()
h5.close()
rm data/*
Storing Data with SQLite3
The DataFrame class provides the method to_sql() to write data to a table in a relational database.
The size on disk with 100+ MB indicates that there is quite some overhead when using relational databases.
%time data = generate_sample_data(1e6, 5, '1min').round(4)
data.info()
import sqlite3 as sq3
con = sq3.connect('data/data.sql')
%time data.to_sql('data', con)
ls -n data/data.*
Consider a query that selects for column No1 all those rows where the value in that row lies between 105 and 108
query = 'SELECT * FROM data WHERE No1 > 105 and No2 < 108'
%time res = con.execute(query).fetchall()
res[:5]
len(res)
con.close()
rm data/*









