The name pandas is derived from panel data, an econometrics term for a multidimensional structured dataset.
Introducing pandas Series, pandas DataFrames, and pandas Indexes
pandas.Series
The pandas.Series data structure represents a one-dimensional series of homogeneous values (integer values, string values, double values, and so on). Series are a type of list and can contain only a single list with an index. A Data Frame, on the other hand, is a collection of one or more series.
import pandas as pd
ser1 = pd.Series(range(1,6));
ser2 = pd.Series(range(1,6), indexf=['a', 'b', 'c', 'd', 'e']);
Create a series by specifying the index->value mapping via a dictionary
ser3 = pd.Series({'a':1.0, 'b':2.0, 'c':3.0, 'd':4.0, 'e':5.0});
Access the index
ser3.index
Access the values
ser3.values
Assign the series a name
ser3.name = 'Alphanumeric';
pandas.DataFrame
The majority of all statistical time series operations are performed on DataFrames and pandas.DataFrame is optimised for parallel super-fast processing of DataFrames, much faster than if the processing was done on separate series.
df1 = pd.DataFrame({'A': range(1,5,1), 'B': range(10,50,10), 'C': range(100, 500, 100)});
Pass the index= parameter to label the indices
df2 = pd.DataFrame({'A': range(1,5,1), 'B': range(10,50,10), 'C': range(100, 500, 100)}, index=['a', 'b', 'c', 'd']);
Returns the names of the different columns
df2.columns
Access the indices
df2.index
Return the values. The result is a 2Darray
df2.values
Add a new column to the DataFrame
df2['D'] = range(1000, 5000, 1000);
Assign names to the DataFrame’s index and columns
df2.index.name = 'lowercase';
df2.columns.name = 'uppercase';
pandas.Index
Both the pandas.Series and pandas.DataFrame data structures utilise the pandas.Index data structure.
Special types of Index objects:
- Int64Index: Int64Index contains integer index values.
- MultiIndex: MultiIndex contains indices that are tuples used in hierarchical indexing.
- DatetimeIndex: DatetimeIndex, which we have seen before, contains datetime index values for time series datasets.
Create a pandas.Index object
ind2 = pd.Index(list(range(5)));
Index objects are immutable and thus cannot be modified in place.
Learning essential pandas.Dataframe operations
Indexing, selection, and filtering of DataFrames
pandas data structures are indexed by special Index objects, while numpy.ndarrays and Python list objects are only indexable by integers.
Select the Series of values in column B
df2['B']
Select multiple columns by passing a list of column names
df2[['A', 'C']]
Use Boolean selection with DataFrames
df2[(df2['D'] > 1000) & (df2['D'] <= 3000)]
Index rows
df2.loc[['c', 'd']]
Use standard integer indexing
df2.iloc[[0]]
Modify the DataFrame with an operation
df2[df2['D' == 2000] = 0;
Dropping rows and columns from a DataFrame
Drop the row at index b
df2.drop('b')
The original DataFrame was not changed. To modify the original DataFrame, use the inplace=parameter
df2.drop('b', inplace=True)
Drop multiple rows
df2.drop(['a','d'])
Drop columns
df2.drop(['A', 'B'], axis=1)
Sorting values and ranking the values’ order within a DataFrame
Create a DataFrame with integer row indices, integer column names, and random values
import numpy as np
df = pd.DataFrame(np.random.randn(5,5), index=np.random.randint(0, 100, size=5), column=np.random.randint(0, 100, size=5));
Sorts the DataFrame by index values
df.sort_index()
Sort by column name values by specifying the axis parameter
df.sort_index(axis=1)
Sort the values in the DataFrame, takes a by=parameter specifying which column(s) to sort by
df.sort_values(by=df.columns[0])
Yield a DataFrame containing the rank/order of values in each column
df.rank()
Arithmetic operations on DataFrames
Create two DataFrames
df1 = pd.DataFrame(np.random.randn(3,2), index=['A','C','E'], columns=['colA','colB']);
df2 = pd.DataFrame(np.random.randn(4,3), index=['A','B','C','D'], columns=['colA','colB','colC'])
Add the two DataFrames together. They have different index values as well as different columns
df1 + df2
Use the pandas.DataFrame.add(…) method with fill_value= to value to be used instead of NaN
df1.add(df2, fill_value=0)
Perform arithmetic operations between DataFrame and Series
df1 - df2[['colB']]
Merging and combining multiple DataFrames into a single DataFrame
Create two DataFrames
df1.index.name = 'Index';
df1.columns.name = 'Columns';
df2.index.name = 'Index';
df2.columns.name = 'Columns';
Joins/merges two DataFrames. The left_index= and right_index= parameters indicate that the merge should be performed on Index values in both DataFrames
pd.merge(df1, df2, left_index=True, right_index=True)
Specify custom suffixes with the suffixes= parameter
pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2'))
Specify the behaviour of the join (outer, inner, left, or right join) using the how= parameter
pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2'), how='outer')
or
df1.merge(df2, left_index=True, right_index=True, suffixes=('_1', '_2'), how='outer')
or
df1.join(df2, lsuffix='_1', rsuffix='_2')
Concatenating rows together
pd.concat([df1, df2])
Concatenate across columns by specifying the axis= parameter
pd.concat([df1, df2], axis=1)
Hierarchical indexing
Hierarchical indexing uses MultiIndex objects, which are tuples of multiple values per Index.
Create sub-DataFrames inside a single DataFrame.
df = pd.DataFrame(np.random.randn(10,2), index=[list('aaabbbccdd'), [1,2,3,1,2,3,1,2,1,2]], column=['A','B']);
Assign names to the MultiIndex object
df.index.names = ['alpha', 'numberic'];
Remove all indexing levels
df.reset_index()
This leads to the following integer indexed DataFrame and the MultiIndex values are added as columns.
The pandas.DataFrame.unstack(…) method has similar behaviour and pivots the inner level of indexing and converts them to columns.
df.unstack()
The opposite of unstack
df.stack()
Examine the structure of the MultiIndex DataFrame
df.stack().index
Grouping operations in DataFrames
Grouping operations in pandas generally follow the split-apply-combine process of operations:
- First, the data is split into groups based on one or more keys.
- Then we apply the necessary functions to these groups to compute the desired results.
- Finally, we combine them to build the transformed dataset.
Remove all hierarchical indexing from the previous df DataFrame
df = df.reset_index();
Group the A and B columns by the alpha column
grouped = df[['A','B']].groupby(df['alpha']);
Collect summary descriptive statistics
grouped.describe()
Apply different functions to each group of the grouped object
grouped.apply(pd.DataFrame.unstack)
Accept functions and aggregate each column for each group
grouped['A','B'].agg('mean')
Similar method and transform works on one columns at a time and returns a sequence of values of the same length as the series
from scipy import stats
grouped[['A','B']].transform(stats.zscore)
Transforming values in DataFrames’ axis indices
Rename the Index labels
df2.index = ['Alpha', 'Beta', 'Gamma', 'Delta']
The map function takes the first three characters of the name and sets that as the new name
df2.index = df2.index.map(lambda x : x[:3]);
Transform both Index names and column names and accepts a dictionary mapping from the old name to the new name
df2.rename(index{'Alp':0, 'Bet':1, 'Gam':2, 'Del':3}, columns={'colA':'A', 'colB':'B', 'colC':'C'})
Handling missing data in DataFrames
Filtering out missing data
Set some values to NaN
for row, col in [('Bet','colA'),('Bet','colB'),('Bet','colC'),('Del','colB'),('Gam','colC')]:
df2.at[row,col] = np.NaN
Finds missing values
df2.isnull()
Detect non-missing values
df2.notnull()
Drop rows with missing values. The additional how= parameter controls which rows get dropped
df2.dropna(how='all')
df2.dropna(how='any')
Filling in missing data
Fill in the missing values either with a value of our choice or using other valid values in the same column to duplicate/extrapolate the missing values
df2.fillna(method='backfill', inplace=True)
The transformation of DataFrames with functions and mappings
pandas DataFrame values can also be modified by passing functions and dictionary mappings that operate on one or more data values and generate new transformed values.
Adding a new column
df2['Category'] = ['HIGH', 'LOW', 'LOW', 'HIGH'];
The pandas.Series.map(…) method accepts a dictionary containing a mapping from the old value to the new value and transforms the values.
df2['Category'] = df2['Category'].map({'HIGH': 'H', 'LOW': 'L'});
The pandas.DataFrame.applymap(…) method allows us to apply functions to data values in a DataFrame.
The following code applies the numpy.exp(…) method, which calculates the exponential:
df2.drop('Category', axis=1).applymap(np.exp)
Discretisation/bucketing of DataFrame values
The simplest way to achieve discretisation is to create ranges of values and assign a single discrete label to all values that fall within a certain bucket.
Generate a random valued ndarray
arr = np.random.randn(10);
The pandas.cut(…) method can be used to discretise these values. The following code uses the bins= and labels=[…] arguments to bin the values into five discrete values with the labels provided:
cat = pd.cut(arr, bins=5, labels=['Very Low', 'Low', 'Med', 'Hihg', 'Very High']);
The pandas.qcut(…) method is similar but uses quartiles to bin the continuous values to discrete values so that each category has the same amount of observations.
qcat = pd.qcut(arr, q=5, labels=['Very Low', 'Low', 'Med', 'High', 'Very High']);
Build a pandas DataFrame consisting of the original continuous values as well as the categories generated from cut and qcut
pd.DataFrame({'Value': arr, 'Category': cat, 'Quartile Category': qcut})
The bucket ranges
pd.cut(arr, bins=5).categories
Inspect the buckets for qcut
pd.qcut(arr, q=5).categories
Permuting and sampling DataFrame values to generate new DataFrames
Permuting available datasets to generate new datasets and sampling datasets to either sub-sample (reduce the number of observations) or super-sample (increase the number of observations) are common operations in statistical analysis.
Generate a DataFrame
df = pd.DataFrame(np.random.randn(10,5), index=np.sort(np.random,randint(0, 100, size=10)), columns= list('ABCDE'));
The numpy.random.permutation(…) method, when applied to a DataFrame, randomly shuffles along the Index axis and can be used to permute the rows in the dataset:
df.loc[np.random.permutation(df.index)]
Use the numpy.random.randint(…) method to generate random integers within a certain range and then use the pandas.DataFrame.iloc[…] attribute to randomly sample with replacement (the same observation can be picked more than once)from our DataFrame
df.iloc[np.random.randint(0, len(df), size=5)]
Exploring file operations with pandas.DataFrames
pandas supports the persistence of DataFrames in both plain-text and binary formats. The common text formats are CSV and JSON files, the most used binary formats are Excel XLSX, HDF5, and pickle.
CSV files
Writing CSV files
df.to_csv('df.csv', sep=',', header=True, index=True)
Inspect the file written to disk
!head -n 4 df.csv
Reading CSV files
The index_col= parameter to specify which column to treat as the Index of the DataFrame, and the nrows= parameter to specify how many rows to read in
pd.read_csv('df.csv', sep=',', index_col=0, nrow=5)
Read in the specified number of line at a time
pd.read_csv('df.csv', spe=',', index_col=0, chunksize=2)
Force the generator to finish evaluation by wrapping it in a list and observe the entire DataFrame loaded in chunks of two lines
list(pd.read_csv('df.csv', sep=',', index_col=0, chunksize=2))
JSON files
JSON files are based upon data structures identical to Python dictionaries. This makes JSON files very convenient for many purposes including representing DataFrames as well as representing configuration files.
The pandas.DataFrame.to_json(…) method conveniently writes a DataFrame to a JSON file on disk.
df.iloc[:4].to_json('df.json')
Check out the JSON file
!cat df.json
Read JSON files back to pandas DataFrames
pd.read_son('df.json')





