Data Manipulation and Analysis with pandas

The name pandas is derived from panel data, an econometrics term for a multidimensional structured dataset.

Introducing pandas Series, pandas DataFrames, and pandas Indexes

pandas.Series

The pandas.Series data structure represents a one-dimensional series of homogeneous values (integer values, string values, double values, and so on). Series are a type of list and can contain only a single list with an index. A Data Frame, on the other hand, is a collection of one or more series.

import pandas as pd
ser1 = pd.Series(range(1,6));
ser2 = pd.Series(range(1,6), indexf=['a', 'b', 'c', 'd', 'e']);

Create a series by specifying the index->value mapping via a dictionary

ser3 = pd.Series({'a':1.0, 'b':2.0, 'c':3.0, 'd':4.0, 'e':5.0});

Access the index

ser3.index

Access the values

ser3.values

Assign the series a name

ser3.name = 'Alphanumeric';

pandas.DataFrame

The majority of all statistical time series operations are performed on DataFrames and pandas.DataFrame is optimised for parallel super-fast processing of DataFrames, much faster than if the processing was done on separate series.

df1 = pd.DataFrame({'A': range(1,5,1), 'B': range(10,50,10), 'C': range(100, 500, 100)});

Pass the index= parameter to label the indices

df2 = pd.DataFrame({'A': range(1,5,1), 'B': range(10,50,10), 'C': range(100, 500, 100)}, index=['a', 'b', 'c', 'd']);

Returns the names of the different columns

df2.columns

Access the indices

df2.index

Return the values. The result is a 2Darray

df2.values

Add a new column to the DataFrame

df2['D'] = range(1000, 5000, 1000);

Assign names to the DataFrame’s index and columns

df2.index.name = 'lowercase';
df2.columns.name = 'uppercase';

pandas.Index

Both the pandas.Series and pandas.DataFrame data structures utilise the pandas.Index data structure.

Special types of Index objects:

Int64Index: Int64Index contains integer index values.
MultiIndex: MultiIndex contains indices that are tuples used in hierarchical indexing.
DatetimeIndex: DatetimeIndex, which we have seen before, contains datetime index values for time series datasets.

Create a pandas.Index object

ind2 = pd.Index(list(range(5)));

Index objects are immutable and thus cannot be modified in place.

Learning essential pandas.Dataframe operations

Indexing, selection, and filtering of DataFrames

pandas data structures are indexed by special Index objects, while numpy.ndarrays and Python list objects are only indexable by integers.

Select the Series of values in column B

df2['B']

Select multiple columns by passing a list of column names

df2[['A', 'C']]

Use Boolean selection with DataFrames

df2[(df2['D'] > 1000) & (df2['D'] <= 3000)]

Index rows

df2.loc[['c', 'd']]

Use standard integer indexing

df2.iloc[[0]]

Modify the DataFrame with an operation

df2[df2['D' == 2000] = 0;

Dropping rows and columns from a DataFrame

Drop the row at index b

df2.drop('b')

The original DataFrame was not changed. To modify the original DataFrame, use the inplace=parameter

df2.drop('b', inplace=True)

Drop multiple rows

df2.drop(['a','d'])

Drop columns

df2.drop(['A', 'B'], axis=1)

Sorting values and ranking the values’ order within a DataFrame

Create a DataFrame with integer row indices, integer column names, and random values

import numpy as np
df = pd.DataFrame(np.random.randn(5,5), index=np.random.randint(0, 100, size=5), column=np.random.randint(0, 100, size=5));

Sorts the DataFrame by index values

df.sort_index()

Sort by column name values by specifying the axis parameter

df.sort_index(axis=1)

Sort the values in the DataFrame, takes a by=parameter specifying which column(s) to sort by

df.sort_values(by=df.columns[0])

Yield a DataFrame containing the rank/order of values in each column

df.rank()

Arithmetic operations on DataFrames

Create two DataFrames

df1 = pd.DataFrame(np.random.randn(3,2), index=['A','C','E'], columns=['colA','colB']);
df2 = pd.DataFrame(np.random.randn(4,3), index=['A','B','C','D'], columns=['colA','colB','colC'])

Add the two DataFrames together. They have different index values as well as different columns

df1 + df2

Use the pandas.DataFrame.add(…) method with fill_value= to value to be used instead of NaN

df1.add(df2, fill_value=0)

Perform arithmetic operations between DataFrame and Series

df1 - df2[['colB']]

Merging and combining multiple DataFrames into a single DataFrame

Create two DataFrames

df1.index.name = 'Index';
df1.columns.name = 'Columns';
df2.index.name = 'Index';
df2.columns.name = 'Columns';

Joins/merges two DataFrames. The left_index= and right_index= parameters indicate that the merge should be performed on Index values in both DataFrames

pd.merge(df1, df2, left_index=True, right_index=True)

Specify custom suffixes with the suffixes= parameter

pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2'))

Specify the behaviour of the join (outer, inner, left, or right join) using the how= parameter

pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_1', '_2'), how='outer')

df1.merge(df2, left_index=True, right_index=True, suffixes=('_1', '_2'), how='outer')

df1.join(df2, lsuffix='_1', rsuffix='_2')

Concatenating rows together

pd.concat([df1, df2])

Concatenate across columns by specifying the axis= parameter

pd.concat([df1, df2], axis=1)

Hierarchical indexing

Hierarchical indexing uses MultiIndex objects, which are tuples of multiple values per Index.

Create sub-DataFrames inside a single DataFrame.

df = pd.DataFrame(np.random.randn(10,2), index=[list('aaabbbccdd'), [1,2,3,1,2,3,1,2,1,2]], column=['A','B']);

Assign names to the MultiIndex object

df.index.names = ['alpha', 'numberic'];

Remove all indexing levels

df.reset_index()

This leads to the following integer indexed DataFrame and the MultiIndex values are added as columns.

The pandas.DataFrame.unstack(…) method has similar behaviour and pivots the inner level of indexing and converts them to columns.

df.unstack()

The opposite of unstack

df.stack()

Examine the structure of the MultiIndex DataFrame

df.stack().index

Grouping operations in DataFrames

Grouping operations in pandas generally follow the split-apply-combine process of operations:

First, the data is split into groups based on one or more keys.
Then we apply the necessary functions to these groups to compute the desired results.
Finally, we combine them to build the transformed dataset.

Remove all hierarchical indexing from the previous df DataFrame

df = df.reset_index();

Group the A and B columns by the alpha column

grouped = df[['A','B']].groupby(df['alpha']);

Collect summary descriptive statistics

grouped.describe()

Apply different functions to each group of the grouped object

grouped.apply(pd.DataFrame.unstack)

Accept functions and aggregate each column for each group

grouped['A','B'].agg('mean')

Similar method and transform works on one columns at a time and returns a sequence of values of the same length as the series

from scipy import stats
grouped[['A','B']].transform(stats.zscore)

Transforming values in DataFrames’ axis indices

Rename the Index labels

df2.index = ['Alpha', 'Beta', 'Gamma', 'Delta']

The map function takes the first three characters of the name and sets that as the new name

df2.index = df2.index.map(lambda x : x[:3]);

Transform both Index names and column names and accepts a dictionary mapping from the old name to the new name

df2.rename(index{'Alp':0, 'Bet':1, 'Gam':2, 'Del':3}, columns={'colA':'A', 'colB':'B', 'colC':'C'})

Handling missing data in DataFrames

Filtering out missing data

Set some values to NaN

for row, col in [('Bet','colA'),('Bet','colB'),('Bet','colC'),('Del','colB'),('Gam','colC')]:
    df2.at[row,col] = np.NaN

Finds missing values

df2.isnull()

Detect non-missing values

df2.notnull()

Drop rows with missing values. The additional how= parameter controls which rows get dropped

df2.dropna(how='all')
df2.dropna(how='any')

Filling in missing data

Fill in the missing values either with a value of our choice or using other valid values in the same column to duplicate/extrapolate the missing values

df2.fillna(method='backfill', inplace=True)

The transformation of DataFrames with functions and mappings

pandas DataFrame values can also be modified by passing functions and dictionary mappings that operate on one or more data values and generate new transformed values.

Adding a new column

df2['Category'] = ['HIGH', 'LOW', 'LOW', 'HIGH'];

The pandas.Series.map(…) method accepts a dictionary containing a mapping from the old value to the new value and transforms the values.

df2['Category'] = df2['Category'].map({'HIGH': 'H', 'LOW': 'L'});

The pandas.DataFrame.applymap(…) method allows us to apply functions to data values in a DataFrame.

The following code applies the numpy.exp(…) method, which calculates the exponential:

df2.drop('Category', axis=1).applymap(np.exp)

Discretisation/bucketing of DataFrame values

The simplest way to achieve discretisation is to create ranges of values and assign a single discrete label to all values that fall within a certain bucket.

Generate a random valued ndarray

arr = np.random.randn(10);

The pandas.cut(…) method can be used to discretise these values. The following code uses the bins= and labels=[…] arguments to bin the values into five discrete values with the labels provided:

cat = pd.cut(arr, bins=5, labels=['Very Low', 'Low', 'Med', 'Hihg', 'Very High']);

The pandas.qcut(…) method is similar but uses quartiles to bin the continuous values to discrete values so that each category has the same amount of observations.

qcat = pd.qcut(arr, q=5, labels=['Very Low', 'Low', 'Med', 'High', 'Very High']);

Build a pandas DataFrame consisting of the original continuous values as well as the categories generated from cut and qcut

pd.DataFrame({'Value': arr, 'Category': cat, 'Quartile Category': qcut})

The bucket ranges

pd.cut(arr, bins=5).categories

Inspect the buckets for qcut

pd.qcut(arr, q=5).categories

Permuting and sampling DataFrame values to generate new DataFrames

Permuting available datasets to generate new datasets and sampling datasets to either sub-sample (reduce the number of observations) or super-sample (increase the number of observations) are common operations in statistical analysis.

Generate a DataFrame

df = pd.DataFrame(np.random.randn(10,5), index=np.sort(np.random,randint(0, 100, size=10)), columns= list('ABCDE'));

The numpy.random.permutation(…) method, when applied to a DataFrame, randomly shuffles along the Index axis and can be used to permute the rows in the dataset:

df.loc[np.random.permutation(df.index)]

Use the numpy.random.randint(…) method to generate random integers within a certain range and then use the pandas.DataFrame.iloc[…] attribute to randomly sample with replacement (the same observation can be picked more than once)from our DataFrame

df.iloc[np.random.randint(0, len(df), size=5)]

Exploring file operations with pandas.DataFrames

pandas supports the persistence of DataFrames in both plain-text and binary formats. The common text formats are CSV and JSON files, the most used binary formats are Excel XLSX, HDF5, and pickle.

CSV files

Writing CSV files

df.to_csv('df.csv', sep=',', header=True, index=True)

Inspect the file written to disk

!head -n 4 df.csv

Reading CSV files

The index_col= parameter to specify which column to treat as the Index of the DataFrame, and the nrows= parameter to specify how many rows to read in

pd.read_csv('df.csv', sep=',', index_col=0, nrow=5)

Read in the specified number of line at a time

pd.read_csv('df.csv', spe=',', index_col=0, chunksize=2)

Force the generator to finish evaluation by wrapping it in a list and observe the entire DataFrame loaded in chunks of two lines

list(pd.read_csv('df.csv', sep=',', index_col=0, chunksize=2))

JSON files

JSON files are based upon data structures identical to Python dictionaries. This makes JSON files very convenient for many purposes including representing DataFrames as well as representing configuration files.

The pandas.DataFrame.to_json(…) method conveniently writes a DataFrame to a JSON file on disk.

df.iloc[:4].to_json('df.json')

Check out the JSON file

!cat df.json

Read JSON files back to pandas DataFrames

pd.read_son('df.json')

Author: Zhe

Leave a Reply Cancel reply

Data Manipulation and Analysis with pandas

Author: Zhe

Related posts

Leave a Reply Cancel reply