Pandas data cleaning

Apply a variable function to a variable column

Posted by Matt Witman on July 23, 2019

Pandas provides a nice pandas.Series.apply() function for invoking a function on values in a Series, i.e. transform a column in a DataFrame. Below is a template function demonstrating how flexibly and succinctly we can apply arbitrary transformations to our data. transform_df() modifies a DataFrame in place based on a variable column name and a variable function using setattr and getattr (note the use of **kwargs allows us to easily apply any function we want on a given column). For a current project, we have a database of hundreds/thousands of columns, many of which need to be modified in different ways. I found this method to be quite helpful in keeping my data cleaning process neat and organized.

import pandas as pd
import numpy as np

def minus_one_to_hex(val):
    return hex(val-1)

def weird_function(val, to_mod=True, denom=2, scale=4):
    if to_mod:
        return val%denom
    else:
        return val/denom*scale

def transform_df(df, col, func, **kwargs):
    setattr(df, col, getattr(df,col).apply(func,**kwargs))

d = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(d)
print("Original:")
print(df)

print("Alter 'col1':")
transform_df(df,'col1',np.square)
print(df)

print("Alter 'col2':")
transform_df(df,'col2',minus_one_to_hex)
print(df)

print("Alter 'col3':")
kwargs={'to_mod':True, 'denom':3, 'scale':1.22}
transform_df(df,'col3',weird_function, **kwargs)
print(df) 
            
 >>>
Original:
   col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9
Alter 'col1':
   col1  col2  col3
0     1     4     7
1     4     5     8
2     9     6     9
Alter 'col2':
   col1 col2  col3
0     1  0x3     7
1     4  0x4     8
2     9  0x5     9
Alter 'col3':
   col1 col2  col3
0     1  0x3     1
1     4  0x4     2
2     9  0x5     0