Transformations

Transformations allow you to preprocess data, save your preprocessing steps to a pickle file, and use it in the model inference.

Build-in transformations

Several built-in transformations can help you to preprocess some specific data.

An example of using build-in transformations:

import pandas as pd

from insolver import InsolverDataFrame
from insolver.transforms import (
    TransformExp,
    InsolverTransform,
    TransformAge,
    TransformMapValues,
    TransformPolynomizer,
    TransformAgeGender
)
from insolver.model_tools import download_dataset

download_dataset('freMPL-R')
InsDataFrame = InsolverDataFrame(pd.read_csv('./datasets/freMPL-R.csv', low_memory=False))

InsTransforms = InsolverTransform(InsDataFrame, [
    TransformAge('DrivAge', 18, 75),
    TransformExp('LicAge', 57),
    TransformMapValues('Gender', {'Male': 0, 'Female': 1}),
    TransformMapValues('MariStat', {'Other': 0, 'Alone': 1}),
    TransformAgeGender('DrivAge', 'Gender', 'Age_m', 'Age_f', age_default=18,
                       gender_male=0, gender_female=1),
    TransformPolynomizer('Age_m'),
    TransformPolynomizer('Age_f')
])

InsTransforms.ins_transform()
InsTransforms.save('transforms.pickle')

General preprocessing

These classes are used to encode categorical values.

class TransformToNumeric Transforms parameter values to numeric types, uses pandas.to_numeric.
class TransformGetDummies Gets dummy columns of the parameter, uses pandas.get_dummies.
class TransformMapValues Transforms parameter values according to the given dictionary.

You can also generate the polynomial features.

class TransformPolynomizer Gets the polynomials of parameters values.

Grouping and sorting

class TransformParamUselessGroup: Groups all parameter values with a small amount of data into one group.
class TransformParamSortFreq: Sorts by the frequency values of the chosen column.
class TransformParamSortAC: Sorts by the average sum values of the chosen column.

Preprocessing data about a person

These classes are used to preprocess data about a person, such as gender, age, or name.

class TransformGenderGetFromName: For Russian names only. Gets the gender of a person from Russian second names.
class TransformAgeGetFromBirthday: Gets the age of a person in years from birth dates.
class TransformAge: Transform a person’s age to age for a specified age_min (lower values are invalid) and age_max (exceeding values will be grouped) age.
class TransformAgeGender: Gets the intersection of a person’s minimum age and gender.
class TransformNameCheck: Checks if the person’s first name is on the particular list. Names may concatenate surnames, first names, and last names.

Preprocessing of insurance data

Since Insolver was made for the insurance industry, some classes are also available to handle the driving experience, vehicle, and region data.

class TransformExp: Transforms the minimum driving experience values in years with a grouping of values greater than exp_max.
class TransformAgeExpDiff: Transforms records with the difference between the minimum driver age and the minimum experience less than diff_min years, sets the minimum driver experience equal to the minimum driver age minus diff_min years.
class TransformVehPower: Transforms vehicle power values. Values under power_min and over power_max will be grouped. Values between power_min and power_max will be grouped with step power_step.
class TransformVehAgeGetFromIssueYear: Get the vehicles’ age (in years) by year of issue and policy start dates.
class TransformVehAge: Transforms vehicle age values (in years). Values over veh_age_max will be grouped.
class TransformRegionGetFromKladr: Gets the region number from KLADRs.
class TransformCarFleetSize: Calculates fleet sizes for policyholders.

AutoFill NA values

Class AutoFillNATransforms is used to fill NA values in a dataset. It fills NA values with median values for numerical columns and the most frequently used categorical columns.

import numpy as np
import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, AutoFillNATransforms

df = InsolverDataFrame(pd.DataFrame(data={'col1': [1, 2, np.nan]}))

print(df)
#    col1
# 0   1.0
# 1   2.0
# 2   NaN

df_transformed = InsolverTransform(df, [
        AutoFillNATransforms(),
    ])
df_transformed.ins_transform()

print(df_transformed)
#    col1
# 0   1.0
# 1   2.0
# 2   1.5

Numerical AutoFillNA methods

There are several options for the numerical_method parameter available to fill the NA numerical values:

median (by default) - the value separating the higher half from the lower half;
mean - the sum of the values divided by the number of values;
mode - the value that appears most often in a set of data values, if several values are found, the first one is used;
remove - removes all columns containing NA values.

Categorical AutoFillNA methods

There are several options for the categorical_method parameter available to fill the NA categorical values:

frequent (by default) - the category that appears most often in a set of data values, if several values are found, the first one is used;
new_category - creates a new category “Unknown” for NA values;
imputed_column - fills with the frequent category and creates a new bool column containing whether a value was imputed or not;
remove - removes all columns containing NA values.

Using constants

You can also use constants to fill NA values using the numerical_constants and categorical_constants parameters for numerical and categorical columns.

import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import InsolverTransform, AutoFillNATransforms
from insolver.model_tools import download_dataset

download_dataset('freMPL-R')
df = InsolverDataFrame(pd.read_csv('./datasets/freMPL-R.csv', low_memory=False))
transform = InsolverTransform(df, [
    AutoFillNATransforms(numerical_constants={'col1': '111'}), 
])
transform.ins_transform()

print(df)
#    col1
# 0   1.0
# 1   2.0
# 2   111

Date and datetime

Class DatetimeTransforms is used to preprocess date and date-time columns. Unlike other transformations, this class does not change the date columns but creates new ones with the used feature in the name.

import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, DatetimeTransforms

df = InsolverDataFrame(pd.DataFrame(data={'last_review': ['2018-10-19', '2019-05-21']}))

print(df)
#    last_review
# 0   2018-10-19
# 1   2019-05-21

transform = InsolverTransform(df, [
        DatetimeTransforms(['last_review']),
    ])
    
transform.ins_transform()

print(df)
#    last_review last_review_unix
# 0   2018-10-19   1.539907e+09     
# 1   2019-05-21   1.558397e+09

Label Encoder

EncoderTransforms based on sklearn’s LabelEncoder. Encode target labels with value between 0 and n_classes-1.

import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, EncoderTransforms


df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))

print(df)
#   col1
# 0    A
# 1    B
# 2    C
# 3    A

df_transformed = InsolverTransform(df, [
    EncoderTransforms(['col1']),
])
df_transformed.ins_transform()

print(df_transformed)
#    col1
# 0     0
# 1     1
# 2     2
# 3     0

One Hot Encoder

OneHotEncoderTransforms based on sklearn’s OneHotEncoder. Encode categorical features as a one-hot numeric array.

import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, OneHotEncoderTransforms


df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))

print(df)
#   col1
# 0    A
# 1    B
# 2    C
# 3    A

df_transformed = InsolverTransform(df, [
    OneHotEncoderTransforms(['col1']),
])
df_transformed.ins_transform()

print(df_transformed)
#    col1_A  col1_B  col1_C
# 0     1.0     0.0     0.0
# 1     0.0     1.0     0.0
# 2     0.0     0.0     1.0
# 3     1.0     0.0     0.0

Custom transformations

Custom transformations can also be created. Transfromation can be defined as a class object, which has __init__ and __call__ methods. The __call__ method should take one argument, which is the initial dataframe, and return transformed one. Also, all packages and assets used in the custom transformation should be imported explicitly in methods where they are used. Otherwise, custom transformations may not work properly, since the saved transformation is serialized by dill package, which may not resolve all the references when transformation will be loaded.

import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform


class TransformToNumeric:
    """Example of user-defined transformations. Transform values to numeric.

    Parameters:
        column_names (list): List of columns for transformations
        downcast (str): parameter from pd.to_numeric, default: 'float'
    """
    def __init__(self, column_names, downcast='float'):
        self.column_names = column_names
        self.downcast = downcast

    def __call__(self, df):
        import pandas as pd
        for column in self.column_names:
            df[column] = pd.to_numeric(df[column], downcast=self.downcast)
        return df

    
df = InsolverDataFrame(pd.DataFrame(data={'col1': ['1.0', '2', -3]}))

print(df)
print(df.dtypes)
#   col1
# 0  1.0
# 1    2
# 2   -3
# col1    object
# dtype: object

df_transformed = InsolverTransform(df, [
    TransformToNumeric(['col1']),
])
df_transformed.ins_transform()

print(df_transformed)
print(df_transformed.dtypes)
#    col1
# 0   1.0
# 1   2.0
# 2  -3.0
# col1    float32
# dtype: object

Saving and loading transformations

Transformations can also be saved with save() method for both development and production use.

import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, OneHotEncoderTransforms

df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))
df_transformed = InsolverTransform(df, [
    OneHotEncoderTransforms(['col1']),
])

df_transformed.ins_transform()
df_transformed.save('transforms')

Transformations saving is performed by serialization with dill package. To use saved transformations (including user-defined) your workflow, you should pass their filepath into load_transforms function. After that you can use them in the same way as the build-in imported transformations.

import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import InsolverTransform, load_transforms

# load data
df = pd.read_json('request_example.json')
InsDataFrame = InsolverDataFrame(df)

# load transformations
transforms = load_transforms('transforms')
InsTransforms = InsolverTransform(InsDataFrame, transforms)
InsTransforms.ins_transform()
...