Transformations
Transformations allow you to preprocess data, save your preprocessing steps to a pickle file, and use it in the model inference.
Build-in transformations
Several built-in transformations can help you to preprocess some specific data.
An example of using build-in transformations:
import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import (
TransformExp,
InsolverTransform,
TransformAge,
TransformMapValues,
TransformPolynomizer,
TransformAgeGender
)
from insolver.model_tools import download_dataset
download_dataset('freMPL-R')
InsDataFrame = InsolverDataFrame(pd.read_csv('./datasets/freMPL-R.csv', low_memory=False))
InsTransforms = InsolverTransform(InsDataFrame, [
TransformAge('DrivAge', 18, 75),
TransformExp('LicAge', 57),
TransformMapValues('Gender', {'Male': 0, 'Female': 1}),
TransformMapValues('MariStat', {'Other': 0, 'Alone': 1}),
TransformAgeGender('DrivAge', 'Gender', 'Age_m', 'Age_f', age_default=18,
gender_male=0, gender_female=1),
TransformPolynomizer('Age_m'),
TransformPolynomizer('Age_f')
])
InsTransforms.ins_transform()
InsTransforms.save('transforms.pickle')
General preprocessing
These classes are used to encode categorical values.
class
TransformToNumericTransforms parameter values to numeric types, usespandas.to_numeric.class
TransformGetDummiesGets dummy columns of the parameter, usespandas.get_dummies.class
TransformMapValuesTransforms parameter values according to the given dictionary.
You can also generate the polynomial features.
class
TransformPolynomizerGets the polynomials of parameters values.
Grouping and sorting
class
TransformParamUselessGroup: Groups all parameter values with a small amount of data into one group.class
TransformParamSortFreq: Sorts by the frequency values of the chosen column.class
TransformParamSortAC: Sorts by the average sum values of the chosen column.
Preprocessing data about a person
These classes are used to preprocess data about a person, such as gender, age, or name.
class
TransformGenderGetFromName: For Russian names only. Gets the gender of a person from Russian second names.class
TransformAgeGetFromBirthday: Gets the age of a person in years from birth dates.class
TransformAge: Transform a person’s age to age for a specifiedage_min(lower values are invalid) andage_max(exceeding values will be grouped) age.class
TransformAgeGender: Gets the intersection of a person’s minimum age and gender.class
TransformNameCheck: Checks if the person’s first name is on the particular list. Names may concatenate surnames, first names, and last names.
Preprocessing of insurance data
Since Insolver was made for the insurance industry, some classes are also available to handle the driving experience, vehicle, and region data.
class
TransformExp: Transforms the minimum driving experience values in years with a grouping of values greater thanexp_max.class
TransformAgeExpDiff: Transforms records with the difference between the minimum driver age and the minimum experience less thandiff_minyears, sets the minimum driver experience equal to the minimum driver age minusdiff_minyears.class
TransformVehPower: Transforms vehicle power values. Values underpower_minand overpower_maxwill be grouped. Values betweenpower_minandpower_maxwill be grouped with steppower_step.class
TransformVehAgeGetFromIssueYear: Get the vehicles’ age (in years) by year of issue and policy start dates.class
TransformVehAge: Transforms vehicle age values (in years). Values overveh_age_maxwill be grouped.class
TransformRegionGetFromKladr: Gets the region number from KLADRs.class
TransformCarFleetSize: Calculates fleet sizes for policyholders.
AutoFill NA values
Class AutoFillNATransforms is used to fill NA values in a dataset.
It fills NA values with median values for numerical columns and the most frequently used categorical columns.
import numpy as np
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, AutoFillNATransforms
df = InsolverDataFrame(pd.DataFrame(data={'col1': [1, 2, np.nan]}))
print(df)
# col1
# 0 1.0
# 1 2.0
# 2 NaN
df_transformed = InsolverTransform(df, [
AutoFillNATransforms(),
])
df_transformed.ins_transform()
print(df_transformed)
# col1
# 0 1.0
# 1 2.0
# 2 1.5
Numerical AutoFillNA methods
There are several options for the numerical_method parameter available to fill the NA numerical values:
median(by default) - the value separating the higher half from the lower half;mean- the sum of the values divided by the number of values;mode- the value that appears most often in a set of data values, if several values are found, the first one is used;remove- removes all columns containing NA values.
Categorical AutoFillNA methods
There are several options for the categorical_method parameter available to fill the NA categorical values:
frequent(by default) - the category that appears most often in a set of data values, if several values are found, the first one is used;new_category- creates a new category “Unknown” for NA values;imputed_column- fills with the frequent category and creates a newboolcolumn containing whether a value was imputed or not;remove- removes all columns containing NA values.
Using constants
You can also use constants to fill NA values using the numerical_constants and categorical_constants parameters for numerical and categorical columns.
import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import InsolverTransform, AutoFillNATransforms
from insolver.model_tools import download_dataset
download_dataset('freMPL-R')
df = InsolverDataFrame(pd.read_csv('./datasets/freMPL-R.csv', low_memory=False))
transform = InsolverTransform(df, [
AutoFillNATransforms(numerical_constants={'col1': '111'}),
])
transform.ins_transform()
print(df)
# col1
# 0 1.0
# 1 2.0
# 2 111
Date and datetime
Class DatetimeTransforms is used to preprocess date and date-time columns.
Unlike other transformations, this class does not change the date columns but creates new ones with the used feature in the name.
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, DatetimeTransforms
df = InsolverDataFrame(pd.DataFrame(data={'last_review': ['2018-10-19', '2019-05-21']}))
print(df)
# last_review
# 0 2018-10-19
# 1 2019-05-21
transform = InsolverTransform(df, [
DatetimeTransforms(['last_review']),
])
transform.ins_transform()
print(df)
# last_review last_review_unix
# 0 2018-10-19 1.539907e+09
# 1 2019-05-21 1.558397e+09
Label Encoder
EncoderTransforms based on sklearn’s LabelEncoder. Encode target labels with value between 0 and n_classes-1.
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, EncoderTransforms
df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))
print(df)
# col1
# 0 A
# 1 B
# 2 C
# 3 A
df_transformed = InsolverTransform(df, [
EncoderTransforms(['col1']),
])
df_transformed.ins_transform()
print(df_transformed)
# col1
# 0 0
# 1 1
# 2 2
# 3 0
One Hot Encoder
OneHotEncoderTransforms based on sklearn’s OneHotEncoder. Encode categorical features as a one-hot numeric array.
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, OneHotEncoderTransforms
df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))
print(df)
# col1
# 0 A
# 1 B
# 2 C
# 3 A
df_transformed = InsolverTransform(df, [
OneHotEncoderTransforms(['col1']),
])
df_transformed.ins_transform()
print(df_transformed)
# col1_A col1_B col1_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 0.0 1.0
# 3 1.0 0.0 0.0
Custom transformations
Custom transformations can also be created. Transfromation can be defined as a class object, which has __init__ and __call__ methods.
The __call__ method should take one argument, which is the initial dataframe, and return transformed one.
Also, all packages and assets used in the custom transformation should be imported explicitly in methods where they are used.
Otherwise, custom transformations may not work properly, since the saved transformation is serialized by dill package, which may not resolve all the references when transformation will be loaded.
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform
class TransformToNumeric:
"""Example of user-defined transformations. Transform values to numeric.
Parameters:
column_names (list): List of columns for transformations
downcast (str): parameter from pd.to_numeric, default: 'float'
"""
def __init__(self, column_names, downcast='float'):
self.column_names = column_names
self.downcast = downcast
def __call__(self, df):
import pandas as pd
for column in self.column_names:
df[column] = pd.to_numeric(df[column], downcast=self.downcast)
return df
df = InsolverDataFrame(pd.DataFrame(data={'col1': ['1.0', '2', -3]}))
print(df)
print(df.dtypes)
# col1
# 0 1.0
# 1 2
# 2 -3
# col1 object
# dtype: object
df_transformed = InsolverTransform(df, [
TransformToNumeric(['col1']),
])
df_transformed.ins_transform()
print(df_transformed)
print(df_transformed.dtypes)
# col1
# 0 1.0
# 1 2.0
# 2 -3.0
# col1 float32
# dtype: object
Saving and loading transformations
Transformations can also be saved with save() method for both development and production use.
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, OneHotEncoderTransforms
df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))
df_transformed = InsolverTransform(df, [
OneHotEncoderTransforms(['col1']),
])
df_transformed.ins_transform()
df_transformed.save('transforms')
Transformations saving is performed by serialization with dill package.
To use saved transformations (including user-defined) your workflow, you should pass their filepath into load_transforms function.
After that you can use them in the same way as the build-in imported transformations.
import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import InsolverTransform, load_transforms
# load data
df = pd.read_json('request_example.json')
InsDataFrame = InsolverDataFrame(df)
# load transformations
transforms = load_transforms('transforms')
InsTransforms = InsolverTransform(InsDataFrame, transforms)
InsTransforms.ins_transform()
...