# Transformations

Transformations allow you to preprocess data, save your preprocessing steps to a pickle file, and use it in the model inference.

## Build-in transformations
Several built-in transformations can help you to preprocess some specific data.

An example of using build-in transformations:

```python
import pandas as pd

from insolver import InsolverDataFrame
from insolver.transforms import (
    TransformExp,
    InsolverTransform,
    TransformAge,
    TransformMapValues,
    TransformPolynomizer,
    TransformAgeGender
)
from insolver.model_tools import download_dataset

download_dataset('freMPL-R')
InsDataFrame = InsolverDataFrame(pd.read_csv('./datasets/freMPL-R.csv', low_memory=False))

InsTransforms = InsolverTransform(InsDataFrame, [
    TransformAge('DrivAge', 18, 75),
    TransformExp('LicAge', 57),
    TransformMapValues('Gender', {'Male': 0, 'Female': 1}),
    TransformMapValues('MariStat', {'Other': 0, 'Alone': 1}),
    TransformAgeGender('DrivAge', 'Gender', 'Age_m', 'Age_f', age_default=18,
                       gender_male=0, gender_female=1),
    TransformPolynomizer('Age_m'),
    TransformPolynomizer('Age_f')
])

InsTransforms.ins_transform()
InsTransforms.save('transforms.pickle')
```

### General preprocessing
These classes are used to encode categorical values.

* class `TransformToNumeric`
    Transforms parameter values to numeric types, uses [`pandas.to_numeric`](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html).

* class `TransformGetDummies`
    Gets dummy columns of the parameter, uses [`pandas.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

* class `TransformMapValues`
    Transforms parameter values according to the given dictionary.

You can also generate the polynomial features.

* class `TransformPolynomizer`
    Gets the polynomials of parameters values.

### Grouping and sorting 

* class `TransformParamUselessGroup`:
    Groups all parameter values with a small amount of data into one group. 

* class `TransformParamSortFreq`:
    Sorts by the frequency values of the chosen column.

* class `TransformParamSortAC`:
    Sorts by the average sum values of the chosen column.

### Preprocessing data about a person
These classes are used to preprocess data about a person, such as gender, age, or name.
* class `TransformGenderGetFromName`:
    For Russian names only. Gets the gender of a person from Russian second names.

* class `TransformAgeGetFromBirthday`:
    Gets the age of a person in years from birth dates.

* class `TransformAge`:
    Transform a person's age to age for a specified `age_min` (lower values are invalid) and `age_max` (exceeding values will be grouped) age. 

* class `TransformAgeGender`:
    Gets the intersection of a person's minimum age and gender.

* class `TransformNameCheck`:
    Checks if the person's first name is on the particular list. Names may concatenate surnames, first names, and last names.

### Preprocessing of insurance data 
Since Insolver was made for the insurance industry, some classes are also available to handle the driving experience, vehicle, and region data.
* class `TransformExp`:
    Transforms the minimum driving experience values in years with a grouping of values greater than `exp_max`. 

* class `TransformAgeExpDiff`:
    Transforms records with the difference between the minimum driver age and the minimum experience less than `diff_min` years, sets the minimum driver experience equal to the minimum driver age minus `diff_min` years. 

* class `TransformVehPower`:
    Transforms vehicle power values. Values under `power_min` and over `power_max` will be grouped. Values between `power_min` and `power_max` will be grouped with step `power_step`.

* class `TransformVehAgeGetFromIssueYear`:
    Get the vehicles' age (in years) by year of issue and policy start dates. 

* class `TransformVehAge`:
    Transforms vehicle age values (in years). Values over `veh_age_max` will be grouped.

* class `TransformRegionGetFromKladr`:
    Gets the region number from KLADRs.

* class `TransformCarFleetSize`:
    Calculates fleet sizes for policyholders.

### AutoFill NA values

```{eval-rst}
.. autoclass:: insolver.transforms.AutoFillNATransforms
    :show-inheritance:
```

Class `AutoFillNATransforms` is used to fill NA values in a dataset. 
It fills NA values with median values for numerical columns and the most frequently used categorical columns.

```python
import numpy as np
import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, AutoFillNATransforms

df = InsolverDataFrame(pd.DataFrame(data={'col1': [1, 2, np.nan]}))

print(df)
#    col1
# 0   1.0
# 1   2.0
# 2   NaN

df_transformed = InsolverTransform(df, [
        AutoFillNATransforms(),
    ])
df_transformed.ins_transform()

print(df_transformed)
#    col1
# 0   1.0
# 1   2.0
# 2   1.5
```

#### Numerical AutoFillNA methods
There are several options for the `numerical_method` parameter available to fill the NA numerical values: 
- `median` (by default) - the value separating the higher half from the lower half;
- `mean` - the sum of the values divided by the number of values;
- `mode` - the value that appears most often in a set of data values, if several values are found, the first one is used;
- `remove` - removes all columns containing NA values.

#### Categorical AutoFillNA methods
There are several options for the `categorical_method` parameter available to fill the NA categorical values: 
- `frequent` (by default) - the category that appears most often in a set of data values, if several values are found, the first one is used;
- `new_category` - creates a new category "Unknown" for NA values; 
- `imputed_column` - fills with the frequent category and creates a new `bool` column containing whether a value was imputed or not;
- `remove` - removes all columns containing NA values.

#### Using constants
You can also use constants to fill NA values using the `numerical_constants` and `categorical_constants` parameters for numerical and categorical columns.

```python
import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import InsolverTransform, AutoFillNATransforms
from insolver.model_tools import download_dataset

download_dataset('freMPL-R')
df = InsolverDataFrame(pd.read_csv('./datasets/freMPL-R.csv', low_memory=False))
transform = InsolverTransform(df, [
    AutoFillNATransforms(numerical_constants={'col1': '111'}), 
])
transform.ins_transform()

print(df)
#    col1
# 0   1.0
# 1   2.0
# 2   111
```

### Date and datetime

```{eval-rst}
.. autoclass:: insolver.transforms.DatetimeTransforms
    :show-inheritance:
```

Class `DatetimeTransforms` is used to preprocess date and date-time columns. 
Unlike other transformations, this class does not change the date columns but creates new ones with the used feature in the name.

```python
import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, DatetimeTransforms

df = InsolverDataFrame(pd.DataFrame(data={'last_review': ['2018-10-19', '2019-05-21']}))

print(df)
#    last_review
# 0   2018-10-19
# 1   2019-05-21

transform = InsolverTransform(df, [
        DatetimeTransforms(['last_review']),
    ])
    
transform.ins_transform()

print(df)
#    last_review last_review_unix
# 0   2018-10-19   1.539907e+09     
# 1   2019-05-21   1.558397e+09
```

### Label Encoder

EncoderTransforms based on [sklearn's LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Encode target labels with value between 0 and `n_classes`-1.

```python
import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, EncoderTransforms


df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))

print(df)
#   col1
# 0    A
# 1    B
# 2    C
# 3    A

df_transformed = InsolverTransform(df, [
    EncoderTransforms(['col1']),
])
df_transformed.ins_transform()

print(df_transformed)
#    col1
# 0     0
# 1     1
# 2     2
# 3     0
```

### One Hot Encoder

OneHotEncoderTransforms based on [sklearn's OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Encode categorical features as a one-hot numeric array.

```python
import pandas as pd

from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, OneHotEncoderTransforms


df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))

print(df)
#   col1
# 0    A
# 1    B
# 2    C
# 3    A

df_transformed = InsolverTransform(df, [
    OneHotEncoderTransforms(['col1']),
])
df_transformed.ins_transform()

print(df_transformed)
#    col1_A  col1_B  col1_C
# 0     1.0     0.0     0.0
# 1     0.0     1.0     0.0
# 2     0.0     0.0     1.0
# 3     1.0     0.0     0.0
```

## Custom transformations

Custom transformations can also be created. Transfromation can be defined as a `class` object, which has `__init__` and `__call__` methods.
The `__call__` method should take one argument, which is the initial dataframe, and return transformed one.
Also, all packages and assets used in the custom transformation should be imported explicitly in methods where they are used.
Otherwise, custom transformations may not work properly, since the saved transformation is serialized by `dill` package, which may not resolve all the references when transformation will be loaded.

```python
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform


class TransformToNumeric:
    """Example of user-defined transformations. Transform values to numeric.

    Parameters:
        column_names (list): List of columns for transformations
        downcast (str): parameter from pd.to_numeric, default: 'float'
    """
    def __init__(self, column_names, downcast='float'):
        self.column_names = column_names
        self.downcast = downcast

    def __call__(self, df):
        import pandas as pd
        for column in self.column_names:
            df[column] = pd.to_numeric(df[column], downcast=self.downcast)
        return df

    
df = InsolverDataFrame(pd.DataFrame(data={'col1': ['1.0', '2', -3]}))

print(df)
print(df.dtypes)
#   col1
# 0  1.0
# 1    2
# 2   -3
# col1    object
# dtype: object

df_transformed = InsolverTransform(df, [
    TransformToNumeric(['col1']),
])
df_transformed.ins_transform()

print(df_transformed)
print(df_transformed.dtypes)
#    col1
# 0   1.0
# 1   2.0
# 2  -3.0
# col1    float32
# dtype: object
```

## Saving and loading transformations

Transformations can also be saved with `save()` method for both development and production use.

```python
import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform, OneHotEncoderTransforms

df = InsolverDataFrame(pd.DataFrame(data={'col1': ['A', 'B', 'C', 'A']}))
df_transformed = InsolverTransform(df, [
    OneHotEncoderTransforms(['col1']),
])

df_transformed.ins_transform()
df_transformed.save('transforms')
```

Transformations saving is performed by serialization with `dill` package.
To use saved transformations (including user-defined) your workflow, you should pass their filepath into `load_transforms` function.
After that you can use them in the same way as the build-in imported transformations.

```python
import pandas as pd
from insolver import InsolverDataFrame
from insolver.transforms import InsolverTransform, load_transforms

# load data
df = pd.read_json('request_example.json')
InsDataFrame = InsolverDataFrame(df)

# load transformations
transforms = load_transforms('transforms')
InsTransforms = InsolverTransform(InsDataFrame, transforms)
InsTransforms.ins_transform()
...
```