Feature Engineering

Class DataPreprocessing allows you to automatically preprocess your data. It supports such feature engineering functionality to transform data as Categorical data transform, AutoFill NA, Normalization, Feature Selection, Dimensionality Reduction, Smoothing, Sampling.

You can just call the method preprocess and it will normalize, fill in NA values and transform your data.

import pandas as pd
from insolver.feature_engineering import DataPreprocessing

df = pd.DataFrame(...)
new_df = DataPreprocessing().preprocess(df=df, target='target')

Data must be pandas.DataFrame or insolver.InsolverDataFrame type.

You can also set selected columns as numerical or categorical:

import pandas as pd
from insolver.feature_engineering import DataPreprocessing

df = pd.DataFrame(...)
new_df = DataPreprocessing(numerical_columns=['1', '2', '3'],
                  categorical_columns=['5']).preprocess(df = df, target='target')

If you want to use some of the functionality available in this class, but don’t want to select a specific method, set the parameters to True and it will use the default values:

import pandas as pd
from insolver.feature_engineering import DataPreprocessing

df = pd.DataFrame(...)
preprocess = DataPreprocessing(normalization=True, fillna=True, sampling=True, transform_categorical=True)
new_df = preprocess.preprocess(df = df, target='target')

However, some functions need to have initialized parameters:

import pandas as pd
from insolver.feature_engineering import DataPreprocessing

df = pd.DataFrame(...)
preprocess = DataPreprocessing(feature_selection=True, feat_select_task='class', smoothing=True,
                           smoothing_column='smoothing_column')
new_df = preprocess.preprocess(df = df, target='target')

The DataPreprocessing class also supports the initialization of multiple targets, for this set the target parameter in the no_name_func method as a list:

import pandas as pd
from insolver.feature_engineering import DataPreprocessing

df = pd.DataFrame(...)
new_df = DataPreprocessing().preprocess(df = df, target=['target', 'target_2'])

You can also modify all functions by changing their parameters or by setting some of the default functions that are True to None/False:

import pandas as pd
from insolver.feature_engineering import DataPreprocessing

df = pd.DataFrame(...)
preprocess = DataPreprocessing(transform_categorical=None,  normalization='minmax', fillna=True,
                           fillna_categorical='imputed_column', fillna_numerical='mode', sampling='cluster',
                           sampling_n=2, sampling_n_clusters=5, smoothing='moving_average',
                           smoothing_column='smoothing_column', feature_selection='lasso', feat_select_task='class',
                           feat_select_threshold='mean')
new_df = preprocess.preprocess(df = df, target='target')

dim_red_preprocess = DataPreprocessing(dim_red='isomap', dim_red_n_components=1, dim_red_n_neighbors=10)
dim_red_new_df = dim_red_preprocess.preprocess(df = df, target='target')

Feature Selection

Class FeatureSelection allows you to compute features importances using the selected method. It also can plot it with the plot size chosen and the importance threshold. You can create a new dataset with the best features using computed importance. The permutation importance model inspection with some models also can be used.

Class FeatureSelection supports such tasks as classification, regression, multiclass classification, and multiclass multioutput classification.

The following methods can be used for each task:

for the class task, Mutual information, F statistics, chi-squared test, Random Forest, Lasso, or ElasticNet can be used;
for the reg task, Mutual information, F statistics, Random Forest, Lasso, or ElasticNet can be used;
for the multiclass task, Random Forest, Lasso or ElasticNet can be used;
for the multiclass_multioutput classification Random Forest can be used.

Random Forest is used by default.

All the methods used in this class are from scikit-learn:

random_forestclassification model / regression model,
lasso classification model / regression model,
elasticnet classification model / regression model,
mutual_inf classification information / regression information,
f_statistic classification statistic / regression statistic,
chi2 classification statistic.

Permutation feature importance technique is also from scikit-learn. It supports only estimator models: Random Forest, Lasso, and ElasticNet.

Methods diagram

feature selection methods

Example

import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.feature_engineering import FeatureSelection

# create dataset using InsolverDataFrame or pandas.DataFrame
dataset = InsolverDataFrame(pd.read_csv("..."))

# init class FeatureSelection with default method
fs = FeatureSelection(y_column='y_column', task='class')

# create model using create_model()
fs.create_model(dataset)

# plot created model importances using plot_importance()
fs.plot_importance()

# create permutation importance using create_permutation_importance()
fs.create_permutation_importance()

# create new dataset using create_new_dataset()
new_dataset = fs.create_new_dataset()

# you can also create permutation importance by setting parameter permutation_importance=True
fs_p = FeatureSelection(method='lasso', task='class', permutation_importance=True)
fs_p.create_model(dataset)

Normalization

Normalization can be defined as adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization can be defined as more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment. Class Normalization implements seven methods for data normalization.

You can select the method by changing the method parameter:

standard - StandardScaler standardizes features by removing the mean and scaling to unit variance;
minmax - MinMaxScaler transforms features by scaling each feature to a given range;
robust - RobustScaler scales features using statistics that are robust to outliers;
normalizer - Normalizer normalizes samples individually to unit norm;
yeo-johnson - PowerTransformer(method=’yeo-johnson’) applies a power transform featurewise to make data more Gaussian-like, supports both positive or negative data;
box-cox - PowerTransformer(method=’box-cox’) applies a power transform featurewise to make data more Gaussian-like, requires input data to be strictly positive;
log - logarithm of the values.

Only selected columns can be transformed using the method with the column_names parameter set to str or list. You can also transform particular columns using the specified method for each column with the column_method parameter set to dict {‘column name’: ‘method’}. If column_method is set and method is None, only columns from column_method will be transformed. Columns in column_method and column_names cannot be dublicated. If column_names is None and column_method is also None, all columns will be transformed using the specified method.

transform(data) is the main normalization method. It creates new pandas.DataFrame as a copy of the original data and transformes either the selected or all columns.

You can also plot original and transformed data with the plot_transformed(column, **kwargs) method. It will plot old and new transformed selected column. You can set parameters for the seaborn.displot as **kwargs.

Example

import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.feature_engineering.normalization import Normalization

#create dataset using InsolverDataFrame or pandas.DataFrame
df = InsolverDataFrame(pd.read_csv("..."))

#create class instance with the selected method
norm = Normalization(method='standard', 
                     column_method={'column3':'minmax', 'Y_column': 'log'}, 
                     column_names=['column1', 'column2'])

#use transform() to create new dataframe
new_data = norm.transform(data = df)

#plot result
norm.plot_transformed(column = 'Y_column', kind="kde")

#set method=None to transform only columns from `column_method`
new_data = Normalization(method=None, 
                         column_method={'column3':'minmax', 'Y_column': 'log'}).transform(data = df)

Dimensionality Reduction

DimensionalityReduction class allows you to reduce data dimensionality with a selected method. There are three types of techniques implemented: decomposition, manifold, and discriminant analysis.

The type of the method can be specified in the method parameter. The list of methods that can be assigned is presented below. All methods are implemented from scikit-learn.

Matrix decomposition is represented by methods such as:

pca - Principal Component Analysis, PCA;
svd - truncated Singular Value Decomposition, SVD;
fa - Factor Analysis, FA;
nmf - Non-Negative Matrix Factorization, NMF.

Discriminant Analysis is represented by methods such as:

lda - Linear Discriminant Analysis, LDA.

Manifold learning is represented by methods such as:

lle - Locally Linear Embedding, LLE;
isomap - Isomap Embedding;
t_sne - T-distributed Stochastic Neighbor Embedding, T-SNE.

Use the transform(X, y=None, **kwargs) method to create a new transformed X. Parameters assigned as kwargs can be used to change model(estimator) parameters which can be found in the sklearn pages above.

You can plot the transformed X and y with the plot_transformed(self, y, figsize=(10,10), **kwargs) method. It uses seaborn to create plots. If the number of components is less than 3, seaborn.scatterplot will be created, else seaborn.pairplot will be created. The y parameter is used as the hue. Parameters assigned as kwargs can be used to change plot parameters found in the seaborn pages.

You can access created model with the estimator attribute.

Example

import pandas as pd
from insolver.feature_engineering import DimensionalityReduction

#create X and y
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = pd.DataFrame(iris.target, columns=['y'])

#create DimensionalityReduction
dm = DimensionalityReduction(method='nmf')

#use transform() to create new X
new_X = dm.transform(X=X, n_components=3)

#plot result
dm.plot_transformed(y, figsize=(5, 5), palette='Set2')

Smoothing

Data smoothing can be defined as a statistical approach of eliminating outliers from datasets to make the patterns more noticeable. Class Smoothing implements four methods for data smoothing.

You can select the method by changing the method parameter:

moving_average is a calculation to analyze data points by creating a series of averages of different subsets of the full data set, uses pandas.DataFrame.rolling().mean() method;
lowess - Locally Weighted Scatterplot Smoothing is a generalization of moving average and polynomial regression, uses statsmodels.api.nonparametric.lowess;
s_g_filter - Savitzky–Golay filter is achieved by fitting successive sub-sets of adjacent data points with a low-degree polynomial by the method of linear least squares, uses scipy.signal.savgol_filter;
fft - Fast Fourier transform is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT), uses scipy.fft.rfft and scipy.fft.irfft.

transform(data, **kwargs) is the main smoothing method. It creates new pandas.DataFrame as a copy of original data and adds a new transformed column.

This class has parameters that are used for different methods:

window - Window size for the moving_average and s_g_filter methods.
polyorder - Polyorder for the s_g_filter method.
threshold - Threshold for the fft method.

Other parameters that are used for each method (besides fft) can be passed as kwargs to the transform(data, **kwargs) method.

You can also plot original and transformed data with the plot_transformed(figsize=(7, 7)) method.

Example

import pandas as pd
from insolver.frame import InsolverDataFrame
from insolver.feature_engineering import Smoothing

#create dataset using InsolverDataFrame or pandas.DataFrame
df = InsolverDataFrame(pd.read_csv("..."))

#create class instance with the selected method
smoothing = Smoothing(method='fft', x_column='x')

#use transform() to create new dataframe
new_data = smoothing.transform(data=df)

#plot result
smoothing.plot_transformed(figsize=(10,10))

Sampling

Sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Class Sampling implements methods from probability sampling. A probability sample is a sample in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. There are four methods you can use by changing the method parameter:

simple (default) sampling is a technique in which a subset is randomly selected number from a set;
systematic sampling is a technique in which a subset is selected from a set using a defined step;
cluster sampling is a technique in which a set is divided into clusters, then the set is determined by a randomly selected number of clusters;
stratified sampling is a technique in which a set is divided into clusters, then the set is determined by a randomly selected number of units from each cluster.

The n parameter is used differently in each sampling method:

for a simple sampling, n is the number of values to keep;
for a systematic sampling, n is the number of step size;
for a cluster sampling, n is the number of clusters to keep;
for a stratified sampling, n is the number of values to keep in each cluster.

You can use the dataframe column as clusters by defining cluster_column. It will use values from this column in cluster and stratified methods.

Example

import pandas as pd
from insolver import InsolverDataFrame
from insolver.feature_engineering import Sampling

#create dataset using InsolverDataFrame or pandas.DataFrame
dataset = InsolverDataFrame(pd.read_csv("..."))

#create class instance with the selected sampling method
sampling = Sampling(n=10, n_clusters=5, method='stratified')

#use method sample_dataset() to create new dataframe
new_dataset = sampling.sample_dataset(df=dataset)

#using dataframe column as clusters
samling = Sampling(n = 2, cluster_column = 'name', method='stratified')
new_dataset = sampling.sample_dataset(df=dataset)