API References

Insolver DataFrame

class insolver.frame.frame.InsolverDataFrame(data: Any | None = None, index: Any | None = None, columns: Any | None = None, dtype: dtype | None = None, copy: bool | None = None)[source]
__init__(data: Any | None = None, index: Any | None = None, columns: Any | None = None, dtype: dtype | None = None, copy: bool | None = None) None[source]

Primary DataFrame class for Insolver. Almost the same as the pandas.DataFrame.

Parameters:
  • data (ndarray (structured or homogeneous), Iterable, dict, or pandas.DataFrame) – Dict can contain pandas.Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains pandas.Series which have an index defined, it is aligned by its index (default=None).

  • index (pandas.Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

  • columns (pandas.Index or array-like) – Column labels to use for resulting frame when data does not have them, defaulting to pandas.RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead (default=None).

  • dtype (numpy.dtype) – Data type to force. Only a single dtype is allowed. If None, infer (default=None).

  • copy (bool) – pandas.DataFrame or 2d ndarray input, the default of None behaves like copy=False (default=None).

get_meta_info() Dict[str, str | int | List[Dict[str, str | dtype]]][source]

Gets JSON with Insolver meta information.

Returns:

Meta information JSON.

Return type:

dict

sample_request(batch_size: int = 1) Dict[str, object][source]

Create json request by a random sample from InsolverDataFrame

Parameters:

batch_size – number of random samples

Returns:

request (dict)

split_frame(val_size: float, test_size: float, random_state: int | None = 0, shuffle: bool = True, stratify: Any | None = None) List[DataFrame][source]

Function for splitting dataset into train/validation/test partitions.

Parameters:
  • val_size (float) – The proportion of the dataset to include in validation partition.

  • test_size (float) – The proportion of the dataset to include in test partition.

  • random_state (int, optional) – Random state, passed to train_test_split() from scikit-learn (default=0).

  • shuffle (bool, optional) – Passed to train_test_split() from scikit-learn (default=True).

  • stratify (array_like, optional) – Passed to train_test_split() from scikit-learn (default=None).

Returns:

(train, valid, test). A list of partitions of the initial dataset.

Return type:

list

Build-in transformations

Core utils for transformations

class insolver.transforms.core.InsolverTransform(data: Any, transforms: List | Dict[str, List | Dict] | None = None, copy: bool = False)[source]

Class to compose transforms to be done on InsolverDataFrame. Transforms may have the priority param. Priority=0: transforms which get values from other (TransformAgeGetFromBirthday, TransformRegionGetFromKladr, etc). Priority=1: main transforms of values (TransformAge, TransformVehPower, etc). Priority=2: transforms which get intersections of features (TransformAgeGender, etc); transforms which sort values (TransformParamSortFreq, TransformParamSortAC). Priority=3: transforms which get functions of values (TransformPolynomizer, TransformGetDummies, etc).

Parameters:
  • data – InsolverDataFrame to transform.

  • transforms – List of transforms to be done.

ins_transform() Dict[source]

Transforms data in InsolverDataFrame.

Returns:

List of transforms have been done.

Return type:

list

exception insolver.transforms.core.TransformsWarning(message: str)[source]

Basic data transformations

class insolver.transforms.basic.EncoderTransforms(column_names, le_classes=None, priority=3)[source]

Label Encoder

Parameters:
  • column_names (list) – columns for label encoding

  • le_classes (dict) – dictionary with label encoding classes for each column

class insolver.transforms.basic.OneHotEncoderTransforms(column_names, encoder_dict=None, priority=3)[source]

OneHotEncoder Transformations

Parameters:
  • column_names (list) – columns for one hot encoding

  • encoder_dict (dict) – dictionary with encoder_params for each column

class insolver.transforms.basic.TransformGetDummies(column_param, drop_first=False, inference=False, dummy_columns=None, priority=3)[source]

Gets dummy columns of the parameter, uses Pandas’ ‘get_dummies’.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter to transform.

  • drop_first (bool) – Whether to get k-1 dummies out of k categorical levels by removing the first level, False by default.

  • inference (bool) – Sign if the transformation is used for inference, False by default.

  • dummy_columns (list) – List of the dummy columns, for inference only.

class insolver.transforms.basic.TransformMapValues(column_param, dictionary, priority=1)[source]

Transforms parameter’s values according to the dictionary.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter to map.

  • dictionary (dict) – The dictionary for mapping.

class insolver.transforms.basic.TransformPolynomizer(column_param, n=2, priority=3)[source]

Gets polynomials of parameter’s values.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter to polynomize.

  • n (int) – Polynomial degree.

class insolver.transforms.basic.TransformToNumeric(column_param, downcast='integer', priority=0)[source]

Transforms parameter’s values to numeric types, uses Pandas’ ‘to_numeric’.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter to transform.

  • downcast – Target numeric dtype, equal to Pandas’ ‘downcast’ in the ‘to_numeric’ function, ‘integer’ by default.

Person data methods

class insolver.transforms.person.TransformAge(column_driver_minage, age_min=18, age_max=70, priority=1)[source]

Transforms values of drivers’ minimum ages in years. Values under ‘age_min’ are invalid. Values over ‘age_max’ will be grouped.

Parameters:
  • column_driver_minage (str) – Column name in InsolverDataFrame containing drivers’ minimum ages in years, column type is integer.

  • age_min (int) – Minimum value of drivers’ age in years, lower values are invalid, 18 by default.

  • age_max (int) – Maximum value of drivers’ age in years, bigger values will be grouped, 70 by default.

class insolver.transforms.person.TransformAgeGender(column_age, column_gender, column_age_m, column_age_f, age_default=18, gender_male='male', gender_female='female', priority=2)[source]

Gets intersections of drivers’ minimum ages and genders.

Parameters:
  • column_age (str) – Column name in InsolverDataFrame containing clients’ ages in years, column type is integer.

  • column_gender (str) – Column name in InsolverDataFrame containing clients’ genders.

  • column_age_m (str) – Column name in InsolverDataFrame for males’ ages, for females default value is applied, column type is integer.

  • column_age_f (str) – Column name in InsolverDataFrame for females’ ages, for males default value is applied, column type is integer.

  • age_default (int) – Default value of the age in years,18 by default.

  • gender_male – Value for male gender in InsolverDataFrame, ‘male’ by default.

  • gender_female – Value for male gender in InsolverDataFrame, ‘female’ by default.

class insolver.transforms.person.TransformAgeGetFromBirthday(column_date_birth, column_date_start, column_age, priority=0)[source]

Gets clients’ ages in years from theirs birth dates and policies’ start dates.

Parameters:
  • column_date_birth (str) – Column name in InsolverDataFrame containing clients’ birth dates, column type is date.

  • column_date_start (str) – Column name in InsolverDataFrame containing policies’ start dates, column type is date.

  • column_age (str) – Column name in InsolverDataFrame for clients’ ages in years, column type is int.

class insolver.transforms.person.TransformGenderGetFromName(column_name, column_gender, gender_male='male', gender_female='female', priority=0)[source]

Gets clients’ genders from theirs russian second names.

Parameters:
  • column_name (str) – Column name in InsolverDataFrame containing clients’ names, column type is string.

  • column_gender (str) – Column name in InsolverDataFrame for clients’ genders.

  • gender_male (str) – Return value for male gender in InsolverDataFrame, ‘male’ by default.

  • gender_female (str) – Return value for female gender in InsolverDataFrame, ‘female’ by default.

class insolver.transforms.person.TransformNameCheck(column_name, column_name_check, names_list, name_full=False, name_position=1, priority=1)[source]

Checks if clients’ first names are in special list. Names may concatenate surnames, first names and last names.

Parameters:
  • column_name (str) – Column name in InsolverDataFrame containing clients’ names, column type is string.

  • name_full (bool) – Sign if name is the concatenation of surname, first name and last name, False by default.

  • column_name_check (str) – Column name in InsolverDataFrame for bool values if first names are in the list or not.

  • names_list (list) – The list of clients’ first names.

  • name_position (int) – The position of the name in full name. For example, argument should be 0 for notation such

  • Doe' (as 'John)

  • Ivan'. (but 1 for notation like 'Ivanov)

Insurance data methods

class insolver.transforms.insurance.TransformAgeExpDiff(column_driver_minage, column_driver_minexp, diff_min=18, priority=2)[source]

Transforms records with difference between drivers’ minimum age and minimum experience less then ‘diff_min’ years, sets drivers’ minimum experience equal to drivers’ minimum age minus ‘diff_min’ years.

Parameters:
  • column_driver_minage (str) – Column name in InsolverDataFrame containing drivers’ minimum ages in years, column type is integer.

  • column_driver_minexp (str) – Column name in InsolverDataFrame containing drivers’ minimum experiences in years, column type is integer.

  • diff_min (int) – Minimum allowed difference between age and experience in years.

class insolver.transforms.insurance.TransformCarFleetSize(column_id, column_date_start, column_fleet_size, priority=3)[source]

Calculates fleet sizes for policyholders.

Parameters:
  • column_id (str) – Column name in InsolverDataFrame containing policyholders’ IDs.

  • column_date_start (str) – Column name in InsolverDataFrame containing policies’ start dates, column type is date.

  • column_fleet_size (str) – Column name in InsolverDataFrame for fleet sizes, column type is int.

class insolver.transforms.insurance.TransformExp(column_driver_minexp, exp_max=52, priority=1)[source]

Transforms values of drivers’ minimum experiences in years with values over ‘exp_max’ grouped.

Parameters:
  • column_driver_minexp (str) – Column name in InsolverDataFrame containing drivers’ minimum experiences in years, column type is integer.

  • exp_max (int) – Maximum value of drivers’ experience in years, bigger values will be grouped, 52 by default.

class insolver.transforms.insurance.TransformRegionGetFromKladr(column_kladr, column_region_num, priority=0)[source]

Gets regions’ numbers from KLADRs.

Parameters:
  • column_kladr (str) – Column name in InsolverDataFrame containing KLADRs, column type is string.

  • column_region_num (str) – Column name in InsolverDataFrame for regions’ numbers, column type is integer.

class insolver.transforms.insurance.TransformVehAge(column_veh_age, veh_age_max=25, priority=1)[source]

Transforms values of vehicles’ ages in years. Values over ‘veh_age_max’ will be grouped.

Parameters:
  • column_veh_age (str) – Column name in InsolverDataFrame containing vehicles’ ages in years, column type is integer.

  • veh_age_max (int) – Maximum value of vehicles’ age in years, bigger values will be grouped, 25 by default.

class insolver.transforms.insurance.TransformVehAgeGetFromIssueYear(column_veh_issue_year, column_date_start, column_veh_age, priority=0)[source]

Gets vehicles’ ages in years from issue years and policies’ start dates.

Parameters:
  • column_veh_issue_year (str) – Column name in InsolverDataFrame containing vehicles’ issue years, column type is integer.

  • column_date_start (str) – Column name in InsolverDataFrame containing policies’ start dates, column type is date.

  • column_veh_age (str) – Column name in InsolverDataFrame for vehicles’ ages in years, column type is integer.

class insolver.transforms.insurance.TransformVehPower(column_veh_power, power_min=10, power_max=500, power_step=10, priority=1)[source]

Transforms values of vehicles’ powers. Values under ‘power_min’ and over ‘power_max’ will be grouped. Values between ‘power_min’ and ‘power_max’ will be grouped with step ‘power_step’.

Parameters:
  • column_veh_power (str) – Column name in InsolverDataFrame containing vehicles’ powers, column type is float.

  • power_min (float) – Minimum value of vehicles’ power, lower values will be grouped, 10 by default.

  • power_max (float) – Maximum value of vehicles’ power, bigger values will be grouped, 500 by default.

  • power_step (int) – Values of vehicles’ power will be divided by this parameter, rounded to integers, 10 by default.

Grouping and sorting data methods

class insolver.transforms.grouping_sorting.TransformParamSortAC(column_param, column_param_sort_ac, column_claims_count, column_claims_sum, inference=False, param_ac_dict=None, priority=2)[source]

Gets sorted by claims’ average sum parameter’s values.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter.

  • column_param_sort_ac (str) – Column name in InsolverDataFrame for sorted values of parameter, column type is integer.

  • column_claims_count (str) – Column name in InsolverDataFrame containing numbers of claims, column type is integer or float.

  • column_claims_sum (str) – Column name in InsolverDataFrame containing sums of claims, column type is integer or float.

  • inference (bool) – Sign if the transformation is used for inference, False by default.

  • param_ac_dict (dict) – The dictionary of sorted values of the parameter, for inference only.

class insolver.transforms.grouping_sorting.TransformParamSortFreq(column_param, column_param_sort_freq, column_policies_count, column_claims_count, inference=False, param_freq_dict=None, priority=2)[source]

Gets sorted by claims’ frequency parameter’s values.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter.

  • column_param_sort_freq (str) – Column name in InsolverDataFrame for sorted values of parameter, column type is integer.

  • column_policies_count (str) – Column name in InsolverDataFrame containing numbers of policies, column type is integer or float.

  • column_claims_count (str) – Column name in InsolverDataFrame containing numbers of claims, column type is integer or float.

  • inference (bool) – Sign if the transformation is used for inference, False by default.

  • param_freq_dict (dict) – The dictionary of sorted values of the parameter, for inference only.

class insolver.transforms.grouping_sorting.TransformParamUselessGroup(column_param, size_min=1000, group_name=0, inference=False, param_useless=None, priority=1)[source]

Groups all parameter’s values with few data to one group.

Parameters:
  • column_param (str) – Column name in InsolverDataFrame containing parameter.

  • size_min (int) – Minimum allowed number of records for each parameter value, 1000 by default.

  • group_name – Name of the group for parameter’s values with few data.

  • inference (bool) – Sign if the transformation is used for inference, False by default.

  • param_useless (list) – The list of useless values of the parameter, for inference only.

static _param_useless_get(df, column_param, size_min)[source]

Checks the amount of data for each parameter’s value.

Parameters:
  • df – InsolverDataFrame to explore.

  • column_param (str) – Column name in InsolverDataFrame containing parameter.

  • size_min (int) – Minimum allowed number of records for each parameter’s value, 1000 by default.

Returns:

List of parameter’s values with few data.

Return type:

list

Missing values imputation methods

class insolver.transforms.autofillna.AutoFillNATransforms(numerical_columns=None, categorical_columns=None, numerical_method='median', categorical_method='frequent', numerical_constants=None, categorical_constants=None, priority=0)[source]

Auto Fill NA values.

Parameters:
  • numerical_columns (list) – List of numerical columns

  • categorical_columns (list) – List of categorical columns

  • numerical_method (str) – Fill numerical NA values using this specified method: ‘median’ (by default), ‘mean’, ‘mode’ or ‘remove’

  • categorical_method (str) – Fill categorical NA values using this specified method: ‘frequent’ (by default), ‘new_category’, ‘imputed_column’ or ‘remove’

  • numerical_constants (dict) – Dictionary of constants for each numerical column

  • categorical_constants (dict) – Dictionary of constants for each categorical column

_fillna_numerical(df)[source]

Replace numerical NaN values using specified method

_fillnan_categorical(df)[source]

Replace categorical NaN values using specified method

Date and Datetime methods

class insolver.transforms.date_time.DatetimeTransforms(column_names, column_types=None, dayfirst=False, yearfirst=False, feature='unix', column_feature=None, priority=0)[source]

Get selected feature from date variable.

Parameters:
  • column_names (list) – List of columns to convert, columns in column_names can’t be duplicated in column_feature.

  • column_types (dict) – Dictionary of columns and types to return.

  • dayfirst (bool) – Parameter from pandas.to_datetime(), specify a date parse order if arg is str or its list-likes.

  • yearfirst (bool) – Parameter from pandas.to_datetime(), specify a date parse order if arg is str or its list-likes.

  • feature (str) – Type of feature to get from date variable: unix (by default), date, time, month, quarter, year, day, day_of_the_week, weekend.

  • column_feature (dict) – List of columns to preprocess using specified feature for each column in the dictionary, columns in column_feature can’t be duplicated in column_names.

Feature engineering

Data preprocessing

Feature selection

class insolver.feature_engineering.feature_selection.FeatureSelection(y_column, task, method='random_forest', permutation_importance=False)[source]

Feature selection. Supports the following tasks: classification, regression, multiclass classification and multiclass multioutput classification.

Note

The following specified methods can be used for each individual task:

  • for the classification problem Mutual information, F statistics, chi-squared test, Random Forest, Lasso or ElasticNet can be used;

  • for the regression problem Mutual information, F statistics, Random Forest, Lasso or ElasticNet can be used;

  • for the multiclass classification Random Forest, Lasso or ElasticNet can be used;

  • for the multiclass multioutput classification Random Forest can be used.

Random Forest is used by default.

Parameters:
  • y_column (str) – The name of the column to predict.

  • task (str) – A task for the model. Values reg, class, multiclass and multiclass_multioutput are supported.

  • method (str) – A technique to compute features importance. Values random_forest`(default), `mutual_inf, chi2, f_statistic, ‘lasso’ and ‘elasticnet’ are supported.

  • permutation_importance (bool) – Uses permutation feature importance, false is default.

new_dataframe

New dataframe with the selected features only.

Type:

pandas.DataFrame

importances

A list of the importances created using selected method.

Type:

np.ndarray

model

A model for feature selection.

permutation_model

Permutation model for feature selection.

_init_importance_dict()[source]

Non-public method for creating an importance dictionary.

_init_methods_dict()[source]

Non-public method for creating a methods’ dictionary.

Raises:

NotImplementedError – If self.task is not supported.

create_model(df)[source]

A method to create a model for feature selection using specified method. Random Forest is used by default.

Parameters:

df (pandas.Dataframe) – The dataframe.

Raises:
  • ValueError – If there are null values in the dataframe.

  • ValueError – If there are object columns in the dataframe.

  • NotImplementedError – If self.method isn’t supported with the task.

create_new_dataset(threshold='mean')[source]

A method for creating new dataset. It uses threshold parameter to select features.

Note

This method can be called only after method ‘create_model’ has been called. This method uses absolute numeric value of the importences during comparison with the threshold value.

Parameters:

threshold – The threshold value to use. It can be ‘mean’(default), ‘median’ or numeric.

Raises:

Exception – Model was not created.

create_permutation_importance(**kwargs)[source]

A method for creating permutation importance for the features. This method will be automatically called if ‘permutation_importance’ parameter was set to True. Features importances will be set to importances_mean from permutation_importance model.

Note

This method can be called only after method ‘create_model’ has been called.

Raises:
  • Exception – Model was not created, self.x or self.importances was not initialized.

  • Exception – Permutation importance was used with the method that doesn’t implement class sklearn.base.BaseEstimator.

plot_importance(figsize=(5, 5), importance_threshold=None)[source]

A method for plotting feature importance using created model.

Note

This method can be called only after method ‘create_model’ has been called.

Parameters:
  • figsize (list) – Figsize of the plot.

  • importance_threshold (float) – The threshold of importance by which the features will be plotted.

Raises:

Exception – Model was not created, self.x or self.importances was not initialized.

Dimensionality reduction

class insolver.feature_engineering.dimensionality_reduction.DimensionalityReduction(method='pca')[source]

Dimensionality Reduction. This class can be used for dimensionality reduction and plotting of the result.

Parameters:

method (str) – Dimensionality reduction method supports: pca, svd, lda, t_sne, isomap, lle, fa, nmf.

method

Dimensionality reduction method.

Type:

str

estimator

Created model.

X_transformed

Transformed X.

Type:

pandas.DataFrame

methods_dict

Methods dictionary.

Type:

dict

_init_methods()[source]

Methods dictionary initialization.

plot_transformed(y, figsize=(10, 10), **kwargs)[source]

Plot transformed X values using y as hue. If n_components < 2 it will use seaborn.scatterplot to plot values. Else it will use sns.pairplot to create plots.

Note

This method can be called only after method ‘transform’ has been called.

Parameters:
  • y (pandas.Series, pandas.DataFrame) – y-value.

  • figsize (list), default=(10,10) – Figure size.

  • **kwargs – Arguments for the plot function.

Raises:
  • TypeError – If y is not pandas.DataFrame or pandas.Series.

  • Exception – If method is called before transform() method.

transform(X, y=None, **kwargs)[source]

Main dimensionality reduction method. It creates an estimator and fit_transform() given values.

Parameters:
  • X – X-value.

  • y – y-value.

  • **kwargs – Arguments for the estimator.

Returns:

New X dataframe with transformed values.

Return type:

X_transformed (pandas.DataFrame)

Raises:

NotImplementedError – If method is not supported.

Sampling

class insolver.feature_engineering.sampling.Sampling(n, cluster_column=None, n_clusters=10, method='simple')[source]

Sampling class. It includes several different techniques: simple sampling, systematic sampling, cluster sampling, stratified sampling.

Parameters:
  • n (int) – This parameter is used in chosen sampling method: for a simple sampling n is the number of values to keep; for a systematic sampling n is the number of step size; for a cluster sampling n is the number of clusters to keep; for a stratified sampling n is the number of values to keep in each cluster.

  • n_clusters (int) – Number of clusters for the cluster and stratified sampling.

  • cluster_column (str) – Column name of the data frame used as clusters.

  • method (str) – Sampling method, supported methods: simple, systematic, cluster, stratified.

_cluster_sampling(df)[source]

Cluster sampling.

Parameters:

df (pandas.Dataframe) – The dataframe.

Returns:

New dataset with selected rows.

_create_clusters(df)[source]

Creating dataframe with clusters. If self.cluster_column is defined, the clusters column is created using the dataframe column. Otherwise the clusters are formed according to the existing order.

Parameters:

df (pandas.Dataframe) – The dataframe.

Raises:

ValueError – Values in the column must be not null.

Returns:

New dataset with cluster column.

_simple_sampling(df)[source]

Simple sampling.

Parameters:

df (pandas.Dataframe) – The dataframe.

Returns:

New dataset with selected rows.

_stratified_sampling(df)[source]

Stratified sampling.

Parameters:

df (pandas.Dataframe) – The dataframe.

Returns:

New dataset with selected rows.

_systematic_sampling(df)[source]

Systematic sampling.

Parameters:

df (pandas.Dataframe) – The dataframe.

Returns:

New dataset with selected rows.

sample_dataset(df)[source]

A method for performing sampling with the dataset using selected method.

Parameters:

df (pandas.Dataframe) – The dataframe.

Raises:

NotImplementedError – If self.method is not supported.

Returns:

New dataset with selected rows.

insolver.feature_engineering.sampling.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

Added in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice

which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')

Smoothing

Normalization

Discretization

class insolver.discretization.discretizer.InsolverDiscretizer(method='uniform')[source]

Trasform continuous variable into discrete form.

Parameters:

method (str) – The method used to discretize. Should be in {‘uniform’, ‘quantile’, ‘kmeans’, ‘cart’}.

transform(X, y=None, n_bins=None, min_samples_leaf=None)[source]

Apply discretization to given data.

Parameters:
  • X – 1-D array, The data to be descretized.

  • y – 1-D array, The target values, ignored for unsupervised transformations.

  • n_bins (int, str) – The number of bins; Either integer number or value in {‘square-root’, ‘sturges’, ‘rice-rule’, ‘scotts-rule’, ‘freedman-diaconis’}.

  • min_samples_leaf (int, float) – The minimum number of samples required to be at a leaf

  • only (node. Used for 'cart' method)

  • otherwise. (ignored)

Returns:

1-D array, The transformed data.

Examples:

Unsupervised discretization

>>> import numpy as np
>>> from insolver.discretization import InsolverDiscretizer
>>> X = np.array([85, 90, 78, 96, 80, 70, 65, 95])
>>> insolverDisc = InsolverDiscretizer(method='uniform')
>>> insolverDisc.transform(X, n_bins=3)
array([1., 2., 1., 2., 1., 0., 0., 2.])

Supervised discretization

>>> import numpy as np
>>> from insolver.discretization import InsolverDiscretizer
>>> X = np.array([85, 90, 78, 96, 80, 70, 65, 95])
>>> y = np.array([1, 0, 1, 0, 0, 1, 1, 1])
>>> insolverDisc = InsolverDiscretizer(method='chimerge')
>>> insolverDisc.transform(X, y, n_bins=3)
array([1, 1, 0, 2, 1, 0, 0, 1], dtype=int64)
class insolver.discretization.discretizer_utils.CARTDiscretizer[source]
static _transform(X, y, min_samples_leaf=None, min_tree_depth=1, max_tree_depth=3)[source]

Apply CART discretization.

Parameters:
  • X – 1-D array, The data to be descretized.

  • y – 1-D array, The target values.

  • min_samples_leaf (int) – The minimum number of samples required to be at a leaf node. A split point at any

  • left (depth will only be considered if it leaves at least min_samples_leaf training samples in each of the)

  • branches. (and right)

  • model (This may have the effect of smoothing the) – If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. If None, then min_samples_leaf implicitly set to 0.1.

  • regression. (especially in) – If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. If None, then min_samples_leaf implicitly set to 0.1.

Returns:

1-D array, The transformed data.

References

[1] Liu, Huan, et al. “Discretization: An enabling technique.” Data mining and knowledge discovery 6.4 (2002): 393-423.

class insolver.discretization.discretizer_utils.ChiMergeDiscretizer[source]
_transform(X, y, n_bins)[source]

Apply ChiMerge discretization

Parameters:
  • X – 1-D array, The data to be descretized.

  • y – 1-D array, The target values.

  • n_bins (int) – The number of bins.

Returns:

1-D array, The transformed data.

References

[1] Kerber, Randy. “Chimerge: Discretization of numeric attributes.” Proceedings of the tenth national conference on Artificial intelligence. 1992. Available from: https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf

class insolver.discretization.discretizer_utils.SklearnDiscretizer[source]
static _transform(X, n_bins, method)[source]

Apply discretizations from scikit-learn.

Parameters:
  • X – 1-D array, The data to be descretized.

  • n_bins (int) – The number of bins.

  • method (string) – The method used by scikit-learn’s KBinsDiscretizer. Either ‘uniform’, ‘quantile’ or

  • 'kmeans'.

Returns:

1-D array, The transformed data.

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

Interpretation

DiCE

LIME

Plots

Model Wrappers

Base Wrapper

class insolver.wrappers.base.InsolverBaseWrapper(backend)[source]

Base wrapper serving as a building block for other wrappers.

Parameters:

backend (str) – Name of the backend to build the model.

load_model(load_path)[source]

Loading a model to the wrapper.

Parameters:

load_path (str) – Path to the model that will be loaded to wrapper.

save_model(path=None, name=None, suffix=None, **kwargs)[source]

Saving the model contained in wrapper.

Parameters:
  • path (str, optional) – Path to save the model. Using current working directory by default.

  • name (str, optional) – Optional, name of the model.

  • suffix (str, optional) – Optional, suffix in the name of the model.

  • **kwargs – Other parameters passed to, e.g. h2o.save_model().

Trivial Wrapper

class insolver.wrappers.trivial.InsolverTrivialWrapper(task=None, col_name=None, agg=None, thresh=0.5, **kwargs)[source]

Dummy wrapper for returning trivial “predictions” for metric comparison and statistics.

Parameters:
  • col_name (str, list, optional) – String or list of strings containing column name(s) to perform groupby operation.

  • agg (callable, optional) – Aggregation function.

  • thresh (float, optional) – Threshold for continuous prediction in dummy classification.

  • **kwargs – Other arguments.

fit(X, y)[source]

Fitting dummy model.

Parameters:
  • X (pd.DataFrame) – Data.

  • y (pd.Series) – Target values.

load_model(load_path)

Loading a model to the wrapper.

Parameters:

load_path (str) – Path to the model that will be loaded to wrapper.

predict(X)[source]

Making dummy predictions.

Parameters:

X (pd.DataFrame, pd.Series) – Data.

Returns:

Trivial model “prediction”.

Return type:

array

save_model(path=None, name=None, suffix=None, **kwargs)

Saving the model contained in wrapper.

Parameters:
  • path (str, optional) – Path to save the model. Using current working directory by default.

  • name (str, optional) – Optional, name of the model.

  • suffix (str, optional) – Optional, suffix in the name of the model.

  • **kwargs – Other parameters passed to, e.g. h2o.save_model().

Generalized Linear Model Wrapper

class insolver.wrappers.glm.InsolverGLMWrapper(backend, family=None, link=None, standardize=True, h2o_init_params=None, load_path=None, **kwargs)[source]

Insolver wrapper for Generalized Linear Models.

Parameters:
  • backend (str) – Framework for building GLM, currently ‘h2o’ and ‘sklearn’ are supported.

  • family (str, float, int, optional) – Distribution for GLM. Supports any family from h2o as str. For sklearn supported str families are [‘gaussian’, ‘normal’, ‘poisson’, ‘gamma’, ‘inverse_gaussian’], also may be defined as int or float as a power for Tweedie GLM. By default, Gaussian GLM is fitted.

  • link (str, optional) – Link function for GLM. If None, sets to default value for both h2o and sklearn.

  • standardize (bool, optional) – Whether to standardize data before fitting the model. Enabled by default.

  • h2o_init_params (dict, optional) – Parameters passed to h2o.init(), when backend == ‘h2o’.

  • load_path (str, optional) – Path to GLM model to load from disk.

  • **kwargs – Parameters for GLM estimators (for H2OGeneralizedLinearEstimator or TweedieRegressor) except family (power for TweedieRegressor) and link.

_hyperopt_obj_cv(params, X, y, scoring, cv=None, agg=None, maximize=False, **kwargs)

Default hyperopt objective performing K-fold cross-validation.

Parameters:
  • params (dict) – Dictionary of hyperopt parameters.

  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • scoring (callable) – Metrics passed to cross_val_score calculation.

  • cv (int, iterable, cross-validation generator, optional) – Cross-validation strategy from sklearn. Performs 5-fold cv by default.

  • agg (callable, optional) – Function computing the final score out of test cv scores.

  • maximize (bool, optional) – Indicator whether to maximize or minimize objective. Minimizing by default.

  • **kwargs – Other parameters passed to sklearn.model_selection.cross_val_score().

Returns:

{‘status’: STATUS_OK, ‘loss’: cv_score}

Return type:

dict

coef()[source]

Output GLM coefficients for non-standardized data. Also calculated when GLM fitted on standardized data.

Returns:

{str: float} Dictionary containing GLM coefficients for non-standardized data.

Return type:

dict

coef_norm()[source]

Output GLM coefficients for standardized data.

Returns:

{str: float} Dictionary containing GLM coefficients for standardized data.

Return type:

dict

coef_to_csv(path_or_buf=None, **kwargs)[source]

Write GLM coefficients to a comma-separated values (csv) file.

Parameters:
  • path_or_buf – str or file handle, default None File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

  • **kwargs – Other parameters passed to Pandas DataFrame.to_csv method.

Returns:

None or str

If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

fit(X, y, sample_weight=None, X_valid=None, y_valid=None, sample_weight_valid=None, report=None, **kwargs)[source]

Fit a Generalized Linear Model.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • sample_weight (pd.DataFrame, pd.Series, optional) – Training sample weights.

  • X_valid (pd.DataFrame, pd.Series, optional) – Validation data (only h2o supported).

  • y_valid (pd.DataFrame, pd.Series, optional) – Validation target values (only h2o supported).

  • sample_weight_valid (pd.DataFrame, pd.Series, optional) – Validation sample weights.

  • report (list, tuple, optional) – A list of metrics to report after model fitting, optional.

  • **kwargs – Other parameters passed to H2OGeneralizedLinearEstimator.

hyperopt_cv(X, y, params, fn=None, algo=None, max_evals=10, timeout=None, fmin_params=None, fn_params=None, p_last=True)

Hyperparameter optimization using hyperopt. Using cross-validation to evaluate hyperparameters by default.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • params (dict) – Dictionary of hyperparameters passed to hyperopt.

  • fn (callable, optional) – Objective function to optimize with hyperopt.

  • algo (callable, optional) – Algorithm for hyperopt. Available choices are: hyperopt.tpe.suggest and hyperopt.random.suggest. Using hyperopt.tpe.suggest by default.

  • max_evals (int, optional) – Number of function evaluations before returning.

  • timeout (None, int, optional) – Limits search time by parametrized number of seconds. If None, then the search process has no time constraint. None by default.

  • fmin_params (dict, optional) – Dictionary of supplementary arguments for hyperopt.fmin function.

  • fn_params (dict, optional) – Dictionary of supplementary arguments for custom fn objective function.

  • p_last (str, optional) – If model object is a sklearn.Pipeline then apply fit parameters to the last step. True by default.

Returns:

Dictionary of best choice of hyperparameters. Also best model is fitted.

Return type:

dict

load_model(load_path)

Loading a model to the wrapper.

Parameters:

load_path (str) – Path to the model that will be loaded to wrapper.

optimize_hyperparam(hyper_params, X, y, sample_weight=None, X_valid=None, y_valid=None, sample_weight_valid=None, h2o_train_params=None, **kwargs)

Hyperparameter optimization & fitting model in H2O.

Parameters:
  • hyper_params

  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • sample_weight (pd.DataFrame, pd.Series, optional) – Training sample weights.

  • X_valid (pd.DataFrame, pd.Series, optional) – Validation data (only h2o supported).

  • y_valid (pd.DataFrame, pd.Series, optional) – Validation target values (only h2o supported).

  • sample_weight_valid (pd.DataFrame, pd.Series, optional) – Validation sample weights.

  • h2o_train_params (dict, optional) – Parameters passed to H2OGridSearch.train().

  • **kwargs – Other parameters passed to H2OGridSearch.

Returns:

{hyperparameter_name: optimal_choice}, Dictionary containing optimal hyperparameter choice.

Return type:

dict

predict(X, sample_weight=None, **kwargs)[source]

Predict using GLM with feature matrix X.

Parameters:
  • X (pd.DataFrame, pd.Series) – Samples.

  • sample_weight (pd.DataFrame, pd.Series, optional) – Test sample weights.

  • **kwargs – Other parameters passed to H2OGeneralizedLinearEstimator.predict().

Returns:

Returns predicted values.

Return type:

array

save_model(path=None, name=None, suffix=None, **kwargs)

Saving the model contained in wrapper.

Parameters:
  • path (str, optional) – Path to save the model. Using current working directory by default.

  • name (str, optional) – Optional, name of the model.

  • suffix (str, optional) – Optional, suffix in the name of the model.

  • **kwargs – Other parameters passed to, e.g. h2o.save_model().

Gradient Boosting Machine Wrapper

class insolver.wrappers.gbm.InsolverGBMWrapper(backend, task=None, objective=None, n_estimators=100, load_path=None, **kwargs)[source]

Insolver wrapper for Gradient Boosting Machines.

Parameters:
  • backend (str) – Framework for building GBM, ‘xgboost’, ‘lightgbm’ and ‘catboost’ are supported.

  • task (str) – Task that GBM should solve: Classification or Regression. Values ‘reg’ and ‘class’ are supported.

  • n_estimators (int, optional) – Number of boosting rounds. Equals 100 by default.

  • objective (str, callable) – Objective function for GBM to optimize.

  • load_path (str, optional) – Path to GBM model to load from disk.

  • **kwargs – Parameters for GBM estimators except n_estimators and objective. Will not be changed in hyperopt.

_hyperopt_obj_cv(params, X, y, scoring, cv=None, agg=None, maximize=False, **kwargs)

Default hyperopt objective performing K-fold cross-validation.

Parameters:
  • params (dict) – Dictionary of hyperopt parameters.

  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • scoring (callable) – Metrics passed to cross_val_score calculation.

  • cv (int, iterable, cross-validation generator, optional) – Cross-validation strategy from sklearn. Performs 5-fold cv by default.

  • agg (callable, optional) – Function computing the final score out of test cv scores.

  • maximize (bool, optional) – Indicator whether to maximize or minimize objective. Minimizing by default.

  • **kwargs – Other parameters passed to sklearn.model_selection.cross_val_score().

Returns:

{‘status’: STATUS_OK, ‘loss’: cv_score}

Return type:

dict

cross_val(X, y, scoring=None, cv=None, **kwargs)[source]

Method for performing cross-validation given the hyperparameters of initialized or fitted model.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • scoring (callable) – Metrics passed to sklearn.model_selection.cross_validate calculation.

  • cv (int, cross-validation generator or an iterable`, optional) – Cross-validation strategy from sklearn. Performs 5-fold cv by default.

  • **kwargs – Other parameters passed to sklearn.model_selection.cross_validate.

Returns:

DataFrame with metrics on folds, DataFrame with shap values on folds.

Return type:

pd.DataFrame, pd.DataFrame

fit(X, y, report=None, **kwargs)[source]

Fit a Gradient Boosting Machine.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • report (list, tuple, optional) – A list of metrics to report after model fitting, optional.

  • **kwargs – Other parameters passed to Scikit-learn API .fit().

hyperopt_cv(X, y, params, fn=None, algo=None, max_evals=10, timeout=None, fmin_params=None, fn_params=None, p_last=True)

Hyperparameter optimization using hyperopt. Using cross-validation to evaluate hyperparameters by default.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • params (dict) – Dictionary of hyperparameters passed to hyperopt.

  • fn (callable, optional) – Objective function to optimize with hyperopt.

  • algo (callable, optional) – Algorithm for hyperopt. Available choices are: hyperopt.tpe.suggest and hyperopt.random.suggest. Using hyperopt.tpe.suggest by default.

  • max_evals (int, optional) – Number of function evaluations before returning.

  • timeout (None, int, optional) – Limits search time by parametrized number of seconds. If None, then the search process has no time constraint. None by default.

  • fmin_params (dict, optional) – Dictionary of supplementary arguments for hyperopt.fmin function.

  • fn_params (dict, optional) – Dictionary of supplementary arguments for custom fn objective function.

  • p_last (str, optional) – If model object is a sklearn.Pipeline then apply fit parameters to the last step. True by default.

Returns:

Dictionary of best choice of hyperparameters. Also best model is fitted.

Return type:

dict

load_model(load_path)

Loading a model to the wrapper.

Parameters:

load_path (str) – Path to the model that will be loaded to wrapper.

predict(X, **kwargs)[source]

Predict using GBM with feature matrix X.

Parameters:
  • X (pd.DataFrame, pd.Series) – Samples.

  • **kwargs – Other parameters passed to Scikit-learn API .predict().

Returns:

Returns predicted values.

Return type:

array

save_model(path=None, name=None, suffix=None, **kwargs)

Saving the model contained in wrapper.

Parameters:
  • path (str, optional) – Path to save the model. Using current working directory by default.

  • name (str, optional) – Optional, name of the model.

  • suffix (str, optional) – Optional, suffix in the name of the model.

  • **kwargs – Other parameters passed to, e.g. h2o.save_model().

shap(X, show=False, plot_type='bar')[source]

Method for shap values calculation and corresponding plot of feature importances.

Parameters:
  • X (pd.DataFrame, pd.Series) – Data for shap values calculation.

  • show (boolean, optional) – Whether to plot a graph.

  • plot_type (str, optional) – Type of feature importance graph, takes value in [‘dot’, ‘bar’].

Returns:

JSON containing shap values.

shap_explain(data, index=None, link=None, show=True, layout_dict=None)[source]

Method for plotting a waterfall graph or return corresponding JSON if show=False.

Parameters:
  • data (pd.DataFrame, pd.Series) – Data for shap values calculation.

  • index (int, optional) – Index of the observation of interest, if data is pd.DataFrame.

  • link (callable, optional) – A function for transforming shap values into predictions. Unnecessary if self.objective is present and it takes values in [‘binary’, ‘poisson’, ‘gamma’].

  • show (boolean, optional) – Whether to plot a graph or return a json.

  • layout_dict (boolean, optional) – Dictionary containing the parameters of plotly figure layout.

Returns:

Waterfall graph or corresponding JSON.

Return type:

None or dict

Random Forest Wrapper

class insolver.wrappers.general.InsolverRFWrapper(backend, task=None, n_estimators=100, load_path=None, **kwargs)[source]

Insolver wrapper for Random Forest.

Parameters:
  • backend (str) – Framework for building RF, ‘sklearn’ is supported.

  • task (str) – Task that RF should solve: Classification or Regression. Values ‘reg’ and ‘class’ are supported.

  • n_estimators (int, optional) – Number of trees in the forest. Equals 100 by default.

  • load_path (str, optional) – Path to RF model to load from disk.

  • **kwargs – Parameters for RF estimators except n_estimators. Will not be changed in hyperopt.

_hyperopt_obj_cv(params, X, y, scoring, cv=None, agg=None, maximize=False, **kwargs)

Default hyperopt objective performing K-fold cross-validation.

Parameters:
  • params (dict) – Dictionary of hyperopt parameters.

  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • scoring (callable) – Metrics passed to cross_val_score calculation.

  • cv (int, iterable, cross-validation generator, optional) – Cross-validation strategy from sklearn. Performs 5-fold cv by default.

  • agg (callable, optional) – Function computing the final score out of test cv scores.

  • maximize (bool, optional) – Indicator whether to maximize or minimize objective. Minimizing by default.

  • **kwargs – Other parameters passed to sklearn.model_selection.cross_val_score().

Returns:

{‘status’: STATUS_OK, ‘loss’: cv_score}

Return type:

dict

cross_val(X, y, scoring=None, cv=None, **kwargs)[source]

Method for performing cross-validation given the hyperparameters of initialized or fitted model.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • scoring (callable) – Metrics passed to sklearn.model_selection.cross_validate calculation.

  • cv (int, iterable, cross-validation generator, optional) – Cross-validation strategy from sklearn. Performs 5-fold cv by default.

  • **kwargs – Other parameters passed to sklearn.model_selection.cross_validate.

Returns:

DataFrame with metrics on folds, DataFrame with shap values on folds.

Return type:

pd.DataFrame, pd.DataFrame

fit(X, y, report=None, **kwargs)[source]

Fit a Random Forest.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • report (list, tuple, optional) – A list of metrics to report after model fitting, optional.

  • **kwargs – Other parameters passed to Scikit-learn API .fit().

hyperopt_cv(X, y, params, fn=None, algo=None, max_evals=10, timeout=None, fmin_params=None, fn_params=None, p_last=True)

Hyperparameter optimization using hyperopt. Using cross-validation to evaluate hyperparameters by default.

Parameters:
  • X (pd.DataFrame, pd.Series) – Training data.

  • y (pd.DataFrame, pd.Series) – Training target values.

  • params (dict) – Dictionary of hyperparameters passed to hyperopt.

  • fn (callable, optional) – Objective function to optimize with hyperopt.

  • algo (callable, optional) – Algorithm for hyperopt. Available choices are: hyperopt.tpe.suggest and hyperopt.random.suggest. Using hyperopt.tpe.suggest by default.

  • max_evals (int, optional) – Number of function evaluations before returning.

  • timeout (None, int, optional) – Limits search time by parametrized number of seconds. If None, then the search process has no time constraint. None by default.

  • fmin_params (dict, optional) – Dictionary of supplementary arguments for hyperopt.fmin function.

  • fn_params (dict, optional) – Dictionary of supplementary arguments for custom fn objective function.

  • p_last (str, optional) – If model object is a sklearn.Pipeline then apply fit parameters to the last step. True by default.

Returns:

Dictionary of best choice of hyperparameters. Also best model is fitted.

Return type:

dict

load_model(load_path)

Loading a model to the wrapper.

Parameters:

load_path (str) – Path to the model that will be loaded to wrapper.

predict(X, **kwargs)[source]

Predict using RF with feature matrix X.

Parameters:
  • X (pd.DataFrame, pd.Series) – Samples.

  • **kwargs – Other parameters passed to Scikit-learn API .predict().

Returns:

Returns predicted values.

Return type:

array

save_model(path=None, name=None, suffix=None, **kwargs)

Saving the model contained in wrapper.

Parameters:
  • path (str, optional) – Path to save the model. Using current working directory by default.

  • name (str, optional) – Optional, name of the model.

  • suffix (str, optional) – Optional, suffix in the name of the model.

  • **kwargs – Other parameters passed to, e.g. h2o.save_model().

Model Tools

Model Comparison

class insolver.model_tools.model_comparison.ModelMetricsCompare(X, y, task=None, create_models=False, source=None, metrics=None, stats=None, h2o_init_params=None, predict_params=None, features=None, names=None)[source]

Class for model comparison. It will compute statistics and metrics for the regression task and metrics for the classification task. You can compare created models with the source parameter or if source is None it use current working directory as a source. If you want to create new models set the create_models parameter to True. If you already have source parameter and set create_models parameter to True, new models will be added to the source list.

Parameters:
  • X (pd.DataFrame, pd.Series) – Data for making predictions.

  • y (pd.DataFrame, pd.Series) – Actual target values for X.

  • task (str, None) – A task for models and metrics. If task new models will be created Supports ‘reg’ and ‘class’.

  • create_models (bool) – If True, new models will be created and added to the comparison list.

  • source (str, list, tuple, None) – List or tuple of insolver wrappers or path to the folder with models. If None, taking current working directory as source.

  • metrics (list, tuple, callable, optional) – Metrics or list of metrics to compute.

  • stats (list, tuple, callable, optional) – Statistics or list of statistics to compute.

  • h2o_init_params (dict, optional) – Parameters passed to h2o.init(), when backend == ‘h2o’.

  • predict_params (list, optional) – List of dictionaries containing parameters passed to predict methods for each model.

  • features (list, optional) – List of lists containing features for predict method for each model.

  • names (list, optional) – List of model names.

_calc_metrics()[source]

Computes metrics and statistics for the models.

Raises:
  • TypeError – Statistics type are not supported.

  • TypeError – Metrics type are not supported.

Returns:

Returns None, but results available in self.stats, self.metrics.

_init_default_metrics()[source]

Initializes default metrics and adds them to the metrics list. If class then accuracy score and f1 score will be added. If reg then mean absolute error and r2 score will be added. If self.metrics is callable it will be changed to the list type.

_init_new_models()[source]

Initializes new models using the task parameter. If class then Gradient Boosting model with the catboost backend and Random Forest with the sklearn backend will be created. If reg then Gradient Boosting model with the catboost backend, Random Forest with the sklearn backend and Linear Model with the sklearn backend will be created. This method uses train_test_split from sklearn.model_selection, fits models with train values and changes self.X, self.y to test values. Thus, when calculating metrics, it will use test values.

_init_source_models()[source]

Initializes source models. if source is None it use current working directory as a source.

Raises:
  • Exception – Models with the insolver name format were not found in the current working directory.

  • TypeError – Source type is not supported.

compare()[source]

Compares models using initialized parameters. If self.create_models == True, new models will be created and added to the source list.

Raises:

Exceptiontask parameter must be initialized and be class or reg.

Model utils

insolver.model_tools.model_utils.train_test_column_split(x, y, df_column)[source]

Function for splitting dataset into train/test partitions w.r.t. a column (pd.Series).

Parameters:
  • x (pd.DataFrame) – DataFrame containing predictors.

  • y (pd.DataFrame) – DataFrame containing target variable.

  • df_column (pd.Series) – Series for train/test split, assuming it is contained in x.

Returns:

(x_train, x_test, y_train, y_test).

A tuple of partitions of the initial dataset.

Return type:

tuple

insolver.model_tools.model_utils.train_val_test_split(*arrays, val_size, test_size, random_state=0, shuffle=True, stratify=None)[source]

Function for splitting dataset into train/validation/test partitions.

Parameters:
  • *arrays (array_like) – Arrays to split into train/validation/test sets containing predictors.

  • val_size (float) – The proportion of the dataset to include in validation partition.

  • test_size (float) – The proportion of the dataset to include in test partition.

  • random_state (int, optional) – Random state, passed to train_test_split() from scikit-learn. (default=0).

  • shuffle (bool, optional) – Passed to train_test_split() from scikit-learn. (default=True).

  • stratify (array_like, optional) – Passed to train_test_split() from scikit-learn. (default=None).

Returns:

[x_train, x_valid, x_test, y_train, y_valid, y_test].

A list of partitions of the initial dataset.

Return type:

list