ATgfe (Automated Transparent Genetic Feature Engineering)
What is ATgfe?
ATgfe stands for Automated Transparent Genetic Feature Engineering. ATgfe is powered by genetic algorithm to engineer new features. The idea is to compose new interpretable features based on interactions between the existing features. The predictive power of the newly constructed features are measured using a predefined evaluation metric, which can be custom designed.
ATgfe applies the following techniques to generate candidate features:
 Simple feature interactions by using the basic operators (+, , *, /).
(petalwidth * petallength)
 Scientific feature interactions by applying transformation operators (e.g. log, cosine, cube, etc. as well as custom operators which can be easily implemented using user defined functions).
squared(sepalwidth)*(log_10(sepalwidth)/squared(petalwidth))cube(sepalwidth)
 Weighted feature interactions by adding weights to the simple and/or scientific feature interactions.
(0.09*exp(petallength)+0.7*sepallength/0.12*exp(petalwidth))+0.9*squared(sepalwidth)
 Complex feature interactions by applying groupBy on the categorical features.
(0.56*groupByYear0TakeMeanOfFeelslike*0.51*feelslike)+(0.45*temp)
Why ATgfe?
ATgfe allows you to deal with nonlinear problems by generating new interpretable features from existing features. The generated features can then be used with a linear model, which is inherently explainable. The idea is to explore potential predictive information that can be represented using interactions between existing features.
When compared with nonlinear models (e.g. gradient boosting machines, random forests, etc.), ATgfe can achieve comparable results and in some cases overperform them. This is demonstrated in the following examples: BMI, Rational difference and IRIS.
Results
Generated
Expression  Linear Regression  LightGBM Regressor  Linear Regression + ATgfe 

BMI = weight/height^2 



Y = (X1  X2) / (X3  X4) 



Y = (Log10(X1) + Log10(X2)) / X5 



Y = 0.4X2^2 + 2X4 + 2 



Classification
Dataset  Logistic Regression  LightGBM Classifier  Logistic Regression + ATgfe 

IRIS (4 features) 



Regression
Dataset  Linear Regression  LightGBM Regressor  Linear Regression + ATgfe 

Concrete (8 features) 



Boston (13 features) 



Get started
Requirements
 Python ^3.6
 DEAP ^1.3
 Pandas ^0.25.2
 Scipy ^1.3
 Numpy ^1.17
 Sympy ^1.4
Install ATgfe
pip install atgfe
Upgrade ATgfe
pip install U atgfe
Usage
Examples
The Examples are grouped under the following two sections:

Generated examples test ATgfe against handcrafted nonlinear problems where we know there is information that can be captured using feature interactions.

Toy Examples show how to use ATgfe in solving a mix of regression and classification problems from publicly available benchmark datasets.
Preprocessing for column names
ATgfe requires column names that are free from special characters and spaces (e.g. @, $, %, #, etc.)
# example
def prepare_column_names(columns):
return [col.replace(' ', '').replace('(cm)', '_cm') for col in columns]
columns = prepare_column_names(df.columns.tolist())
df.columns = columns
Configuring the parameters of GeneticFeatureEngineer
GeneticFeatureEngineer(
model,
x_train: pandas.core.frame.DataFrame,
y_train: pandas.core.frame.DataFrame,
numerical_features: List[str],
number_of_candidate_features: int,
number_of_interacting_features: int,
evaluation_metric: Callable[..., Any],
minimize_metric: bool = True,
categorical_features: List[str] = None,
enable_grouping: bool = False,
sampling_size: int = None,
cv: int = 10,
fit_wo_original_columns: bool = False,
enable_feature_transformation_operations: bool = False,
enable_weights: bool = False,
enable_bias: bool = False,
max_bias: float = 100.0,
weights_number_of_decimal_places: int = 2,
shuffle_training_data_every_generation: bool = False,
cross_validation_in_objective_func: bool = False,
objective_func_cv: int = 3,
n_jobs: int = 1,
verbose: bool = True
)
model
ATgfe works with any model or pipeline that follows scikitlearn API (i.e. the model should implement the fit()
and predict()
methods).
x_train
Training features in a pandas Dataframe.
y_train
Training labels in a pandas Dataframe to also handle multiple target problems.
numerical_features
The list of column names that represent the numerical features.
number_of_candidate_features
The maximum number of features to be generated.
number_of_interacting_features
The maximum number of existing features that can be used in constructing new features.
These features are selected from those passed in the numerical_features
argument.
evaluation_metric
Any of the scitkitlearn metrics or a custommade evaluation metric to be used by the genetic algorithm to evaluate the predictive power of the newly generated features.
import numpy as np
from sklearn.metrics import mean_squared_error
def rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
minimize_metric
A boolean flag, which should be set to True
if the evaluation metric is to be minimized; otherwise set to False
if the evaluation metric is to be maximized.
categorical_features
The list of column names that represent the categorical features. The parameter enable_grouping
should be set to True
in order for the categorical_features
to be utilized in grouping.
enable_grouping
A boolean flag, which should be set to True
to construct complex feature interactions that use pandas.groupBy
.
sampling_size
The exact size of the sampled training dataset. Use this parameter to run the optimization using the specified number of observations in the training data. If the sampling_size
is greater than the number of observations, then ATgfe will create a sample with replacement.
cv
The number of folds for cross validation. Every generation of the genetic algorithm, ATgfe evaluates the current best solution using kfold cross validation. The default number of folds is 10.
fit_wo_original_columns
A boolean flag, which should be set to True
to fit the model without the original features specified in numerical_features
. In this case, ATgfe will only use the newly generated features together with any remaining original features in x_train
.
enable_feature_transformation_operations
A boolean flag, which should be set to True
to enable scientific feature interactions on the numerical_features
.
The predefined transformation operators are listed as follows:
np_log(), np_log_10(), np_exp(), squared(), cube()
You can easily remove from or add to the existing list of transformation operators. Check out the next section for examples.
enable_weights
A boolean flag, which should be set to True
to enable weighted feature interactions.
weights_number_of_decimal_places
The number of decimal places (i.e. precision) to be applied to the weight values.
enable_bias
A boolean flag, which enables the genetic algorithm to add a bias to the expressions generated. For example:
0.43*log(cement) + 806.8557595548646
max_bias
The value of the bias will be between max_bias
and max_bias
.
If the max_bias
is 100 then the bias value will be between 100 and 100.
shuffle_training_data_every_generation
A boolean flag, if enabled the train_test_split
method in the objective function uses the generation number as its random seed. This can prevent overfitting.
This option is only available if cross_validation_in_objective_func
is set to False
.
cross_validation_in_objective_func
A boolean flag, if enabled the train_test_split
method will not be used in the objective function. Instead of using train_test_split
, the genetic algorithm will use cross validation to evaluate the generated features.
The default number of folds is 3. The number of folds can modified using the objective_func_cv
parameter.
objective_func_cv
The number of folds to be used when cross_validation_in_objective_func
is enabled.
verbose
A boolean flag, which should be set to True
to enable the logging functionality.
n_jobs
To enable parallel processing, set n_jobs
to the number of CPUs that you would like to utilise. If n_jobs
is set to 1, all the machine's CPUs will be utilised.
Configuring the parameters of fit()
gfe.fit(
number_of_generations: int = 100,
mu: int = 10,
lambda_: int = 100,
crossover_probability: float = 0.5,
mutation_probability: float = 0.2,
early_stopping_patience: int = 5,
random_state: int = 77
)
number_of_generations
The maximum number of generations to be explored by the genetic algorithm.
mu
The number of solutions to select for the next generation.
lambda_
The number of children to produce at each generation.
crossover_probability
The crossover probability.
mutation_probability
The mutation probability.
early_stopping_patience
The maximum number of generations to be explored before early the stopping criteria is satisfied when the validation score is not improving.
Configuring the parameters of transform()
X = gfe.transform(X)
Where X is the pandas dataframe that you would like to append the generated features to.
Transformation operations
Get current transformation operations
gfe.get_enabled_transformation_operations()
The enabled transformation operations will be returned.
['None', 'np_log', 'np_log_10', 'np_exp', 'squared', 'cube']
Remove existing transformation operations
gfe.remove_transformation_operation
accepts string or a list of strings
gfe.remove_transformation_operation('squared')
gfe.remove_transformation_operation(['np_log_10', 'np_exp'])
Add new transformation operations
np_sqrt = np.sqrt
def some_func(x):
return (x * 2)/3
gfe.add_transformation_operation('sqrt', np_sqrt)
gfe.add_transformation_operation('some_func', some_func)