Best Practices

Feature Engineering

 Normalize parameter to get the percentage

Following example shows how to use normalize input parameter to get the grouping in percentage

  • print(y_test.value_counts(normalize=True)*100)

    Exited 0 79.25 1 20.75

 Split to Train, Test and Validation

Following example shows how to

  • divide data into temporary and test sets with a ratio of 80:20
  • divide the temporary set into train and validation with a ratio of 75:25


# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(

 X, y, test_size=0.2, random_state=1, stratify=y


# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(

 X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp


print(X_train.shape, X_val.shape, X_test.shape)


Data Encoding

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()

df["Classifier"] = le.fit_transform(df["Classifier"])

le.transform(["yes", "no"])

=> This maps the numbers in the alphabetical order, No = 0, Yes = 1


# Discretize variable into equal-sized buckets based on quantiles

df["city"] = pd.qcut(


    q=[0, 0.25, 0.5, 1],

    labels=["Under_Developed", "Developing", "Developed"],


Missing-Value Treatment – KNN Imputer

One way for missing value treatment is KNN imputer to impute missing values.

·       Each sample's missing values are imputed by looking at the n_neighbors nearest neighbors found in the training set. Default value for n_neighbors=5.

·       KNN imputer replaces missing values using the average of k nearest non-missing feature values.

·       Nearest points are found based on euclidean distance.

The values obtained might not be integer always which is not be the best way to impute categorical values

  • To take care of that we can round off the obtained values to nearest integer value

# Define the imputer

imputer = KNNImputer(n_neighbors=5)

# Define list of columns to impute

reqd_col_for_impute = [




# Convert/encode any non-numeric columns to numerical values before executing imputer

X[reqd_col_for_impute] = imputer.fit_transform(X[reqd_col_for_impute])

Missing-Value Treatment – Simple Imputer

This is another way for treatment of missing values when you want to replace values with mean, median, most frequent (mode) or constant values.

# Import the library

from sklearn.impute import SimpleImputer


# Define imputer

imputer = SimpleImputer(strategy='median', fill_value = 'numerical')

dataframe_imputed = imputer.fit_transform(dataframe)

OverSampling of Data

from sklearn.linear_model import LogisticRegression
lr1 = LogisticRegression(random_state=1),y_train)

model_performance_classification_sklearn(lr1, X_train, y_train)

sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

lr2 = LogisticRegression(random_state=1),y_train_over)

model_performance_classification_sklearn(lr2, X_train_over, y_train_over)


Confusion Matrix from Sklearn


from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix


rf = RandomForestClassifier(random_state = 1), y_train)

confusion_matrix(y_train, rf.predict(X_train))


Cross validation Score

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

models = [] # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("LR", LogisticRegression(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models

# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")

for name, model in models:
     scoring = "recall"
     kfold = StratifiedKFold(
     n_splits=5, shuffle=True, random_state=1
     ) # Setting number of splits equal to 5
     cv_result = cross_val_score(
     estimator=model, X=X_train_over, y=y_train_over, scoring=scoring,       cv=kfold
     print("{}: {}".format(name, cv_result.mean() * 100))


Using Randomized Search

from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV

# defining model
model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in GridSearchCV

param_grid = {
 "n_estimators": np.arange(10, 110, 10),
 "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
 "base_estimator": [
 DecisionTreeClassifier(max_depth=1, random_state=1),
 DecisionTreeClassifier(max_depth=2, random_state=1),
 DecisionTreeClassifier(max_depth=3, random_state=1),

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))




Under Sampling


from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)


model1 = AdaBoostClassifier(random_state=1), y_train_un)

model_performance_classification_sklearn(model1, X_train_un, y_train_un)

model_performance_classification_sklearn(model1, X_val, y_val)


model2 = AdaBoostClassifier(random_state=1), y_train_over)

model_performance_classification_sklearn(model2, X_train_over, y_train_over)

model_performance_classification_sklearn(model2, X_val, y_val)


