Predicting Life Expectancy¶
In this notebook we will tackle the well known Life Expectancy dataset. Data can be found here on Kaggle. Our goal will be to see how well can we predict Life expectancy for an unkown location, based on health and economy factors. Then we will try to identify which factors are most important for this task. By identifying those, we can for example help future public policies for places that wish to improve Life expectancy.
A - Data exploration and Preprocessing¶
Firstly let's load necessary libraries :
import numpy as np
np.random.seed(62)
import pandas as pd
import random as rd
rd.seed(62)
import matplotlib.pyplot as plt
import seaborn as sns
We'll now load the data :
df = pd.read_csv('Life_Expectancy_Data.csv')
datasets = {'raw':df}
features = list(df.columns)
print(f'Shape : {df.shape}')
print(f"\nFeatures : {features}")
print("\nHere's a small description of the target variable (Life Expectancy)")
TARGET = 'Life expectancy '
print(df[TARGET].describe())
Shape : (2938, 22) Features : ['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling'] Here's a small description of the target variable (Life Expectancy) count 2928.000000 mean 69.224932 std 9.523867 min 36.300000 25% 63.100000 50% 72.100000 75% 75.700000 max 89.000000 Name: Life expectancy , dtype: float64
We will not be using Country
because the goal is to predict for an unkown country, the life exectancy of its inhabitants based on descriptors of health and economy in this country. The Year
variable is also not useful as it is neither a health nor an economy related feature and is only present in the dataset due to data collection protocols. There is however a correlation for a given country between Year
and Life expectancy
as shown below but due to the definition of the task we set for this study, it will not be considered.
country_mask = np.where(df['Country'] == 'Bangladesh', True, False)
sns.pointplot(data=df.iloc[country_mask,:], x='Year', y=TARGET)
plt.xticks(rotation=90)
plt.show()
features.remove('Country')
features.remove('Year')
df_2 = df[features]
datasets.update({'raw_no_Country_Year':df_2})
Let's visualize the data :
def plot_distribution(df, col):
plt.figure()
plt.hist(df[col])
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Number of observations')
plt.show()
df = datasets['raw_no_Country_Year']
fig, axes = plt.subplots(5,4, figsize=(12,24), sharey=False)
for i, col in enumerate(list(df.columns)):
row = i // 4
col_index = i % 4
sns.histplot(df[col], kde=False, ax=axes[row, col_index])
axes[row, col_index].set_title(f'Distribution of {col}')
axes[row, col_index].set_xlabel(col)
if col_index == 0:
axes[row, col_index].set_ylabel('Count')
else:
axes[row, col_index].set_ylabel('')
plt.tight_layout()
plt.show()
def look_NAs(df, col):
NAs = np.where(pd.isna(df[col]), 1, 0).sum()
percentage = np.where(pd.isna(df[col]), 1, 0).mean() * 100
print(f'\nThere are {NAs} NAs in {col}.')
print(f'This represents roughly {percentage} % of missing values.')
if NAs == 0:
return False
return True
cols_with_NAs = []
cols_no_NAs = []
for col in features:
if look_NAs(df, col):
cols_with_NAs.append(col)
else:
cols_no_NAs.append(col)
There are 0 NAs in Status. This represents roughly 0.0 % of missing values. There are 10 NAs in Life expectancy . This represents roughly 0.3403675970047651 % of missing values. There are 10 NAs in Adult Mortality. This represents roughly 0.3403675970047651 % of missing values. There are 0 NAs in infant deaths. This represents roughly 0.0 % of missing values. There are 194 NAs in Alcohol. This represents roughly 6.603131381892443 % of missing values. There are 0 NAs in percentage expenditure. This represents roughly 0.0 % of missing values. There are 553 NAs in Hepatitis B. This represents roughly 18.82232811436351 % of missing values. There are 0 NAs in Measles . This represents roughly 0.0 % of missing values. There are 34 NAs in BMI . This represents roughly 1.1572498298162015 % of missing values. There are 0 NAs in under-five deaths . This represents roughly 0.0 % of missing values. There are 19 NAs in Polio. This represents roughly 0.6466984343090538 % of missing values. There are 226 NAs in Total expenditure. This represents roughly 7.6923076923076925 % of missing values. There are 19 NAs in Diphtheria . This represents roughly 0.6466984343090538 % of missing values. There are 0 NAs in HIV/AIDS. This represents roughly 0.0 % of missing values. There are 448 NAs in GDP. This represents roughly 15.248468345813478 % of missing values. There are 652 NAs in Population. This represents roughly 22.19196732471069 % of missing values. There are 34 NAs in thinness 1-19 years. This represents roughly 1.1572498298162015 % of missing values. There are 34 NAs in thinness 5-9 years. This represents roughly 1.1572498298162015 % of missing values. There are 167 NAs in Income composition of resources. This represents roughly 5.684138869979578 % of missing values. There are 163 NAs in Schooling. This represents roughly 5.547991831177672 % of missing values.
The following columns do not have any missing values : Status, infant deaths, percentage expenditure, Measles, under-five deaths and HIV/AIDS.
To handle missing values, we will consider 3 strategies : Mean, Median or KNN imputation.
from sklearn.impute import SimpleImputer, KNNImputer
mean_imp = SimpleImputer(strategy="mean")
med_imp = SimpleImputer(strategy="median")
knn_imp = KNNImputer(weights="distance") # close neighbors have greater influence on imputation value than further neighbors.
mean_imp_features = pd.DataFrame(mean_imp.fit_transform(datasets['raw_no_Country_Year'][cols_with_NAs]), columns=cols_with_NAs)
mean_df = pd.concat([datasets['raw_no_Country_Year'][cols_no_NAs], mean_imp_features], axis=1)
print(f'\nThere are {np.where(pd.isna(mean_df), 1, 0).sum()} NAs in this dataframe')
datasets.update({'mean_imputed':mean_df})
med_imp_features = pd.DataFrame(med_imp.fit_transform(datasets['raw_no_Country_Year'][cols_with_NAs]), columns=cols_with_NAs)
med_df = pd.concat([datasets['raw_no_Country_Year'][cols_no_NAs], med_imp_features], axis=1)
print(f'\nThere are {np.where(pd.isna(med_df), 1, 0).sum()} NAs in this dataframe')
datasets.update({'median_imputed':med_df})
knn_imp_features = pd.DataFrame(knn_imp.fit_transform(datasets['raw_no_Country_Year'][cols_with_NAs]), columns=cols_with_NAs)
knn_df = pd.concat([datasets['raw_no_Country_Year'][cols_no_NAs], knn_imp_features], axis=1)
print(f'\nThere are {np.where(pd.isna(knn_df), 1, 0).sum()} NAs in this dataframe')
datasets.update({'knn_imputed':knn_df})
There are 0 NAs in this dataframe There are 0 NAs in this dataframe
There are 0 NAs in this dataframe
Now that we got datasets with no missing values, let's normalize our data. We will simply make sure every feature has the same scale of values using MinMaxScaler
. Then we will use OneHotEncoding on the Status
feature, the only remaining categorical feature.
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
scaler = MinMaxScaler()
enc = OneHotEncoder(sparse_output=False)
def preprocess_data(df, cat, target):
numerical = [col for col in df.columns if col not in cat and col != target]
scaled_df = pd.DataFrame(scaler.fit_transform(df[numerical]), columns=numerical)
one_hot_df = pd.DataFrame(enc.fit_transform(df[cat]), columns=list(np.unique(df[cat])))
return pd.concat([scaled_df, one_hot_df, df[target]], axis=1)
for imputed_df in ['mean_imputed', 'median_imputed', 'knn_imputed']:
df = datasets[imputed_df]
df = preprocess_data(df, ['Status'], TARGET)
datasets[imputed_df] = df
We have handled missing values, adjusting scales and transformed categorical features : we are now ready to start training ML models.
B - Models¶
Before starting training models, we need to define our training strategy. Since we have a reasonable amount of data, we will use a 5-fold cross-validation protocol to evaluate and compare models. This cross-validation step will be done on a subset of our data.
More clearly said, we will do a simple Train/Test split where we will use 80% for the training set and 20% for the test set. Cross-validation will be performed on the training set.
def make_sets(df_size, test_size=0.2):
train_size = int(np.ceil(df_size * (1 - test_size)))
train_set = rd.sample([i for i in range(df_size)], train_size)
test_set = [i for i in range(df_size) if i not in train_set]
rd.shuffle(test_set)
return train_set, test_set
def get_data(df, row_set):
return df.iloc[row_set,:]
def get_train_test(df, train, test, target):
columns = list(df.columns)
columns.remove(target)
return get_data(df, train)[columns], get_data(df, train)[target], get_data(df, test)[columns], get_data(df, test)[target]
train_set, test_set = make_sets(datasets['mean_imputed'].shape[0], test_size=0.2)
print(f'\nThere are {len(train_set)} observations in the training set.')
print(f'\nThere are {len(test_set)} observations in the test set.')
There are 2351 observations in the training set. There are 587 observations in the test set.
The baseline model for this task is the LinearRegression
. We will compare it with Ridge
, LASSO
, ElasticNet
, linear SVR
, gaussian SVR
, RandomForestRegressor
, CatBoostRegressor
. This will be done with all three imputing techniques datasets.
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
def cv_on_datasets(estimator, train_set, test_set, target, cv=5):
dico = {}
for imputed_df in ['mean_imputed', 'median_imputed', 'knn_imputed']:
X_train, Y_train, X_test, Y_test = get_train_test(datasets[imputed_df], train_set, test_set, target)
scores = cross_val_score(estimator, X_train, Y_train, cv=cv, scoring='neg_mean_absolute_percentage_error')
perf = -scores.mean() * 100 # To make it a percentage
dico.update({imputed_df:perf, 'std_'+imputed_df:np.abs(np.std(-np.array(scores)*100))})
return dico
baseline_model = LinearRegression(n_jobs=-1)
baseline_perfs = cv_on_datasets(baseline_model, train_set, test_set, TARGET)
print(baseline_perfs)
results = {'Model':['LinearRegression'],
'mean_imputed':[baseline_perfs['mean_imputed']],
'std_mean_imputed':[baseline_perfs['std_mean_imputed']],
'median_imputed':[baseline_perfs['median_imputed']],
'std_median_imputed':[baseline_perfs['std_median_imputed']],
'knn_imputed':[baseline_perfs['knn_imputed']],
'std_knn_imputed':[baseline_perfs['std_knn_imputed']]}
{'mean_imputed': 4.6810660441156235, 'std_mean_imputed': 0.15182584715280947, 'median_imputed': 4.668883489956363, 'std_median_imputed': 0.1574078133628543, 'knn_imputed': 4.4674953856839075, 'std_knn_imputed': 0.11952634376293078}
The baseline model already seems to be performing really well with only a 4 % average error on predicted Life Expectancy across all folds. The best imputation technique seems to be the knn_imputed one.
Let's compare those base line performances to other models.
to_compare = {'Ridge':Ridge(),
'Lasso':Lasso(),
'ElasticNet':ElasticNet(),
'SVR(linear)':SVR(kernel='linear'),
'SVR(gaussian)':SVR(kernel='rbf'),
'RandomForestRegressor':RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
'CatBoostRegressor':CatBoostRegressor(n_estimators=100, random_state=42, verbose=0, thread_count=-1)}
def compare_models(models_to_compare, train_set, test_set, target, result_dict, cv=5):
for model_name in models_to_compare.keys():
print(model_name)
model = models_to_compare[model_name]
mod_perfs = cv_on_datasets(model, train_set, test_set, target, cv=cv)
result_dict['Model'].append(model_name)
result_dict['mean_imputed'].append(mod_perfs['mean_imputed'])
result_dict['std_mean_imputed'].append(mod_perfs['std_mean_imputed'])
result_dict['median_imputed'].append(mod_perfs['median_imputed'])
result_dict['std_median_imputed'].append(mod_perfs['std_median_imputed'])
result_dict['knn_imputed'].append(mod_perfs['knn_imputed'])
result_dict['std_knn_imputed'].append(mod_perfs['std_knn_imputed'])
df = pd.DataFrame(result_dict)
return df
def plot_performances(result_df):
fig, axes = plt.subplots(3,1, figsize=(12,12), sharex=True)
for i,method in enumerate(['mean_imputed', 'median_imputed', 'knn_imputed']):
barplot = sns.barplot(data=result_df, y='Model', x=method, hue='Model',
legend=False, palette='viridis', ax=axes[i], orient='h')
for container in barplot.containers:
barplot.bar_label(container, fmt='%.2f', label_type='center')
# Manually add error bars
for j, (x, err) in enumerate(zip(result_df[method], result_df['std_'+method])):
barplot.errorbar(x=x, y=j, xerr=err, fmt='none', ecolor='black', capsize=3, capthick=1)
axes[i].set_title(f'Model performances on {method} dataset (lower is better)')
axes[i].set_ylabel('Models')
axes[i].set_xlabel('MAPE')
plt.tight_layout()
plt.show()
results_df = compare_models(to_compare, train_set, test_set, TARGET, results, cv=5)
Ridge Lasso ElasticNet SVR(linear) SVR(gaussian) RandomForestRegressor CatBoostRegressor
plot_performances(results_df)
We can see that best performances are obtained using the RandomForestRegressor
on the mean_imputed
dataset. performances are already really good with less than 2% of error on predicted value on average !
C - Optimization¶
Now that we found the best algorithm for our regression task, we can try to optimize it. We will try to find the minimal subset of features which allow best performances, then we will try to see if engineered features can help further improve predictions made by the model. Finally we will use hyperparameter tuning.
For the first step, let's look at correlations between features and `Life expectancy '. The goal is to build intuition on which features that seem useful.
df = datasets['mean_imputed']
for i in range(7):
sns.pairplot(df, x_vars=df.columns[3*i:3*(i+1)],y_vars=[TARGET])
plt.show()
correlations = df.corr()[TARGET].drop(TARGET).sort_values()
plt.figure(figsize=(10, 6))
sns.barplot(x=correlations.values, y=correlations.index, palette='coolwarm')
plt.title(f'Correlation of each feature with {TARGET}')
plt.xlabel('Correlation coefficient')
plt.ylabel('Features')
plt.show()
/tmp/ipykernel_367042/304857981.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=correlations.values, y=correlations.index, palette='coolwarm')
From these plots, one could argue that under-five deaths, infant deaths, Measles, Population, Hepatitis B and Total expenditure are to be removed from features since they correspond to less correlated features. However, apart from Population which correlation with Life expectancy is close to 0, all other features could still be of value for the model. To truly identify which are to be removed, we will be using an ascending Feature Selection algorithm.
def select_features(estimator, training_data, trainset, testset, target, cv=5, scoring='neg_mean_absolute_percentage_error', min_improvement=1e-4, warm_start=([], 1e5)):
X_train, Y_train, _, _ = get_train_test(training_data, trainset, testset, target)
columns = X_train.columns
perf_history = []
old_perf = +np.inf
perf = warm_start[1]
n_features = 1
kept = warm_start[0]
best_model = None
while perf <= old_perf - min_improvement:
print(f'\nBest performance so far : {perf}')
print(f'Best combination so far : {kept}')
# Create combinations of features
combinations = [kept + [col] for col in columns if col not in kept]
if combinations == []:
break
# For each combination we use cross-validation to evaluate the model
perf_list = []
for comb in combinations:
scores = cross_val_score(estimator, X_train[comb], Y_train, cv=cv, scoring=scoring, n_jobs=-1)
if 'neg' in scoring:
new_perf = -scores.mean()
if scoring == 'neg_mean_absolute_percentage_error':
new_perf = new_perf * 100
perf_list.append(new_perf)
perf_history.append((comb, new_perf))
# We take the best performance and update the performance threshold
old_perf = perf
perf = np.min(perf_list)
if perf < old_perf:
# We update the best combination of features accordingly
best_comb = combinations[np.argmin(perf_list)]
kept = best_comb
# We look for combinations of n+1 features
n_features += 1
return best_comb, perf_history
def plot_feature_selection(perf_history, best_comb=None, metric='Metric'):
if best_comb:
gradual = [best_comb[0:i+1] for i in range(len(best_comb) - 1)] + [best_comb]
dico = {'Features':[str(gradual[0])] + ['+ ' + str(comb[-1]) for comb in gradual[1:]],
metric: [t[1] for t in perf_history if t[0] in gradual]}
print(dico)
else:
dico = {'Features':[str(t[0]) for t in perf_history],
metric: [t[1] for t in perf_history]}
df = pd.DataFrame(dico)
plt.figure()
ax = sns.barplot(data=df, y='Features', x=metric, palette='viridis', orient='h')
for bars in ax.containers:
ax.bar_label(bars)
ax.set_xlim((0,1.2*np.max(dico[metric])))
plt.show()
best_model = to_compare['RandomForestRegressor']
best_comb, perf_history = select_features(best_model, datasets['mean_imputed'], train_set, test_set, TARGET)
Best performance so far : 100000.0 Best combination so far : []
Best performance so far : 4.185676116439454 Best combination so far : ['Adult Mortality'] Best performance so far : 2.4876195262944667 Best combination so far : ['Adult Mortality', 'Income composition of resources'] Best performance so far : 2.0011636822189005 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years'] Best performance so far : 1.8434063994594623 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths '] Best performance so far : 1.791672165150825 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS'] Best performance so far : 1.7726989244402087 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing'] Best performance so far : 1.7583468669741678 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling'] Best performance so far : 1.7541577287333292 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths'] Best performance so far : 1.7505564009136694 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed']
print(f'\nSelected Features : {best_comb}')
df = datasets['mean_imputed']
datasets.update({'best_features':df[best_comb+[TARGET]]})
plot_feature_selection(perf_history, best_comb, metric='MAPE')
Selected Features : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed'] {'Features': ["['Adult Mortality']", '+ Income composition of resources', '+ thinness 5-9 years', '+ under-five deaths ', '+ HIV/AIDS', '+ Developing', '+ Schooling', '+ infant deaths', '+ Developed'], 'MAPE': [4.185676116439454, 2.4876195262944667, 2.0011636822189005, 1.8434063994594623, 1.791672165150825, 1.7726989244402087, 1.7583468669741678, 1.7541577287333292, 1.7505564009136694]}
/tmp/ipykernel_367042/606519076.py:53: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(data=df, y='Features', x=metric, palette='viridis', orient='h')
This subset of features allows better performances than using all of them. We can see that by using only Adult Mortality
the model is already capable of making quite good predictions of Life expectancy ! One reassuring note is that this subset is composed of a variety of health and economy factors, just as we expected.
We will now try to see if adding other interactions variables can help performances.
def add_interactions(df, target):
columns = list(df.columns)
if target in columns:
columns.remove(target)
int_dico = {}
memory = []
for col1 in columns:
for col2 in columns:
if col1 != col2:
name = col1 + '_x_' + col2
alt_name = col2 + '_x_' + col1
if name not in memory and alt_name not in memory:
interaction = df[col1] * df[col2]
memory.append(name)
int_dico.update({name:interaction})
int_df = pd.DataFrame(int_dico)
return pd.concat([df[columns], int_df, df[target]], axis=1)
with_interactions = add_interactions(df, TARGET)
datasets.update({'best_features_w_interactions':with_interactions})
We will start from our previously identified best combination.
best_comb_int, perf_history_int = select_features(best_model, datasets['best_features_w_interactions'],
train_set, test_set, TARGET,
warm_start=(best_comb, [t[1] for t in perf_history if t[0] == best_comb][0]))
Best performance so far : 1.7505564009136694 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed'] Best performance so far : 1.7313922504356534 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing'] Best performance so far : 1.7252295553490373 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing', ' thinness 5-9 years_x_Schooling'] Best performance so far : 1.7180323809984268 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing', ' thinness 5-9 years_x_Schooling', 'Schooling_x_Developed'] Best performance so far : 1.7163105652964157 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing', ' thinness 5-9 years_x_Schooling', 'Schooling_x_Developed', ' HIV/AIDS_x_Schooling'] Best performance so far : 1.7107973701816115 Best combination so far : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing', ' thinness 5-9 years_x_Schooling', 'Schooling_x_Developed', ' HIV/AIDS_x_Schooling', 'Adult Mortality_x_Developing']
print(f'\nSelected Features : {best_comb_int}')
datasets.update({'best_features_w_interactions':datasets['best_features_w_interactions'][best_comb_int+[TARGET]]})
Selected Features : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing', ' thinness 5-9 years_x_Schooling', 'Schooling_x_Developed', ' HIV/AIDS_x_Schooling', 'Adult Mortality_x_Developing']
print(f'\nThe best combination is : {best_comb_int}')
print(f'New best performance : {[t[1] for t in perf_history_int if t[0] == best_comb_int][0]}')
print(f'Difference with previous best : {[t[1] for t in perf_history_int if t[0] == best_comb_int][0] - [t[1] for t in perf_history if t[0] == best_comb][0]}')
The best combination is : ['Adult Mortality', 'Income composition of resources', ' thinness 5-9 years', 'under-five deaths ', ' HIV/AIDS', 'Developing', 'Schooling', 'infant deaths', 'Developed', 'Alcohol_x_Developing', ' thinness 5-9 years_x_Schooling', 'Schooling_x_Developed', ' HIV/AIDS_x_Schooling', 'Adult Mortality_x_Developing'] New best performance : 1.7107973701816115 Difference with previous best : -0.03975903073205789
import pickle as pkl
with open('all_datasets.pkl', 'wb') as f:
pkl.dump(datasets, f)
dico = {'best':best_comb, 'best_history':perf_history, 'best_w_interactions':best_comb_int, 'best_w_interactions_history':perf_history_int}
with open('feature_selection.pkl', 'wb') as f:
pkl.dump(dico, f)
with open('all_datasets.pkl', 'rb') as f:
datasets = pkl.load(f)
with open('feature_selection.pkl', 'rb') as f:
feature_selection_results = pkl.load(f)
best_comb, perf_history = feature_selection_results['best'], feature_selection_results['best_history']
best_comb_int, perf_history_int = feature_selection_results['best_w_interactions'], feature_selection_results['best_w_interactions_history']
def cross_val(estimator, df, train, test, target, cv=5, scoring='neg_mean_absolute_percentage_error'):
X_train, Y_train, _, _ = get_train_test(df, train_set, test_set, target)
scores = cross_val_score(estimator, X_train, Y_train, cv=cv, scoring=scoring, n_jobs=-1)
perf = -scores.mean() * 100
return perf
perf_with_best_interactions = cross_val(best_model, datasets['best_features_w_interactions'], train_set, test_set, TARGET)
print(perf_with_best_interactions)
print(len(best_comb_int))
1.7107973701816117 14
In the end we went from 21 features to 14 and our performances went from 1.89 % error on prediction on average to 1.71.
Now we will try to see if we can further improve our performances by hyperparameter tuning.
import optuna
def objective(trial):
n_estimators = trial.suggest_int("n_estimators", 50, 1000)
max_depth = trial.suggest_int("max_depth", 4, 20)
criterion = trial.suggest_categorical("criterion", ['squared_error', 'absolute_error', 'friedman_mse','poisson'])
min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 10)
max_features = trial.suggest_categorical("max_features", ["sqrt", "log2", None])
model = RandomForestRegressor(
n_estimators=n_estimators,
max_depth=max_depth,
criterion=criterion,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_features=max_features,
n_jobs=-1,
random_state=42
)
perf = cross_val(model, datasets['best_features_w_interactions'], train_set, test_set, TARGET)
print(perf)
return perf
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50)
[I 2024-11-01 12:53:25,046] A new study created in memory with name: no-name-b3121759-1c70-43bf-bff1-618348cf521c [I 2024-11-01 12:53:39,865] Trial 0 finished with value: 1.8692153963421752 and parameters: {'n_estimators': 445, 'max_depth': 11, 'criterion': 'poisson', 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 0 with value: 1.8692153963421752.
1.8692153963421752
[I 2024-11-01 12:53:44,310] Trial 1 finished with value: 2.1472938005501216 and parameters: {'n_estimators': 173, 'max_depth': 13, 'criterion': 'friedman_mse', 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': None}. Best is trial 0 with value: 1.8692153963421752.
2.1472938005501216
[I 2024-11-01 12:53:49,240] Trial 2 finished with value: 2.4273857598729514 and parameters: {'n_estimators': 471, 'max_depth': 7, 'criterion': 'squared_error', 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 'log2'}. Best is trial 0 with value: 1.8692153963421752.
2.4273857598729514
[I 2024-11-01 12:54:06,150] Trial 3 finished with value: 2.459105263400133 and parameters: {'n_estimators': 237, 'max_depth': 17, 'criterion': 'absolute_error', 'min_samples_split': 2, 'min_samples_leaf': 9, 'max_features': 'log2'}. Best is trial 0 with value: 1.8692153963421752.
2.459105263400133
[I 2024-11-01 12:54:17,963] Trial 4 finished with value: 2.058120803032954 and parameters: {'n_estimators': 868, 'max_depth': 20, 'criterion': 'squared_error', 'min_samples_split': 4, 'min_samples_leaf': 4, 'max_features': 'log2'}. Best is trial 0 with value: 1.8692153963421752.
2.058120803032954
[I 2024-11-01 12:54:22,024] Trial 5 finished with value: 2.431245361050971 and parameters: {'n_estimators': 372, 'max_depth': 15, 'criterion': 'squared_error', 'min_samples_split': 3, 'min_samples_leaf': 9, 'max_features': 'log2'}. Best is trial 0 with value: 1.8692153963421752.
2.431245361050971
[I 2024-11-01 12:54:25,889] Trial 6 finished with value: 2.3868006769195462 and parameters: {'n_estimators': 263, 'max_depth': 18, 'criterion': 'squared_error', 'min_samples_split': 8, 'min_samples_leaf': 8, 'max_features': 'sqrt'}. Best is trial 0 with value: 1.8692153963421752.
2.3868006769195462
[I 2024-11-01 12:54:28,220] Trial 7 finished with value: 1.6943086056550947 and parameters: {'n_estimators': 155, 'max_depth': 18, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
1.6943086056550947
[I 2024-11-01 12:54:36,629] Trial 8 finished with value: 2.7247277573482145 and parameters: {'n_estimators': 700, 'max_depth': 6, 'criterion': 'squared_error', 'min_samples_split': 6, 'min_samples_leaf': 6, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
2.7247277573482145
[I 2024-11-01 12:54:40,806] Trial 9 finished with value: 2.5722439296323616 and parameters: {'n_estimators': 188, 'max_depth': 6, 'criterion': 'squared_error', 'min_samples_split': 4, 'min_samples_leaf': 8, 'max_features': None}. Best is trial 7 with value: 1.6943086056550947.
2.5722439296323616
[I 2024-11-01 12:54:41,851] Trial 10 finished with value: 2.1876399343790416 and parameters: {'n_estimators': 51, 'max_depth': 10, 'criterion': 'poisson', 'min_samples_split': 7, 'min_samples_leaf': 4, 'max_features': 'sqrt'}. Best is trial 7 with value: 1.6943086056550947.
2.1876399343790416
[I 2024-11-01 12:55:00,874] Trial 11 finished with value: 1.8713569066632498 and parameters: {'n_estimators': 583, 'max_depth': 11, 'criterion': 'poisson', 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 7 with value: 1.6943086056550947.
1.8713569066632498
[I 2024-11-01 12:55:13,745] Trial 12 finished with value: 1.9469744251965995 and parameters: {'n_estimators': 412, 'max_depth': 14, 'criterion': 'poisson', 'min_samples_split': 10, 'min_samples_leaf': 3, 'max_features': None}. Best is trial 7 with value: 1.6943086056550947.
1.9469744251965995
[I 2024-11-01 12:58:26,135] Trial 13 finished with value: 1.961259512125055 and parameters: {'n_estimators': 651, 'max_depth': 10, 'criterion': 'absolute_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 7 with value: 1.6943086056550947.
1.961259512125055
[I 2024-11-01 12:58:38,968] Trial 14 finished with value: 1.9444791830076358 and parameters: {'n_estimators': 985, 'max_depth': 17, 'criterion': 'friedman_mse', 'min_samples_split': 6, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 7 with value: 1.6943086056550947.
1.9444791830076358
[I 2024-11-01 12:58:44,243] Trial 15 finished with value: 2.371710278199946 and parameters: {'n_estimators': 347, 'max_depth': 8, 'criterion': 'poisson', 'min_samples_split': 5, 'min_samples_leaf': 5, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
2.371710278199946
[I 2024-11-01 12:58:46,758] Trial 16 finished with value: 1.778663393902621 and parameters: {'n_estimators': 63, 'max_depth': 20, 'criterion': 'poisson', 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 7 with value: 1.6943086056550947.
1.778663393902621
[I 2024-11-01 12:58:47,877] Trial 17 finished with value: 1.7171834583123782 and parameters: {'n_estimators': 79, 'max_depth': 20, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
1.7171834583123782
[I 2024-11-01 12:58:49,752] Trial 18 finished with value: 3.39760393022678 and parameters: {'n_estimators': 139, 'max_depth': 4, 'criterion': 'friedman_mse', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
3.39760393022678
[I 2024-11-01 12:58:53,234] Trial 19 finished with value: 2.174140722799763 and parameters: {'n_estimators': 266, 'max_depth': 18, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 5, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
2.174140722799763
[I 2024-11-01 12:58:54,591] Trial 20 finished with value: 1.9935398614125317 and parameters: {'n_estimators': 86, 'max_depth': 15, 'criterion': 'friedman_mse', 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
1.9935398614125317
[I 2024-11-01 12:59:00,618] Trial 21 finished with value: 1.8969364092531942 and parameters: {'n_estimators': 67, 'max_depth': 20, 'criterion': 'absolute_error', 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'log2'}. Best is trial 7 with value: 1.6943086056550947.
1.8969364092531942
[I 2024-11-01 12:59:04,933] Trial 22 finished with value: 1.721949445524661 and parameters: {'n_estimators': 135, 'max_depth': 20, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 7 with value: 1.6943086056550947.
1.721949445524661
[I 2024-11-01 12:59:10,635] Trial 23 finished with value: 1.6912366928377367 and parameters: {'n_estimators': 307, 'max_depth': 18, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 23 with value: 1.6912366928377367.
1.6912366928377367
[I 2024-11-01 12:59:15,362] Trial 24 finished with value: 1.6908013636395873 and parameters: {'n_estimators': 313, 'max_depth': 18, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.6908013636395873
[I 2024-11-01 12:59:19,230] Trial 25 finished with value: 2.065122197550533 and parameters: {'n_estimators': 275, 'max_depth': 16, 'criterion': 'friedman_mse', 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
2.065122197550533
[I 2024-11-01 12:59:26,871] Trial 26 finished with value: 1.9483020065367702 and parameters: {'n_estimators': 513, 'max_depth': 18, 'criterion': 'friedman_mse', 'min_samples_split': 4, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.9483020065367702
[I 2024-11-01 12:59:32,224] Trial 27 finished with value: 1.9154585486320632 and parameters: {'n_estimators': 365, 'max_depth': 16, 'criterion': 'friedman_mse', 'min_samples_split': 7, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.9154585486320632
[I 2024-11-01 12:59:36,295] Trial 28 finished with value: 1.7684227445904774 and parameters: {'n_estimators': 323, 'max_depth': 13, 'criterion': 'squared_error', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.7684227445904774
[I 2024-11-01 13:00:10,977] Trial 29 finished with value: 2.5056760068454014 and parameters: {'n_estimators': 465, 'max_depth': 18, 'criterion': 'absolute_error', 'min_samples_split': 5, 'min_samples_leaf': 10, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
2.5056760068454014
[I 2024-11-01 13:00:13,888] Trial 30 finished with value: 1.8207039205968563 and parameters: {'n_estimators': 218, 'max_depth': 19, 'criterion': 'friedman_mse', 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.8207039205968563
[I 2024-11-01 13:00:16,088] Trial 31 finished with value: 1.711893283267114 and parameters: {'n_estimators': 136, 'max_depth': 19, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.711893283267114
[I 2024-11-01 13:00:18,277] Trial 32 finished with value: 1.7124769833390823 and parameters: {'n_estimators': 154, 'max_depth': 17, 'criterion': 'friedman_mse', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.7124769833390823
[I 2024-11-01 13:00:22,197] Trial 33 finished with value: 1.8195560862114373 and parameters: {'n_estimators': 310, 'max_depth': 19, 'criterion': 'friedman_mse', 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.8195560862114373
[I 2024-11-01 13:00:24,793] Trial 34 finished with value: 1.9539583390050117 and parameters: {'n_estimators': 216, 'max_depth': 16, 'criterion': 'friedman_mse', 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.9539583390050117
[I 2024-11-01 13:00:30,001] Trial 35 finished with value: 1.6996608676583196 and parameters: {'n_estimators': 406, 'max_depth': 19, 'criterion': 'squared_error', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.6996608676583196
[I 2024-11-01 13:00:35,355] Trial 36 finished with value: 1.848243602957832 and parameters: {'n_estimators': 429, 'max_depth': 15, 'criterion': 'squared_error', 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.848243602957832
[I 2024-11-01 13:00:41,627] Trial 37 finished with value: 1.8526421583386452 and parameters: {'n_estimators': 523, 'max_depth': 13, 'criterion': 'squared_error', 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
1.8526421583386452
[I 2024-11-01 13:00:47,387] Trial 38 finished with value: 2.3144170687498615 and parameters: {'n_estimators': 400, 'max_depth': 17, 'criterion': 'squared_error', 'min_samples_split': 4, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 24 with value: 1.6908013636395873.
2.3144170687498615
[I 2024-11-01 13:00:52,256] Trial 39 finished with value: 1.6756511972615542 and parameters: {'n_estimators': 300, 'max_depth': 19, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.6756511972615542
[I 2024-11-01 13:00:56,019] Trial 40 finished with value: 2.071687337284148 and parameters: {'n_estimators': 287, 'max_depth': 14, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'log2'}. Best is trial 39 with value: 1.6756511972615542.
2.071687337284148
[I 2024-11-01 13:01:00,499] Trial 41 finished with value: 1.6757607187373635 and parameters: {'n_estimators': 324, 'max_depth': 19, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.6757607187373635
[I 2024-11-01 13:01:03,761] Trial 42 finished with value: 1.689632901319494 and parameters: {'n_estimators': 210, 'max_depth': 18, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.689632901319494
[I 2024-11-01 13:01:07,396] Trial 43 finished with value: 1.8231974192724758 and parameters: {'n_estimators': 239, 'max_depth': 17, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.8231974192724758
[I 2024-11-01 13:01:11,408] Trial 44 finished with value: 1.7023947129128612 and parameters: {'n_estimators': 352, 'max_depth': 19, 'criterion': 'squared_error', 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.7023947129128612
[I 2024-11-01 13:01:19,906] Trial 45 finished with value: 1.9500334626410079 and parameters: {'n_estimators': 586, 'max_depth': 18, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.9500334626410079
[I 2024-11-01 13:01:22,496] Trial 46 finished with value: 1.9903905861358466 and parameters: {'n_estimators': 187, 'max_depth': 16, 'criterion': 'squared_error', 'min_samples_split': 9, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.9903905861358466
[I 2024-11-01 13:01:27,145] Trial 47 finished with value: 1.6817817547311238 and parameters: {'n_estimators': 322, 'max_depth': 18, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.6817817547311238
[I 2024-11-01 13:01:32,640] Trial 48 finished with value: 2.238034558135237 and parameters: {'n_estimators': 464, 'max_depth': 19, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 6, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
2.238034558135237
[I 2024-11-01 13:01:35,916] Trial 49 finished with value: 1.6882408776970421 and parameters: {'n_estimators': 239, 'max_depth': 17, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 39 with value: 1.6756511972615542.
1.6882408776970421
trial = study.best_trial
print("MAPE: {}".format(trial.value))
print("Best hyperparameters: {}".format(trial.params))
print(f'\nPerformance difference with default parameters is : {trial.value - [t[1] for t in perf_history_int if t[0] == best_comb_int][0]}')
MAPE: 1.6756511972615542 Best hyperparameters: {'n_estimators': 300, 'max_depth': 19, 'criterion': 'squared_error', 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'} Performance difference with default parameters is : -0.035146172920057284
We should have close to the best model possible with our approach.
D - Testing¶
Let's now test our best model on the test set !
from sklearn.metrics import mean_absolute_percentage_error, r2_score
best_model = RandomForestRegressor(**trial.params)
X_train, Y_train, X_test, Y_test = get_train_test(datasets['best_features_w_interactions'], train_set, test_set, TARGET)
best_model.fit(X_train, Y_train)
yhat = best_model.predict(X_test)
final_perf = mean_absolute_percentage_error(Y_test, yhat) * 100
r2 = r2_score(Y_test, yhat)
dico = {'True':Y_test, 'Predicted':yhat, 'Difference':np.abs(yhat - Y_test)}
df = pd.DataFrame(dico)
norm = plt.Normalize(df['Difference'].min(), df['Difference'].max())
sm = plt.cm.ScalarMappable(cmap="viridis", norm=norm)
sm.set_array([])
ax = sns.scatterplot(x='True', y='Predicted', data=df, palette='viridis', hue='Difference')
ax.get_legend().remove()
ax.figure.colorbar(sm, ax=ax, label='Absolute Error | yhat - Y_test |')
plt.title(f'{str(final_perf)[:4]} % average error\n Max error : {str(np.max(df['Difference']))[:4]}\n R2 : {str(r2)[:4]}')
plt.show()
feature_importances = best_model.feature_importances_
features = best_model.feature_names_in_
dico_features = {'Features':features, 'Importances':feature_importances}
df_features = pd.DataFrame(dico_features)
ax = sns.barplot(data=df_features, y='Features', x='Importances', orient='h', hue='Importances')
ax.get_legend().remove()
plt.show()
Feature importances shows a summarized evaluation of which features are most important when making predictions. The above plot shows that the model heavily realies on the following native features HIV/AIDS
, Income composition of resources
, Adult Mortality
, Schooling
and on interactions HIV/AIDS_x_Schooling
and Adult_Mortality_x_Developing
. This seems reasonable since this means that the model is looking for information about infant deaths due to HIV/AIDS, an index of average wealth for the country's inhabitants, the probability of dying between 15 and 60 in this country and the average number of years of schooling in this country (see dataset description on Kaggle).
A more critical standpoint on this could reveal that the importances of both cited interaction features are mainly due to the fact that they encompass very important features (HIV/AIDS and Adult_Mortality). We need to remind ourselves that estimators in our RandomForest are only given access to a fraction of all available features (max_features='log2')
. This means that when some trees get a subset of features which does not contain the highly important native features cited above but do have access to these interaction features, using those is almost as valuable because main correlations are conserved within these interaction features.
Another way to understand the model is by using Shapley values. here is a valuable blog entry on this
Shapley values are estimations of how much a feature's value impacts model output. They are really valuable when trying to make sense of a model's output.
import shap
explainer = shap.Explainer(best_model)
shap_values = explainer(X_test)
shap.plots.bar(shap_values, max_display=len(X_test.columns))
This plot shows the average absolute SHAP value of each feature in our model. We can see that our findings from taking a look at sklearn's feature_importances are recapitulated here. However slight differences are observed, as with this method, the most impactful feature for Life expectancy prediction is Income composition of resources
followed by HIV/AIDS
when it was the other way around with the previous method.
The power of using Shapley values is that we can even more precisely the ouput a model gives for a given observation, allowing us to investigate model output and thus do either sanity checks or on the contrary to look for explanations of errors.
big_Life_expectancy = (df['True'] > 74)
low1 = df.loc[big_Life_expectancy]['Difference'].nsmallest(1).iloc[0]
idx1 = list(df['Difference']).index(low1)
small_Life_expectancy = (df['True'] < 59)
low2 = df.loc[small_Life_expectancy]['Difference'].nsmallest(1).iloc[0]
idx2 = list(df['Difference']).index(low2)
# Here are two observations of the test set accurately predicted by the model
plt.title(f'Country : {datasets['raw']['Country'].iloc[idx1]}, True Life expectancy : {df['True'].iloc[idx1]}')
shap.plots.waterfall(shap_values[idx1], max_display=len(X_test.columns)) # High Life expectancy true value
plt.title(f'Country : {datasets['raw']['Country'].iloc[idx2]}, True Life expectancy : {df['True'].iloc[idx2]}')
shap.plots.waterfall(shap_values[idx2], max_display=len(X_test.columns)) # Low Life expectancy true value
In the above plots we are showing how the model arrived to its prediction based on the observation it was asked to predict. You can read each line of this plot like this "Because the Feature F had the value X in this observation, the model likely modified its predicted value in this direction" the intensity of the modification is correlated to the shapley value of this feature.
Let's look at some observations the model got wrong.
largests = list(df['Difference'].nlargest(2))
biggest_error_idx = list(df['Difference']).index(largests[0])
second_error_idx = list(df['Difference']).index(largests[1])
# Here are two observations of the test set badly predicted by the model
plt.title(f'Country : {datasets['raw']['Country'].iloc[biggest_error_idx]}, True Life expectancy : {df['True'].iloc[biggest_error_idx]}')
shap.plots.waterfall(shap_values[biggest_error_idx], max_display=len(X_test.columns)) # High Life expectancy true value
plt.title(f'Country : {datasets['raw']['Country'].iloc[second_error_idx]}, True Life expectancy : {df['True'].iloc[second_error_idx]}')
shap.plots.waterfall(shap_values[second_error_idx], max_display=len(X_test.columns)) # Low Life expectancy true value
When taking a look at the above observations I think we can all agree on the fact that the model's output 'make's sense' as those were country with high infant deaths due to HIV/AIDS and high Adult Mortality. When analyzing the observations above, we can think that a feature the model lacks could help better estimate the true Life expectancy such as Birth rate, population structure indicators (proportion of citizens under the poverty threshold, proportion of very high income houses). In any case, the model did not learn patterns which could be applied to these.
One last thing I wanted to show is the use of Shapley values to see how each features values tends to impact model outputs when looking at all samples instead of individual observations.
shap.plots.beeswarm(shap_values, max_display=len(X_test.columns))
In this plot, the wider points are scattered, the more important the feature is to make predictions (high absolute Shapley values). For each feature, the color of points shows how high or small the value of the feature was for this point. When point colors tend to be homogeneous we can directly see how feature value tends to impact model output. For example, very high values of HIV/AIDS have huge negative impacts on model ouptut whereas low to very low ones barely impact model ouputs. This means that the model learned to identify any increase of HIV/AIDS as a major indicator that Life expectancy would be lower.
I will conclude the project on this note as we have applied a careful Machine Learning approach on this task and even delved into Model explainability and error analysis.