Tiffany Chan
Ensemble Techniques Project
Deliverable –1 (Exploratory data quality report reflecting the following)–(20)
1.Univariate analysis(12marks) a.Univariate analysis –data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers. b.Strategies to address the different data challenges such as data pollution, outlier’s treatment and missing values treatment. c.Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots.
#Importing necessary packages needed for data clearning, plotting and for creating models:
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
#from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import Image
#import pydotplus as pydot
from sklearn import tree
from os import system
#Reading in dataset, and renaming it "bank"
bank = pd.read_csv("bank-full.csv")
#Printing the top 5 cases in the dataset to check we read in the dataset correctly.
print(bank.head(5))
print("")
#Dimensions of the dataset
print("Dimensions of dataset:")
print(bank.shape)
Univariate Analysis
#1a
#Data Types
print("Data Types")
print("")
print(bank.dtypes)
print("")
#Descriptive statistics of continous numerical variables before data cleaning/transformation.
print("Descriptive Statistics of continuous numerical variables.")
print("")
print(bank.describe())
print("")
#Changing string variables into categorical variables.
for feature in bank.columns: # Loop through all columns in the dataframe
if bank[feature].dtype == 'object': # Only apply for columns with categorical strings
bank[feature] = pd.Categorical(bank[feature])# Replace strings with an integer
#Verifying that the string variables are changed into categorical variables.
print("Modified datatypes of dataset variables")
bank.dtypes
Looking at the descriptive statistics, the pdays variable and the previous variable seem to have excessive values of -1 and 0 respectively, and may have to undergo some sort of transformation or deletion during explorative data analysis (EDA). The balance variable has -8019 as a minimum value, which most likely suggests that the account has been overdrawn, and the account owner has a negative balance. This variable may have outliers and could be something to evaluate later on when conducting univariate analysis.
#Frequency of categorical variables
print("Frequency of categorical variables")
print("")
print("Dependent variable:")
print(bank['Target'].value_counts())
print("")
print("Independent categorical variables:")
print(bank['job'].value_counts())
print(bank['marital'].value_counts())
print(bank['education'].value_counts())
print(bank['default'].value_counts())
print(bank['housing'].value_counts())
print(bank['loan'].value_counts())
print(bank['contact'].value_counts())
print(bank['month'].value_counts())
print(bank['poutcome'].value_counts())
print("")
#Frequency of continuous variables of interest that are numerical and continuous in nature.
#I would like to see how many -1 and 0s there are in pdays and previous, respectively.
print("Pdays Variable Frequency")
print(bank['pdays'].value_counts())
print("Previous Variable Frequency")
print(bank['previous'].value_counts())
#Number of NAs in dataset
print("Number of missing values in dataset:")
bank.isnull().sum()
There are no NAs in the dataset. However, there are 36954 -1 values in pdays. According to the codebook of this assignment, the -1 values are either individuals that were never contacted or people who did not reciprocate after 900 days after last contact from the last campaign. So, this value could be interpreted as missing values since we do not know if they were contacted or if they did not respond purposefully after 900 days. There are also the same number of zeroes in the previous variable.
#Heatmap to evaluate Pearson's correlation.
sns.heatmap(bank.corr(), annot = True);
Pdays and previous have a positive Pearson's correlation of 0.45, which can be interpreted as low-moderate correlation. This does not mean that we should entirely exclude either one from our models. More analysis is needed to understand what these variables look like statistically before making a decision of how to handle them.
There are two other pairings that have very low correlation: 1. day and campaign (Pearson's: 0.16), 2. balance and age (Pearson's 0.098). Day and campaign have a weak correlation, while balance and age's correlation is very close to zero and could potentially suggest no correlation at all.
The rest of the Pearson's correlation values are very small and close to zero, suggesting that the variables in these pairings may be completely independent from one another.
#Univariate analysis
#Histograms of all the continuous variables to see the spread and shape of the data.
columns = list(bank)[:] # Creating a new list with all columns
bank[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); # Histogram of all columns
From these histograms, balance, campaign, duration, pdays and previous are skewed to the right, which may suggest there could be outliers in these variables that must be handled. Age has a good spread but there may still be outliers also in this variable. In order to verify if there are true outliers, we would need to look at the range and boxplots of these variables.
#A better look at the histogram for age.
#It does not follow a normal distribution.
sns.distplot(bank['age'])
#Boxplot for age to see if there are outliers beyond the whiskers.
bank[['age']].boxplot()
There are outliers for age, and there are multiple ways of dealing with outliers. In general, a good way to deal with a variable like age that has to make predictions on a categorical dependent variable, is to categorize it into different age groups.
bins = [18, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99']
bank['agerange'] = pd.cut(bank.age, bins, labels = labels,include_lowest = True)
bank['agerange'] = bank.agerange.astype(str)
bank.dtypes #agerange is the categorical variable for age, but it is listed below as a string. I will change this later down below with One Hot Encoding.
# Balance boxplot to see the outliers.
bank[['balance']].boxplot()
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['balance'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#Looking at how much of the variable points are outliers
balanceoutlier_max = bank[(bank['balance'] > 3462)]
print(balanceoutlier_max) #10.4% of the data variable
balanceoutlier_min = bank[(bank['balance'] < -1962)]
balanceoutlier_min #0.037% of the data variable
#Replace the outliers with the value of the whiskers. This is one technique to handle the outliers of skewed data.
#Making new boxplot to check if outliers are replaced. Run this code twice to make sure the imputations are met.
bank.balance[bank['balance']>3462] = 3462
bank[['balance']].boxplot()
bank.balance[bank['balance']<-1962] = -1962
bank[['balance']].boxplot()
#Boxplot for campaign variable.
bank[['campaign']].boxplot();
There are only right skewed outliers in this variable.
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['campaign'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
#Set the threshold to the upper whisker. Find out how many outliers are in this variable.
campaignoutlier_max = bank[(bank['campaign'] > 6)]
print(campaignoutlier_max) #6.8% of the data
#Set the outliers to the upper whisker value and verify.
bank.campaign[bank['campaign']>6] = 6
bank[['campaign']].boxplot()
#Boxplot for day variable.
bank[['day']].boxplot()
There are no outliers in the day variable. So, there is no need to transform the data.
#Boxplot for duration.
bank[['duration']].boxplot()
There are only outliers above the upper whisker.
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['duration'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
durationoutlier_max = bank[(bank['duration'] > 643)]
print(durationoutlier_max) #7.2% of the data variable
#Replacing the outliers with the value of the upper whisker.
bank.duration[bank['duration']>643] = 643
bank[['duration']].boxplot()
#Boxplot for pdays.
bank[['pdays']].boxplot()
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['pdays'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
The upper whisker is the -1 value(which can be interpreted as an unknown value because we don't know if they were contacted or they never responded after 900 days). Replacing the outliers with this value would not serve much benefit because it would just create more -1 values or unknowns in the data. It would just make sense to discard this column all together.
bank[['previous']].boxplot()
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['previous'], [75 ,25])
iqr = q75 - q25
#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
It would also just make sense to discard this column because there are too many zeroes.
1b. The variables of concern seem to be pdays and poutcome.
For pdays, the number of -1 in the variable is concerning because -1 means either the customer was not contacted or that more than 900 days have passed by since the last contact. These two conditions have completely different meanings and should be reported as missing values. There is not enough information from the other dataset variables or the codebook to differentiate the two groups. "-1" appears in total 36954 times in the pdays variable, which is approximately 81.7% of the dataset. Due to the excessive number of missing values, it would be more beneficial to lose this column all together, since more than half have missing values (Kumar, 2020). This way, you maintain a more robust robust model than if mean/median/mode or confidence level imputations are done for 81.7% of this variable. The downside of this method would be the loss of data. However, only about 18.3% of the variable data would be sacrificed, which is not too much. The same is applied to the previous variable.
#Boxplot for balance x target
print(sns.boxplot(x = 'Target', y = 'balance', data = bank));
#Density curves for balance x target
balancekde= sns.kdeplot(bank.loc[(bank['Target']=='yes'),
'balance'], color='r', shade=True, Label='Yes')
balancekde= sns.kdeplot(bank.loc[(bank['Target']=='no'),
'balance'], color='b', shade=True, Label='no');
balancekde.set(xlabel="balance", ylabel = "Probability Density");
There are similar trends between balance and the target variable. There are spikes in the data for both yes and no for term deposit. The reason may be that there are more individuals sampled that have balances closely below and above 0, as well as individuals that have $3000-4000 balance.
#Boxplot for campaign
print(sns.boxplot(x = 'Target', y = 'campaign', data = bank));
#Density plot for campaign
campaignkde= sns.kdeplot(bank.loc[(bank['Target']=='yes'),
'campaign'], color='r', shade=True, Label='Yes')
campaignkde= sns.kdeplot(bank.loc[(bank['Target']=='no'),
'campaign'], color='b', shade=True, Label='no')
campaignkde.set(xlabel="campaign", ylabel = "Probability Density");
The bank made the most one calls to clients. The median campaign contacts for those who did and did not subscribe to term deposit was roughly 2 for both groups according to the boxplot.
#Duration- This is not a variable that will be used in the models later because the variable highly affects output target and should be discarded according to the data source.
#Boxplot for duration.
print(sns.boxplot(x = 'Target', y = 'duration', data = bank));
#Density graphs for duration.
durationkde= sns.kdeplot(bank.loc[(bank['Target']=='yes'),
'duration'], color='r', shade=True, Label='Yes')
durationkde= sns.kdeplot(bank.loc[(bank['Target']=='no'),
'duration'], color='b', shade=True, Label='no')
durationkde.set(xlabel="duration", ylabel = "Probability Density");
print(sns.boxplot(x = 'Target', y = 'pdays', data = bank));
pdayskde= sns.kdeplot(bank.loc[(bank['Target']=='yes'),
'pdays'], color='r', shade=True, Label='Yes')
pdayskde= sns.kdeplot(bank.loc[(bank['Target']=='no'),
'pdays'], color='b', shade=True, Label='no')
pdayskde.set(xlabel="pdays", ylabel = "Probability Density");
As discussed before, this variable will not be included in the model, as this right skewed data only represents approximately 18.3% of the data in this variable. Many of these values on the skewed tail are considered outliers.
print(sns.boxplot(x = 'Target', y = 'previous', data = bank));
#Density plot for previous x target
previouskde= sns.kdeplot(bank.loc[(bank['Target']=='yes'),
'previous'], color='r', shade=True, Label='Yes')
previouskde= sns.kdeplot(bank.loc[(bank['Target']=='no'),
'previous'], color='b', shade=True, Label='no')
previouskde.set(xlabel="previous", ylabel = "Probability Density");
For the same reason as stated above and what was stated for pdays. Imputing this variable's outliers on the skewed table would generate more 0s. Discarding this variable would generate more robust models.
#Days x Target
print(sns.boxplot(x = 'Target', y = 'day', data = bank));
#Density plot for day x target
previouskde= sns.kdeplot(bank.loc[(bank['Target']=='yes'),
'day'], color='r', shade=True, Label='Yes')
previouskde= sns.kdeplot(bank.loc[(bank['Target']=='no'),
'day'], color='b', shade=True, Label='no')
previouskde.set(xlabel="day", ylabel = "Probability Density");
There seems to be more positve response from 0 to short of 15 days. 15 days and onwards seems to generate less positive outcomes in terms of term deposit sign ups.
# Education x Target
#Frequency using 3x2 table and then plotting bar graph to illustrate this frequency.
print(pd.crosstab(bank['education'],bank['Target']))
ax = sns.countplot(x="Target", hue="education", data=bank)
print(ax)
#Crosstabs to evaluate the relationship between the two variables.
print(pd.crosstab(bank['education'],bank['Target'],normalize='columns'))
Those who have tertiary education subscribe more to term deposits compared to those who have a primary education.
#Job x Target
#Table to evaluate frequency values for job x target, and plotting these results.
print(pd.crosstab(bank['job'],bank['Target']))
sns.set(rc={'figure.figsize':(11.7,8.27)})
bx = sns.countplot(x="Target", hue="job", data=bank)
bx;
#Crosstabs to evaluate the relationship between both variables (job x target)
print(pd.crosstab(bank['job'],bank['Target'],normalize='columns'))
Those with blue collar jobs are less likely to get a term deposit compared to other occupations. People who are self-employed seem to display the same distribution when it comes to subscribing to a term deposit or not subscribing to a term deposit. The same trend is observed among those with unknown jobs, technicians, and administrators. However, those that are retired and students are more likely to get a term deposit than not.
#Marital x Target
#2x2 to observe the frequency of marital status by target status, and accompanying bar plot.
print(pd.crosstab(bank['marital'],bank['Target']))
cx = sns.countplot(x="Target", hue="marital", data=bank)
cx;
#Crosstabs to evaluate the relationship between both variables (marital x target)
print(pd.crosstab(bank['marital'],bank['Target'],normalize='columns'))
Those that are married are less likely to subscribe to a term deposit than those that are single. The distribution is the same among those that are divorced when it comes to subscribing or not subscribing to a term deposit.
#Default x Target
#Frequency of target status by default status using 2x2 table and plot.
print(pd.crosstab(bank['default'],bank['Target']))
dx = sns.countplot(x="Target", hue="default", data=bank)
dx;
#Crosstabs to evaluate the relationship between both variables (default x target)
print(pd.crosstab(bank['default'],bank['Target'],normalize='columns'))
Those that did not default have the same distribution when it comes to subscribing or not subscribing to term deposit. Those that did default are less likely to take a term deposit.
#Housing x Target
#2x2 table showing the frequency of housing by target, and accompanying bar chart.
print(pd.crosstab(bank['housing'],bank['Target']))
ex = sns.countplot(x="Target", hue="housing", data=bank)
ex;
#Crosstabs to evaluate the relationship between both variables (housing x target)
print(pd.crosstab(bank['housing'],bank['Target'],normalize='columns'))
Those that have a housing loan are less likely to engage to a term deposit. Those that don't have housing loans are more likely to subscribe to a term deposit.
#Loan x Target
#2x2 table showing target frequencies by loan and accompanying bar chart.
print(pd.crosstab(bank['loan'],bank['Target']))
fx = sns.countplot(x="Target", hue="loan", data=bank)
fx;
#Crosstabs to evaluate the relationship between both variables (loan x target)
print(pd.crosstab(bank['loan'],bank['Target'],normalize='columns'))
Those that have a personal loan are less likely to get a term deposit. Those that don't have a personal loan are slightly more likely to get a term deposit.
#Contact x Target
#2x2 table showing frequencies of target by contact means, and bar chart.
print(pd.crosstab(bank['contact'],bank['Target']))
gx = sns.countplot(x="Target", hue="contact", data=bank)
gx;
#Crosstabs to evaluate the relationship between both variables (contact x target)
print(pd.crosstab(bank['contact'],bank['Target'],normalize='columns'))
Those that were using their cells were more likely to agree to a term deposit. Those that use a telephone are more likely to subscribe to a term deposit. Those with unknown contact means were more likely to not subscribe to a term deposit.
#Month x Target
#Frequency table of Target by Month, and bar plot.
print(pd.crosstab(bank['month'],bank['Target']))
sns.set(rc={'figure.figsize':(11.7,8.27)})
hx = sns.countplot(x="Target", hue="month", data=bank)
hx;
#Crosstabs to evaluate the relationship between both variables (month x target)
print(pd.crosstab(bank['month'],bank['Target'],normalize='columns'))
Those that were last contacted in February, March, April, September, October and December are more likely to say yes to a term deposit. Last contact in January, May, June, July, and November were less likely to say yes to a term deposit. The reamaining months exhibited no difference between accepting or refusing a term deposit.
#Frequency table of target variable by poutcome and bar chart.
print(pd.crosstab(bank['poutcome'],bank['Target']))
ix = sns.countplot(x="Target", hue="poutcome", data=bank)
ix;
#Crosstabs to evaluate the relationship between both variables (poutcome x target)
print(pd.crosstab(bank['poutcome'],bank['Target'],normalize='columns'))
Those that had a positive outcome from the last campaign are more likely to subscribe to a term deposit, while those whose outcomes were unknown for the last campaign are more likely to not subscribe to a term deposit. Those who had a negative outcome did not show much of a difference in subscribing and not subscribing to a term deposit.
#Agerange x Target
#Agerange is a categorical variable that was derived from age.
#Frequency of agerange by target variable, and bar chart.
print(pd.crosstab(bank['agerange'],bank['Target']))
jx = sns.countplot(x="Target", hue="agerange", data=bank)
jx;
#Crosstabs to evaluate the relationship between both variables (agerange x target)
print(pd.crosstab(bank['agerange'],bank['Target'],normalize='columns'))
Those between the ages of 30-59 are less likely to subscribe to a term deposit. Those that are younger(18-29) and those about to retire and retirees are more likely to participate in getting a term deposit.
#Evaluating every variable's dtype.
bank.dtypes
#Replacing string categories with numerical values. Making these variables numerical.
replaceStruct = {
"default": {"no": 0, "yes": 1 },
"housing": {"no": 0, "yes":1 },
"loan": {"no": 0, "yes":1 },
"Target": {"no": 0, "yes": 1 }
}
#One hot encoding variables with multiple categories.
oneHotCols=["job","marital","education","contact","month","poutcome","agerange"]
#Applying the above to the dataset.
#Create dummy variables for model.
bank=bank.replace(replaceStruct)
bank=pd.get_dummies(bank, columns=oneHotCols)
bank.head(10)
#Verifying the dtype of each newly encoded variable.
bank.info()
#Preparing the data. Separating the dependent variable "Target" from the rest of the dataset. Drop pdays, previous and age due to reasons discussed above.
X = bank.drop(["Target", "pdays", "previous", "age", "duration"], axis = 1)
y = bank.pop("Target")
#Splitting data into training (70%) and testing (30%) datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
#Creating logistic regression model
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
#predict on test
y_predict = model.predict(X_test)
model_score = model.score(X_test, y_test)
print(model_score)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
print("Numbers from Confusion Matrix")
print(cm)
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
print("")
#Confusion Matrix
print("Confusion Matrix")
def draw_cm( actual, predicted ):
cm = confusion_matrix( actual, predicted)
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
#Making dataframe for logistic regression metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Logistic_Regression':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]}
dataframe1 = pd.DataFrame(data)
dataframe1
The logistic regression model may not be the best suited for this data. Training accuracy and testing accuracy are 90%, and very high. These high scores are due to the large number correctly predicted negatives (true negatives).
However, looking at the misclassified cases is more important in this model's evaluation. Recall for this logistic regression model is: 0.19, which is considered low. This is bad. Recall determine of all who subscribed to a term deposit in the past, how many did the model predict accurately. 1-Recall (0.81) explains the severity of false negatives. This is somewhat high and can be good for the bank even though it indicates a bad model. These cases are people who decided to invest into a term deposit even though the model did not predict them to. This allows the bank to get more money than expected from the model, and to invest in high profitable business ventures to maximize gain. The bank would benefit with a higher 1-Recall value but it also means that the model's predictions were not very good. 1264 cases are false negatives.
Precision is the other measure to focus on. Precision is 0.61, which is decent. This measures of how many of those the model predicted to take part in a term deposit, actually did take a term deposit. 1-Precision(0.39) shows the severity of false positives. These are individuals the bank would expect would participate but ended up not doing so (False positive cases). This provides less funds for the bank to use to make investments. 363 people are false positives.
Since the number of false negatives is more than false positives, the bank would in the end gain, according to this model. However, because the model's recall and precision scores are not that great, we would need to question whether another model would be better suited for this data. The F1 Score is also low and is indicative of a not so great model.
The net gain for the bank, other than the true positive cases, is predicted to be: 1264-187= 1077 term deposit acounts.
## Feature Importance or Coefficients
fi = pd.DataFrame()
fi['Col'] = X_train.columns
fi['Coeff'] = np.round(abs(model.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
When evaluating the features of importance, after the use of one hot encoding, it seems that poutcome_success (coefficient: 1.45) has the most influence on this model when making predictions. Month_march (1.04), month_jan (1.02), contact_unknown(0.89), month_october (0.75), also exert high influence on this model, as well. The components that exert the most influence are for the most part the poutcome variables, the month variables, and contact variables.
Variables with moderate impact include housing (0.47), loan(0.35), and different age categories like 70-79 (0.54)and 50-59 (0.47).
Other variables like balance and day and duration don't seem to make an impact on the model for logistic regression.
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
This shows the model overfit to the training data. So we must do pruning.
#Pruning
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 7, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
The model fits well to the training data and the testing data. The scores are very close to one another.
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp"], index = X_train.columns))
Like what was observed in the logistic regression model, poutcome_success (0.5) had the highest impact on the model. Contact_unknown was the second most contributive variable to this model. Among the months, Month_march, month_june, month_september and month_october played a moderate role in creating this prediction model. Oddly, month_jan did not seem as important in this model as it did in the logistic regression model. There are other less contributive components that may make a smaller impact.
print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(X_test)
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
#Making dataframe for Decision Tree metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Decision_Tree':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]}
dataframe2 = pd.DataFrame(data)
dataframe2
From these results, you could tell that this decision tree model is performing better than the logistic regression model. You can instantly tell from looking at the F1 score and the ROC score are better than the logistic regression model. The recall is now 0.214, which shows an increase. Precision stayed about the same at 0.6. The number of true positives has increased to 332, suggesting improvement from the previos model. 1219 individuals are false negatives. 224 cases are false positives.
Again, the bank would gain because the number of false negatives is more than false positives but the model is still not performing adequately because the recall is still low. The precision value is adequate.
The net gain for the bank, other than the true positive cases, is predicted to be: 1219-224= 995 term deposit acounts.
#Bagging
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)
bgcl = bgcl.fit(X_train, y_train)
y_predict = bgcl.predict(X_test)
print(bgcl.score(X_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
#Making dataframe for Bagging metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Bagging':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]}
dataframe3 = pd.DataFrame(data)
dataframe3
For the bagging model, the recall (0.27) has improved but the precision (0.53) values have gone down. The question that we need to ask is whether a recall score or precision score is more important. Like the other models, the accuracy is high due to the large true negatives value. There is a net gain, despite the amount of misclassification by the model; the false negative value is more than the false positives values. The net gain of customers, other than the true positive cases, that will participate in term deposits is: 1125-367 = 758 customers (term deposit accounts).
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(X_train, y_train)
y_predict = abcl.predict(X_test)
print(abcl.score(X_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
#Making dataframe for AdaBoost metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'AdaBoost':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]}
dataframe4 = pd.DataFrame(data)
dataframe4
The recall score is 0.19, which is lower than the decision tree model and the bagging model. The precision score is 0.64 for the Adaboosting model, which is higher than all the previous models.
The net gain for the bank, other than the true positive cases, is predicted to be: 1262-160= 1102 term deposit acounts.
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(X_train, y_train)
y_predict = gbcl.predict(X_test)
print(gbcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Metrics
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
#Making dataframe for GradientBoost metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'GradientBoost':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]}
dataframe5 = pd.DataFrame(data)
dataframe5
The recall score is 0.17, which is lower than all of the other models. The precision score is 0.66 for the Gradientboosting model, which is higher than all the previous models.
The net gain for the bank, other than the true positive cases, is predicted to be: 1287-139= 1148 term deposit acounts.
#Random forest model:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(X_train, y_train)
#Confusion Matrix
y_predict = rfcl.predict(X_test)
print(rfcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
columns = [i for i in ["Predict_No","Predict_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
#Metrics
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
The recall score is 0.25. The precision score is 0.55 for the Random Forest, which is not as high as GradientBoost.
The net gain for the bank, other than the true positive cases, is predicted to be: 1165-312= 853 term deposit acounts.
#Making dataframe out of the random forest metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Random_Forest':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]}
dataframe6 = pd.DataFrame(data)
dataframe6
# 3. Merge all dataframes together to make a large dataframe with all metrics for all algorithms
Metrics_Dataframe = pd.merge(dataframe1,dataframe2,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe3,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe4,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe5,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe6,how='outer',on='Metrics')
Metrics_Dataframe
All the models have about the same training and testing accuracy scores.
It seems that of all the models, bagging has the best recall score but GradientBoost has the best precision.
Bagging has the highest ROC area under the curve score and may be the best model overall.
However, the one model that generates the most gain from the misclassification errors is: Gradientboost. It has highest Precision and the lowest false positives of all the models.
In other words, for the bank to actually make a profit, you would want a model with more false negative values (Customers that were predicted to not subscribe to a term deposit but actually ended up doing it) and less false positive values (Customers that were predicted to subscribe for a term deposit but ended up not participating.) This, however, may not be the best model that makes the best predictions.
Bagging seems to be the best model with the highest ROC score and the highest F1 Score that may make the best predictions, but the bank earns more with a model like Gradientboost because it has the lowest false positives and the highest precision.