Tiffany Chan

Ensemble Techniques Project

Deliverable –1 (Exploratory data quality report reflecting the following)–(20)

1.Univariate analysis(12marks) a.Univariate analysis –data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers. b.Strategies to address the different data challenges such as data pollution, outlier’s treatment and missing values treatment. c.Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots.

In [1088]:
#Importing necessary packages needed for data clearning, plotting and for creating models:
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
#from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import Image  
#import pydotplus as pydot
from sklearn import tree
from os import system
In [1089]:
#Reading in dataset, and renaming it "bank"
bank = pd.read_csv("bank-full.csv")
In [1090]:
#Printing the top 5 cases in the dataset to check we read in the dataset correctly.
print(bank.head(5))
print("")
#Dimensions of the dataset
print("Dimensions of dataset:")
print(bank.shape)
   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome Target  
0  unknown    5   may       261         1     -1         0  unknown     no  
1  unknown    5   may       151         1     -1         0  unknown     no  
2  unknown    5   may        76         1     -1         0  unknown     no  
3  unknown    5   may        92         1     -1         0  unknown     no  
4  unknown    5   may       198         1     -1         0  unknown     no  

Dimensions of dataset:
(45211, 17)

Univariate Analysis

In [1091]:
#1a
#Data Types
print("Data Types")
print("")
print(bank.dtypes)
print("")

#Descriptive statistics of continous numerical variables before data cleaning/transformation.
print("Descriptive Statistics of continuous numerical variables.")
print("")
print(bank.describe())
print("")

#Changing string variables into categorical variables.
for feature in bank.columns: # Loop through all columns in the dataframe
    if bank[feature].dtype == 'object': # Only apply for columns with categorical strings
        bank[feature] = pd.Categorical(bank[feature])# Replace strings with an integer
#Verifying that the string variables are changed into categorical variables.
print("Modified datatypes of dataset variables")
bank.dtypes
Data Types

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
Target       object
dtype: object

Descriptive Statistics of continuous numerical variables.

                age        balance           day      duration      campaign  \
count  45211.000000   45211.000000  45211.000000  45211.000000  45211.000000   
mean      40.936210    1362.272058     15.806419    258.163080      2.763841   
std       10.618762    3044.765829      8.322476    257.527812      3.098021   
min       18.000000   -8019.000000      1.000000      0.000000      1.000000   
25%       33.000000      72.000000      8.000000    103.000000      1.000000   
50%       39.000000     448.000000     16.000000    180.000000      2.000000   
75%       48.000000    1428.000000     21.000000    319.000000      3.000000   
max       95.000000  102127.000000     31.000000   4918.000000     63.000000   

              pdays      previous  
count  45211.000000  45211.000000  
mean      40.197828      0.580323  
std      100.128746      2.303441  
min       -1.000000      0.000000  
25%       -1.000000      0.000000  
50%       -1.000000      0.000000  
75%       -1.000000      0.000000  
max      871.000000    275.000000  

Modified datatypes of dataset variables
Out[1091]:
age             int64
job          category
marital      category
education    category
default      category
balance         int64
housing      category
loan         category
contact      category
day             int64
month        category
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome     category
Target       category
dtype: object

Looking at the descriptive statistics, the pdays variable and the previous variable seem to have excessive values of -1 and 0 respectively, and may have to undergo some sort of transformation or deletion during explorative data analysis (EDA). The balance variable has -8019 as a minimum value, which most likely suggests that the account has been overdrawn, and the account owner has a negative balance. This variable may have outliers and could be something to evaluate later on when conducting univariate analysis.

In [1092]:
#Frequency of categorical variables

print("Frequency of categorical variables")
print("")
print("Dependent variable:")
print(bank['Target'].value_counts())
print("")

print("Independent categorical variables:")
print(bank['job'].value_counts())
print(bank['marital'].value_counts())
print(bank['education'].value_counts())
print(bank['default'].value_counts())
print(bank['housing'].value_counts())
print(bank['loan'].value_counts())
print(bank['contact'].value_counts())
print(bank['month'].value_counts())
print(bank['poutcome'].value_counts())
print("")

#Frequency of continuous variables of interest that are numerical and continuous in nature. 
#I would like to see how many -1 and 0s there are in pdays and previous, respectively.
print("Pdays Variable Frequency")
print(bank['pdays'].value_counts())
print("Previous Variable Frequency")
print(bank['previous'].value_counts())


#Number of NAs in dataset
print("Number of missing values in dataset:")
bank.isnull().sum()
Frequency of categorical variables

Dependent variable:
no     39922
yes     5289
Name: Target, dtype: int64

Independent categorical variables:
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64
married     27214
single      12790
divorced     5207
Name: marital, dtype: int64
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64
no     44396
yes      815
Name: default, dtype: int64
yes    25130
no     20081
Name: housing, dtype: int64
no     37967
yes     7244
Name: loan, dtype: int64
cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64
unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

Pdays Variable Frequency
-1      36954
 182      167
 92       147
 183      126
 91       126
        ...  
 749        1
 717        1
 589        1
 493        1
 32         1
Name: pdays, Length: 559, dtype: int64
Previous Variable Frequency
0      36954
1       2772
2       2106
3       1142
4        714
5        459
6        277
7        205
8        129
9         92
10        67
11        65
12        44
13        38
15        20
14        19
17        15
16        13
19        11
23         8
20         8
22         6
18         6
24         5
27         5
29         4
25         4
21         4
30         3
28         2
26         2
37         2
38         2
55         1
40         1
35         1
58         1
51         1
41         1
32         1
275        1
Name: previous, dtype: int64
Number of missing values in dataset:
Out[1092]:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
Target       0
dtype: int64

There are no NAs in the dataset. However, there are 36954 -1 values in pdays. According to the codebook of this assignment, the -1 values are either individuals that were never contacted or people who did not reciprocate after 900 days after last contact from the last campaign. So, this value could be interpreted as missing values since we do not know if they were contacted or if they did not respond purposefully after 900 days. There are also the same number of zeroes in the previous variable.

In [1093]:
#Heatmap to evaluate Pearson's correlation.
sns.heatmap(bank.corr(), annot = True);

Pdays and previous have a positive Pearson's correlation of 0.45, which can be interpreted as low-moderate correlation. This does not mean that we should entirely exclude either one from our models. More analysis is needed to understand what these variables look like statistically before making a decision of how to handle them.

There are two other pairings that have very low correlation: 1. day and campaign (Pearson's: 0.16), 2. balance and age (Pearson's 0.098). Day and campaign have a weak correlation, while balance and age's correlation is very close to zero and could potentially suggest no correlation at all.

The rest of the Pearson's correlation values are very small and close to zero, suggesting that the variables in these pairings may be completely independent from one another.

In [1094]:
#Univariate analysis
#Histograms of all the continuous variables to see the spread and shape of the data.
columns = list(bank)[:] # Creating a new list with all columns 
bank[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); # Histogram of all columns

From these histograms, balance, campaign, duration, pdays and previous are skewed to the right, which may suggest there could be outliers in these variables that must be handled. Age has a good spread but there may still be outliers also in this variable. In order to verify if there are true outliers, we would need to look at the range and boxplots of these variables.

In [1095]:
#A better look at the histogram for age. 
#It does not follow a normal distribution.

sns.distplot(bank['age'])
Out[1095]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dccfb6b0a0>
In [1096]:
#Boxplot for age to see if there are outliers beyond the whiskers.

bank[['age']].boxplot()
Out[1096]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dccecb5580>

There are outliers for age, and there are multiple ways of dealing with outliers. In general, a good way to deal with a variable like age that has to make predictions on a categorical dependent variable, is to categorize it into different age groups.

In [1097]:
bins = [18, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99']
bank['agerange'] = pd.cut(bank.age, bins, labels = labels,include_lowest = True)
bank['agerange'] = bank.agerange.astype(str)
In [1098]:
bank.dtypes  #agerange is the categorical variable for age, but it is listed below as a string. I will change this later down below with One Hot Encoding.
Out[1098]:
age             int64
job          category
marital      category
education    category
default      category
balance         int64
housing      category
loan         category
contact      category
day             int64
month        category
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome     category
Target       category
agerange       object
dtype: object

There are many variables in this dataset that have outliers. Outliers can be defined as any points that go beyond the boxplot whiskers. There are outliers that like beyond 3IQR and then there are suspected outliers that lie beyond 1.5 IQR.

For these values, we will have to do many imputations for measures beyond the boxplot whiskers.

In [1099]:
# Balance boxplot to see the outliers.
bank[['balance']].boxplot()
Out[1099]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dccef17d00>
In [1100]:
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['balance'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
1356.0
75th percentile
1428.0
25th percentile
72.0
Upper whisker
3462.0
Lower whisker
-1962.0
In [1101]:
#Looking at how much of the variable points are outliers 

balanceoutlier_max = bank[(bank['balance'] > 3462)]
print(balanceoutlier_max)   #10.4% of the data variable
balanceoutlier_min = bank[(bank['balance'] < -1962)]
balanceoutlier_min  #0.037% of the data variable
       age          job   marital  education default  balance housing loan  \
34      51   management   married   tertiary      no    10635     yes   no   
65      51   management   married   tertiary      no     6530     yes   no   
69      35  blue-collar    single  secondary      no    12223     yes  yes   
70      57  blue-collar   married  secondary      no     5935     yes  yes   
186     40     services  divorced    unknown      no     4384     yes   no   
...    ...          ...       ...        ...     ...      ...     ...  ...   
45164   35     services   married   tertiary      no     4655      no   no   
45181   46  blue-collar   married  secondary      no     6879      no   no   
45185   60     services   married   tertiary      no     4256     yes   no   
45191   75      retired  divorced   tertiary      no     3810     yes   no   
45208   72      retired   married  secondary      no     5715      no   no   

        contact  day month  duration  campaign  pdays  previous poutcome  \
34      unknown    5   may       336         1     -1         0  unknown   
65      unknown    5   may        91         1     -1         0  unknown   
69      unknown    5   may       177         1     -1         0  unknown   
70      unknown    5   may       258         1     -1         0  unknown   
186     unknown    5   may       315         1     -1         0  unknown   
...         ...  ...   ...       ...       ...    ...       ...      ...   
45164  cellular    9   nov       111         2     -1         0  unknown   
45181  cellular   15   nov        74         2    118         3  failure   
45185  cellular   16   nov       200         1     92         4  success   
45191  cellular   16   nov       262         1    183         1  failure   
45208  cellular   17   nov      1127         5    184         3  success   

      Target agerange  
34        no    50-59  
65        no    50-59  
69        no    30-39  
70        no    50-59  
186       no    30-39  
...      ...      ...  
45164     no    30-39  
45181     no    40-49  
45185    yes    50-59  
45191    yes    70-79  
45208    yes    70-79  

[4712 rows x 18 columns]
Out[1101]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target agerange
1896 57 self-employed married tertiary yes -3313 yes yes unknown 9 may 153 1 -1 0 unknown no 50-59
7413 43 management married tertiary yes -3372 yes no unknown 29 may 386 2 -1 0 unknown no 40-49
7756 49 management divorced tertiary no -2049 yes no unknown 30 may 169 3 -1 0 unknown no 40-49
11655 56 management divorced tertiary yes -1968 no no unknown 20 jun 60 3 -1 0 unknown no 50-59
12909 26 blue-collar single secondary yes -8019 no yes cellular 7 jul 299 3 -1 0 unknown no 18-29
13687 38 blue-collar divorced secondary no -2093 yes yes unknown 9 jul 120 3 -1 0 unknown no 30-39
13708 49 entrepreneur married secondary no -1965 no yes telephone 10 jul 317 2 -1 0 unknown no 40-49
14434 51 management divorced tertiary no -2282 yes yes cellular 14 jul 301 6 -1 0 unknown no 50-59
15682 49 management married tertiary yes -6847 no yes cellular 21 jul 206 1 -1 0 unknown no 40-49
17159 51 entrepreneur married secondary yes -2082 no yes cellular 28 jul 123 6 -1 0 unknown no 50-59
18573 40 technician married tertiary yes -2827 yes yes cellular 31 jul 843 1 -1 0 unknown no 30-39
20317 35 management single tertiary yes -1980 yes yes cellular 11 aug 227 1 -1 0 unknown no 30-39
25119 49 blue-collar single primary yes -2604 yes no cellular 18 nov 142 1 -1 0 unknown no 40-49
25240 43 services married primary no -2122 yes yes cellular 18 nov 141 3 -1 0 unknown no 40-49
31509 52 management married tertiary no -2712 yes yes cellular 2 apr 253 1 -1 0 unknown no 50-59
32713 39 self-employed married tertiary no -3058 yes yes cellular 17 apr 882 3 -1 0 unknown yes 30-39
38736 60 management divorced tertiary no -4057 yes no cellular 18 may 242 6 -1 0 unknown no 50-59
In [1105]:
#Replace the outliers with the value of the whiskers. This is one technique to handle the outliers of skewed data.
#Making new boxplot to check if outliers are replaced. Run this code twice to make sure the imputations are met.
bank.balance[bank['balance']>3462] = 3462
bank[['balance']].boxplot()
bank.balance[bank['balance']<-1962] = -1962
bank[['balance']].boxplot()
<ipython-input-1105-df2c3434c340>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bank.balance[bank['balance']>3462] = 3462
<ipython-input-1105-df2c3434c340>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bank.balance[bank['balance']<-1962] = -1962
Out[1105]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dccf820a00>
In [1106]:
#Boxplot for campaign variable.

bank[['campaign']].boxplot();

There are only right skewed outliers in this variable.

In [1107]:
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['campaign'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
2.0
75th percentile
3.0
25th percentile
1.0
Upper whisker
6.0
Lower whisker
-2.0
In [1108]:
#Set the threshold to the upper whisker. Find out how many outliers are in this variable.

campaignoutlier_max = bank[(bank['campaign'] > 6)]
print(campaignoutlier_max)   #6.8% of the data
       age            job  marital  education default  balance housing loan  \
758     59       services  married  secondary      no      307     yes  yes   
780     30         admin.  married  secondary      no        4      no   no   
906     27       services   single  secondary      no        0     yes   no   
1103    52     technician  married    unknown      no      133     yes   no   
1105    43         admin.  married   tertiary      no     1924     yes   no   
...    ...            ...      ...        ...     ...      ...     ...  ...   
44666   25     technician   single  secondary      no     1298      no   no   
44680   27  self-employed   single   tertiary      no      560      no   no   
44770   37       services   single   tertiary      no     1189      no   no   
44886   38     management  married   tertiary      no        0      no   no   
45056   64     management  married   tertiary      no     3462      no   no   

         contact  day month  duration  campaign  pdays  previous poutcome  \
758      unknown    6   may       250         7     -1         0  unknown   
780      unknown    7   may       172         8     -1         0  unknown   
906      unknown    7   may       388         7     -1         0  unknown   
1103     unknown    7   may       253         8     -1         0  unknown   
1105     unknown    7   may       244         7     -1         0  unknown   
...          ...  ...   ...       ...       ...    ...       ...      ...   
44666   cellular    2   sep       206         7     94         2    other   
44680  telephone    3   sep       543         9    465         2  failure   
44770   cellular   13   sep       323         9    199         4    other   
44886   cellular   24   sep       246        12    197         6  failure   
45056  telephone   19   oct       898         7    137        11    other   

      Target agerange  
758       no    50-59  
780       no    18-29  
906       no    18-29  
1103      no    50-59  
1105      no    40-49  
...      ...      ...  
44666     no    18-29  
44680     no    18-29  
44770    yes    30-39  
44886     no    30-39  
45056    yes    60-69  

[3064 rows x 18 columns]
In [1109]:
#Set the outliers to the upper whisker value and verify.

bank.campaign[bank['campaign']>6] = 6
bank[['campaign']].boxplot()
<ipython-input-1109-5b9d222db525>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bank.campaign[bank['campaign']>6] = 6
Out[1109]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcbc6b5910>
In [1110]:
#Boxplot for day variable.

bank[['day']].boxplot()
Out[1110]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcc780d8e0>

There are no outliers in the day variable. So, there is no need to transform the data.

In [1111]:
#Boxplot for duration.

bank[['duration']].boxplot()
Out[1111]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dccce69730>

There are only outliers above the upper whisker.

In [1112]:
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['duration'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
216.0
75th percentile
319.0
25th percentile
103.0
Upper whisker
643.0
Lower whisker
-221.0
In [1113]:
durationoutlier_max = bank[(bank['duration'] > 643)]
print(durationoutlier_max)   #7.2% of the data variable
       age          job   marital  education default  balance housing loan  \
37      53   technician   married  secondary      no       -3      no   no   
43      54      retired   married  secondary      no      529     yes   no   
53      42       admin.    single  secondary      no      -76     yes   no   
59      46     services   married    primary      no      179     yes   no   
61      53   technician  divorced  secondary      no      989     yes   no   
...    ...          ...       ...        ...     ...      ...     ...  ...   
45124   27  blue-collar    single    primary      no      232     yes   no   
45199   34  blue-collar    single  secondary      no     1475     yes   no   
45200   38   technician   married  secondary      no      557     yes   no   
45206   51   technician   married   tertiary      no      825      no   no   
45208   72      retired   married  secondary      no     3462      no   no   

        contact  day month  duration  campaign  pdays  previous poutcome  \
37      unknown    5   may      1666         1     -1         0  unknown   
43      unknown    5   may      1492         1     -1         0  unknown   
53      unknown    5   may       787         1     -1         0  unknown   
59      unknown    5   may      1778         1     -1         0  unknown   
61      unknown    5   may       812         1     -1         0  unknown   
...         ...  ...   ...       ...       ...    ...       ...      ...   
45124  cellular   26   oct       701         2     -1         0  unknown   
45199  cellular   16   nov      1166         3    530        12    other   
45200  cellular   16   nov      1556         4     -1         0  unknown   
45206  cellular   17   nov       977         3     -1         0  unknown   
45208  cellular   17   nov      1127         5    184         3  success   

      Target agerange  
37        no    50-59  
43        no    50-59  
53        no    40-49  
59        no    40-49  
61        no    50-59  
...      ...      ...  
45124    yes    18-29  
45199     no    30-39  
45200    yes    30-39  
45206    yes    50-59  
45208    yes    70-79  

[3235 rows x 18 columns]
In [1114]:
#Replacing the outliers with the value of the upper whisker.

bank.duration[bank['duration']>643] = 643
bank[['duration']].boxplot()
<ipython-input-1114-13917d6bbc4b>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bank.duration[bank['duration']>643] = 643
Out[1114]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcc923c730>
In [1115]:
#Boxplot for pdays.

bank[['pdays']].boxplot()
Out[1115]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcc4748940>
In [1116]:
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['pdays'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
0.0
75th percentile
-1.0
25th percentile
-1.0
Upper whisker
-1.0
Lower whisker
-1.0

The upper whisker is the -1 value(which can be interpreted as an unknown value because we don't know if they were contacted or they never responded after 900 days). Replacing the outliers with this value would not serve much benefit because it would just create more -1 values or unknowns in the data. It would just make sense to discard this column all together.

In [1117]:
bank[['previous']].boxplot()
Out[1117]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dccf83e7f0>
In [1118]:
#Calculate the whisker values of the boxplot to get the threshold that separates the outliers.
#Calculate the interquartile range:
q75, q25 = np.percentile(bank['previous'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
0.0
75th percentile
0.0
25th percentile
0.0
Upper whisker
0.0
Lower whisker
0.0

It would also just make sense to discard this column because there are too many zeroes.

1b. The variables of concern seem to be pdays and poutcome.

For pdays, the number of -1 in the variable is concerning because -1 means either the customer was not contacted or that more than 900 days have passed by since the last contact. These two conditions have completely different meanings and should be reported as missing values. There is not enough information from the other dataset variables or the codebook to differentiate the two groups. "-1" appears in total 36954 times in the pdays variable, which is approximately 81.7% of the dataset. Due to the excessive number of missing values, it would be more beneficial to lose this column all together, since more than half have missing values (Kumar, 2020). This way, you maintain a more robust robust model than if mean/median/mode or confidence level imputations are done for 81.7% of this variable. The downside of this method would be the loss of data. However, only about 18.3% of the variable data would be sacrificed, which is not too much. The same is applied to the previous variable.

BIVARIATE ANALYSIS:

2.Multivariate analysis(8marks)

a.Bi-variate analysis between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes.

b.Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots

Bivariate analysis for continuous predictors and target variable.

In [1119]:
#Boxplot for balance x target

print(sns.boxplot(x = 'Target', y = 'balance', data = bank));
AxesSubplot(0.125,0.125;0.775x0.755)
In [1120]:
#Density curves for balance x target

balancekde= sns.kdeplot(bank.loc[(bank['Target']=='yes'), 
            'balance'], color='r', shade=True, Label='Yes') 
  
balancekde= sns.kdeplot(bank.loc[(bank['Target']=='no'),  
            'balance'], color='b', shade=True, Label='no'); 
balancekde.set(xlabel="balance", ylabel = "Probability Density");

There are similar trends between balance and the target variable. There are spikes in the data for both yes and no for term deposit. The reason may be that there are more individuals sampled that have balances closely below and above 0, as well as individuals that have $3000-4000 balance.

In [1121]:
#Boxplot for campaign
print(sns.boxplot(x = 'Target', y = 'campaign', data = bank));
AxesSubplot(0.125,0.125;0.775x0.755)
In [1122]:
#Density plot for campaign

campaignkde= sns.kdeplot(bank.loc[(bank['Target']=='yes'), 
            'campaign'], color='r', shade=True, Label='Yes') 
  
campaignkde= sns.kdeplot(bank.loc[(bank['Target']=='no'),  
            'campaign'], color='b', shade=True, Label='no')

campaignkde.set(xlabel="campaign", ylabel = "Probability Density");

The bank made the most one calls to clients. The median campaign contacts for those who did and did not subscribe to term deposit was roughly 2 for both groups according to the boxplot.

In [1123]:
#Duration- This is not a variable that will be used in the models later because the variable highly affects output target and should be discarded according to the data source.
#Boxplot for duration. 
print(sns.boxplot(x = 'Target', y = 'duration', data = bank));
AxesSubplot(0.125,0.125;0.775x0.755)
In [1124]:
#Density graphs for duration.
durationkde= sns.kdeplot(bank.loc[(bank['Target']=='yes'), 
            'duration'], color='r', shade=True, Label='Yes') 
  
durationkde= sns.kdeplot(bank.loc[(bank['Target']=='no'),  
            'duration'], color='b', shade=True, Label='no')

durationkde.set(xlabel="duration", ylabel = "Probability Density");

The density plot makes sense because the more time spent on the call, the higher the probability density the individual will sign up for a term deposit. The less amount of time spent on the phone would suggest disinterest and more likely not lead to signing up for a term deposit.

In [1125]:
print(sns.boxplot(x = 'Target', y = 'pdays', data = bank));
AxesSubplot(0.125,0.125;0.775x0.755)
In [1126]:
pdayskde= sns.kdeplot(bank.loc[(bank['Target']=='yes'), 
            'pdays'], color='r', shade=True, Label='Yes') 

pdayskde= sns.kdeplot(bank.loc[(bank['Target']=='no'),  
            'pdays'], color='b', shade=True, Label='no')

pdayskde.set(xlabel="pdays", ylabel = "Probability Density");
C:\Users\TiChan\Anaconda3\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)

As discussed before, this variable will not be included in the model, as this right skewed data only represents approximately 18.3% of the data in this variable. Many of these values on the skewed tail are considered outliers.

In [1127]:
print(sns.boxplot(x = 'Target', y = 'previous', data = bank));
AxesSubplot(0.125,0.125;0.775x0.755)
In [1128]:
#Density plot for previous x target
previouskde= sns.kdeplot(bank.loc[(bank['Target']=='yes'), 
            'previous'], color='r', shade=True, Label='Yes') 

previouskde= sns.kdeplot(bank.loc[(bank['Target']=='no'),  
            'previous'], color='b', shade=True, Label='no')

previouskde.set(xlabel="previous", ylabel = "Probability Density");
C:\Users\TiChan\Anaconda3\lib\site-packages\seaborn\distributions.py:369: UserWarning: Default bandwidth for data is 0; skipping density estimation.
  warnings.warn(msg, UserWarning)

For the same reason as stated above and what was stated for pdays. Imputing this variable's outliers on the skewed table would generate more 0s. Discarding this variable would generate more robust models.

In [1129]:
#Days x Target
print(sns.boxplot(x = 'Target', y = 'day', data = bank));
AxesSubplot(0.125,0.125;0.775x0.755)
In [1130]:
#Density plot for day x target
previouskde= sns.kdeplot(bank.loc[(bank['Target']=='yes'), 
            'day'], color='r', shade=True, Label='Yes') 

previouskde= sns.kdeplot(bank.loc[(bank['Target']=='no'),  
            'day'], color='b', shade=True, Label='no')

previouskde.set(xlabel="day", ylabel = "Probability Density");

There seems to be more positve response from 0 to short of 15 days. 15 days and onwards seems to generate less positive outcomes in terms of term deposit sign ups.

Bivariate analysis for categorical independent variables

In [1131]:
# Education x Target

#Frequency using 3x2 table and then plotting bar graph to illustrate this frequency.
print(pd.crosstab(bank['education'],bank['Target']))
ax = sns.countplot(x="Target", hue="education", data=bank)
print(ax)
Target        no   yes
education             
primary     6260   591
secondary  20752  2450
tertiary   11305  1996
unknown     1605   252
AxesSubplot(0.125,0.125;0.775x0.755)
In [1132]:
#Crosstabs to evaluate the relationship between the two variables.
print(pd.crosstab(bank['education'],bank['Target'],normalize='columns'))
Target           no       yes
education                    
primary    0.156806  0.111741
secondary  0.519814  0.463226
tertiary   0.283177  0.377387
unknown    0.040203  0.047646

Those who have tertiary education subscribe more to term deposits compared to those who have a primary education.

In [1133]:
#Job x Target
#Table to evaluate frequency values for job x target, and plotting these results.

print(pd.crosstab(bank['job'],bank['Target']))
sns.set(rc={'figure.figsize':(11.7,8.27)})
bx = sns.countplot(x="Target", hue="job", data=bank)
bx;
Target           no   yes
job                      
admin.         4540   631
blue-collar    9024   708
entrepreneur   1364   123
housemaid      1131   109
management     8157  1301
retired        1748   516
self-employed  1392   187
services       3785   369
student         669   269
technician     6757   840
unemployed     1101   202
unknown         254    34
In [1134]:
#Crosstabs to evaluate the relationship between both variables (job x target)
print(pd.crosstab(bank['job'],bank['Target'],normalize='columns'))
Target               no       yes
job                              
admin.         0.113722  0.119304
blue-collar    0.226041  0.133863
entrepreneur   0.034167  0.023256
housemaid      0.028330  0.020609
management     0.204323  0.245982
retired        0.043785  0.097561
self-employed  0.034868  0.035356
services       0.094810  0.069767
student        0.016758  0.050860
technician     0.169255  0.158820
unemployed     0.027579  0.038192
unknown        0.006362  0.006428

Those with blue collar jobs are less likely to get a term deposit compared to other occupations. People who are self-employed seem to display the same distribution when it comes to subscribing to a term deposit or not subscribing to a term deposit. The same trend is observed among those with unknown jobs, technicians, and administrators. However, those that are retired and students are more likely to get a term deposit than not.

In [1135]:
#Marital x Target
#2x2 to observe the frequency of marital status by target status, and accompanying bar plot.
print(pd.crosstab(bank['marital'],bank['Target']))

cx = sns.countplot(x="Target", hue="marital", data=bank)
cx;
Target       no   yes
marital              
divorced   4585   622
married   24459  2755
single    10878  1912
In [1136]:
#Crosstabs to evaluate the relationship between both variables (marital x target)
print(pd.crosstab(bank['marital'],bank['Target'],normalize='columns'))
Target          no       yes
marital                     
divorced  0.114849  0.117603
married   0.612670  0.520892
single    0.272481  0.361505

Those that are married are less likely to subscribe to a term deposit than those that are single. The distribution is the same among those that are divorced when it comes to subscribing or not subscribing to a term deposit.

In [1137]:
#Default x Target
#Frequency of target status by default status using 2x2 table and plot.

print(pd.crosstab(bank['default'],bank['Target']))
dx = sns.countplot(x="Target", hue="default", data=bank)
dx;
Target      no   yes
default             
no       39159  5237
yes        763    52
In [1138]:
#Crosstabs to evaluate the relationship between both variables (default x target)
print(pd.crosstab(bank['default'],bank['Target'],normalize='columns'))
Target         no       yes
default                    
no       0.980888  0.990168
yes      0.019112  0.009832

Those that did not default have the same distribution when it comes to subscribing or not subscribing to term deposit. Those that did default are less likely to take a term deposit.

In [1139]:
#Housing x Target
#2x2 table showing the frequency of housing by target, and accompanying bar chart.

print(pd.crosstab(bank['housing'],bank['Target']))
ex = sns.countplot(x="Target", hue="housing", data=bank)
ex;
Target      no   yes
housing             
no       16727  3354
yes      23195  1935
In [1140]:
#Crosstabs to evaluate the relationship between both variables (housing x target)
print(pd.crosstab(bank['housing'],bank['Target'],normalize='columns'))
Target         no       yes
housing                    
no       0.418992  0.634146
yes      0.581008  0.365854

Those that have a housing loan are less likely to engage to a term deposit. Those that don't have housing loans are more likely to subscribe to a term deposit.

In [1141]:
#Loan x Target
#2x2 table showing target frequencies by loan and accompanying bar chart. 

print(pd.crosstab(bank['loan'],bank['Target']))
fx = sns.countplot(x="Target", hue="loan", data=bank)
fx;
Target     no   yes
loan               
no      33162  4805
yes      6760   484
In [1142]:
#Crosstabs to evaluate the relationship between both variables (loan x target)
print(pd.crosstab(bank['loan'],bank['Target'],normalize='columns'))
Target       no       yes
loan                     
no      0.83067  0.908489
yes     0.16933  0.091511

Those that have a personal loan are less likely to get a term deposit. Those that don't have a personal loan are slightly more likely to get a term deposit.

In [1143]:
#Contact x Target
#2x2 table showing frequencies of target by contact means, and bar chart.

print(pd.crosstab(bank['contact'],bank['Target']))
gx = sns.countplot(x="Target", hue="contact", data=bank)
gx;
Target        no   yes
contact               
cellular   24916  4369
telephone   2516   390
unknown    12490   530
In [1144]:
#Crosstabs to evaluate the relationship between both variables (contact x target)
print(pd.crosstab(bank['contact'],bank['Target'],normalize='columns'))
Target           no       yes
contact                      
cellular   0.624117  0.826054
telephone  0.063023  0.073738
unknown    0.312860  0.100208

Those that were using their cells were more likely to agree to a term deposit. Those that use a telephone are more likely to subscribe to a term deposit. Those with unknown contact means were more likely to not subscribe to a term deposit.

In [1145]:
#Month x Target
#Frequency table of Target by Month, and bar plot.

print(pd.crosstab(bank['month'],bank['Target']))
sns.set(rc={'figure.figsize':(11.7,8.27)})
hx = sns.countplot(x="Target", hue="month", data=bank)
hx;
Target     no  yes
month             
apr      2355  577
aug      5559  688
dec       114  100
feb      2208  441
jan      1261  142
jul      6268  627
jun      4795  546
mar       229  248
may     12841  925
nov      3567  403
oct       415  323
sep       310  269
In [ ]:
 
In [1146]:
#Crosstabs to evaluate the relationship between both variables (month x target)
print(pd.crosstab(bank['month'],bank['Target'],normalize='columns'))
Target        no       yes
month                     
apr     0.058990  0.109094
aug     0.139247  0.130081
dec     0.002856  0.018907
feb     0.055308  0.083381
jan     0.031587  0.026848
jul     0.157006  0.118548
jun     0.120109  0.103233
mar     0.005736  0.046890
may     0.321652  0.174891
nov     0.089349  0.076196
oct     0.010395  0.061070
sep     0.007765  0.050860

Those that were last contacted in February, March, April, September, October and December are more likely to say yes to a term deposit. Last contact in January, May, June, July, and November were less likely to say yes to a term deposit. The reamaining months exhibited no difference between accepting or refusing a term deposit.

In [1147]:
#Frequency table of target variable by poutcome and bar chart.

print(pd.crosstab(bank['poutcome'],bank['Target']))
ix = sns.countplot(x="Target", hue="poutcome", data=bank)
ix;
Target       no   yes
poutcome             
failure    4283   618
other      1533   307
success     533   978
unknown   33573  3386
In [1148]:
#Crosstabs to evaluate the relationship between both variables (poutcome x target)
print(pd.crosstab(bank['poutcome'],bank['Target'],normalize='columns'))
Target          no       yes
poutcome                    
failure   0.107284  0.116846
other     0.038400  0.058045
success   0.013351  0.184912
unknown   0.840965  0.640197

Those that had a positive outcome from the last campaign are more likely to subscribe to a term deposit, while those whose outcomes were unknown for the last campaign are more likely to not subscribe to a term deposit. Those who had a negative outcome did not show much of a difference in subscribing and not subscribing to a term deposit.

In [1149]:
#Agerange x Target
#Agerange is a categorical variable that was derived from age.
#Frequency of agerange by target variable, and bar chart.

print(pd.crosstab(bank['agerange'],bank['Target']))
jx = sns.countplot(x="Target", hue="agerange", data=bank)
jx;
Target       no   yes
agerange             
18-29      5885  1145
30-39     15875  1812
40-49     10220  1019
50-59      7256   811
60-69       417   284
70-79       213   175
80-89        54    38
90-99         2     5
In [1150]:
#Crosstabs to evaluate the relationship between both variables (agerange x target)
print(pd.crosstab(bank['agerange'],bank['Target'],normalize='columns'))
Target          no       yes
agerange                    
18-29     0.147412  0.216487
30-39     0.397650  0.342598
40-49     0.255999  0.192664
50-59     0.181754  0.153337
60-69     0.010445  0.053696
70-79     0.005335  0.033088
80-89     0.001353  0.007185
90-99     0.000050  0.000945

Those between the ages of 30-59 are less likely to subscribe to a term deposit. Those that are younger(18-29) and those about to retire and retirees are more likely to participate in getting a term deposit.

Deliverable –2 (Prepare the data for analytics)–(10)

1.Ensure the attribute types are correct. If not, take appropriate actions.

2.Get the data model ready.

3.Transform the data i.e. scale / normalize if required

4.Create the training set and test set in ratio of 70:30

  1. We need to convert some of the string variables to ordinal categorical variables. The order of these categories must be maintained and are crucial to making the decision tree later on. These variables include: default,
In [ ]:
 
In [1151]:
#Evaluating every variable's dtype.

bank.dtypes
Out[1151]:
age             int64
job          category
marital      category
education    category
default      category
balance         int64
housing      category
loan         category
contact      category
day             int64
month        category
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome     category
Target       category
agerange       object
dtype: object
In [1152]:
#Replacing string categories with numerical values. Making these variables numerical.

replaceStruct = {
                "default":  {"no": 0, "yes": 1 },
                "housing": {"no": 0, "yes":1 },
                "loan": {"no": 0, "yes":1 },
                "Target":     {"no": 0, "yes": 1 } 
                    }

#One hot encoding variables with multiple categories.
oneHotCols=["job","marital","education","contact","month","poutcome","agerange"]
In [1153]:
#Applying the above to the dataset.
#Create dummy variables for model.
bank=bank.replace(replaceStruct)
bank=pd.get_dummies(bank, columns=oneHotCols)
bank.head(10)
Out[1153]:
age default balance housing loan day duration campaign pdays previous ... poutcome_success poutcome_unknown agerange_18-29 agerange_30-39 agerange_40-49 agerange_50-59 agerange_60-69 agerange_70-79 agerange_80-89 agerange_90-99
0 58 0 2143 1 0 5 261 1 -1 0 ... 0 1 0 0 0 1 0 0 0 0
1 44 0 29 1 0 5 151 1 -1 0 ... 0 1 0 0 1 0 0 0 0 0
2 33 0 2 1 1 5 76 1 -1 0 ... 0 1 0 1 0 0 0 0 0 0
3 47 0 1506 1 0 5 92 1 -1 0 ... 0 1 0 0 1 0 0 0 0 0
4 33 0 1 0 0 5 198 1 -1 0 ... 0 1 0 1 0 0 0 0 0 0
5 35 0 231 1 0 5 139 1 -1 0 ... 0 1 0 1 0 0 0 0 0 0
6 28 0 447 1 1 5 217 1 -1 0 ... 0 1 1 0 0 0 0 0 0 0
7 42 1 2 1 0 5 380 1 -1 0 ... 0 1 0 0 1 0 0 0 0 0
8 58 0 121 1 0 5 50 1 -1 0 ... 0 1 0 0 0 1 0 0 0 0
9 43 0 593 1 0 5 55 1 -1 0 ... 0 1 0 0 1 0 0 0 0 0

10 rows × 57 columns

In [1154]:
#Verifying the dtype of each newly encoded variable.

bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 57 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   age                  45211 non-null  int64
 1   default              45211 non-null  int64
 2   balance              45211 non-null  int64
 3   housing              45211 non-null  int64
 4   loan                 45211 non-null  int64
 5   day                  45211 non-null  int64
 6   duration             45211 non-null  int64
 7   campaign             45211 non-null  int64
 8   pdays                45211 non-null  int64
 9   previous             45211 non-null  int64
 10  Target               45211 non-null  int64
 11  job_admin.           45211 non-null  uint8
 12  job_blue-collar      45211 non-null  uint8
 13  job_entrepreneur     45211 non-null  uint8
 14  job_housemaid        45211 non-null  uint8
 15  job_management       45211 non-null  uint8
 16  job_retired          45211 non-null  uint8
 17  job_self-employed    45211 non-null  uint8
 18  job_services         45211 non-null  uint8
 19  job_student          45211 non-null  uint8
 20  job_technician       45211 non-null  uint8
 21  job_unemployed       45211 non-null  uint8
 22  job_unknown          45211 non-null  uint8
 23  marital_divorced     45211 non-null  uint8
 24  marital_married      45211 non-null  uint8
 25  marital_single       45211 non-null  uint8
 26  education_primary    45211 non-null  uint8
 27  education_secondary  45211 non-null  uint8
 28  education_tertiary   45211 non-null  uint8
 29  education_unknown    45211 non-null  uint8
 30  contact_cellular     45211 non-null  uint8
 31  contact_telephone    45211 non-null  uint8
 32  contact_unknown      45211 non-null  uint8
 33  month_apr            45211 non-null  uint8
 34  month_aug            45211 non-null  uint8
 35  month_dec            45211 non-null  uint8
 36  month_feb            45211 non-null  uint8
 37  month_jan            45211 non-null  uint8
 38  month_jul            45211 non-null  uint8
 39  month_jun            45211 non-null  uint8
 40  month_mar            45211 non-null  uint8
 41  month_may            45211 non-null  uint8
 42  month_nov            45211 non-null  uint8
 43  month_oct            45211 non-null  uint8
 44  month_sep            45211 non-null  uint8
 45  poutcome_failure     45211 non-null  uint8
 46  poutcome_other       45211 non-null  uint8
 47  poutcome_success     45211 non-null  uint8
 48  poutcome_unknown     45211 non-null  uint8
 49  agerange_18-29       45211 non-null  uint8
 50  agerange_30-39       45211 non-null  uint8
 51  agerange_40-49       45211 non-null  uint8
 52  agerange_50-59       45211 non-null  uint8
 53  agerange_60-69       45211 non-null  uint8
 54  agerange_70-79       45211 non-null  uint8
 55  agerange_80-89       45211 non-null  uint8
 56  agerange_90-99       45211 non-null  uint8
dtypes: int64(11), uint8(46)
memory usage: 5.8 MB
In [1155]:
#Preparing the data. Separating the dependent variable "Target" from the rest of the dataset. Drop pdays, previous and age due to reasons discussed above.

X = bank.drop(["Target", "pdays", "previous", "age", "duration"], axis = 1)
y = bank.pop("Target")
In [1156]:
#Splitting data into training (70%) and testing (30%) datasets. 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)

Deliverable –3 (create the ensemble model)–(30)

1.First create modelsusing Logistic Regression and Decision Tree algorithm. Note the model performance by using different matrices. Use confusion matrix to evaluate class level metrics i.e. Precision/Recall. Also reflect the accuracy and F1 scoreof the model. (10marks)

2.Build the ensemble models (Bagging and Boosting) and note the model performance by using different matrices. Use same metrics as in above model. (at least 3 algorithms) (15 marks)

3.Make a DataFrame to compare models and their metrics. Give conclusion regarding the best algorithm and your reason behind it. (5marks)

Logistic Regression

In [1157]:
#Creating logistic regression model

from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(X_train, y_train)
#predict on test
y_predict = model.predict(X_test)
In [1158]:
model_score = model.score(X_test, y_test)
print(model_score)
0.8930256561486287
In [1159]:
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
print("Numbers from Confusion Matrix")
print(cm)
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
print("")

#Confusion Matrix
print("Confusion Matrix")
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()
Numbers from Confusion Matrix
[[  287  1264]
 [  187 11826]]

Confusion Matrix
In [1160]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score

print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.8909533289095333

Testing accuracy 0.8930256561486287

Recall: 0.18504190844616378

Precision: 0.6054852320675106

F1 Score: 0.2834567901234568

Roc Auc Score: 0.5847377193941466
In [1161]:
#Making dataframe for logistic regression metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Logistic_Regression':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]} 
In [1162]:
dataframe1 = pd.DataFrame(data)
In [1163]:
dataframe1
Out[1163]:
Metrics Logistic_Regression
0 Training Accuracy 0.890953
1 Testing Accuracy 0.893026
2 Recall 0.185042
3 Precision 0.605485
4 F1 Score 0.283457
5 ROC AUC Score 0.584738

The logistic regression model may not be the best suited for this data. Training accuracy and testing accuracy are 90%, and very high. These high scores are due to the large number correctly predicted negatives (true negatives).

However, looking at the misclassified cases is more important in this model's evaluation. Recall for this logistic regression model is: 0.19, which is considered low. This is bad. Recall determine of all who subscribed to a term deposit in the past, how many did the model predict accurately. 1-Recall (0.81) explains the severity of false negatives. This is somewhat high and can be good for the bank even though it indicates a bad model. These cases are people who decided to invest into a term deposit even though the model did not predict them to. This allows the bank to get more money than expected from the model, and to invest in high profitable business ventures to maximize gain. The bank would benefit with a higher 1-Recall value but it also means that the model's predictions were not very good. 1264 cases are false negatives.

Precision is the other measure to focus on. Precision is 0.61, which is decent. This measures of how many of those the model predicted to take part in a term deposit, actually did take a term deposit. 1-Precision(0.39) shows the severity of false positives. These are individuals the bank would expect would participate but ended up not doing so (False positive cases). This provides less funds for the bank to use to make investments. 363 people are false positives.

Since the number of false negatives is more than false positives, the bank would in the end gain, according to this model. However, because the model's recall and precision scores are not that great, we would need to question whether another model would be better suited for this data. The F1 Score is also low and is indicative of a not so great model.

The net gain for the bank, other than the true positive cases, is predicted to be: 1264-187= 1077 term deposit acounts.

In [1164]:
## Feature Importance or Coefficients 
fi = pd.DataFrame()
fi['Col'] = X_train.columns
fi['Coeff'] = np.round(abs(model.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
Out[1164]:
Col Coeff
42 poutcome_success 1.45
35 month_mar 1.04
32 month_jan 1.02
27 contact_unknown 0.89
38 month_oct 0.75
37 month_nov 0.74
39 month_sep 0.72
29 month_aug 0.71
40 poutcome_failure 0.71
43 poutcome_unknown 0.65
33 month_jul 0.58
30 month_dec 0.56
49 agerange_70-79 0.54
47 agerange_50-59 0.47
2 housing 0.47
41 poutcome_other 0.46
46 agerange_40-49 0.45
45 agerange_30-39 0.41
25 contact_cellular 0.41
36 month_may 0.37
3 loan 0.35
31 month_feb 0.32
34 month_jun 0.31
8 job_entrepreneur 0.27
21 education_primary 0.27
19 marital_married 0.27
48 agerange_60-69 0.26
50 agerange_80-89 0.26
14 job_student 0.25
9 job_housemaid 0.24
16 job_unemployed 0.15
0 default 0.15
17 job_unknown 0.13
5 campaign 0.12
12 job_self-employed 0.12
44 agerange_18-29 0.11
26 contact_telephone 0.10
22 education_secondary 0.10
11 job_retired 0.10
7 job_blue-collar 0.09
24 education_unknown 0.06
23 education_tertiary 0.06
20 marital_single 0.06
6 job_admin. 0.05
18 marital_divorced 0.04
15 job_technician 0.03
10 job_management 0.02
13 job_services 0.01
51 agerange_90-99 0.01
28 month_apr 0.00
1 balance 0.00
4 day 0.00

When evaluating the features of importance, after the use of one hot encoding, it seems that poutcome_success (coefficient: 1.45) has the most influence on this model when making predictions. Month_march (1.04), month_jan (1.02), contact_unknown(0.89), month_october (0.75), also exert high influence on this model, as well. The components that exert the most influence are for the most part the poutcome variables, the month variables, and contact variables.

Variables with moderate impact include housing (0.47), loan(0.35), and different age categories like 70-79 (0.54)and 50-59 (0.47).

Other variables like balance and day and duration don't seem to make an impact on the model for logistic regression.

Decision Tree Model

In [1165]:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
Out[1165]:
DecisionTreeClassifier(random_state=1)
In [1166]:
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
0.9994628242803425
0.8271158950162194

This shows the model overfit to the training data. So we must do pruning.

In [1167]:
#Pruning

dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 7, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
0.8978734161215913
0.8936154526688292

The model fits well to the training data and the testing data. The scores are very close to one another.

In [1168]:
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp"], index = X_train.columns))
                          Imp
default              0.000000
balance              0.025421
housing              0.014305
loan                 0.001614
day                  0.057128
campaign             0.008056
job_admin.           0.005837
job_blue-collar      0.000000
job_entrepreneur     0.001912
job_housemaid        0.000000
job_management       0.000712
job_retired          0.001274
job_self-employed    0.001095
job_services         0.000000
job_student          0.001232
job_technician       0.004263
job_unemployed       0.000000
job_unknown          0.000000
marital_divorced     0.004436
marital_married      0.001687
marital_single       0.001183
education_primary    0.000411
education_secondary  0.001754
education_tertiary   0.001814
education_unknown    0.003636
contact_cellular     0.002094
contact_telephone    0.000000
contact_unknown      0.070196
month_apr            0.003303
month_aug            0.004610
month_dec            0.000000
month_feb            0.008061
month_jan            0.000000
month_jul            0.000000
month_jun            0.055266
month_mar            0.056887
month_may            0.008904
month_nov            0.002239
month_oct            0.058660
month_sep            0.050280
poutcome_failure     0.001273
poutcome_other       0.000000
poutcome_success     0.501241
poutcome_unknown     0.001162
agerange_18-29       0.000000
agerange_30-39       0.000000
agerange_40-49       0.006180
agerange_50-59       0.002169
agerange_60-69       0.029708
agerange_70-79       0.000000
agerange_80-89       0.000000
agerange_90-99       0.000000

Like what was observed in the logistic regression model, poutcome_success (0.5) had the highest impact on the model. Contact_unknown was the second most contributive variable to this model. Among the months, Month_march, month_june, month_september and month_october played a moderate role in creating this prediction model. Oddly, month_jan did not seem as important in this model as it did in the logistic regression model. There are other less contributive components that may make a smaller impact.

In [1169]:
print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(X_test)

cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
                  columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.8936154526688292
Out[1169]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcc906d400>
In [1170]:
print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.8909533289095333

Testing accuracy 0.8930256561486287

Recall: 0.21405544809800128

Precision: 0.5971223021582733

F1 Score: 0.31514000949216897

Roc Auc Score: 0.5977044908849284
In [1171]:
#Making dataframe for Decision Tree metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Decision_Tree':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]} 
In [1172]:
dataframe2 = pd.DataFrame(data)
In [1173]:
dataframe2
Out[1173]:
Metrics Decision_Tree
0 Training Accuracy 0.890953
1 Testing Accuracy 0.893026
2 Recall 0.214055
3 Precision 0.597122
4 F1 Score 0.315140
5 ROC AUC Score 0.597704

From these results, you could tell that this decision tree model is performing better than the logistic regression model. You can instantly tell from looking at the F1 score and the ROC score are better than the logistic regression model. The recall is now 0.214, which shows an increase. Precision stayed about the same at 0.6. The number of true positives has increased to 332, suggesting improvement from the previos model. 1219 individuals are false negatives. 224 cases are false positives.

Again, the bank would gain because the number of false negatives is more than false positives but the model is still not performing adequately because the recall is still low. The precision value is adequate.

The net gain for the bank, other than the true positive cases, is predicted to be: 1219-224= 995 term deposit acounts.

In [1174]:
#Bagging
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)

bgcl = bgcl.fit(X_train, y_train)
In [1175]:
y_predict = bgcl.predict(X_test)

print(bgcl.score(X_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
                  columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.890002948982601
Out[1175]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcce8d9df0>
In [1176]:
print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.8909533289095333

Testing accuracy 0.8930256561486287

Recall: 0.2746615087040619

Precision: 0.5372005044136192

F1 Score: 0.36348122866894206

Roc Auc Score: 0.6220556357305376
In [1177]:
#Making dataframe for Bagging metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Bagging':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]} 
In [1178]:
dataframe3 = pd.DataFrame(data)
In [1179]:
dataframe3
Out[1179]:
Metrics Bagging
0 Training Accuracy 0.890953
1 Testing Accuracy 0.893026
2 Recall 0.274662
3 Precision 0.537201
4 F1 Score 0.363481
5 ROC AUC Score 0.622056

For the bagging model, the recall (0.27) has improved but the precision (0.53) values have gone down. The question that we need to ask is whether a recall score or precision score is more important. Like the other models, the accuracy is high due to the large true negatives value. There is a net gain, despite the amount of misclassification by the model; the false negative value is more than the false positives values. The net gain of customers, other than the true positive cases, that will participate in term deposits is: 1125-367 = 758 customers (term deposit accounts).

In [1180]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(X_train, y_train)
In [1181]:
y_predict = abcl.predict(X_test)
print(abcl.score(X_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
                  columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.8951636685343557
Out[1181]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcc47dbf70>
In [1182]:
print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.8909533289095333

Testing accuracy 0.8930256561486287

Recall: 0.18633139909735655

Precision: 0.643652561247216

F1 Score: 0.28900000000000003

Roc Auc Score: 0.5865062472886267
In [1183]:
#Making dataframe for AdaBoost metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'AdaBoost':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]} 
In [1184]:
dataframe4 = pd.DataFrame(data)
In [1185]:
dataframe4
Out[1185]:
Metrics AdaBoost
0 Training Accuracy 0.890953
1 Testing Accuracy 0.893026
2 Recall 0.186331
3 Precision 0.643653
4 F1 Score 0.289000
5 ROC AUC Score 0.586506

The recall score is 0.19, which is lower than the decision tree model and the bagging model. The precision score is 0.64 for the Adaboosting model, which is higher than all the previous models.

The net gain for the bank, other than the true positive cases, is predicted to be: 1262-160= 1102 term deposit acounts.

In [1186]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(X_train, y_train)
In [1187]:
y_predict = gbcl.predict(X_test)
print(gbcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
                  columns = [i for i in ["Prediction_No","Prediction_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.8948687702742554
Out[1187]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcc55c0520>
In [1188]:
#Metrics
print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.8909533289095333

Testing accuracy 0.8930256561486287

Recall: 0.1702127659574468

Precision: 0.6550868486352357

F1 Score: 0.27021494370522003

Roc Auc Score: 0.5793209838278035
In [1189]:
#Making dataframe for GradientBoost metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'GradientBoost':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]} 
In [1190]:
dataframe5 = pd.DataFrame(data)
In [1191]:
dataframe5
Out[1191]:
Metrics GradientBoost
0 Training Accuracy 0.890953
1 Testing Accuracy 0.893026
2 Recall 0.170213
3 Precision 0.655087
4 F1 Score 0.270215
5 ROC AUC Score 0.579321

The recall score is 0.17, which is lower than all of the other models. The precision score is 0.66 for the Gradientboosting model, which is higher than all the previous models.

The net gain for the bank, other than the true positive cases, is predicted to be: 1287-139= 1148 term deposit acounts.

EXTRA

In [1192]:
#Random forest model:

from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(X_train, y_train)
In [1193]:
#Confusion Matrix
y_predict = rfcl.predict(X_test)
print(rfcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["Outcome_No","Outcome_Yes"]],
                  columns = [i for i in ["Predict_No","Predict_Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.891108817457977
Out[1193]:
<matplotlib.axes._subplots.AxesSubplot at 0x2dcd32f39d0>
In [1194]:
#Metrics
print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.8909533289095333

Testing accuracy 0.8930256561486287

Recall: 0.2488716956802063

Precision: 0.5530085959885387

F1 Score: 0.34326367274344155

Roc Auc Score: 0.6114499159330026

The recall score is 0.25. The precision score is 0.55 for the Random Forest, which is not as high as GradientBoost.

The net gain for the bank, other than the true positive cases, is predicted to be: 1165-312= 853 term deposit acounts.

In [1195]:
#Making dataframe out of the random forest metrics
data ={'Metrics':['Training Accuracy', 'Testing Accuracy', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'], 'Random_Forest':[model.score(X_train,y_train), model.score(X_test, y_test), recall_score(y_test,y_predict), precision_score(y_test,y_predict), f1_score(y_test,y_predict), roc_auc_score(y_test,y_predict)]} 
In [1196]:
dataframe6 = pd.DataFrame(data)
In [1197]:
dataframe6
Out[1197]:
Metrics Random_Forest
0 Training Accuracy 0.890953
1 Testing Accuracy 0.893026
2 Recall 0.248872
3 Precision 0.553009
4 F1 Score 0.343264
5 ROC AUC Score 0.611450
In [1198]:
# 3. Merge all dataframes together to make a large dataframe with all metrics for all algorithms

Metrics_Dataframe = pd.merge(dataframe1,dataframe2,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe3,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe4,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe5,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe, dataframe6,how='outer',on='Metrics')
In [1014]:
Metrics_Dataframe
Out[1014]:
Metrics Logistic_Regression Decision_Tree Bagging AdaBoost GradientBoost Random_Forest
0 Training Accuracy 0.890953 0.890953 0.890953 0.890953 0.890953 0.890953
1 Testing Accuracy 0.893026 0.893026 0.893026 0.893026 0.893026 0.893026
2 Recall 0.185042 0.214055 0.274662 0.186331 0.170213 0.248872
3 Precision 0.605485 0.597122 0.537201 0.643653 0.655087 0.553009
4 F1 Score 0.283457 0.315140 0.363481 0.289000 0.270215 0.343264
5 ROC AUC Score 0.584738 0.597704 0.622056 0.586506 0.579321 0.611450

All the models have about the same training and testing accuracy scores.

It seems that of all the models, bagging has the best recall score but GradientBoost has the best precision.

Bagging has the highest ROC area under the curve score and may be the best model overall.

However, the one model that generates the most gain from the misclassification errors is: Gradientboost. It has highest Precision and the lowest false positives of all the models.

In other words, for the bank to actually make a profit, you would want a model with more false negative values (Customers that were predicted to not subscribe to a term deposit but actually ended up doing it) and less false positive values (Customers that were predicted to subscribe for a term deposit but ended up not participating.) This, however, may not be the best model that makes the best predictions.

Bagging seems to be the best model with the highest ROC score and the highest F1 Score that may make the best predictions, but the bank earns more with a model like Gradientboost because it has the lowest false positives and the highest precision.

In [ ]: