Tiffany Chan

Supervised Learning Project.

Thera Bank Personal Loan Campaign

1. Import the datasets and libraries, check datatype, statistical summary, shape, null values or incorrect imputation. (5 marks)

In [496]:
## Importing necessary libraries for this assignment.
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
In [497]:
#Reading CSV file for analysis and renaming the dataset.
bank = pd.read_csv("Bank_Personal_Loan_Modelling.csv")
In [498]:
#Looking at first 5 observations in dataset.
bank.head()
Out[498]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [499]:
#Data type of variables in the dataset.
bank.dtypes
Out[499]:
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object

In the data, there are 14 variables. Thirteen are integer variables, and 1 is a float (which has decimals).

In [500]:
#Descriptive statistics of variables
bank.describe().transpose()
Out[500]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Interpreting ID, and zip code's descriptive statistics do not make sense because the numbers are mainly characters and do not hold any numerical value.

In [467]:
#Shape of data.
bank.shape
Out[467]:
(5000, 14)
In [468]:
#Seeing if there are any variables with null values
bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

There are no null values in the data. All cells in the dataset are filled.

In [501]:
#Checking to see unique values in Age, Experience, Income, Family, CCAvg, Education, Mortgage, Personal Loan,
#Securities Account, CD Account, Online and CreditCard
print("Age")
print(np.unique(bank[['Age']].values))
print("")
print("Experience")
print(np.unique(bank[['Experience']].values))
print("")
print("Income")
print(np.unique(bank[['Income']].values))
print("")
print("Family")
print(np.unique(bank[['Family']].values))
print("")
print("CCAvg")
print(np.unique(bank[['CCAvg']].values))
print("")
print("Education")
print(np.unique(bank[['Education']].values))
print("")
print("Mortgage")
print(np.unique(bank[['Mortgage']].values))
print("")
print("Personal Loan")
print(np.unique(bank[['Personal Loan']].values))
print("")
print("Securities Account")
print(np.unique(bank[['Securities Account']].values))
print("")
print("CD Account")
print(np.unique(bank[['CD Account']].values))
print("")
print("Online")
print(np.unique(bank[['Online']].values))
print("")
print("Credit Card")
print(np.unique(bank[['CreditCard']].values))
print("ZIP Code")
print(np.unique(bank['ZIP Code'].values))
Age
[23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67]

Experience
[-3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43]

Income
[  8   9  10  11  12  13  14  15  18  19  20  21  22  23  24  25  28  29
  30  31  32  33  34  35  38  39  40  41  42  43  44  45  48  49  50  51
  52  53  54  55  58  59  60  61  62  63  64  65  68  69  70  71  72  73
  74  75  78  79  80  81  82  83  84  85  88  89  90  91  92  93  94  95
  98  99 100 101 102 103 104 105 108 109 110 111 112 113 114 115 118 119
 120 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141
 142 143 144 145 148 149 150 151 152 153 154 155 158 159 160 161 162 163
 164 165 168 169 170 171 172 173 174 175 178 179 180 181 182 183 184 185
 188 189 190 191 192 193 194 195 198 199 200 201 202 203 204 205 218 224]

Family
[1 2 3 4]

CCAvg
[ 0.    0.1   0.2   0.3   0.4   0.5   0.6   0.67  0.7   0.75  0.8   0.9
  1.    1.1   1.2   1.3   1.33  1.4   1.5   1.6   1.67  1.7   1.75  1.8
  1.9   2.    2.1   2.2   2.3   2.33  2.4   2.5   2.6   2.67  2.7   2.75
  2.8   2.9   3.    3.1   3.2   3.25  3.3   3.33  3.4   3.5   3.6   3.67
  3.7   3.8   3.9   4.    4.1   4.2   4.25  4.3   4.33  4.4   4.5   4.6
  4.67  4.7   4.75  4.8   4.9   5.    5.1   5.2   5.3   5.33  5.4   5.5
  5.6   5.67  5.7   5.8   5.9   6.    6.1   6.2   6.3   6.33  6.4   6.5
  6.6   6.67  6.7   6.8   6.9   7.    7.2   7.3   7.4   7.5   7.6   7.8
  7.9   8.    8.1   8.2   8.3   8.5   8.6   8.8   8.9   9.    9.3  10.  ]

Education
[1 2 3]

Mortgage
[  0  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109
 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
 255 256 257 258 259 260 262 263 264 265 266 267 268 270 271 272 273 275
 276 277 278 280 281 282 283 284 285 286 287 289 290 292 293 294 295 296
 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314
 315 318 319 321 322 323 325 326 327 328 329 330 331 333 334 336 337 341
 342 343 344 345 351 352 353 354 355 357 358 359 360 361 364 366 368 372
 373 374 378 380 381 382 383 385 389 391 392 394 396 397 398 400 402 403
 405 406 408 410 412 415 416 419 421 422 427 428 429 431 432 433 437 442
 446 449 452 455 458 461 464 466 467 470 475 477 481 483 485 496 500 505
 508 509 522 524 535 541 547 550 553 565 567 569 571 577 581 582 587 589
 590 601 612 617 635]

Personal Loan
[0 1]

Securities Account
[0 1]

CD Account
[0 1]

Online
[0 1]

Credit Card
[0 1]
ZIP Code
[ 9307 90005 90007 90009 90011 90016 90018 90019 90024 90025 90027 90028
 90029 90032 90033 90034 90035 90036 90037 90041 90044 90045 90048 90049
 90057 90058 90059 90064 90065 90066 90068 90071 90073 90086 90089 90095
 90210 90212 90230 90232 90245 90250 90254 90266 90272 90274 90275 90277
 90280 90291 90304 90401 90404 90405 90502 90503 90504 90505 90509 90601
 90623 90630 90638 90639 90640 90650 90717 90720 90740 90745 90747 90755
 90813 90840 91006 91007 91016 91024 91030 91040 91101 91103 91105 91107
 91109 91116 91125 91129 91203 91207 91301 91302 91304 91311 91320 91326
 91330 91335 91342 91343 91345 91355 91360 91361 91365 91367 91380 91401
 91423 91604 91605 91614 91706 91709 91710 91711 91730 91741 91745 91754
 91763 91765 91768 91770 91773 91775 91784 91791 91801 91902 91910 91911
 91941 91942 91950 92007 92008 92009 92024 92028 92029 92037 92038 92054
 92056 92064 92068 92069 92084 92093 92096 92101 92103 92104 92106 92109
 92110 92115 92116 92120 92121 92122 92123 92124 92126 92129 92130 92131
 92152 92154 92161 92173 92177 92182 92192 92220 92251 92325 92333 92346
 92350 92354 92373 92374 92399 92407 92507 92518 92521 92606 92612 92614
 92624 92626 92630 92634 92646 92647 92648 92653 92660 92661 92672 92673
 92675 92677 92691 92692 92694 92697 92703 92704 92705 92709 92717 92735
 92780 92806 92807 92821 92831 92833 92834 92835 92843 92866 92867 92868
 92870 92886 93003 93009 93010 93014 93022 93023 93033 93063 93065 93101
 93105 93106 93107 93108 93109 93111 93117 93118 93302 93305 93311 93401
 93403 93407 93437 93460 93524 93555 93561 93611 93657 93711 93720 93727
 93907 93933 93940 93943 93950 93955 94002 94005 94010 94015 94019 94022
 94024 94025 94028 94035 94040 94043 94061 94063 94065 94066 94080 94085
 94086 94087 94102 94104 94105 94107 94108 94109 94110 94111 94112 94114
 94115 94116 94117 94118 94122 94123 94124 94126 94131 94132 94143 94234
 94301 94302 94303 94304 94305 94306 94309 94402 94404 94501 94507 94509
 94521 94523 94526 94534 94536 94538 94539 94542 94545 94546 94550 94551
 94553 94555 94558 94566 94571 94575 94577 94583 94588 94590 94591 94596
 94598 94604 94606 94607 94608 94609 94610 94611 94612 94618 94701 94703
 94704 94705 94706 94707 94708 94709 94710 94720 94801 94803 94806 94901
 94904 94920 94923 94928 94939 94949 94960 94965 94970 94998 95003 95005
 95006 95008 95010 95014 95020 95023 95032 95035 95037 95039 95045 95051
 95053 95054 95060 95064 95070 95112 95120 95123 95125 95126 95131 95133
 95134 95135 95136 95138 95192 95193 95207 95211 95307 95348 95351 95354
 95370 95403 95405 95422 95449 95482 95503 95518 95521 95605 95616 95617
 95621 95630 95670 95678 95741 95747 95758 95762 95812 95814 95816 95817
 95818 95819 95820 95821 95822 95825 95827 95828 95831 95833 95841 95842
 95929 95973 96001 96003 96008 96064 96091 96094 96145 96150 96651]

From looking at the unique values in the variables of interest, 'Experience' has negative numbers, which does not make any sense. Negative number of years of professional experience does not make sense. Ultimately, these values will need to be newly imputed later on.

2. EDA: Study the data distribution in each attribute and target variable, share your findings (20 marks)

Number of unique in each column?

Number of people with zero mortgage?

Number of people with zero credit card spending per month?

Value counts of all categorical columns.

Univariate and Bivariate

Get data model ready

In [470]:
#Number of unique in each column.
bank.nunique()
Out[470]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIP Code               467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal Loan            2
Securities Account       2
CD Account               2
Online                   2
CreditCard               2
dtype: int64
In [492]:
#Subset Mortgage = 0, and make a new dataframe and count the number of 0s
print("Number of Zeroes in Mortgage:")
bank_mortgage = bank[bank['Mortgage'] == 0]             
print(bank_mortgage['Mortgage'].value_counts().sum())  
#Double check
print("")
print("Double check")
print(bank['Mortgage'].value_counts())
Number of Zeroes in Mortgage:
3462

Double check
0      3462
98       17
103      16
119      16
83       16
91       16
89       16
90       15
102      15
78       15
118      14
101      14
87       14
94       14
131      14
104      14
109      13
106      13
116      13
144      13
112      13
81       13
185      12
76       12
120      12
97       12
121      12
100      12
158      11
184      11
95       11
111      11
86       11
137      11
79       11
84       11
153      11
115      11
161      10
82       10
135      10
151      10
117      10
113      10
142      10
108      10
149       9
166       9
159       9
88        9
167       9
123       9
221       9
128       9
146       9
148       9
110       8
138       8
170       8
174       8
122       8
75        8
218       8
194       8
114       8
147       8
157       8
105       8
169       8
207       8
205       8
132       8
140       8
164       8
204       7
249       7
129       7
126       7
219       7
154       7
124       7
85        7
96        7
80        7
196       7
230       7
229       7
127       7
193       7
107       6
251       6
130       6
136       6
99        6
125       6
236       6
141       6
145       6
240       6
150       6
187       6
163       6
182       6
192       6
155       6
134       6
294       5
171       5
256       5
178       5
198       5
301       5
180       5
307       5
172       5
188       5
272       5
227       5
156       5
209       5
212       5
220       5
217       5
93        4
162       4
241       4
189       4
181       4
297       4
190       4
245       4
77        4
247       4
211       4
232       4
199       4
224       4
244       4
342       4
152       4
179       4
175       4
203       4
231       4
139       4
200       4
92        4
223       3
352       3
328       3
308       3
177       3
197       3
239       3
165       3
215       3
226       3
255       3
160       3
380       3
400       3
341       3
143       3
333       3
329       3
325       3
428       3
257       3
305       3
176       3
168       3
213       3
263       3
248       3
310       3
214       3
260       3
264       3
238       3
242       3
186       3
327       3
323       3
270       3
319       3
282       3
285       3
315       3
277       3
133       3
228       3
268       3
275       3
422       3
366       3
216       3
208       3
427       2
243       2
267       2
455       2
271       2
303       2
359       2
299       2
233       2
287       2
225       2
183       2
201       2
259       2
222       2
289       2
312       2
292       2
280       2
276       2
442       2
402       2
394       2
358       2
354       2
330       2
322       2
314       2
306       2
252       2
266       2
262       2
250       2
246       2
234       2
202       2
565       2
304       2
392       2
364       2
293       2
309       2
313       2
357       2
372       2
397       2
368       2
408       2
437       2
449       2
360       1
416       1
432       1
452       1
464       1
508       1
496       1
300       1
412       1
284       1
500       1
524       1
296       1
396       1
344       1
336       1
635       1
612       1
283       1
195       1
191       1
590       1
582       1
550       1
522       1
470       1
466       1
458       1
446       1
410       1
406       1
398       1
382       1
378       1
374       1
334       1
235       1
295       1
318       1
311       1
571       1
567       1
547       1
535       1
483       1
475       1
467       1
431       1
419       1
415       1
403       1
391       1
383       1
355       1
351       1
343       1
331       1
326       1
302       1
173       1
461       1
429       1
421       1
405       1
389       1
385       1
381       1
373       1
361       1
353       1
345       1
337       1
321       1
281       1
273       1
265       1
253       1
237       1
433       1
477       1
298       1
481       1
290       1
286       1
278       1
258       1
210       1
206       1
617       1
601       1
589       1
581       1
587       1
569       1
553       1
541       1
509       1
505       1
485       1
577       1
Name: Mortgage, dtype: int64
In [494]:
#Subset CreditCard = 0, and make a new dataframe and count the number of 0s
print("Number of Zeroes in CreditCard:")
bank_credit_card = bank[bank['CreditCard'] == 0]
print(bank_credit_card['CreditCard'].value_counts().sum())
#Double check
print("")
print("Double Check")
print(bank['CreditCard'].value_counts())
Number of Zeroes in CreditCard:
3530

Double Check
0    3530
1    1470
Name: CreditCard, dtype: int64
In [495]:
#Categorical variables in dataset include: Family, Education, Personal Loan, Securities Account, CD Account, Online, Credit Card
print("Family")
print(bank['Family'].value_counts())
print("")
print("Education")
print(bank['Education'].value_counts())
print("")
print("Personal Loan")
print(bank['Personal Loan'].value_counts())
print("")
print("Securities Account")
print(bank['Securities Account'].value_counts())
print("")
print("CD Account")
print(bank['CD Account'].value_counts())
print("")
print("Online")
print(bank['Online'].value_counts())
print("")
print("CreditCard")
print(bank['CreditCard'].value_counts())
print("ZIP Code")
print(bank['ZIP Code'].value_counts())
Family
1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64

Education
1    2096
3    1501
2    1403
Name: Education, dtype: int64

Personal Loan
0    4520
1     480
Name: Personal Loan, dtype: int64

Securities Account
0    4478
1     522
Name: Securities Account, dtype: int64

CD Account
0    4698
1     302
Name: CD Account, dtype: int64

Online
1    2984
0    2016
Name: Online, dtype: int64

CreditCard
0    3530
1    1470
Name: CreditCard, dtype: int64
ZIP Code
94720    169
94305    127
95616    116
90095     71
93106     57
92037     54
93943     54
91320     53
91711     52
94025     52
92093     51
90245     50
90024     50
90089     46
91330     46
92121     45
94304     45
94143     37
95051     34
94608     34
92182     32
92028     32
92521     32
95054     31
95814     30
95014     29
94542     27
94301     27
94550     27
93407     26
95064     26
95819     26
94501     26
95039     26
94105     25
94303     25
95060     25
94022     25
91107     25
94596     24
93117     24
93555     23
94080     23
95521     23
92717     22
91380     22
92612     22
94110     21
92647     21
91768     21
90034     20
90401     20
94117     20
94132     20
95747     19
92697     19
90025     19
94005     19
92122     19
90266     19
93940     18
94709     18
90840     18
92130     18
94122     18
94309     18
90291     18
92120     17
94583     17
92691     17
93023     17
92009     17
90630     17
92407     17
91311     17
90740     17
92096     17
92126     16
92677     16
94539     16
95136     16
90277     16
92780     16
94115     15
94061     15
92354     15
91604     15
94588     15
95929     15
95134     15
94928     15
94111     14
94920     14
92152     14
92507     14
94545     14
94590     14
95020     14
90064     14
92115     13
94611     13
92064     13
94704     13
94609     13
92103     13
95762     13
92007     13
94063     12
95133     12
95053     12
95818     12
94701     12
94109     12
94402     12
93955     12
93108     12
91775     12
95670     12
91355     12
94112     11
95008     11
95827     11
94024     11
92831     11
95035     11
93109     11
94801     11
92123     11
92110     11
94306     11
90210     11
92008     11
92173     10
94010     10
92373     10
91360     10
91902     10
94086     10
93907     10
92374     10
90041     10
95630     10
94706     10
90717     10
90405     10
92870     10
93305     10
94035     10
94065     10
92646     10
94998     10
91304     10
91605     10
93014     10
95032     10
95120      9
95023      9
91103      9
93711      9
91302      9
95123      9
94710      9
90250      9
95351      9
91125      9
95605      9
91335      9
90049      9
92660      9
96001      9
92806      9
94102      9
90033      9
94131      9
91910      9
92101      8
90028      8
95831      8
94104      8
90032      8
92672      8
92104      8
90747      8
90065      8
92630      8
92626      8
90503      8
93561      8
92821      8
95833      8
90009      8
93105      8
94591      8
90212      8
91016      8
92866      8
91101      8
95841      7
95138      7
93101      7
94607      7
90254      7
94551      7
90036      7
95370      7
91105      7
94002      7
95211      7
92220      7
94085      7
95821      7
95006      7
94234      7
91763      7
92124      7
94114      7
91423      7
92131      7
94015      7
92333      7
91942      7
91950      7
92038      7
94534      7
94123      7
92807      7
93118      7
92675      7
92709      7
94708      7
95112      7
93010      7
92606      7
92704      6
95812      6
95503      6
91706      6
94707      6
93727      6
91040      6
95207      6
95126      6
92868      6
91401      6
91365      6
91301      6
94949      6
91730      6
90502      6
91911      6
90007      6
94521      6
92346      6
94546      6
90073      6
90504      6
94538      6
96651      6
92056      6
94523      6
90035      6
91116      6
90071      6
94606      6
94043      6
94610      6
94960      6
91709      6
94923      6
96003      6
90058      6
95449      6
94904      6
90230      6
95621      6
94553      6
94040      6
91367      6
91765      6
91745      6
92029      6
92109      6
91741      6
90066      6
91754      5
94612      5
90274      5
90037      5
90029      5
91203      5
90509      5
90638      5
94555      5
93022      5
91770      5
95617      5
90019      5
95817      5
94124      5
92068      5
94066      5
96064      5
94901      5
92192      5
92703      5
90005      5
94107      5
93611      5
92054      5
93065      5
90404      5
92867      5
93302      5
91109      5
93107      5
95045      5
91007      5
95403      5
91006      5
92634      5
90232      5
95070      5
92154      5
93460      5
95822      5
93524      5
94939      5
92325      5
93437      5
90720      5
93950      5
92350      4
95678      4
94577      4
92084      4
95758      4
92129      4
90048      4
94536      4
92833      4
95354      4
94126      4
96150      4
93401      4
94705      4
95422      4
90639      4
95825      4
94118      4
90601      4
93003      4
95010      4
91361      4
93009      4
91030      4
95348      4
91342      4
91791      4
95003      4
94028      4
95973      4
92886      4
91614      4
95741      4
96091      4
92648      4
95828      4
90059      4
94302      4
92735      4
92069      4
94571      4
92624      4
90018      4
94108      4
95820      4
92653      4
95125      4
92661      3
92673      3
90640      3
94618      3
94566      3
92177      3
94806      3
94558      3
92251      3
93063      3
92835      3
91773      3
91801      3
92834      3
90755      3
90086      3
92614      3
92399      3
92518      3
90650      3
92843      3
90045      3
95518      3
93111      3
95192      3
93311      3
90275      3
93403      3
91710      3
92106      3
96008      3
93720      3
91345      3
90057      3
93657      3
95135      3
95131      3
93933      3
91343      3
94803      3
92692      3
94703      3
90505      3
92024      3
90011      3
90027      3
95405      2
90044      2
90016      2
94526      2
94509      2
90272      2
95005      2
90280      2
90623      2
95193      2
90304      2
95037      2
91207      2
95482      2
91129      2
95307      2
95842      2
91784      2
96094      2
92116      2
94507      2
94575      2
90745      2
93033      2
94019      2
92161      2
91941      2
94116      2
95816      2
91326      2
92705      2
94604      2
90813      1
94404      1
94965      1
9307       1
91024      1
92694      1
96145      1
94970      1
94598      1
90068      1
94087      1
Name: ZIP Code, dtype: int64
In [474]:
#Univariate analysis
columns = list(bank)[:] # Creating a new list with all columns 
bank[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); # Histogram of all columns

From the histograms above, you could see that Age, CCAvg, Experience, and Income are well represented. The income and CCAvg data are right skewed, suggesting the average is higher than the median value.

Plotting the histograms for ID and Zip Code do not make sense because they are and should be considered string variables.

The Mortgage variable is somewhat problematic because there is a large amount of zeroes in this variable, and imputation is most likely the better way of handling the zeroes. Unlike the other categorical variables, you could see on the x-axis, this is more likely a numerical continuous variable, and zeroes should be dealt with.

The other variables are categorical in nature, and plotting them this way is acceptable, but bar charts would have been better.

BIVARIATE ANALYSIS

In [175]:
# Pairplot gives us a rough idea of how the variables could be related.
sns.pairplot(bank[['Experience', 'Income', 'CCAvg', 'Education', 'Mortgage', 'Personal Loan', 'Securities Account', 'CD Account', 'CreditCard']]);
In [502]:
# Correlation
# A better way to evaluate the relationship between 2 variables is using correlation.
bank.corr()
Out[502]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
ID 1.000000 -0.008473 -0.008326 -0.017695 0.013432 -0.016797 -0.024675 0.021463 -0.013920 -0.024801 -0.016972 -0.006909 -0.002528 0.017028
Age -0.008473 1.000000 0.994215 -0.055269 -0.029216 -0.046418 -0.052012 0.041334 -0.012539 -0.007726 -0.000436 0.008043 0.013702 0.007681
Experience -0.008326 0.994215 1.000000 -0.046574 -0.028626 -0.052563 -0.050077 0.013152 -0.010582 -0.007413 -0.001232 0.010353 0.013898 0.008967
Income -0.017695 -0.055269 -0.046574 1.000000 -0.016410 -0.157501 0.645984 -0.187524 0.206806 0.502462 -0.002616 0.169738 0.014206 -0.002385
ZIP Code 0.013432 -0.029216 -0.028626 -0.016410 1.000000 0.011778 -0.004061 -0.017377 0.007383 0.000107 0.004704 0.019972 0.016990 0.007691
Family -0.016797 -0.046418 -0.052563 -0.157501 0.011778 1.000000 -0.109275 0.064929 -0.020445 0.061367 0.019994 0.014110 0.010354 0.011588
CCAvg -0.024675 -0.052012 -0.050077 0.645984 -0.004061 -0.109275 1.000000 -0.136124 0.109905 0.366889 0.015086 0.136534 -0.003611 -0.006689
Education 0.021463 0.041334 0.013152 -0.187524 -0.017377 0.064929 -0.136124 1.000000 -0.033327 0.136722 -0.010812 0.013934 -0.015004 -0.011014
Mortgage -0.013920 -0.012539 -0.010582 0.206806 0.007383 -0.020445 0.109905 -0.033327 1.000000 0.142095 -0.005411 0.089311 -0.005995 -0.007231
Personal Loan -0.024801 -0.007726 -0.007413 0.502462 0.000107 0.061367 0.366889 0.136722 0.142095 1.000000 0.021954 0.316355 0.006278 0.002802
Securities Account -0.016972 -0.000436 -0.001232 -0.002616 0.004704 0.019994 0.015086 -0.010812 -0.005411 0.021954 1.000000 0.317034 0.012627 -0.015028
CD Account -0.006909 0.008043 0.010353 0.169738 0.019972 0.014110 0.136534 0.013934 0.089311 0.316355 0.317034 1.000000 0.175880 0.278644
Online -0.002528 0.013702 0.013898 0.014206 0.016990 0.010354 -0.003611 -0.015004 -0.005995 0.006278 0.012627 0.175880 1.000000 0.004210
CreditCard 0.017028 0.007681 0.008967 -0.002385 0.007691 0.011588 -0.006689 -0.011014 -0.007231 0.002802 -0.015028 0.278644 0.004210 1.000000
In [503]:
#Correlation
#Correlation Matrix
def plot_corr(bank, size=18):
    corr = bank.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns)
    plt.yticks(range(len(corr.columns)), corr.columns)
    for (i, j), z in np.ndenumerate(corr):
        ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
In [504]:
plot_corr(bank)

Correlation Matrix Interpretation

Yellow suggests highly correlated, while dark blue suggests lowly correlated.

Age and Experience are highly correlated. To be exact, the Pearson's correlation value for this relationship is: 0.994. Since these two are so highly correlated, if we were to build a multivariate model, we should only include 1 of these, since the addition of the second variable would more likely not contribute much towards improving the model's performance.

CCAvg and Income have a Pearson's correlation value of 0.6.

Personal Loan and Income have a Pearson's correlation value of 0.5

Personal Loan and CCAvg have a Pearson's correlation value of 0.4

All other relationships are very lowly correlated.

Crosstabs to evaluate bivariate relationships.

Continuous Variables:

In [476]:
#I will only evaluate the mean for numerical continuous variables: Age, Experience, Income, CCAvg. and Mortgage
bank.groupby(["Personal Loan"]).mean()
Out[476]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard
Personal Loan
0 2512.165487 45.367257 20.132301 66.237389 93152.428761 2.373451 1.729009 1.843584 51.789381 0.102212 0.035841 0.595796 0.293584
1 2390.650000 45.066667 19.843750 144.745833 93153.202083 2.612500 3.905354 2.233333 100.845833 0.125000 0.291667 0.606250 0.297917

The average income, CCAvg, and Mortgage were a lot higher for those who were offered and accepted the personal loan from last campaign. Age and experience did not differ too much between those who took out loans and those who didn't.

In [478]:
#I will also only evaluate the median for numerical continuous variables: Age, Experience, Income, CCAvg. and Mortgage
bank.groupby(["Personal Loan"]).median()
Out[478]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Securities Account CD Account Online CreditCard
Personal Loan
0 2518.5 45.0 20.0 59.0 93437.0 2.0 1.4 2.0 0.0 0.0 0.0 1.0 0.0
1 2342.0 45.0 20.0 142.5 93407.0 3.0 3.8 2.0 0.0 0.0 0.0 1.0 0.0

The median showed similar results as the mean did for all variables except Mortgage. Mortgage has a 0 median for both groups due to the excessive 0s. The reason why there are such high average values for the mortgage variable is because average is highly impacted by outliers. Mortgage may be a variable that should not be included into the model due to its excessive 0s, and outlier values.

Categorical Variables

In [479]:
pd.crosstab(bank['Family'],bank['Personal Loan'],normalize='columns')
Out[479]:
Personal Loan 0 1
Family
1 0.301991 0.222917
2 0.263274 0.220833
3 0.194027 0.277083
4 0.240708 0.279167

Families with more family members seem to take more personal loans and families with less family members take less personal loans.

In [334]:
pd.crosstab(bank['Education'],bank['Personal Loan'],normalize='columns')
Out[334]:
Personal Loan 0 1
Education
1 0.443142 0.193750
2 0.270133 0.379167
3 0.286726 0.427083

It seems that loans are offered and accepted more by individuals with more education than with less education.

In [335]:
pd.crosstab(bank['Online'],bank['Personal Loan'],normalize='columns')
Out[335]:
Personal Loan 0 1
Online
0 0.404204 0.39375
1 0.595796 0.60625

Online users have the same distribution when it comes to taking or not taking personal loans. The same trend is observed among non-online users.

In [338]:
pd.crosstab(bank['Securities Account'],bank['Personal Loan'],normalize='columns')
Out[338]:
Personal Loan 0 1
Securities Account
0 0.897788 0.875
1 0.102212 0.125

Those that have a securities account are taking slightly more personal loans.

In [337]:
pd.crosstab(bank['CD Account'],bank['Personal Loan'],normalize='columns')
Out[337]:
Personal Loan 0 1
CD Account
0 0.964159 0.708333
1 0.035841 0.291667

Those that have a CD account are taking more personal loans.

In [339]:
pd.crosstab(bank['CreditCard'],bank['Personal Loan'],normalize='columns')
Out[339]:
Personal Loan 0 1
CreditCard
0 0.706416 0.702083
1 0.293584 0.297917

Credit card holders have the same distribution when it comes to taking or not taking personal loans. The same trend is observed among non-credit card holders.

In [487]:
pd.crosstab(bank['ZIP Code'],bank['Personal Loan'],normalize='columns')
Out[487]:
Personal Loan 0 1
ZIP Code
9307 0.000221 0.000000
90005 0.001106 0.000000
90007 0.001327 0.000000
90009 0.001770 0.000000
90011 0.000664 0.000000
90016 0.000221 0.002083
90018 0.000885 0.000000
90019 0.000885 0.002083
90024 0.010841 0.002083
90025 0.003761 0.004167
90027 0.000442 0.002083
90028 0.001549 0.002083
90029 0.001106 0.000000
90032 0.001549 0.002083
90033 0.001991 0.000000
90034 0.004204 0.002083
90035 0.001327 0.000000
90036 0.001549 0.000000
90037 0.000885 0.002083
90041 0.002212 0.000000
90044 0.000442 0.000000
90045 0.000664 0.000000
90048 0.000885 0.000000
90049 0.001327 0.006250
90057 0.000442 0.002083
90058 0.001327 0.000000
90059 0.000442 0.004167
90064 0.002655 0.004167
90065 0.001327 0.004167
90066 0.001327 0.000000
90068 0.000221 0.000000
90071 0.001106 0.002083
90073 0.001106 0.002083
90086 0.000664 0.000000
90089 0.008407 0.016667
90095 0.013938 0.016667
90210 0.002212 0.002083
90212 0.001549 0.002083
90230 0.001106 0.002083
90232 0.001106 0.000000
90245 0.009956 0.010417
90250 0.001770 0.002083
90254 0.001327 0.002083
90266 0.004204 0.000000
90272 0.000442 0.000000
90274 0.001106 0.000000
90275 0.000664 0.000000
90277 0.003319 0.002083
90280 0.000442 0.000000
90291 0.003319 0.006250
90304 0.000442 0.000000
90401 0.004425 0.000000
90404 0.000885 0.002083
90405 0.001770 0.004167
90502 0.000885 0.004167
90503 0.001770 0.000000
90504 0.001106 0.002083
90505 0.000664 0.000000
90509 0.001106 0.000000
90601 0.000664 0.002083
90623 0.000442 0.000000
90630 0.003319 0.004167
90638 0.001106 0.000000
90639 0.000885 0.000000
90640 0.000442 0.002083
90650 0.000442 0.002083
90717 0.001991 0.002083
90720 0.001106 0.000000
90740 0.003540 0.002083
90745 0.000442 0.000000
90747 0.001770 0.000000
90755 0.000664 0.000000
90813 0.000221 0.000000
90840 0.003540 0.004167
91006 0.000885 0.002083
91007 0.001106 0.000000
91016 0.001549 0.002083
91024 0.000221 0.000000
91030 0.000885 0.000000
91040 0.001327 0.000000
91101 0.001327 0.004167
91103 0.001549 0.004167
91105 0.001327 0.002083
91107 0.004867 0.006250
91109 0.001106 0.000000
91116 0.001327 0.000000
91125 0.001770 0.002083
91129 0.000221 0.002083
91203 0.000885 0.002083
91207 0.000442 0.000000
91301 0.001327 0.000000
91302 0.001327 0.006250
91304 0.001991 0.002083
91311 0.003540 0.002083
91320 0.011283 0.004167
91326 0.000442 0.000000
91330 0.009513 0.006250
91335 0.001770 0.002083
91342 0.000885 0.000000
91343 0.000664 0.000000
91345 0.000664 0.000000
91355 0.001770 0.008333
91360 0.001770 0.004167
91361 0.000885 0.000000
91365 0.001106 0.002083
91367 0.001106 0.002083
91380 0.003761 0.010417
91401 0.001327 0.000000
91423 0.001106 0.004167
91604 0.002876 0.004167
91605 0.001991 0.002083
91614 0.000664 0.002083
91706 0.001327 0.000000
91709 0.001327 0.000000
91710 0.000664 0.000000
91711 0.010619 0.008333
91730 0.001327 0.000000
91741 0.001327 0.000000
91745 0.001327 0.000000
91754 0.001106 0.000000
91763 0.001549 0.000000
91765 0.001106 0.002083
91768 0.004204 0.004167
91770 0.001106 0.000000
91773 0.000664 0.000000
91775 0.002434 0.002083
91784 0.000442 0.000000
91791 0.000885 0.000000
91801 0.000664 0.000000
91902 0.001770 0.004167
91910 0.001991 0.000000
91911 0.001106 0.002083
91941 0.000442 0.000000
91942 0.001327 0.002083
91950 0.001549 0.000000
92007 0.001991 0.008333
92008 0.001991 0.004167
92009 0.003540 0.002083
92024 0.000664 0.000000
92028 0.006195 0.008333
92029 0.001327 0.000000
92037 0.010841 0.010417
92038 0.001327 0.002083
92054 0.001106 0.000000
92056 0.000664 0.006250
92064 0.002655 0.002083
92068 0.000885 0.002083
92069 0.000885 0.000000
92084 0.000885 0.000000
92093 0.009292 0.018750
92096 0.003540 0.002083
92101 0.001770 0.000000
92103 0.002876 0.000000
92104 0.001770 0.000000
92106 0.000442 0.002083
92109 0.001327 0.000000
92110 0.002212 0.002083
92115 0.002876 0.000000
92116 0.000442 0.000000
92120 0.003761 0.000000
92121 0.009292 0.006250
92122 0.003761 0.004167
92123 0.002212 0.002083
92124 0.001327 0.002083
92126 0.003097 0.004167
92129 0.000885 0.000000
92130 0.003982 0.000000
92131 0.001549 0.000000
92152 0.002434 0.006250
92154 0.001106 0.000000
92161 0.000442 0.000000
92173 0.001770 0.004167
92177 0.000664 0.000000
92182 0.005531 0.014583
92192 0.001106 0.000000
92220 0.001106 0.004167
92251 0.000664 0.000000
92325 0.001106 0.000000
92333 0.001106 0.004167
92346 0.001327 0.000000
92350 0.000885 0.000000
92354 0.003319 0.000000
92373 0.002212 0.000000
92374 0.002212 0.000000
92399 0.000664 0.000000
92407 0.003540 0.002083
92507 0.002876 0.002083
92518 0.000664 0.000000
92521 0.006416 0.006250
92606 0.001549 0.000000
92612 0.003982 0.008333
92614 0.000442 0.002083
92624 0.000885 0.000000
92626 0.001327 0.004167
92630 0.001770 0.000000
92634 0.001106 0.000000
92646 0.001549 0.006250
92647 0.004204 0.004167
92648 0.000885 0.000000
92653 0.000885 0.000000
92660 0.001549 0.004167
92661 0.000664 0.000000
92672 0.001327 0.004167
92673 0.000664 0.000000
92675 0.001327 0.002083
92677 0.002876 0.006250
92691 0.003761 0.000000
92692 0.000664 0.000000
92694 0.000221 0.000000
92697 0.003761 0.004167
92703 0.001106 0.000000
92704 0.001327 0.000000
92705 0.000442 0.000000
92709 0.001327 0.002083
92717 0.004204 0.006250
92735 0.000885 0.000000
92780 0.003540 0.000000
92806 0.001991 0.000000
92807 0.001549 0.000000
92821 0.001549 0.002083
92831 0.002212 0.002083
92833 0.000885 0.000000
92834 0.000442 0.002083
92835 0.000664 0.000000
92843 0.000664 0.000000
92866 0.001770 0.000000
92867 0.001106 0.000000
92868 0.001327 0.000000
92870 0.001991 0.002083
92886 0.000885 0.000000
93003 0.000664 0.002083
93009 0.000885 0.000000
93010 0.001327 0.002083
93014 0.001770 0.004167
93022 0.000664 0.004167
93023 0.003097 0.006250
93033 0.000442 0.000000
93063 0.000664 0.000000
93065 0.001106 0.000000
93101 0.001549 0.000000
93105 0.001770 0.000000
93106 0.011726 0.008333
93107 0.001106 0.000000
93108 0.001991 0.006250
93109 0.002434 0.000000
93111 0.000442 0.002083
93117 0.005088 0.002083
93118 0.001549 0.000000
93302 0.000885 0.002083
93305 0.001991 0.002083
93311 0.000442 0.002083
93401 0.000885 0.000000
93403 0.000442 0.002083
93407 0.004867 0.008333
93437 0.001106 0.000000
93460 0.001106 0.000000
93524 0.001106 0.000000
93555 0.004646 0.004167
93561 0.001327 0.004167
93611 0.001106 0.000000
93657 0.000664 0.000000
93711 0.001770 0.002083
93720 0.000442 0.002083
93727 0.001327 0.000000
93907 0.002212 0.000000
93933 0.000664 0.000000
93940 0.003540 0.004167
93943 0.011062 0.008333
93950 0.000885 0.002083
93955 0.001770 0.008333
94002 0.001327 0.002083
94005 0.004204 0.000000
94010 0.001991 0.002083
94015 0.001549 0.000000
94019 0.000442 0.000000
94022 0.004204 0.012500
94024 0.002434 0.000000
94025 0.010619 0.008333
94028 0.000885 0.000000
94035 0.002212 0.000000
94040 0.001327 0.000000
94043 0.001327 0.000000
94061 0.002876 0.004167
94063 0.002655 0.000000
94065 0.002212 0.000000
94066 0.001106 0.000000
94080 0.004867 0.002083
94085 0.001327 0.002083
94086 0.001770 0.004167
94087 0.000221 0.000000
94102 0.001549 0.004167
94104 0.001549 0.002083
94105 0.005310 0.002083
94107 0.001106 0.000000
94108 0.000442 0.004167
94109 0.002655 0.000000
94110 0.004425 0.002083
94111 0.002655 0.004167
94112 0.002434 0.000000
94114 0.001106 0.004167
94115 0.003097 0.002083
94116 0.000442 0.000000
94117 0.004425 0.000000
94118 0.000885 0.000000
94122 0.003540 0.004167
94123 0.001549 0.000000
94124 0.001106 0.000000
94126 0.000885 0.000000
94131 0.001549 0.004167
94132 0.004425 0.000000
94143 0.007522 0.006250
94234 0.001327 0.002083
94301 0.005531 0.004167
94302 0.000885 0.000000
94303 0.005088 0.004167
94304 0.008186 0.016667
94305 0.025221 0.027083
94306 0.001991 0.004167
94309 0.003540 0.004167
94402 0.002434 0.002083
94404 0.000221 0.000000
94501 0.005310 0.004167
94507 0.000442 0.000000
94509 0.000442 0.000000
94521 0.001327 0.000000
94523 0.001106 0.002083
94526 0.000442 0.000000
94534 0.001327 0.002083
94536 0.000664 0.002083
94538 0.001327 0.000000
94539 0.003540 0.000000
94542 0.005310 0.006250
94545 0.002876 0.002083
94546 0.001106 0.002083
94550 0.005973 0.000000
94551 0.001327 0.002083
94553 0.000885 0.004167
94555 0.000885 0.002083
94558 0.000664 0.000000
94566 0.000664 0.000000
94571 0.000885 0.000000
94575 0.000442 0.000000
94577 0.000885 0.000000
94583 0.003319 0.004167
94588 0.003097 0.002083
94590 0.002655 0.004167
94591 0.001770 0.000000
94596 0.004425 0.008333
94598 0.000221 0.000000
94604 0.000442 0.000000
94606 0.001106 0.002083
94607 0.001327 0.002083
94608 0.007301 0.002083
94609 0.002655 0.002083
94610 0.001106 0.002083
94611 0.002655 0.002083
94612 0.001106 0.000000
94618 0.000664 0.000000
94701 0.002655 0.000000
94703 0.000664 0.000000
94704 0.002434 0.004167
94705 0.000442 0.004167
94706 0.002212 0.000000
94707 0.001106 0.002083
94708 0.001549 0.000000
94709 0.003540 0.004167
94710 0.001770 0.002083
94720 0.033186 0.039583
94801 0.001991 0.004167
94803 0.000442 0.002083
94806 0.000664 0.000000
94901 0.000885 0.002083
94904 0.001106 0.002083
94920 0.002876 0.002083
94923 0.001327 0.000000
94928 0.002212 0.010417
94939 0.001106 0.000000
94949 0.001106 0.002083
94960 0.001106 0.002083
94965 0.000221 0.000000
94970 0.000221 0.000000
94998 0.001991 0.002083
95003 0.000885 0.000000
95005 0.000442 0.000000
95006 0.001549 0.000000
95008 0.001991 0.004167
95010 0.000664 0.002083
95014 0.005531 0.008333
95020 0.002876 0.002083
95023 0.001991 0.000000
95032 0.001549 0.006250
95035 0.002212 0.002083
95037 0.000442 0.000000
95039 0.004867 0.008333
95045 0.001106 0.000000
95051 0.006195 0.012500
95053 0.002434 0.002083
95054 0.005752 0.010417
95060 0.004646 0.008333
95064 0.005088 0.006250
95070 0.000885 0.002083
95112 0.001327 0.002083
95120 0.001991 0.000000
95123 0.001991 0.000000
95125 0.000664 0.002083
95126 0.001327 0.000000
95131 0.000664 0.000000
95133 0.002434 0.002083
95134 0.003097 0.002083
95135 0.000221 0.004167
95136 0.003097 0.004167
95138 0.001106 0.004167
95192 0.000442 0.002083
95193 0.000442 0.000000
95207 0.001327 0.000000
95211 0.001327 0.002083
95307 0.000442 0.000000
95348 0.000885 0.000000
95351 0.001991 0.000000
95354 0.000664 0.002083
95370 0.001549 0.000000
95403 0.000885 0.002083
95405 0.000442 0.000000
95422 0.000885 0.000000
95449 0.001106 0.002083
95482 0.000442 0.000000
95503 0.001106 0.002083
95518 0.000664 0.000000
95521 0.004867 0.002083
95605 0.001549 0.004167
95616 0.024336 0.012500
95617 0.001106 0.000000
95621 0.001327 0.000000
95630 0.002212 0.000000
95670 0.002655 0.000000
95678 0.000885 0.000000
95741 0.000664 0.002083
95747 0.003761 0.004167
95758 0.000664 0.002083
95762 0.002655 0.002083
95812 0.001327 0.000000
95814 0.005531 0.010417
95816 0.000442 0.000000
95817 0.001106 0.000000
95818 0.002212 0.004167
95819 0.005531 0.002083
95820 0.000885 0.000000
95821 0.001549 0.000000
95822 0.000885 0.002083
95825 0.000885 0.000000
95827 0.002434 0.000000
95828 0.000664 0.002083
95831 0.001549 0.002083
95833 0.001770 0.000000
95841 0.001327 0.002083
95842 0.000442 0.000000
95929 0.003097 0.002083
95973 0.000664 0.002083
96001 0.001991 0.000000
96003 0.001106 0.002083
96008 0.000221 0.004167
96064 0.001106 0.000000
96091 0.000885 0.000000
96094 0.000442 0.000000
96145 0.000221 0.000000
96150 0.000885 0.000000
96651 0.001327 0.000000

For all zip codes, there seems to not be a difference between those who take personal loans and those who don't. There are too few cases for each zip code to tell if there's a difference between those who got loans and those who didn't. All results are very close to 0.

In [289]:
# Getting Data Model Ready: 1. Replace negative values in Experience with the mean years of Experience. 
                          # Sidenote: Mortgage has excessive 0s for a continuous variable. If I were tasked with using this dataset as a small sample to predict the outcomes of a large population, I would impute a fraction of the zeroes in Mortgage to match the population that had 0 mortgages. Since that is not what we are trying to accomplish here, I will not impute that variable and will eliminate it from the model all together. 

#Impute Experience average for the negative values. 
from sklearn.impute import SimpleImputer
rep_neg3 = SimpleImputer(missing_values=-3, strategy="mean")
cols=['Experience']
imputer = rep_neg3.fit(bank[cols])
bank[cols] = imputer.transform(bank[cols])

rep_neg2 = SimpleImputer(missing_values=-2, strategy="mean")
cols=['Experience']
imputer = rep_neg2.fit(bank[cols])
bank[cols] = imputer.transform(bank[cols])

rep_neg1 = SimpleImputer(missing_values=-1, strategy="mean")
cols=['Experience']
imputer = rep_neg1.fit(bank[cols])
bank[cols] = imputer.transform(bank[cols])

#Double check
print("Experience")
print(np.unique(bank[['Experience']].values))
print(bank['Experience'].value_counts())
bank['Experience'].mean()
#The mean values differ based on the removal of these negative values from the dataset. However, the means are very close to one another and range from 20.12-20.33.
Experience
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 39. 40. 41. 42. 43.]
32.0    154
20.0    148
9.0     147
5.0     146
23.0    144
35.0    143
25.0    142
28.0    138
18.0    137
19.0    135
26.0    134
24.0    131
3.0     129
16.0    127
14.0    127
30.0    126
27.0    125
34.0    125
17.0    125
22.0    124
29.0    124
7.0     121
8.0     119
15.0    119
6.0     119
0.0     118
10.0    118
13.0    117
33.0    117
37.0    116
11.0    116
36.0    114
4.0     113
21.0    113
31.0    104
12.0    102
38.0     88
2.0      85
39.0     85
1.0      74
40.0     57
41.0     43
42.0      8
43.0      3
Name: Experience, dtype: int64
Out[289]:
20.1196

3. Split the data into training and test set in the ratio of 70:30 respectively (5 marks)

In [505]:
from sklearn.model_selection import train_test_split


# I will drop age, id, zip code and mortgage from my list of indepdent variable predictors. 
# Reasons: 1. Age is highly correlated with 'Experience' from Correlation matrix (Pearson's corr = 1). Age doesn't show true independence from Experience. 
#          2. ID is just an identifier, is different for each case and holds no predictive value for the model.
#          3. ZIP code should not be in the model because there are too few cases reported in each zip code for any of the calculations to mean anything. In the bivariate analysis, it was difficult to interpret anything because the results were all close to 0.
#          4. Mortgage is not a reliable variable. It has excessive zeroes and has high outliers. 
X = bank.drop(['Personal Loan', 'Age', 'ID', 'ZIP Code', 'Mortgage'], axis=1)     # Separate predictor variables (independent variables) from outcome (dependent variable)
Y = bank['Personal Loan']   # Predicted class (1=True, 0=False)

#Make dummy variable out of the categorical predictor variables
X = pd.get_dummies(X, drop_first=True)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)    #Splitting 70:30
# 1 equals any random seed number

x_train.head()
Out[505]:
Experience Income Family CCAvg Education Securities Account CD Account Online CreditCard
1334 22 35 2 1.3 1 0 0 1 0
4768 14 39 1 2.0 2 0 0 1 0
65 35 131 1 3.8 1 0 0 1 1
177 3 65 4 1.8 2 0 0 0 0
4489 13 21 3 0.2 2 0 0 1 0
In [506]:
#Verifying if 70% is training data and 30% is test data

print("{0:0.2f}% data is in training set".format((len(x_train)/len(bank.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(bank.index)) * 100))
70.00% data is in training set
30.00% data is in test set
In [507]:
#Seing the distribution of values after data splitting

print("Original Personal Loan True Values    : {0} ({1:0.2f}%)".format(len(bank.loc[bank['Personal Loan'] == 1]), (len(bank.loc[bank['Personal Loan'] == 1])/len(bank.index)) * 100))
print("Original Personal Loan False Values   : {0} ({1:0.2f}%)".format(len(bank.loc[bank['Personal Loan'] == 0]), (len(bank.loc[bank['Personal Loan'] == 0])/len(bank.index)) * 100))
print("")
print("Training Personal Loan True Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Personal Loan False Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Personal Loan True Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Personal Loan False Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
Original Personal Loan True Values    : 480 (9.60%)
Original Personal Loan False Values   : 4520 (90.40%)

Training Personal Loan True Values    : 331 (9.46%)
Training Personal Loan False Values   : 3169 (90.54%)

Test Personal Loan True Values        : 149 (9.93%)
Test Personal Loan False Values       : 1351 (90.07%)

4. Use the Logistic Regression model to predict whether the customer will take a personal loan or not. Print all the metrics related to evaluating the model performance (accuracy, recall, precision, f1score, and roc_auc_score). Draw a heatmap to display confusion matrix (15 marks)

In [508]:
#Importing Logistic Regression from sklearn

from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)
#predict on test
y_predict = model.predict(x_test)
In [509]:
#Looking at the model_score
model_score = model.score(x_test, y_test)

print(model_score)
0.9433333333333334
In [517]:
#Finding the actual true positive, true negative, false positive and false negative values

cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
print("Numbers from Confusion Matrix")
print(cm)
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
print("")

#Confusion Matrix
print("Confusion Matrix")
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()
Numbers from Confusion Matrix
[[  77   72]
 [  13 1338]]

Confusion Matrix
In [518]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score

print("Trainig accuracy",model.score(x_train,y_train))  
print()
print("Testing accuracy",model.score(x_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
Trainig accuracy 0.9505714285714286

Testing accuracy 0.9433333333333334

Recall: 0.5167785234899329

Precision: 0.8555555555555555

F1 Score: 0.6443514644351466

Roc Auc Score: 0.753578010819726

5. Find out coefficients of all the attributes and show the output in a data frame with column names? For test data show all the rows where the predicted class is not equal to the observed class. (10 marks)

In [519]:
## Feature Importance or Coefficients 
fi = pd.DataFrame()
fi['Col'] = x_train.columns
fi['Coeff'] = np.round(abs(model.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
Out[519]:
Col Coeff
6 CD Account 3.15
4 Education 1.18
8 CreditCard 1.01
5 Securities Account 0.94
7 Online 0.62
2 Family 0.46
3 CCAvg 0.13
1 Income 0.04
0 Experience 0.01

From the above syntax of coefficients, CD Account, Education, CreditCard, and Securities Account had the most impact on the making of this model. Other less contributive predictors include: Online, Family, CCAvg, Income and Experience.

These coefficients are derived from the independent variables in the training data that make up the prediction model.

Sidenote: In order to statistically infer causation between the independent variables (predictors) and the outcome (dependent variables), you would need to include the y variable and have syntax like the following:

import statsmodels.api as sm

logitmodel=sm.Logit(y_train, x_train)

result=logitmodel.fit()

print(result.summary())

Running this code and evaluating the p-values and confidence intervals may also lead to better selective combination of predictors.

In [513]:
predict= x_test.copy()
predict['Observed Personal Loan Status'] = y_test
predict['Predicted Personal Loan Status'] = y_predict

# Showing rows where predicted does not equal observed.
pd.set_option("display.max_rows", None, "display.max_columns", None)
predict.loc[predict['Observed Personal Loan Status'] != predict['Predicted Personal Loan Status']]
Out[513]:
Experience Income Family CCAvg Education Securities Account CD Account Online CreditCard Observed Personal Loan Status Predicted Personal Loan Status
4377 8 145 1 2.70 3 0 0 1 0 1 0
3271 27 93 4 4.10 2 0 0 0 1 1 0
349 2 60 2 3.00 1 0 0 0 0 1 0
927 40 95 3 3.70 2 0 0 0 1 1 0
2968 18 162 4 1.30 1 0 0 0 0 1 0
1626 6 180 2 6.70 1 0 0 0 0 0 1
1518 17 64 4 3.00 3 0 0 1 0 1 0
1552 5 195 1 4.30 1 0 0 0 0 0 1
2478 5 178 2 6.70 1 0 0 0 0 0 1
2784 9 115 4 2.20 2 0 0 0 0 1 0
1277 20 194 2 8.80 1 0 0 0 0 0 1
1064 17 138 3 6.90 2 0 0 1 0 1 0
57 31 131 2 1.20 3 0 0 0 0 1 0
4302 27 85 3 3.40 3 0 0 0 0 1 0
1822 23 112 1 5.10 2 1 1 1 0 1 0
941 13 129 4 4.40 1 0 0 0 0 1 0
3467 37 149 2 0.20 3 0 0 1 0 1 0
1784 29 119 3 2.00 1 1 1 0 0 1 0
1579 5 122 4 3.00 1 0 0 0 1 1 0
1177 3 71 1 3.30 2 1 1 1 0 1 0
4016 28 173 4 2.70 1 0 0 1 0 1 0
3784 6 115 4 3.80 2 0 0 1 0 1 0
4590 34 151 3 0.60 2 0 0 0 0 1 0
322 39 101 1 3.90 1 1 1 1 0 1 0
2158 25 83 4 3.10 1 0 0 0 1 1 0
1768 18 128 4 5.30 1 0 0 0 0 1 0
1577 8 65 1 3.00 1 0 0 1 0 1 0
2024 12 113 4 0.20 1 0 0 0 0 1 0
1499 26 91 1 4.30 2 0 1 1 1 1 0
3720 39 131 3 2.60 3 0 0 1 0 1 0
90 30 118 4 5.60 2 0 0 1 0 1 0
965 36 135 2 5.20 2 0 0 1 0 1 0
3573 36 165 3 5.60 1 0 0 0 1 1 0
2544 34 90 1 3.60 2 0 0 0 0 1 0
2268 3 105 1 3.00 2 1 0 0 0 1 0
2533 29 111 1 1.10 2 0 0 1 0 1 0
2262 29 131 2 0.70 2 0 0 0 1 1 0
1125 13 172 2 6.50 1 0 0 0 0 0 1
4532 22 133 2 3.10 2 0 0 1 0 1 0
4763 25 173 1 0.50 2 0 0 1 0 1 0
4575 27 115 2 0.50 3 0 0 0 0 1 0
1432 2 195 1 6.33 1 0 0 0 0 0 1
1328 36 145 4 6.90 1 0 0 0 1 1 0
1559 35 102 4 3.00 2 0 0 1 0 1 0
3858 18 158 2 0.40 2 0 0 0 1 1 0
1913 33 134 4 0.90 1 0 0 1 0 1 0
1126 8 104 2 3.70 1 0 0 0 1 1 0
3818 0 102 4 2.30 3 0 0 0 0 0 1
4168 34 139 4 0.40 1 0 0 1 0 1 0
316 31 165 1 1.60 2 0 0 1 0 1 0
2539 7 98 1 4.20 1 1 1 0 0 1 0
2934 13 195 2 6.50 1 0 0 1 0 0 1
3758 23 199 2 6.67 1 0 0 1 0 0 1
4927 19 121 1 0.70 2 0 1 1 1 1 0
2385 17 125 4 3.50 2 0 0 0 0 1 0
3835 9 131 3 2.20 3 1 0 0 0 1 0
3312 22 190 2 8.80 1 0 0 0 0 0 1
1225 6 118 2 2.80 2 0 0 0 0 1 0
3472 27 120 4 3.00 2 0 0 1 0 1 0
2874 25 114 1 2.50 3 0 0 1 0 1 0
671 41 105 1 3.00 2 1 1 1 0 1 0
438 32 113 2 3.80 2 0 1 1 1 1 0
4156 12 193 1 8.60 1 0 0 0 0 0 1
4702 5 108 2 2.75 3 1 0 0 0 1 0
3343 37 125 1 1.00 3 0 0 1 0 1 0
3318 20 105 4 3.20 1 0 0 0 0 1 0
650 22 122 1 5.10 3 1 0 0 0 1 0
2403 13 140 4 0.50 1 0 0 0 0 1 0
1794 32 98 3 3.90 3 0 0 0 0 1 0
1195 7 123 2 2.90 2 0 0 1 0 1 0
1373 35 135 3 0.30 3 0 0 1 0 1 0
4865 24 133 4 1.40 2 0 0 0 1 1 0
785 22 164 2 7.60 1 0 1 1 1 0 1
2159 35 99 1 4.80 3 0 0 0 1 1 0
4418 34 145 4 1.80 1 0 0 1 0 1 0
2470 7 81 2 4.50 3 0 1 1 1 0 1
2136 26 115 1 1.20 3 0 0 0 1 1 0
2536 25 104 1 4.20 2 0 0 1 0 1 0
464 19 83 4 3.60 3 0 0 0 1 1 0
1143 7 120 1 3.20 3 0 0 1 0 1 0
1793 9 113 3 0.80 3 0 0 1 0 1 0
813 25 130 1 1.10 2 0 0 1 0 1 0
3766 35 108 4 3.80 2 0 0 1 0 1 0
421 3 115 4 3.10 2 0 0 0 0 1 0
3651 23 140 1 1.90 3 0 0 0 1 1 0

6. Give conclusion related to the Business understanding of your model? (5 marks)

Logistic regression may not be the most suitable model for the data. Here are the reasons why:

According to the confusion matrix, 43 (77) individuals were true positive cases, meaning we predict these individuals to accept a personal loan offer and they will take it, based off of the data from the last campaign. We also anticipate 13 false positives, which we predict will accept the personal loan offered but they will refuse it. There will be 1338 true negative cases, which are individuals we predict will not accept the personal loan and they will not accept it. 72 individuals were labeled as false negative, which we predict will not accept the personal loan but they will accept it.

Important features of this logistic regression model include: CD Account (3.15), Education (1.18), Credit Card (1.01) and Securities Account (0.94). These are the top predictor variables (and their corresponding coefficients) that impact the model's performance. There were other variables that were not as influential, notably: Online (0.62), Family (0.46), CCAvg (0.13), Income (0.4) and Experience (0.1).

Important metric: In order for the bank to secure less of a loss, we need to rely on recall and precision as important metrics. Recall determines of all who accept the loan, how many will the model predict accurately. In this case, the recall value is 0.517. 1-Recall (0.483) determines the severity of false negatives. This value is decent. Larger false negative count is bad for the bank, as these are individuals that were predicted to not take the personal loan but they end up accepting it. So, the bank would have to provide loans and take a risk on these individuals that were not predicted by the model to accept it in the first place. In summary, the bank may have to lose money by lending to these 72 individuals (False negative count).

At least with this model, there were more true positives than false negatives.

Precision is also an important metric. It determines of all the cases the model predicted would accept the personal loan, how many will actually accept the loan. Precision for this model is: 0.8556. 1-Precision(0.1444) allows us to understand how the bank can save money on these loans. Individuals accounted for in 1-Precision are people who were predicted to take a loan but ended up not taking it; so the bank would not have to take a gamble on these individuals. There are 13 individuals in this group (False Positive) and the bank would save money.

According to this model, the bank would lose some money overall because there are more false negatives than false positives. This model predicts that the bank would have to pay for 59 additional personal loans (False Negative - False Positive = 72-13).