-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathe_commerce_churn.py
More file actions
6305 lines (4807 loc) · 235 KB
/
e_commerce_churn.py
File metadata and controls
6305 lines (4807 loc) · 235 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""E Commerce Churn.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1XqlFbSG3txioNOHkao3S3ewu0M8Fnhy9
**Problem statement** :
Online E-Commerce company XYZ has been observing that customers switching to competitor banks over the past couple of quarters. As such, this has caused a huge dent in the quarterly revenues and might drastically affect annual revenues for the ongoing financial year, causing stocks to plunge and market cap to reduce by X %. A team of experts from the fields of economics, product development, technology and data science was assembled to halt this decline.
**Objective** : Is it possible to use a model to predict with reasonable accuracy which customers will leave in the near future? The ability to accurately estimate the timing of their departure would be an added bonus.
**Definition of churn** : A customer having closed all their active accounts with the e commerce paltform is said to have churned. Churn can be defined in other ways as well, based on the context of the problem. A customer not transacting for 6 months or 1 year can also be defined as to have churned, based on the business requirements
From the perspective of a business team/product manager: :
(1) Business goal : Arrest slide in revenues or loss of active E-commerce customers
(2) Identify data source : Transactional systems, event-based logs, Data warehouse (MySQL DBs, Redshift/AWS), Data Lakes, NoSQL DBs
(3) Audit for data quality : De-duplication of events/transactions, Complete or partial absence of data for chunks of time in between, Obscuring PII (personal identifiable information) data
(4) Define business and data-related metrics : Tracking of these metrics over time, probably through some intuitive visualizations
(i) Business metrics : Churn rate (month-on-month, weekly/quarterly), Trend of avg. number of products per customer,
%age of dormant customers, Other such descriptive metrics
(ii) Data-related metrics : F1-score, Recall, Precision
Recall = TP/(TP + FN)
Precision = TP/(TP + FP)
F1-score = Harmonic mean of Recall and Precision
where, TP = True Positive, FP = False Positive and FN = False Negative
(5) Prediction model output format : Since this is not going to be an online model, it doesn't require deployment. Instead, periodic (monthly/quarterly) model runs could be made and the list of customers, along with their propensity to churn shared with the business (Sales/Marketing) or Product team
(6) Action to be taken based on model's output/insights : Based on the output obtained from Data Science team as above, various business interventions can be made to save the customer from getting churned. Customer-centric E-commerce offers, getting in touch with customers to address grievances etc. Here, also Data Science team can help with basic EDA to highlight different customer groups/segments and the appropriate intervention to be applied against them
__Collaboration with Engineering and DevOps :__
(1) Application deployment on production servers (In the context of this problem statement, not required)
(2) [DevOps] Monitoring the scale aspects of model performance over time (Again, not required, in this case)
### How to set the target/goal for the metrics?
* Data science-related metrics :
- Recall : >70%
- Precision : >70%
- F1-score : >70%
* Business metrics : Usually, it's top down. But a good practice is to consider it to make at least half the impact of the data science metric. For e.g., If we take Recall target as __70%__ which means correctly identifying 70% of customers who's going to churn in the near future, we can expect that due to business intervention (offers, getting in touch with customers etc.), 50% of the customers can be saved from being churned, which means atleast a __35%__ improvement in Churn Rate
## Show me the code!
"""
from google.colab import drive
drive.mount('/content/drive')
!pip install "numpy<2.0" "xgboost==2.1.4" "scikit-learn==1.3.2"
import numpy
import xgboost
print(numpy.__version__) # Should be 1.26.4
print(xgboost.__version__) # Should be 2.1.4
## Import required libraries
import os
import pandas as pd
from pandas.plotting import parallel_coordinates
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.legend_handler import HandlerPathCollection
import matplotlib.lines as mlines
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from sklearn.calibration import CalibratedClassifierCV
from sklearn.calibration import CalibrationDisplay
from sklearn.metrics import brier_score_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.compose import ColumnTransformer
from sklearn.inspection import PartialDependenceDisplay
from lightgbm import LGBMClassifier
import xgboost as xgb
from xgboost import XGBClassifier
# Commented out IPython magic to ensure Python compatibility.
# %matplotlib inline
## Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
## Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
## Display all rows and columns of a dataframe instead of a truncated version
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
## Reading the dataset
# This might be present in S3, or obtained through a query on a database
df_churn = pd.read_csv('/content/drive/MyDrive/Churn Projectpro/E Commerce Dataset.csv')
# Dataframe has 16890 observations and 21 columns
print(df_churn.shape)
df_churn.head(10).T
"""### Basic EDA"""
df_churn.describe() # Describe all numerical columns
df_churn.describe(include = ['O']) # Describe all non-numerical/categorical columns
## Checking number of unique customers in the dataset
df_churn.shape[0], df_churn.CustomerID.nunique()
customer_counts = df_churn['CustomerID'].value_counts()
print(customer_counts)
customer_counts = df_churn['CustomerID'].value_counts()
# Check if the minimum and maximum counts are both 3
min_count = customer_counts.min()
max_count = customer_counts.max()
print(f"Minimum observations per customer: {min_count}")
print(f"Maximum observations per customer: {max_count}")
# List of all columns expected to be consistent for a given CustomerID across observations
# Exclude 'CustomerID' itself and 'Unnamed: 0' as they are identifiers or index-like.
all_static_cols = [col for col in df_churn.columns if col not in ['CustomerID', 'Unnamed: 0']]
# This produces a DataFrame showing the number of unique values for each of these columns per CustomerID
consistency_df = df_churn.groupby('CustomerID')[all_static_cols].nunique()
# Check if all customers have a unique count of exactly 1 across all relevant columns
all_consistent_overall = (consistency_df == 1).all().all()
if all_consistent_overall:
print(f"\nAll customers have consistent values for all specified static attributes across their 3 observations.")
else:
print(f"\nSome customers have different values across their observations in one or more of the following columns: {all_static_cols}")
print("\nRun the next cell to see details per column.")
for col in all_static_cols:
inconsistent_customers_for_col = consistency_df[consistency_df[col] != 1].index.tolist()
if inconsistent_customers_for_col:
print(f"\nCustomers with inconsistent values for '{col}': {len(inconsistent_customers_for_col)} customers")
print(f"Example CustomerIDs for '{col}': {inconsistent_customers_for_col[:5]} (showing first 5 if many)")
# Optionally, display the actual inconsistent data for the first customer in that column
if inconsistent_customers_for_col:
first_inconsistent_customer = inconsistent_customers_for_col[0]
print(f"Data for CustomerID {first_inconsistent_customer} in column '{col}':")
display(df_churn[df_churn['CustomerID'] == first_inconsistent_customer][['CustomerID', col]])
check_customer_data = df_churn[df_churn['CustomerID'] == 50011]
display(check_customer_data)
df_churn_ = df_churn.copy()
df_churn_ = df_churn_.drop_duplicates(subset=['CustomerID'])
# Drop the redundant Churn column and rename the remaining one
#df_churn_ = df_churn_.drop(columns=['Churn_y'])
#df_churn_ = df_churn_.rename(columns={'Churn_x': 'Churn'})
df_churn_.shape[0], df_churn.CustomerID.nunique()
df_churn = df_churn.drop_duplicates(subset=['CustomerID'])
# Drop the redundant Churn column and rename the remaining one
#df_churn = df_churn.drop(columns=['Churn_y'])
df_churn.shape[0], df_churn.CustomerID.nunique()
# The two-step merge process is the robust way to ensure everything works correctly.
# 1. Aggregate ONLY numeric columns using median
# Any customer who had ALL 3 rows as NaN will have a NaN here.
# df_agg_numeric = df_churn_.groupby('CustomerID').median(numeric_only=True).reset_index()
# 2. Aggregate categorical columns using 'first'
#df_agg_categorical = df_churn_.groupby('CustomerID')[['PreferredLoginDevice', 'PreferredPaymentMode',
# 'Gender', 'PreferedOrderCat', 'MaritalStatus', 'Churn']].first().reset_index()
# 3. Merge the results
#df_churn_ = pd.merge(df_agg_numeric, df_agg_categorical, on='CustomerID', how='left')
# Drop the redundant Churn column and rename the remaining one
#df_churn_ = df_churn_.drop(columns=['Churn_y'])
#df_churn_ = df_churn_.rename(columns={'Churn_x': 'Churn'})
# Check how many NaNs are in your final 5630-row DataFrame:
print("\nNumber of NaNs remaining in final DataFrame:")
print(df_churn_.isnull().sum())
# Check how many NaNs are in your final 5630-row DataFrame:
print("\nNumber of NaNs remaining in final DataFrame:")
print(df_churn.isnull().sum())
df_tier = df_churn_.groupby(['CityTier']).agg({'CustomerID':'count', 'Churn':'mean'}
).reset_index().sort_values(by='CustomerID', ascending=False)
df_tier
df_churn_.CityTier.value_counts(normalize=True)
"""#### Conclusion
* Discard Unnamed: 0
* Discard CustomerID as well, since it doesn't convey any extra info. Each row pertains to a unique customer
* Based on the above, columns/features can be segregated into non-essential, numerical, categorical and target variables
In general, CustomerID is a very useful feature on the basis of which we can calculate a lot of user-centric features. Here, the dataset is not sufficient to calculate any extra customer features
"""
## Separating out different columns into various categories as defined above
## is CityTier, (NumberOfDeviceRegistered) also categorical feature
target_var = ['Churn']
cols_to_remove = ['Unnamed: 0', 'CustomerID']
num_feats = ['Tenure', 'CityTier', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered', 'SatisfactionScore', 'NumberOfAddress', 'Complain', 'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount', 'DaySinceLastOrder', 'CashbackAmount']
cat_feats = ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat', 'MaritalStatus']
"""Among these, NumberOfDeviceRegistered could also be categorical and Complain is binary categorical variable and SatisfactionScore probably ordinal.
"""
## Separating out target variable and removing the non-essential columns
y = df_churn_[target_var].values
df_churn_.drop(cols_to_remove, axis=1, inplace=True)
"""### Questioning the data :
- Why do values in the origional dataset appear three times. An uplaod error?
- No date/time column. A lot of useful features can be built using date/time columns
- When was the data snapshot taken? There are certain customer features like : CashbackAmount, Tenure, OrderCount etc., which will have different values across time
- Are all these values/features pertaining to the same single date or spread across multiple dates?
- How frequently are customer features updated?
- Will it be possible to have the values of these features over a period of time as opposed to a single, snapshot date?
- Some customers have more than one address, do they order for others or for a company?
- Customer transaction patterns can also help us ascertain whether the customer has actually churned or not. For example, a customer might transact daily/weekly vs a customer who transacts annuallly
Here, the objective is to understand the data and distill the problem statement and the stated goal further. In the process, if more data/context can be obtained, that adds to the end result of the model performance
### Data Cleaning
"""
# Calculating the percentage of the missing values in every column
percent_missing = (df_churn_.isnull().sum() * 100) / len(df_churn_)
fig = plt.figure(figsize=(12, 6))
_ = percent_missing.plot(kind='bar')
_ = plt.title('Percentage of Missing Values per Column')
_ = plt.xlabel('Column Name')
_ = plt.ylabel('Percentage Missing')
_ = plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("Checking all columns for negative values:")
# Iterate through all columns in the DataFrame
for col in df_churn_.columns:
# We only want to check columns that are numeric types (int or float)
if np.issubdtype(df_churn_[col].dtype, np.number):
# Check if any value is less than zero
count_negative = (df_churn_[col] < 0).sum()
if count_negative > 0:
print(f"Column '{col}' has {count_negative} negative value(s).")
# else:
# print(f"Column '{col}' is clean (no negatives).")
print("\nCheck complete.")
# Before imputation
print("Value counts for 'PreferredPaymentMode' before imputation:")
display(df_churn_['PreferredPaymentMode'].value_counts(dropna=False))
# Fill missing values in 'PreferredPaymentMode' with 'Unknown'
df_churn_['PreferredPaymentMode'].fillna('Unknown', inplace=True)
# Verify that there are no more NaNs in this column
print(f"\nNumber of NaN values in 'PreferredPaymentMode' after imputation: {df_churn_['PreferredPaymentMode'].isnull().sum()}")
df_churn_['PreferredPaymentMode'] = df_churn_['PreferredPaymentMode'].replace({
'CC': 'Credit Card',
'COD': 'Cash on Delivery'
})
df_churn['PreferredPaymentMode'] = df_churn['PreferredPaymentMode'].replace({
'CC': 'Credit Card',
'COD': 'Cash on Delivery'
})
# Example 1: PreferredPaymentMode
print("Value counts for 'PreferredPaymentMode':")
display(df_churn_['PreferredPaymentMode'].value_counts())
plt.figure(figsize=(8, 5))
sns.countplot(data=df_churn_, x='PreferredPaymentMode', palette='viridis')
_ = plt.title('Distribution of PreferredPaymentMode')
_ = plt.xlabel('PreferredPaymentMode')
_ = plt.ylabel('Count')
_ = plt.xticks(rotation=45, ha='right') # Added rotation here
plt.show()
# Before imputation
print("Value counts for 'MaritalStatus' before imputation:")
display(df_churn_['MaritalStatus'].value_counts(dropna=False))
# Fill missing values in 'MaritalStatus' with 'Unknown'
df_churn_['MaritalStatus'].fillna('Unknown', inplace=True)
# After imputation
print("\nValue counts for 'MaritalStatus' after imputation:")
display(df_churn_['MaritalStatus'].value_counts(dropna=False))
# Verify that there are no more NaNs in this column
print(f"\nNumber of NaN values in 'MaritalStatus' after imputation: {df_churn_['MaritalStatus'].isnull().sum()}")
# Example 1: MaritalStatus
print("Value counts for 'MaritalStatus':")
display(df_churn_['MaritalStatus'].value_counts())
plt.figure(figsize=(8, 5))
sns.countplot(data=df_churn_, x='MaritalStatus', palette='viridis')
_ = plt.title('Distribution Marital Status')
_ = plt.xlabel('MaritalStatus')
_ = plt.ylabel('Count')
_ = plt.xticks(rotation=45, ha='right') # Added rotation here
plt.show()
rows_with_zeros = df_churn_[df_churn_['PreferredLoginDevice'] == '0']
display(rows_with_zeros)
# Before replacement
print("Value counts for 'PreferredLoginDevice' before standardization:")
display(df_churn_['PreferredLoginDevice'].value_counts(dropna=False))
# Define the mapping
PreferredLoginDevice_mapping = {
'0': 'Unknown'
}
# Apply the replacement
df_churn_['PreferredLoginDevice'] = df_churn_['PreferredLoginDevice'].replace(PreferredLoginDevice_mapping)
# After replacement
print("\nValue counts for 'PreferredLoginDevice' after standardization:")
display(df_churn_['PreferredLoginDevice'].value_counts(dropna=False))
# PreferredLoginDevice
print("Value counts for 'PreferredLoginDevice':")
display(df_churn_['PreferredLoginDevice'].value_counts())
plt.figure(figsize=(8, 5))
sns.countplot(data=df_churn_, x='PreferredLoginDevice', palette='viridis')
_ = plt.title('Distribution PreferredLoginDevice')
_ = plt.xlabel('Preferred PreferredLoginDevice')
_ = plt.ylabel('Count')
_ = plt.xticks(rotation=45, ha='right') # Added rotation here
plt.show()
# df_churn contains missing values
df_churn['PreferredLoginDevice'] = df_churn['PreferredLoginDevice'].replace("0", np.nan)
# df_churn contains missing values
print("Value counts for 'PreferredLoginDevice':")
display(df_churn['PreferredLoginDevice'].value_counts())
plt.figure(figsize=(8, 5))
sns.countplot(data=df_churn, x='PreferredLoginDevice', palette='viridis')
_ = plt.title('Distribution PreferredLoginDevice')
_ = plt.xlabel('Preferred PreferredLoginDevice')
_ = plt.ylabel('Count')
_ = plt.xticks(rotation=45, ha='right') # Added rotation here
plt.show()
# Before replacement
print("Value counts for 'Gender' before treatment:")
display(df_churn_['Gender'].value_counts(dropna=False))
# Define the mapping
gender_mapping = {
'm': 'Male',
'f': 'Female'
}
# Apply the replacement
df_churn_['Gender'] = df_churn_['Gender'].replace(gender_mapping)
# After replacement
print("\nValue counts for 'Gender' after treatment:")
display(df_churn_['Gender'].value_counts(dropna=False))
# df_churn chainging m to Male and f to Female
print("Value counts for 'Gender' before treatment")
display(df_churn['Gender'].value_counts(dropna=False))
# Define the mapping
gender_mapping = {
'm': 'Male',
'f': 'Female'
}
# Apply the replacement
df_churn['Gender'] = df_churn['Gender'].replace(gender_mapping)
# After replacement
print("\nValue counts for 'Gender' after treatment:")
display(df_churn['Gender'].value_counts(dropna=False))
rows_with_unreasonable_values = df_churn_[df_churn_['SatisfactionScore'] == 589314.0]
display(rows_with_unreasonable_values)
count_unreasonable_SatisfactionScore = (df_churn_['SatisfactionScore'] == -1).sum()
print(f"Number of rows where 'SatisfactionScore' is 589314.0: {count_unreasonable_SatisfactionScore}")
# Example 1: SatisfactionScore
print("Value counts for 'SatisfactionScore':")
display(df_churn_['SatisfactionScore'].value_counts())
plt.figure(figsize=(8, 5))
sns.countplot(data=df_churn_, x='SatisfactionScore', palette='viridis')
_ = plt.title('Distribution SatisfactionScore')
_ = plt.xlabel('SatisfactionScore')
_ = plt.ylabel('Count')
_ = plt.xticks(rotation=45, ha='right') # Added rotation here
plt.show()
df_churn['SatisfactionScore'] = df_churn['SatisfactionScore'].replace(589314.0, np.nan)
# Example 1: SatisfactionScore
print("Value counts for 'SatisfactionScore':")
display(df_churn['SatisfactionScore'].value_counts())
plt.figure(figsize=(8, 5))
sns.countplot(data=df_churn, x='SatisfactionScore', palette='viridis')
_ = plt.title('Distribution SatisfactionScore')
_ = plt.xlabel('SatisfactionScore')
_ = plt.ylabel('Count')
_ = plt.xticks(rotation=45, ha='right') # Added rotation here
plt.show()
CategoricalFeatures = ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat', 'MaritalStatus',
"SatisfactionScore"]
def ValueCounts():
for i in CategoricalFeatures:
print(df_churn_[i].value_counts())
ValueCounts()
# Count the total number of entries where the value in 'Tenure' is less than 0
count_negative_values = (df_churn_['Tenure'] < 0).sum()
print(f"Total number of negative values in df_churn_['Tenure']: {count_negative_values}")
rows_with_negative_tenure = df_churn_[df_churn_['Tenure'] == -10000]
display(rows_with_negative_tenure)
count_negative_tenure = (df_churn_['Tenure'] == -10000).sum()
print(f"Number of rows where 'Tenure' is -10000: {count_negative_tenure}")
count_nan_tenure = df_churn_['Tenure'].isnull().sum()
print(f"Number of NaN values in 'Tenure': {count_nan_tenure}")
df_churn_['Tenure'] = df_churn_['Tenure'].replace(-10000, np.nan)
# Let's verify the change in the new DataFrame
count_negative_tenure_new_df = (df_churn_['Tenure'] == -10000).sum()
count_nan_tenure_new_df = df_churn_['Tenure'].isnull().sum()
print(f"In the new DataFrame, 'df_churn_':")
print(f" Number of -10000 values in 'Tenure': {count_negative_tenure_new_df}")
print(f" Number of NaN values in 'Tenure' after replacement: {count_nan_tenure_new_df}")
# Also show a sample of the updated column
display(df_churn_[df_churn_['Tenure'].isnull()].head())
# Counting Nan values in df_churn
count_nan_tenure_in_df_churn = df_churn['Tenure'].isnull().sum()
print(f"Number of NaN values in 'Tenure': {count_nan_tenure_in_df_churn}")
# Veriyfying that -10000 has changed to NaN in df_chrun
df_churn['Tenure'] = df_churn['Tenure'].replace(-10000, np.nan)
# Let's verify the change in the new DataFrame
count_negative_tenure_new_df = (df_churn['Tenure'] == -10000).sum()
count_nan_tenure_new_df = df_churn['Tenure'].isnull().sum()
print(f"In the new DataFrame, 'df_churn':")
print(f" Number of -10000 values in 'Tenure': {count_negative_tenure_new_df}")
print(f" Number of NaN values in 'Tenure' after replacement: {count_nan_tenure_new_df}")
# Also show a sample of the updated column
display(df_churn[df_churn['Tenure'].isnull()].head())
"""* The missing values of categorical features have been filled with Unkown and also a new category been generated for the missing values
* The gender feature included m and f. m has bee assigned to Male and f has been assigned to Female.
* Tenure had among NaN also -10000 which are now treated as NaN values.
* For the second churn df (churn_df) the missing values or unreasonable values have been defined as NaN
### Separating out train-test-valid sets
Since this is the only data available to us, we keep aside a holdout/test set to evaluate our model at the very end in order to estimate our chosen model's performance on unseen data / new data.
A validation set is also created which we'll use in our baseline models to evaluate and tune our models
"""
#Data Frames
#1. df_churn original dataframe contains duplicate entries (NOTE: DUPLICATES REMOVED BEFOR THE SPLIT)
#2. df_churn_ duplicate entries removed missing values in categorical features are categorized as unknown and
#unreasonable numbers in numeric features are defined as NaN
#3. df_churn --> df_val_m --> df_train_m --> df_test_m --> y_train_m --> y_val_m (NOTE: df_churn dubplicates removed, contains missing values)
#4. df_churn_ --> df_val --> df_train --> df_test --> y_train --> y_val --> y_test
#5. df_train, df_test, df_val -> df_train_II, df_test_II -> df_test_II, df_val_II (II = Imputed, No multicollinearity, encoded)
# df_train_m, df_val_m, df_test_m --> df_train_m_X, df_val_m_X, df_test_m_X
#6. df_train_II, df_X_test_II, df_X_val_II -> sc_en_X_train_II, sc_en_X_test_II, sc_en_X_val_II (sclaed and encoded)
# df_train_II, df_X_test_II, df_X_val_II --> df_train_II_NoOutliers, df_test_II_NoOutliers, df_val_II_NoOutliers
# sc_en_X_train_II, sc_en_X_test_II, sc_en_X_val_II --> sc_en_X_train_II_NoOutliers, sc_en_X_test_II_NoOutliers, sc_en_X_val_II_NoOutliers
#7. X_train, X_val, X_test -> df_train_II, df_val_II, df_test_II (Imputed, No multicollinearity, encoded)
#8. X_train_m, X_val_m, X_test_m -> df_train_m, df_val_m, df_test_m
## Keeping aside a test/holdout set
df_train_val, df_test, y_train_val, y_test = train_test_split(df_churn_, y.ravel(), test_size = 0.1, random_state = 42)
## Splitting into train and validation set
df_train, df_val, y_train, y_val = train_test_split(df_train_val, y_train_val, test_size = 0.12, random_state = 42)
df_train.shape, df_val.shape, df_test.shape, y_train.shape, y_val.shape, y_test.shape
np.mean(y_train), np.mean(y_val), np.mean(y_test)
## Applying on df_churn that has missing values
## Keeping aside a test/holdout set
df_train_val_m, df_test_m, y_train_val_m, y_test_m = train_test_split(df_churn, y.ravel(), test_size = 0.1, random_state = 42)
## Splitting into train and validation set
df_train_m, df_val_m, y_train_m, y_val_m = train_test_split(df_train_val_m, y_train_val_m, test_size = 0.12, random_state = 42)
# 'df' is your cleaned dataframe
#df_churn.to_csv('df_churn.csv', index=False)
#from google.colab import files
#files.download('df_churn.csv')
df_train_m_X = df_train_m.copy()
df_val_m_X = df_val_m.copy()
df_test_m_X = df_test_m.copy()
df_train_m_X.columns
sns.set(style="whitegrid")
# Create a count plot for the 'Churn' variable
plt.figure(figsize=(6, 4)) # Optional: Adjusts the size of the plot
sns.countplot(x="Churn", data=df_train) # 'x' maps to the categorical column
plt.title("Distribution of Churn df_train (No Churn vs Churn)") # Add a descriptive title
plt.xlabel("Churn Status") # Optional: label for the x-axis
plt.ylabel("Count") # Optional: label for the y-axis
plt.show() # Displays the plot
sns.set(style="whitegrid")
# Create a count plot for the 'Churn' variable
plt.figure(figsize=(6, 4)) # Optional: Adjusts the size of the plot
sns.countplot(x="Churn", data=df_test) # 'x' maps to the categorical column
plt.title("Distribution of Churn df_test (No Churn vs Churn)") # Add a descriptive title
plt.xlabel("Churn Status") # Optional: label for the x-axis
plt.ylabel("Count") # Optional: label for the y-axis
plt.show() # Displays the plot
sns.set(style="whitegrid")
# Create a count plot for the 'Churn' variable
plt.figure(figsize=(6, 4)) # Optional: Adjusts the size of the plot
sns.countplot(x="Churn", data=df_val) # 'x' maps to the categorical column
plt.title("Distribution of Churn df_val (No Churn vs Churn)") # Add a descriptive title
plt.xlabel("Churn Status") # Optional: label for the x-axis
plt.ylabel("Count") # Optional: label for the y-axis
plt.show() # Displays the plot
sns.set(style="whitegrid")
# Create a count plot for the 'Churn' variable
plt.figure(figsize=(6, 4)) # Optional: Adjusts the size of the plot
sns.countplot(x="Churn", data=df_train_m) # 'x' maps to the categorical column
plt.title("Distribution of Churn df_train_m (No Churn vs Churn)") # Add a descriptive title
plt.xlabel("Churn Status") # Optional: label for the x-axis
plt.ylabel("Count") # Optional: label for the y-axis
plt.show() # Displays the plot
sns.set(style="whitegrid")
# Create a count plot for the 'Churn' variable
plt.figure(figsize=(6, 4)) # Optional: Adjusts the size of the plot
sns.countplot(x="Churn", data=df_test_m) # 'x' maps to the categorical column
plt.title("Distribution of Churn df_test_m (No Churn vs Churn)") # Add a descriptive title
plt.xlabel("Churn Status") # Optional: label for the x-axis
plt.ylabel("Count") # Optional: label for the y-axis
plt.show() # Displays the plot
sns.set(style="whitegrid")
# Create a count plot for the 'Churn' variable
plt.figure(figsize=(6, 4)) # Optional: Adjusts the size of the plot
sns.countplot(x="Churn", data=df_val_m) # 'x' maps to the categorical column
plt.title("Distribution of Churn df_val_m (No Churn vs Churn)") # Add a descriptive title
plt.xlabel("Churn Status") # Optional: label for the x-axis
plt.ylabel("Count") # Optional: label for the y-axis
plt.show() # Displays the plot
"""### Univariate plots of numerical variables in training set"""
num_list = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'NumberOfAddress', 'OrderAmountHikeFromlastYear','CouponUsed', 'OrderCount',
'DaySinceLastOrder', 'CashbackAmount']
def numeric_features_visuals():
for visual in num_list:
plt.figure()
sns.set(style = "whitegrid")
sns.boxplot(y = df_train[visual])
return plt.show()
print(numeric_features_visuals())
num_list = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'NumberOfAddress', 'OrderAmountHikeFromlastYear','CouponUsed', 'OrderCount',
'DaySinceLastOrder', 'CashbackAmount']
def numeric_features_visuals():
for visual in num_list:
plt.figure()
#sns.set(style = "whitegrid")
sns.violinplot(y = df_train[visual])
return plt.show()
print(numeric_features_visuals())
"""Violin plot suggests that NumberOfDeviceRegistered and HoursSpendOnApp could be treated as categorical features as well?"""
num_list = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'NumberOfAddress', 'OrderAmountHikeFromlastYear','CouponUsed', 'OrderCount',
'DaySinceLastOrder', 'CashbackAmount']
def numeric_features_visuals():
for visual in num_list:
plt.figure()
sns.set(style = 'ticks')
sns.distplot(df_train[visual], hist=True, kde=False)
return plt.show()
print(numeric_features_visuals())
num_list = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'NumberOfAddress', 'OrderAmountHikeFromlastYear','CouponUsed', 'OrderCount',
'DaySinceLastOrder', 'CashbackAmount']
def numeric_features_visuals():
for visual in num_list:
plt.figure()
#sns.set(style = 'ticks')
sns.kdeplot(df_train[visual])
return plt.show()
print(numeric_features_visuals())
num_list = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'NumberOfAddress', 'OrderAmountHikeFromlastYear','CouponUsed', 'OrderCount',
'DaySinceLastOrder', 'CashbackAmount']
def numeric_features_visuals():
for visual in num_list:
plt.figure()
#sns.set(style = 'ticks')
sns.histplot(df_train[visual])
return plt.show()
print(numeric_features_visuals())
"""* HoursSpendOnApp and NumberOfDeviceRegistered might not be strong predictors because they are only 3 or 6 numbers (could be treated as categorical or even ordinal).
* CouponUsed and OrderCount might also not be strong predictrs due to fewer numbers in some categories.
### Bivariate plots of categorical variables in training set
"""
print("Unique categories in 'PreferredLoginDevice':")
display(df_train['PreferredLoginDevice'].unique())
print("\nCategories and their counts in 'PreferredPaymentMode':")
display(df_train['PreferredPaymentMode'].unique())
print("\nCategories and their counts in 'Gender':")
display(df_train['Gender'].unique())
print("\nCategories and their counts in 'PreferedOrderCat':")
display(df_train['PreferedOrderCat'].unique())
print("\nCategories and their counts in 'MaritalStatus':")
display(df_train['MaritalStatus'].unique())
print("\nCategories and their counts in 'SatisfactionScore':")
display(df_train['SatisfactionScore'].unique())
print("\nCategories and their counts in 'Complain':")
display(df_train['Complain'].unique())
print("\nCategories and their counts in 'CityTier':")
display(df_train['CityTier'].unique())
categorical_for_churn_viz = ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat',
'MaritalStatus', 'SatisfactionScore', 'Complain', 'CityTier']
def categorical_vs_target_visuals():
for visual in categorical_for_churn_viz:
# Calculate counts for each category and Churn status
temp_df = df_train.groupby(visual)['Churn'].value_counts().rename('count').reset_index()
# Filter for 'Not Churned' (Churn == 0) and sort by their counts
churn_0_df = temp_df[temp_df['Churn'].astype(int) == 0].sort_values(by='count', ascending=False)
# Get the ordered list of categories to use in the plot
order_list = churn_0_df[visual].tolist()
plt.figure(figsize=(10, 6))
sns.set(style = "whitegrid")
sns.countplot(data=df_train, x=visual, hue='Churn', palette='viridis', order=order_list)
plt.title(f'Distribution of {visual} by Churn (Ordered by Not Churned Count)')
plt.xlabel(visual)
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Churn', labels=['Not Churned', 'Churned'])
return plt.show()
print(categorical_vs_target_visuals())
def categorical_vs_target_visuals_percentage_ordered():
categorical_for_churn_viz = ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat',
'MaritalStatus', 'SatisfactionScore', 'Complain', 'CityTier']
for visual in categorical_for_churn_viz:
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate churn percentages within each category
temp_df = df_train.groupby(visual)['Churn'].value_counts(normalize=True).rename('percentage').reset_index()
temp_df['percentage'] = temp_df['percentage'] * 100 # Convert to actual percentage
# --- NEW STEP: Determine the order based on 'Not Churned' percentage (assuming 0 means Not Churned) ---
# Filter for 'Not Churned' (Churn == 0) and sort in descending order of percentage
# Use .astype(int) for Churn if it's currently stored as float or object
churn_0_df = temp_df[temp_df['Churn'].astype(int) == 0].sort_values(by='percentage', ascending=False)
# Get the ordered list of categories to use in the plot
order_list = churn_0_df[visual].tolist()
# Plot the bar chart using the calculated percentages and the new 'order' parameter
sns.barplot(data=temp_df, x=visual, y='percentage', hue='Churn', palette='viridis', ax=ax, order=order_list)
plt.title(f'Churn Percentage Distribution within {visual} (Ordered by Not Churned %)')
plt.xlabel(visual)
plt.ylabel('Percentage (%)')
plt.xticks(rotation=45, ha='right')
# --- Fix Legend ---
handles, labels = ax.get_legend_handles_labels()
# Ensure labels are correct if Churn values are 0/1 or string '0'/'1'
ax.legend(handles, ['Not Churned', 'Churned'], title='Churn', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
return plt.show()
# Run the new function
print(categorical_vs_target_visuals_percentage_ordered())
# Assuming 'df_train' is your DataFrame and 'Churn' is 0/1 (0 for Not Churned, 1 for Churned)
def categorical_vs_target_visuals_percentage_ordered():
categorical_for_churn_viz = ['HourSpendOnApp', 'NumberOfDeviceRegistered']
for visual in categorical_for_churn_viz:
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate churn percentages within each category
temp_df = df_train.groupby(visual)['Churn'].value_counts(normalize=True).rename('percentage').reset_index()
temp_df['percentage'] = temp_df['percentage'] * 100 # Convert to actual percentage
# --- NEW STEP: Determine the order based on 'Not Churned' percentage (assuming 0 means Not Churned) ---
# Filter for 'Not Churned' (Churn == 0) and sort in descending order of percentage
# Use .astype(int) for Churn if it's currently stored as float or object
churn_0_df = temp_df[temp_df['Churn'].astype(int) == 0].sort_values(by='percentage', ascending=False)
# Get the ordered list of categories to use in the plot
order_list = churn_0_df[visual].tolist()
# Plot the bar chart using the calculated percentages and the new 'order' parameter
sns.barplot(data=temp_df, x=visual, y='percentage', hue='Churn', palette='viridis', ax=ax, order=order_list)
plt.title(f'Churn Percentage Distribution within {visual} (Ordered by Not Churned %)')
plt.xlabel(visual)
plt.ylabel('Percentage (%)')
plt.xticks(rotation=45, ha='right')
# --- Fix Legend ---
handles, labels = ax.get_legend_handles_labels()
# Ensure labels are correct if Churn values are 0/1 or string '0'/'1'
ax.legend(handles, ['Not Churned', 'Churned'], title='Churn', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
return plt.show()
# Run the new function
print(categorical_vs_target_visuals_percentage_ordered())
"""Conlcusion
- HourSpendOnApp does not look like an informative feature. It might one of the least informative features in the dataset. There is no clear pattern.
- Regarding gender feature, both female and male the number of churn is the same for both gender.
- The other features show good pattern. For instance, the more somone is satisfied lessl likely that someone is going to churn. Also if someone complained then the churn likelyhood increases.
### Bivariate plots of numeric features
"""
num_list = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'NumberOfAddress', 'OrderAmountHikeFromlastYear','CouponUsed', 'OrderCount',
'DaySinceLastOrder', 'CashbackAmount']
def numeric_features_visuals():
for visual in num_list:
plt.figure(figsize=(8, 6)) # Adjust figure size for better readability
sns.set(style = "whitegrid")
sns.boxplot(x='Churn', y=visual, data=df_train, palette='viridis')
plt.title(f'Distribution of {visual} by Churn')
plt.xlabel('Churn (0 = No Churn, 1 = Churn)')
plt.ylabel(visual)
plt.grid(axis='y', linestyle='--', alpha=0.7)
return plt.show()
print(numeric_features_visuals())
"""Cconlusion
- Regarding tenure, the longer somone is registered as a member, less likely is the churn.
- The more devices someone registred, the more likely someone is going churn.
- The more adresses someone has, the more likely someone is going to churn.
- The likelihood of churn is higher if the day since the previous order is lower.
- The higher the cash back amount, less likely someone is going to churn.
"""
# continue from here
"""### Missing values and outlier treatment
#### Outliers
* Can be observed from univariate plots of different features
* Outliers can either be logically improbable (as per the feature definition) or just an extreme value as compared to the feature distribution
* As part of outlier treatment, the particular row containing the outlier can be removed from the training set, provided they do not form a significant chunk of the dataset (< 0.5-1%)
* In cases where the value of outlier is logically faulty, e.g. negative Age or Tenure < 0, the particular record can be replaced with mean of the feature or the nearest among min/max logical value of the feature
Outliers in numerical features can be of a very high/low value, lying in the top 1% or bottom 1% of the distribution or values which are not possible as per the feature definition.
Outliers in categorical features are usually levels with a very low frequency/no. of samples as compared to other categorical levels.
##### Is outlier treatment always required ?
No, Not all ML algorithms are sensitive to outliers. Algorithms like linear/logistic regression are sensitive to outliers.
Tree algorithms, kNN, clustering algorithms etc. are in general, robust to outliers
Outliers affect metrics such as mean, std. deviation
#### Missing values
df_train
"""
# Calculating the percentage of the missing values in every column
percent_missing = (df_train.isnull().sum() * 100) / len(df_train)
fig = plt.figure(figsize=(12, 6))
_ = percent_missing.plot(kind='bar')
_ = plt.title('Percentage of Missing Values per Column')
_ = plt.xlabel('Column Name')
_ = plt.ylabel('Percentage Missing')
_ = plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
"""There are a few missing values which can also be observed from df.describe() commands. A couple of things which can be done in such cases :
- If the column/feature has too many missing values, it can be dropped as it might not add much relevance to the data
- If there a few missing values, the column/feature can be imputed with its summary statistics (mean/median/mode) and/or numbers like 0, -1 etc. which add value depending on the data and context.
"""
print(df_train.isnull().sum()/df_train.shape[0])
# Extract the 'WarehouseToHome' column into new, separate DataFrames
df_train_WarehouseToHome = df_train[['WarehouseToHome']].copy()
df_test_WarehouseToHome = df_test[['WarehouseToHome']].copy()
df_val_WarehouseToHome = df_val[['WarehouseToHome']].copy()
# The original df_train, df_test, and df_val remain untouched
# Generating a funciton for the specific column to asign them to df_train df_test and df_val
#def impute_missing_values(df_train_1, df_test_1, df_val_1):
# df_train_WarehouseToHome = df_train_1[['WarehouseToHome']]
# df_test_WarehouseToHome = df_test_1[['WarehouseToHome']]
# df_val_WarehouseToHome = df_val_1[['WarehouseToHome']]
# return df_train_1, df_test_1, df_val_1
#df_train_WarehouseToHome, df_test_WarehouseToHome, df_val_WarehouseToHome = impute_missing_values(df_train, df_test, df_val)
print(df_train.isnull().sum()/df_train.shape[0])
()
print(df_test.isnull().sum()/df_test.shape[0])
()
print(df_val.isnull().sum()/df_val.shape[0])
# Initialize the imputer once
imp = SimpleImputer(missing_values=np.nan, strategy='median')
# Impute missing values in the training set
df_train_WarehouseToHome['WarehouseToHome'] = imp.fit_transform(df_train_WarehouseToHome[['WarehouseToHome']])
# Impute missing values in the test set using the SAME fitted imputer
df_test_WarehouseToHome['WarehouseToHome'] = imp.transform(df_test_WarehouseToHome[['WarehouseToHome']])
# Impute missing values in the validation set using the SAME fitted imputer
df_val_WarehouseToHome['WarehouseToHome'] = imp.transform(df_val_WarehouseToHome[['WarehouseToHome']])
# Display samples of the imputed columns
print("df_train_WarehouseToHome after imputation:")
display(df_train_WarehouseToHome['WarehouseToHome'].head())
print("\ndf_test_WarehouseToHome after imputation:")
display(df_test_WarehouseToHome['WarehouseToHome'].head())
print("\ndf_val_WarehouseToHome after imputation:")
display(df_val_WarehouseToHome['WarehouseToHome'].head())
# Copying the dataframe so the original does not change
df_train_II = df_train.copy()
df_test_II = df_test.copy()
df_val_II = df_val.copy()
# Numeric Features to be imputed
# Check this again if Number of NumberOfAddress has missing values
numerical_features = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp', 'NumberOfDeviceRegistered',
'OrderAmountHikeFromlastYear', 'CouponUsed' ,'OrderCount', 'DaySinceLastOrder']
# Select the relevant numerical data for the imputer
features_for_imputation_train = df_train_II[numerical_features].copy()
features_for_imputation_test = df_test_II[numerical_features].copy()
features_for_imputation_val = df_val_II[numerical_features].copy()
# Initialize a single imputer instance
imputer = IterativeImputer(sample_posterior=True, random_state=42)
# Fit the imputer ONLY on the training data and transform it
II_train_df = imputer.fit_transform(features_for_imputation_train)
# Transform the test and validation sets using the SAME fitted imputer
II_test_df = imputer.transform(features_for_imputation_test)
II_val_df = imputer.transform(features_for_imputation_val)
# Assign the imputed values back to the original DataFrame using the correct index
# The output is a NumPy array, so we create a temporary DataFrame to align indices/columns