A framework to improve churn prediction performance in retail banking

Financial Innovation

Table 1 Previous research assessing churn prediction models (specifications of our study are presented in the bottom line for comparison purposes)

Authors	Context	Data preprocessing stages	Predictive models	Dataset size
Lemmens and Croux (2006)	Telecom	MVI IDT-Over IDT-Under	Logistic regression, bagging, and stochastic gradient boosting	Datasets 1 and 2: 51,306 customers Dataset 3: 100,462 customers
Xie et al. (2009)	Banking	OT	Support Vector Machines	2,382 customers
Zhao and Dang (2008)	Banking	–	Random forests, neural networks, decision trees, and Support Vector Machines	1,524 customers
Benoit and Poel (2012)	Banking	–	Random forest	244,787 customers
Huang et al. (2012)	Telecom	FE	Logistic regression, Naive Bayes, linear classification, C4.5, neural networks, Support Vector Machines, and data mining by evolutionary learning (DMEL)	827,124 customers
Farquad et al. (2014)	Banking	IDT-Over IDT-Under FS	Support Vector Machines, Naive Bayes trees	14,814 customers
He et al. (2014)	Banking	IDT-Over IDT-Under	Logistic regression and Support Vector Machines	46,406 customers
Datta et al. (2015)	TV subscription	MVI FE	Binomial probit model	16,512 customers
Keramati et al. (2016)	Banking	OT MVI IDT-Over FS	Decision trees	4,383 customers
Geiler et al. (2022)	Banking and others	IDT-Over IDT-Under	KNN, Logistic regression, Naive Bayes, Support Vector Machines, Decision trees, neural networks, Random Forest, XGBoost	16 datasets (average of 108,473 customers)
Tékouabou et al. (2022)	Banking	MVI FS	Chandy-Misra-Bryant	45,000 customers
Our study	Banking	MVI FE IDT-Over IDT-Under	XGBoost and Elastic Net	3,283,332 customers