Default or profit scoring credit systems? Evidence from European and US peer-to-peer lending markets

For the emerging peer-to-peer (P2P) lending markets to survive, they need to employ credit-risk management practices such that an investor base is profitable in the long run. Traditionally, credit-risk management relies on credit scoring that predicts loans’ probability of default. In this paper, we use a profit scoring approach that is based on modeling the annualized adjusted internal rate of returns of loans. To validate our profit scoring models with traditional credit scoring models, we use data from a European P2P lending market, Bondora, and also a random sample of loans from the Lending Club P2P lending market. We compare the out-of-sample accuracy and profitability of the credit and profit scoring models within several classes of statistical and machine learning models including the following: logistic and linear regression, lasso, ridge, elastic net, random forest, and neural networks. We found that our approach outperforms standard credit scoring models for Lending Club and Bondora loans. More specifically, as opposed to credit scoring models, returns across all loans are 24.0% (Bondora) and 15.5% (Lending Club) higher, whereas accuracy is 6.7% (Bondora) and 3.1% (Lending Club) higher for the proposed profit scoring models. Moreover, our results are not driven by manual selection as profit scoring models suggest investing in more loans. Finally, even if we consider data sampling bias, we found that the set of superior models consists almost exclusively of profit scoring models. Thus, our results contribute to the literature by suggesting a paradigm shift in modeling credit-risk in the P2P market to prefer profit as opposed to credit-risk scoring models.

potential. The P2P market might represent a substitute and a complement to the traditional bank lending market (Tang 2019;De Roure et al. 2021). The substitution effect might occur during a regulatory or systemic shock to the traditional banking sector (Kou et al. 2021b). On the contrary, complementarity is visible when the P2P lending sector extends credit to the markets that remain underserviced within the traditional lending paradigm (see Jagtiani and Lemieux 2018 [for LC dataset]; Jagtiani et al. 2021 [for mortgage market]).
Although different business models exist for P2P lending, the most common and traditional approach is centered on an Internet-based lending platform that facilitates transactions between borrowers and potential lenders (e.g., Lending Club, Bondora, Prosper). Default rates above 10% are not exceptional in P2P markets (e.g., Bondora). Given the informational asymmetry between borrowers and lenders (see Emekter et al. 2015;Serrano-Cinca et al. 2015) and that such loans are usually unsecured, the lender faces considerable credit risks (Kou et al. 2014;Li et al. 2021). As with any new technology, for the P2P lending market to be sustainable, a (growing) customer base is necessary, which essentially means that lenders who will be profitable in the long run are needed. This case requires state-of-the-art credit-risk models, which in turn is the aim of this paper. Balyuk (2019) showed how credit from a P2P market might signal banks' increased creditworthiness of the borrower. The effect is larger for borrowers with short credit history and low credit grades. This information spillover from the P2P to the traditional credit market is somewhat surprising, given that banks have decades of experience in consumer lending. However, Balyuk (2019) suggested that the improved ability of P2P lenders to evaluate credit risk can most likely be attributed to the use of machine learning (ML) algorithms and alternative data sources. Interestingly, Jagtiani and Lemieux (2019) found that the correlation of credit grades for similar loans issued by banks and provided by P2P platforms declined over time as well. This case is an effect that can be attributed to the unique data or advanced scoring methodology used by players in the P2P lending market. Given the recommendations of the Basel Committee for Banking Supervision, the financial industry has developed a plethora of statistical credit-risk evaluation methods that harness information from past loans to assess the credit risk of new loan applicants and thereby aid credit-risk modeling (e.g., Bastani et al. 2019;Giudici and Misheva 2018;Guo et al. 2016;Kim and Cho 2019a;Malekipirbazari and Aksakalli 2015;Serrano-Cinca and Gutiérrez-Nieto 2016;Xia et al. 2017aXia et al. , b, 2018. Credit-risk models that are based on predicting the probability of default are usually referred to as credit scoring (CS) models. Most of the advances in P2P credit-risk models fall into this category (e.g., Malekipirbazari and Aksakalli 2015;Xu et al. 2019;Xia et al. 2018;Li et al. 2020). Alternatively, profit scoring (PS) models predict a loan's profitability and not only defaults. Despite very promising early results (e.g., Bastani et al. 2019;Serrano-Cinca and Gutiérrez-Nieto 2016;Xia et al. 2017b), little effort has been made in this domain, and many unanswered questions remain. In this paper, we contribute to this latest strand of the literature, that is, we propose a PS-based model that predicts the annualized adjusted internal rate of return of a loan. We document that in the context of both data sets considered (Bondora and Lending Club market places) and in an outof-sample framework, our approach outperforms the standard CS models in terms of statistical significance and economic relevance (profitability and total profit).
The remainder of the paper is organized as follows. In the next section, we provide a short review of the most closely related works on PS models in the P2P market. Next, we present the data and specific features of our data sets. Then, we present our PS method, namely, how we estimate a loan's profitability, the statistical models that we use, and our forecasting and evaluation procedures. The presentation of empirical results follows, and the final section summarizes our key findings.

Related works
In the P2P literature, credit-risk models are based on either statistical (e.g., logistic regression (LR); Emekter et al. 2015;Ge et al. 2016;Guo et al. 2016;Serrano-Cinca et al. 2015;, nonparametric (e.g., decision trees or random forest [RF]; Malekipirbazari and Aksakalli 2015;Zhou et al. 2019), or artificial intelligence methods (e.g., support vector machines or artificial neural networks; Byanjankar et al. 2015;Sjoblom et al. 2019;Xu et al. 2019). Among the credit-risk models, the standard CS framework models the probability of default. Subsequently, the higher the risk of the borrower, the lower the given credit rating grade. 1 Although the CS approach has proven to be successful, the goal of CS is not necessarily aligned with the investor's long-term goal, which is profit maximization. For example, many of the non-performing loans have a history of payments, suggesting that not all non-performing loans are alike. In some cases, borrowers may have paid off a sum that equals or is even greater than the initial loan amount, whereas in other cases, not a single payment has been made. Clearly, for the lender/investor, the difference between the two non-performing loans is relevant because it leads to different loan returns. Serrano-Cinca and Gutiérrez-Nieto (2016), therefore, suggested using the PS approach for credit-risk models, where the dependent variable is represented by the loans' returns as opposed to an indication of whether the loan defaulted or not.
In the P2P literature, the first PS model was presented by Serrano-Cinca and Gutiérrez-Nieto (2016). In their study, they used a data set from Lending Club (US) and found that CS and PS models represent different aspects of the loan. The reason is that the factors driving the probability of default are different from those driving investors' profitability. Moreover, they report that when using a decision tree approach (CHAID) in a PS framework, the returns are not only above the average but also above those suggested by LR models. Next, Xia et al. (2017b) proposed a cost-sensitive boosted decision tree to evaluate annualized loan return. Using data from Lending Club and We.com (China), they found that their approach outperforms standard methods, and more importantly, PS models that explain annualized returns outperform CS models to explain loan defaults. Bastani et al. (2019) followed the work of Serrano-Cinca and Gutiérrez-Nieto (2016) in that they used data from Lending Club and model the internal rate of return. Their approach is interesting in that it draws from the wide and deep learning algorithm of Cheng et al. (2016), which combines the predictions from the CS and the PS models in two stages. In the first stage, they predicted non-default loans, which are modeled in the second stage, where the predicted internal rate of return is of interest. In a test sample, the proposed two-stage approach resulted in positive returns and fared better than the approach of Serrano-Cinca and Gutiérrez-Nieto (2016). Thus, assembling information from CS and PS models might make sense. Finally, Xia et al. (2017b) and Bastani et al. (2019) also addressed the imbalance problem of the P2P datasets, where bad loans tend to be under-represented. They argued that models that account for the imbalance problem might lead to more accurate predictions.
Surprisingly, the literature on PS on P2P is very limited, given that the PS models seem to clearly outperform their CS counterparts. In this paper, we contribute to the literature on the P2P PS models in several ways. First, we do not model annualized return (Xia et al. 2017b) or the standard internal rate of return (Bastani et al. 2019;Serrano-Cinca and Gutiérrez-Nieto 2016). Instead, we model an adjusted annualized internal rate of return, where the reinvestment rate is based on the performance of previous loans on the market. Second, we propose a statistical framework that is based not only on standard, easy to implement and interpret models (multivariate linear and LRs with regularization constraints) but also on more sophisticated models (RF and neural networks) that are used for CS and PS, thereby facilitating a fair comparison. Third, previous evidence is mostly related to the Lending Club marketplace, whereas Serrano-Cinca and Gutiérrez-Nieto (2016) and Bastani et al (2019) noted that PS models need to be validated on other P2P platforms as well. Are previous positive results of the PS model related only to the Lending Club (Bastani et al. 2019;Serrano-Cinca and Gutiérrez-Nieto 2016;Xia et al. 2017b) or we.com (Xia et al. 2017b) lending markets? Evidence from other markets is missing. Therefore, apart from a random sample of loans from the Lending Club database, we use a sample of loans from a European platform Bondora, which offers short-term risky loans. We found that PS models tend to perform much better than CS models, and therefore, we strengthen the case for PS models in the literature. We also evaluate individual models' performance using absolute profits and returns as a loss function because these are ultimately the main concerns of investors. Contrary to most studies in this field, our evaluation is also based on statistical tests that consider data snooping bias. We found a set of superior models that almost always include models that predict loan returns.

Data
Existing studies predominantly worked with data from the US-based Lending Club (Bastani et al. 2019;Emekter et al. 2015;Guo et al. 2016;Jin and Zhu 2015;Serrano-Cinca and Gutiérrez-Nieto 2016;Serrano-Cinca et al. 2015;Teply and Polena 2020;Xia et al. 2017b;Ye et al. 2018) and Prosper (Guo et al. 2016;Miller 2015;Wang et al. 2018;Zhang and Liu 2012;Zhang and Chen 2017), leaving other P2P market platforms underrepresented in the literature. Our primary interest is establishing the validity of the PS models using data from a European lending platform Bondora. 2 However, to establish a fair comparison across markets and validate our PS models, we also use a random sample of loans from the Lending Club marketplace.
The lending platform Bondora offers a database of loan characteristics and payments. In this paper, we show that credit-risk models can be improved. Instead, of modeling defaults, we model the annualized modified internal rate of returns calculated from the loan payments database. Our first loan starts on 21st February 2009 and ends on 11th November 2016. To focus on short-to mid-term loans, we remove loans that lasted less than 1 month or longer than 5 years. 3 We use only a sample of loans that are issued from Estonia or Finland and that had a Bondora rating version 2 available. After data preprocessing, our sample covers 161 explanatory variables and consists of 10,002 loans. Among which, 8001 were selected for training and 2001 for validation.
To match the size of the Bondora dataset, we used a random sample of loans from the much larger Lending Club database of finished loans. As before, 8001 loans were randomly selected to form the training and 2002 loans to form the testing dataset. The earliest loans were from 1st January 2013. Although all loans had a nominal maturity of 36 months, loans that had a real duration of less than 1 month (very early repayments) were removed from the dataset. After data pre-processing, our sample covers 142 explanatory variables. 4 We used two versions of the training dataset (for Bondora and Lending Club data). For training PS models, where the internal rate of return is of concern, we used the raw datasets. For training the CS model, we address the imbalance that arises because of the under-representation of defaults in the training dataset (Table 1). Our approach is to use random under-sampling of the majority class (good loans) and random over-sampling (with replacements) of the minority class (non-performing, defaulted loans).
Data for both datasets were pre-processed in two ways. First, several non-negative numerical variables were skewed (to the right), which led us to apply the logarithmic transformation. Moreover, all categorical variables were transformed into dummy variables. Second, both datasets were subject to the following algorithm to address extremes, under-representation of classes and collinearity issues: • For all numerical variables, the lowest and highest 0.1% were winsorized. • For each dummy variable, we required at least 1% of event occurrences (i.e., either 1% or more one's or zero's). • If any two variables had an absolute value of the Spearman's rank correlation coefficient higher than 0.95, one of the two variables was (randomly) removed. • We checked whether exact linear multi-collinearity exists, and if yes, one of the variables was (randomly) removed. • We ensure that the range of each variable in the testing dataset falls within the range of the same variable in the training dataset. Lyócsa et al. Financial Innovation (2022)

Loan performance measures
To distinguish between potentially performing and non-performing loans, the CS literature on the P2P market uses the standard default/not-default credit-risk framework.
With that in mind, a loan is considered to perform well if all liabilities originating from the loan are repaid within a given payment schedule-on time (including the grace period). We denote the standard loan performance measure as follows: where index i denotes the given loan, and t = 1, 2, … is the usual time index; we use t i to denote the beginning of the ith loan contract on the specific day t. The CS models aim to estimate Eq. (1) for loan evaluation purposes. In this paper, Eq.
(1) is based on the status of the loan as reported by the respective P2P lending platform. Assume that the borrower receives the loan amount in a single payment at the beginning of the period denoted as CO i,t i , where, as before, index t i highlights the fact that the loan amount is paid out at the beginning of the period. This case is also the same for all loans in our sample. The loans have different (nominal) maturities m i (in days), and their real maturities can also differ from the nominal (agreed upon) date because of early repayment by the borrower. Over the given time period, one or more regular or irregular payments are received from the borrower by the investor. If all the payments are made on time, then the investor receives the loan amount plus the profit determined by the interest rate on the loan. As loan maturities differ, we assume that the investor has the possibility to re-invest received payments. In this way, we make loans with different real maturities comparable to each other in terms of their profitability. The future value is as follows: where CI i,t are cash inflows over the period t ∈ (t i , t i + m i � , and R t i is a fixed reinvestment rate assumed to be known at the start of the loan. The investor's annualized return is calculated as follows: Equation (3) is our second loan performance measure. The value of the return depends on how the reinvestment rate is estimated. The standard critique of using the reinvestment rate is based on two premises. The first is whether an investor even has the opportunity to re-invest incoming proceeds in investments with similar risks. The second is whether the opportunities offer returns comparable to the assumed return from the evaluated investment/loan. Most established P2P markets (Lending Club, Mintos, and Bondora) have sufficient liquidity to offer many similar loans. Therefore, we consider the reinvestment assumption to be valid. With regard to the value of the reinvestment rate, our approach is empirical and designed not to overestimate the overall return. In this Performing loan 1 Non -performing loan , case, we use a return that was achieved in the past, which proceeds in the following two steps: 1. In the first step, we calculate Eq. (3) for all loans in our sample, assuming that R t i = 0 , that is, no reinvestment rate. We denote the resulting returns as P ( * ) i,t i . 2. In the second step, we calculate Eq. (3) for all loans, but now for each loan, the reinvestment rate R t i is equal to the following: that is, the reinvestment rate R t i is the median value of P ( * ) j,t j calculated overall loans that finished t j + m j in the past 365 days prior to the beginning of the evaluated ith loan. This approach ensures that our reinvestment rate is historical and tracks the improvement or worsening of the economic conditions of the borrowers. The rate is calculated over loans that have concluded and also include defaulted loans. However, this approach cannot be applied to initial loans. Instead of removing such loans, we use a zero reinvestment rate as a reinvestment rate.

Competing models
To show that modeling an investor's rate of return is a meaningful exercise, we perform a statistical and economic evaluation in an out-of-sample forecasting framework. We compare realized returns per loan and total profits of a hypothetical investor who is using either the standard CS model based on default predictions or the PS model based on the loan's return P (2) i prediction.
The following sections describe the four classes of credit-risk models that we employ in this study: (1) linear regression-based regularization techniques (lasso, ridge, elastic net), (2) logistic-based regularization techniques, (3) RF, and (4) neural networks. We use regularization methods because they can be estimated quickly using conventional processing power and are also easy to interpret. We use RF and neural networks as these are standard ML models used in the P2P lending market literature.

Regularization in linear models
The lasso, ridge, and elastic net model estimates can be expressed as special cases of the following optimization problem (Tibshirani 1996;Zou and Hastie 2005): where y i is the ith loan performance measure, x i and β are p × 1 column vectors of the standardized explanatory variables and coefficients, respectively. Parameter λ controls for the weight of the penalty term, and if = 0 , the model breaks down to ordinary least squares. If α = 1, the model breaks down to the lasso approach, α = 0 leads to the ridge regression, and 0 < α < 1 is the elastic net approach. The key difference between the three models lies in how they handle correlated regressors. In the case of multiple correlated regressors, lasso tends to select one into the model at the expense of others; ridge med P selects and reduces coefficients to a similar size, whereas the elastic net is a compromise between the two approaches. The combination of an α ∈ (0.1, 0.2, ...0.8, 0.9) and λ parameter is estimated. This case leads to the following four forecasts: LM (linear regression model), LM min ,α=1 , LM min ,α=0 , and LM min ,α min .

Regularization in LR models
As before, we use penalization techniques adapted for the LR. The parameter estimates can be expressed as follows: The only difference now is that y i represents a default/no-default loan performance measure. Letting = 0 leads to the standard LR model, whereas α = 1 leads to the logistic lasso, α = 0 to the ridge lasso, and 0 < α < 1 to the elastic net lasso. Suitable parameters are found via tenfold cross-validation maximizing area under the curve (AUC). We end up with four forecasts: LR, LR min ,α=1 , LR min ,α=0 , and LR min ,α min .

Random forest
As the name indicates, a random tree is a randomly constructed tree from a set of possible trees having K random features at each node. More formally, a RF classifier is a combination of tree-structured classifiers h(x, θ k ), k = 1 , where θ k is the independent identically distributed random vectors (Breiman 2001). Once the trees are created, they vote on the most popular class. The specific steps followed in training the RF model both for the classification (modeling defaults) or regression tasks (modeling returns) in this work are listed below (Friedman et al. 2001): • for k = 1 : K • Select a bootstrap sample Z from the training data set. • Build a RF to the sample Z by recursively repeating the following steps for each terminal node of the tree: • randomly choose m features from the total input space p; • select the best performing features and the best split points; • split the node.
• Output: {T k } K Having trained out the RF algorithm, we proceed with making a prediction for a new loan contract, x: • Modeling the returns: f K rf (x) = 1 K K k=1 T k (x). • Modeling the defaults: Let C k (x) be the predicted loan status of the k tree. Then, C K rf = majority vote. (5) The RF models need to be tuned using data from the training data set. Specifically, suitable values for maximum tree depth (3, 6, 9, and 12), number of trees (500, 1500, and 3000), and the number of variables to possibly split at in each node (5, 10, 15, 20, 25, 30, and 40) need to be determined. We use tenfold cross-validation and a grid search, where optimum parameters are those that minimized mean squared error (regression) or maximized AUC.

Neural networks
In addition to the RF, we also train feed-forward neural networks with a single hidden layer. Feed-forward networks have units that are one-way connected to our units, and they can be labeled from inputs to outputs so that each unit is only connected to units with higher numbers. A generic feed-forward network with one hidden layer can be represented by the following function (Ripley 2007): Namely, to form the total input x i , each unit summarizes its input and adds the bias. Consequently, to obtain the output y i , we apply a function f i to x i . The connections from i to j have weights, w i,j which multiply the signal passing through the units. The inputs, on the other hand, have f = 1 as they only distribute the input. The neural networkbased credit and PS system consists of two main steps: (1) data normalization and (2) model training and validation. In the first step, we re-scale the numerical variables into a range of [0,1]-process necessary for the neural network training and classification/evaluation. In the second phase, to specify the two hyperparameters, size (i.e., the number of units in the hidden layer) and decay (i.e., the regularization parameter to avoid overfitting), we employ tenfold cross-validation before using data from the training dataset. 5 Notably, the literature offered many studies that aimed to classify loan applicants into creditworthy or not-creditworthy using artificial neural networks (Byanjankar et al. 2015;Moscato et al. 2021;Plawiak et al. 2020;Turiel and Aste 2019). However, in practice, this methodology is not used extensively. One highly relevant barrier for wider adoption of such ML models in CS in practice is related to the concept of explainability (Arrieta et al. 2020;Arya et al. 2019). Namely, ML solutions, such as neural networks, are often referred to as black boxes because, typically, tracing the steps that the algorithm took to arrive at its decision is difficult. This challenge is particularly relevant for P2P platforms, which in the attempt to offer cheap administration of loans through automatized scoring, are subjected to the General Data Protection Regulation (GDPR). GDPR provides a right to explanation, thereby enabling users to ask for an explanation as to the decisionmaking processes affecting them.
5 For the Bondora dataset, the optimum size parameter is 5 (CS model) and 1 (PS model), whereas the decay parameter in both cases was found to be 0.1. For the Lending Club dataset, the optimum size parameter was again 5 (CS model) and 1 (PS model), with decay set to 0 and 10 −4 .

Forecasting procedure
Our forecasting procedure follows a standard procedure found in the P2P literature as we randomly divide our sample of loans into training (80%) and testing (20%) datasets. In the first step, using all loans from the training dataset, we estimate and fine-tune (via cross-validation) predictive models. In the second step, given estimated model parameters and characteristics of loans in the training dataset, we predict those loans' performance measures P * {1,2} i,t i ,r . Figure 1 shows the procedure. 6 Simply having predicted loan performance measures is not enough to decide whether to invest or not in the given loan. For example, if the LR model for the ith loan estimates the probability of default to be P * (1) i,t i ,r = 0.234 , should the investor invest? A similar question arises for models explaining loan returns. For example, if the LM model for the ith loan estimates the return to be P * (2) i,t i ,r = 15.24% , should the investor make an investment? For both types of predictions, suitable threshold values are needed. For CS models, our threshold is TR CS r,i = 0.50 as we are using a balanced dataset (see the "Data" section), that is, P * (1) i,t i ,r > TR CS r,i = 0.5 are predicted to default (non-performing). For PS models, we use the raw dataset, and our threshold is naturally TR PS r,i = 0.00% , that is, loans with a negative predicted modified internal rate of returns P * (2) i,t i ,r < TR PS r,i = 0% are predicted to default (non-performing).

Returns and profit measures
Our choice to prefer returns and profits as performance measures is motivated by the fact that CS and PS models cannot be compared via the ROC curve and AUC measure. Moreover, in any real-life scenario, P2P platform operators and investors need to evaluate the credit-risk model by using a single threshold as opposed to a range of possible values. Average returns and total profit are economic measures that provide a direct and fair comparison between CS and PS models. To assess the performance of each model r and loan i, we use the return: Fig. 1 Higher-level overview of the methodological procedure 6 Bastani et al. (2019) presented an alternative approach by combining the CS and PS models into a two-stage sequential approach. We do not opt for this approach as the one-stage PS models lead to higher average returns and total profits. However, their approach deserves attention in future studies. Lyócsa et al. Financial Innovation (2022) 8:32 where in the case of PS models, S(r, i) is the signaling function: which returns 1 if the predicted performance measure exceeds the optimum threshold (the case for modeling defaults, that is, using TR CS r,i is analogous) and 0 if otherwise. The mean return over all loans is as follows: Irrespective of the predictive model r, we use the realized return P (2) i,t i ,r to evaluate the loan. Very conservative models might be highly successful in predicting positive returns correctly. However, from another aspect, these models might recommend investing in only a handful of loans. To remedy this effect, the mean return, as defined above, gives 0% return for loans, for which investment is not recommended.
Although the average return across all loans is of interest to the P2P platform provider, regulators, and supervisory institutions, investors might be interested in the realized returns across invested loans. We, therefore, report mean return only over-invested loans: where |.| denotes the cardinality of a set. 7 Next, we report the standard deviation of the returns as follows: SD inv r is defined accordingly. We also report the total nominal profit, that is, the difference between the sums of money inflows minus loan amount without assuming any reinvestment, which is realized across only invested loans R profit r . In this way, we directly consider the importance of the prediction, which should be higher in the case of larger loans because the consequences of imprecise predictions are larger. More specifically: Interestingly, establishing what type of returns is usually reported in the literature is difficult, which is surprising given that the values tend to be quite different. Lyócsa et al. Financial Innovation (2022) 8:32

Statistical evaluation
The model that leads to the highest returns might not always be superior to other models. Differences across the performance of models might be just an artifact of the inherent high uncertainty regarding the outcomes of P2P loans. Therefore, we formally compare the realized return [Eq. (9)] of all 12 models: six models that predict returns and six models that predict the probability of default. To account for multiple pairwise comparisons and possible data snooping, we use the model confidence set (MCS) of Hansen et al. (2011). In our setting, we define the loss function for a given model and a loan to be l i,r = −R i,r , that is, the higher the return, the lower the loss of a given model. Next, the difference between the losses of models m,n is as follows: The equal predictive ability (EPA) hypothesis is as follows: We use the T MAX statistics of Hansen et al. (2011) to test the above hypothesis, where the distribution under the null hypothesis is derived using a bootstrapping procedure with 5000 bootstrap samples. As indicated above, the MCS procedure is a sequence of EPA tests, where we start with a set of all 12 models and perform the above test. If the null is not rejected, the superior set consists of all models. If the null is rejected, we remove the worst performing model and continue with the EPA test on the remaining 11 models. The procedure continues until the null is not rejected or only one model is left. The α that indicated the confidence level is set to 0.10. The higher the α, the lower the confidence level, and more models tend to be selected in the superior set of models.
Data pre-processing, sampling (under-and over-sampling), statistical model estimation, and evaluation is performed via program R (Wickham et al. 2021;Wallig et al. 2020aWallig et al. , 2020bBernardi and Catania 2018;Gorman 2018;Friedman et al. 2010;Kuhn et al. 2008;Ripley, 2007;Wright and Ziegler 2015). Scripts performing the predictions and evaluation of loans are available as Additional file 1.

Overview of performance loan measures
In Fig. 2 (Bondora left, Lending Club right panels), we highlight loans with negative (modified internal rates of returns, returns henceforth) returns (P (2) ) . Similar figures are reported by Serrano-Cinca and Gutiérrez-Nieto (2016) and Bastani et al (2019). The loan returns are skewed to the left, as negative returns are possible and returns near − 100% occur often. This case also leads to a large variation in returns. In Table 1 (training dataset), we report the default rate of 22.5% (Bondora) and 19.7% (Lending Club) and the average annualized returns stratified for profitable and nonprofitable returns.
The results show that fishing for good loans might be very lucrative as the return is 27.06% for profitable loans on Bondora and 13.08% for Lending Club. Given the low interest rate environment in respective countries over the given sample period, these numbers suggest why investing in the P2P market is attractive for many investors. On (13) d m,n = l i,m − l i,n ; m, n = 1, 2, . . . , i = 1, 2, . . . , NF . the other hand, for nonprofitable loans, the return is − 43.98% (Bondora) and − 39.75% (Lending Club). Comparing loans from the two markets shows that our sample of loans from Bondora is riskier. That is, the sample offers higher returns for profitable loans but also lower returns for non-performing loans. Moreover, the standard deviations are larger for returns realized on Bondora as opposed to Lending Club loans, for performing (7.49% vs. 4.31%) and non-performing loans (41.71% vs. 32.38%).

Credit or profit scoring?
Tables 2, 3, 4 and 5 show the results from the individual CS and PS models. We report results for the Bondora loans in Tables 2 and 3 and results for the Lending Club loans in Fig. 2 Distribution of loan returns. Notes: Red color denotes loans that led to negative internal rate of return Table 2 Out-of-sample performance of credit and profit scoring models-Bondora LR min ,α=1 , LR min ,α=0 , LR min ,αmin are lasso, ridge and elastic net versions of logistic regression, LM min ,α=1 , LM min ,α=0 , LM min ,αmin are lasso, ridge and elastic net versions of linear regression SD standard deviation, LR logistic regression-based models, LM linear regression-based models, RFC random forest classification, NNC neural network classification, RRR denotes random forest regression, NNR neural network regression † Denotes a model that belongs to the set of superior models

Performance across all loans
Total profit in mil. EUR   Tables 4 and 5. For example, the value of 59.8% in the first row of Table 2 corresponds to the percentage of loans, which were identified by the forecasting model in the given row as one where an investment can be made, that is, S(r, i) = 1 . The average annualized return per invested loan using LR is 19.57%, with a considerable standard deviation of 33.06. Notably, for PS models, the returns are very similar but they have a higher standard deviation. Although this case might suggest superiority of CS models, this comes at a price of a much lower percentage of invested loans, which for PS jumps to more than 75%. An exception is the RF CS model that performs similarly to the PS RF model. This model is the only CS model that also leads to returns that belong to the superior set of models, as indicated by the Hansen et al. 's (2011) test. As CS models tend to overestimate risks (which leads to the low percentage of invested loans), unsurprisingly, the average return across all loans is much lower, as opposed to the PS models, from around 11% for CS models to around 15% for PS models. Opposed to CS models, returns are only 3.1% higher for PS models across invested loans but much higher, by 24.0%, across all loans. Is the higher number of invested loans worth the effort? The total profit measure 8 suggests that it is as it leads to absolute profits (the last column in Table 2) that are approximately 26.7% higher for PS models.

Average return SD Average return SD
In Table 3, we report accuracy, specificity, and sensitivity across CS and PS models for the Bondora loans. An interesting observation is that default scoring models tend to be more accurate at predicting good loans (specificity) at the expense of predicting nonperforming loans (sensitivity). The overall accuracy of PS is however higher, on average by 6.7%, and ranges from 75.5% (neural network model) to 77.4% (RF). For CS, the range starts from 67.9% (neural network model) to 78.8% (RF). As before, among CS models, RF clearly stands out, matching the RF regression, the best PS model. However, prior to any analysis, which model will perform the best is unclear. Given this model choice uncertainty, our results suggest that, in general, PS is overwhelmingly the preferred choice in modeling P2P credit risk on the Bondora loan market. The reason is that only one among CS models is able to match results from individual PS models. Having established our key results for the Bondora dataset, we report performance results for the Lending Club dataset in Tables 4 and 5. Previous research (e.g., Bastani et al. 2019; Serrano-Cinca and Gutiérrez-Nieto 2016) already established the superiority of PS models for Lending Club loans. However, whether these results hold for regularization methods, RF, and neural network models is unclear. In several ways, results in Tables 4 and 5 are similar to those for the Bondora dataset. The average return across invested loans is in the range of 7.46% (RF) to 8.66% (LR) for CS models, whereas it ranges from 7.83% (neural networks) to 8.94% (lasso regression) for PS models. On average, PS outperforms CS models by 2.9%-close to what we observed with the Bondora dataset. When we turn our attention to average returns across all loans, the gap between CS (from 5.19% for an elastic net to 5.86% for RF) and PS (from 5.84% linear regression to 6.50% elastic net) models widens to 15.5% in favor of PS models. These results are also supported via the MCS that includes only five PS models (all except linear regression) and total profit that is 21.5% higher for PS models.
A closer inspection of our results shows (Tables 3 and 5) that the gains from PS are achieved in the ability of PS models to better identify bad loans, that is, they have higher specificity. On the contrary, CS models are better at identifying good loans. The overall accuracy is 3.1% higher for PS models. Moreover, notably, as before, RF stands out among CS models achieving a high level of accuracy that even surpasses that of PS models. To visualize the accuracy of the RF model's accuracy (generally best performing model), Fig. 3 plots predicted and realized returns. In Fig. 3, we can also observe that several loans are labeled as defaulted (red dot), whereas their realized return was positive. This case happens if the borrower has not met his or her obligations although he or she has paid off most of his or her loans and interest. Investors' perspective typically differs from that of a lending platform as the former is solely focused on profitable investments. To mimic the cherry-picking behavior of an investor, Serrano-Cinca and Gutiérrez-Nieto (2016) and Bastani et al (2019) reported internal rates of return for loans with the highest 100 (former) or 30 (later) predicted scores (the Lending Club loan market). In the former case, the linear regression led to a return of 11.92%, whereas the CHAID 9 led to a 5.98% return across 100 loans. In the latter case, the one-stage methodology produced returns in a range of 9.4% (wide learning) to 13.4$ (wide and deep learning). Moreover, the two-stage methodology led to a range of 12.8% (wide learning) to 16.4% (wide and deep learning) across 30 loans. Our results are similar as we achieved 10.92% for top 100 loans and 12.91% for top 30 loans on the Lending Club loan market using the best PS approach-RF regression. With the CS approach, the most that we could hope for was an 8.05% and 7.84% return with the RF classification algorithm. The same strategy on the Bondora loan market would lead to higher returns for the investor. The best CS (RF classification) led to 9.65% for the top 30 and 10.95% for the top 100 loans. PS models fared better here as well, with much higher returns at 28.95% for the top 100 and 28.80% for top 30 loans (RF regression). These results show that PS models are specifically suited for budget-constrained risk-maximizing investors who have to select a certain number of loans.
An important aspect of investing is the trading activity governed by the given creditrisk model, which is of concern to investors and P2P market providers alike. As already noted before, sizable differences exist, as PS models suggest investment in 71.8% (neural networks) to 76.3% (lasso) of cases. On the contrary, CS models in only 59.3% (ridge) to 75.8% (an exception of RF) of cases. On average, a difference of 20.4% exists in favor of PS models. A similar pattern is observed for the Lending Club loan market with an increase in trading activity by 11.8% for PS models.
To summarize, the benefits of using PS models are higher overall returns, accuracy, and trading activity, whereas returns across invested loans are similar to those of CS models. These results hold for the Bondora and Lending Club loan markets.

Conclusion
In the past decade, the emergence of novel P2P lending has led to new challenges for investors, risk managers, and regulators. For the industry to thrive, its credit-risk models should be improved. The technology can serve as an intermediary between the lender and borrower in a market of consumer loans. In this paper, we present empirical results from PS models that help decision-makers, investors, and operators of P2P platforms to manage risky loans better. In doing so, we provide new and supporting evidence that PS models tend to outperform default scoring models.
We use data on loan and loan payments from Bondora, a European P2P platform that facilitates short-term risky loans between borrowers and lenders, including data from the well-known Lending Club marketplace. Using regularization methods (lasso, ridge, elastic net) in linear and LRs and RF and neural networks, our empirical results suggest that modeling the adjusted internal rate of returns leads to much higher returns 9 Chi-square automatic interaction detection algorithm.