To supervise or to self-supervise: a machine learning based comparison on credit supervision

This study investigates the need for credit supervision as conducted by on-site banking supervisors. It builds on a real bank on-site credit examination to compare the performance of a hypothetical self-supervision approach, in which banks themselves assess their loan portfolios without external intervention, with the on-site banking supervision approach of the Central Bank of Brazil. The experiment develops two machine learning classification models: the first model is based on good and bad ratings informed by banks, and the second model is based on past on-site credit portfolio examinations conducted by banking supervision. The findings show that the overall performance of the on-site supervision approach is consistently higher than the performance of the self-supervision approach, justifying the need for on-site credit portfolio examination as conducted by the Central Bank.

The purpose of banking supervision is to keep the financial system sound and safe, ensuring that the financial regulation, the set of rules governing the financial system, is followed (Masciandaro and Quintyn 2015). 1 The flagship of financial regulation is the Basel Accords, which are the policy directives prepared by the Basel Committee on Banking Supervision, a high-level committee of the Bank of International Settlements (BIS). Moreover, this financial regulation is adopted worldwide. The third Basel Accord, which emerged in the aftermath of the Great Financial Crisis (2008/2009), broadened the scope of prudential regulation and embraced liquidity and leverage as relevant microprudential issues. However, the solvency-based perspective remains the focus of prudential regulation, highlighting the capital adequacy ratio as its leading indicator.
From the solvency perspective, keeping the financial system sound and safe means asserting the worth of the bank's assets (Hellwig 2014). In particular, credit portfolio assessment, due to its relevance among assets, is an important task assigned to banking supervision. From the accounting standpoint, the credit portfolio is often measured by amortized cost deduced from the loan loss provision. The loan loss provision is a combination of incurred and expected losses, which is designed to adjust the credit portfolio to its fair value. The role of banking supervision is to assess loan portfolios and check whether banks comply with rules and regulatory requirements, especially the adequacy of loan loss provision to the loan portfolio risk profile.
Although credit portfolio assessment is a classic banking supervision predicate, the Great Financial Crisis slowed down a self-regulation process that gradually increased the reach of internal-based models. The internal ratings-based approach, an important innovation introduced in the Second Basel Accord, allows banks to replace regulatory standardized models with proprietary versions internally developed. Continuous innovation in the financial system, brought about by the technology revolution, may suggest the reignition of this process in the spirit of Stefanadis (2003). De Chiara et al. (2018) analyzed the effect of tighter regulation and powerful supervision in the financial sector and the consequent social costs. The authors argued that the optimal supervisory architecture combines a supervisory regime where direct assessment by a supervisor is always required (mandatory supervision) with a flexible supervision regime where banks self-select the regulatory contract designed for their level of risk.
In this sense, this study investigates the need for credit supervision as conducted by the Central Bank of Brazil (CBB). It builds on a real bank on-site credit examination to compare the performance of a hypothetical self-supervision approach, where banks themselves assess their loan portfolios without external intervention, with CBB's on-site banking supervision. To conduct this experiment, we train two machine learning classification models: (1) a model based on good and bad ratings informed by banks and (2) a model based on past on-site loan portfolio examinations conducted by CBB's banking supervision.
1 Although complementary, banking regulations and banking supervision are separate activities, usually performed by different actors. The former concerns the rules governing the financial system, whereas the latter regards the enforcement of such rules (Masciandaro and Quintyn 2015). In the Brazilian financial system, the National Monetary Council is responsible for banking regulations and the Central Bank of Brazil (CBB) is responsible for banking supervision.
The findings show that CBB's on-site supervision consistently outperforms the selfsupervision approach, which justifies the necessity of on-site credit portfolio examination as conducted by CBB.
The remainder of this paper is structured as follows. Section 2 discusses the related literature on financial supervision and loan loss provisioning. Section 3 highlights the Brazilian financial system, credit regulation, and supervision. Section 4 presents the empirical analysis comprising (1) the machine learning algorithm used to develop classification models based on on-site supervision and banks' experience; (2) on-site examination procedure that produced the ground truth against which both supervisory approaches are compared; and (3) the analysis of the results. Section 5 concludes the paper.

Banking supervision and loan loss provisioning regulation
The financial crisis casted doubts over policy certainties ranging from monetary policy to financial regulation and supervision. Barth et al. (2013) and Blanchard (2009) argued that the crisis was the result not only of incomplete regulation but also of ineffective supervision. Bernanke (2010) ascertained that stronger regulation and supervision aimed at problems with underwriting practices and lender's risk management would have been a more effective and surgical approach to constraining the housing bubble than a general increase in interest rates. Their assertion is based on evidence of declining lending standards during the boom. Viñals et al. (2010) drew lessons from the financial crisis to answer why countries with similar financial systems, operating under the same set of global rules, were less affected than others. The authors argued that, besides the need for better regulations in areas such as capital, liquidity, provisioning, and others, financial supervision was not effective as it should have been. Moreover, they mentioned that to be effective, financial supervision must be intrusive, adaptive, skeptical, proactive, comprehensive, and conclusive. Therefore, a twofold approach was needed. On the one hand, the regulation was broadened and enhanced, including the explicit financial stability mandate, headed by financial stability committees. On the other hand, supervisory skills incorporated additional toolkits to face the forward-looking assessments of risks and the challenging macroprudential dimension the crisis added to supervision. Masciandaro and Quintyn (2013) stated that financial supervision is the vital link between financial regulation and financial sector stability. Financial supervision acts as an essential complement to financial regulation in the authorities' pursuit of financial stability. The importance of financial supervision as an independent policy area motivated the development of strands in the literature to understand its role. Among the topics that gained attention are as follows: the relationship between the supervision and monetary policy (Goodhart and Schoenmaker 1995;Poloz 2015;Antunes, Moraes and Montes 2016); supervisory architecture (Taylor 1995(Taylor , 1996; and supervisory governance (Kane 1989;Randall 1993).
Establishing clearly the roles assigned to financial regulation and supervision is a starting point to Freixas and Santomero's (2002) thorough review of the theoretical framework of banking regulation and supervision. On the one hand, financial intermediaries present the solution to market imperfections derived from asymmetric information problems. On the other hand, regulation and supervision are the response to avoid excessive risk-taking or undesired monopolistic powers that can emerge as consequences of financial intermediaries' actions. Whenever a financial intermediary addresses a market failure, it works as a second-best solution, for it causes another market failure and requires financial regulation and supervision.
Among the market failures addressed by financial intermediaries, such as providing liquidity risk insurance, creating safe assets, screening of potential borrowers, and monitoring customers' actions and efforts (Freixas and Santomero 2002), we draw attention to the screening and monitoring activities as those directly linked to this study.
The quality of banks' assets, as well as the quality of their balance sheets, points to the quality of screening and monitoring activities. Gorton (1988) argued that a bank failure may signal both a weakness limited to the bank and fragility in the system as a whole. Thus, the systemic risk may emerge from microprudential failures that make depositors question the soundness of the financial system. Financial supervision is entitled to assess the quality of screening and monitoring practices. Moreover, for credit, it is responsible to spot bad credit clusters and properly resolve them before they turn into a going-concern problem. Such action avoids spillover effects and mitigates systemic risk.
An extensive number of studies have considered the influence of financial supervision on bank's risk-taking to be relevant. However, results are mixed when it comes to the effects of supervision on financial stability. For instance, Bhattacharya et al. (2002) concluded that intense supervision can improve the timeliness of supervisory intervention, whereas Delis and Staikouras (2009) showed that intense supervision can limit banks' risk-taking. White (2006) defended supervision and regulation as the best instruments to achieve financial stability, whereas Barth et al. (2004Barth et al. ( , 2008Barth et al. ( , 2013 argued that the efficiency of financial intermediation, hence financial performance, is reduced by financial supervision. Meanwhile, Brown and Dinç (2011) used a competing risk hazard model for bank survival to study bank failures in 21 emerging market countries in the 1990s and show that a government is less likely to take over or close a failing bank if the banking system is weak, hence establishing a Too-Many-to-Fail effect based on regulatory forbearance.

Brazilian financial system: credit regulation and supervision
In the Brazilian financial system, henceforth financial system, different types of financial institutions coexist, ranging from niche institutions, which explore specific types of activities, to universal banks, which gather many different activities in the same entity. The financial system is complex and well developed. In June/2019, it comprises 178 banks, mounting to 126% of GDP in assets, and 47% of GDP in credit, 2 which makes Brazil an interesting case study. National Monetary Council (NMC) is the financial regulator, and differently from other jurisdictions, CBB is responsible for all aspects regarding financial institutions oversight, from entry to the resolution, concordantly with Barth et al. 's (2004) public interest view.
In Brazil, the supervisory process partially follows the Twin Peaks model (Group of Thirty 2008; FSI 2018), which recommends supervisory specialization by objectives: prudential monitoring of regulated institutions and oversight of business conduct. Although the Twin Peaks model expects two separate financial supervision authorities to tackle banking supervision, the Brazilian solution is a hybrid model in which an integrated supervisor, namely, CBB, holds both objectives inside the same authority.
The prudential regulation (henceforth banking supervision) is the focus of our analysis. The objective of banking supervision is to assess the soundness of financial institutions, mainly commercial banks, and to assert that regulation is complied. It consists of two cornerstones: examination, or on-site supervision, and monitoring, or off-site supervision. On-site supervision follows a supervision cycle and involves sending supervisory staff to banks to conduct specific examinations. Off-site supervision is a permanent process that analyzes bank's performance and compliance to regulation based on multiple sources of data, as well as the outcomes of on-site supervision.
Brazilian financial regulator, the NMC, still has not adopted IFRS 9 as loan loss provisioning regulation for the financial system. To date, NMC resolution 2682/99 (NMC 1999) defines a loan loss provisioning regulation. It combines expected loss and incurred loss approaches in the same framework. Accordingly, financial intermediaries are bound to assign an individual rating to each credit operation booked in the loan portfolio. As presented in Table 1, nine different ratings reflect the minimum and maximum provisions as percentage points of loan amount due. Whether the credit is due or past due defines the way ratings are assigned. For due credits, banks apply the expected loss approach, in which they assign ratings as they find best, as long as based on consistent credit risk assessment. The expected loss approach assigns ratings compatible to the loss banks expect to face in each credit operation along its lifetime. However, when credit is past due, incurred loss approach can be used, and banks are deemed to assign ratings compatible with the extension of delinquency, as determined by the regulation (see Table 1 for more details).
In June 2019, the amount of loan loss provisions (LLP) in the Brazilian financial system equaled to 6.22% of credit portfolios and 18.2% of equity, reflecting the relevance of credit activity, hence loan portfolios, to the financial system. 3 The larger the loan portfolio, the more vulnerable banks are to an increase in loan default arising from deteriorating economic conditions (Laeven and Majnoni 2003). Therefore, monitoring and supervising LLP is a crucial microprudential surveillance tool that bank supervisors use to assess banks' loan portfolio quality (Ozili and Outa 2017). On-site prudential credit supervision works out under two perspectives, namely, credit management and credit risk. Credit management inspections focus on credit processes and compliance of internal credit policies to credit regulation and good practices. As for credit risk inspections, the objective is to assess the quality of credit portfolios and sufficiency of LLP. On-site credit risk supervision focuses on incurred losses insufficiently provisioned and is centered on the borrower's financial performance, which involves intense cash flow analysis, following Antunes et al. (2017) and Antunes et al. (2019). Banks that fall short of incurred loss provisions are demanded to increase them to match their loan portfolios' risk.
From the loan portfolio information banks file monthly at CBB's credit bureau repository, it is possible to derive elementary cash flow variables, such as expected cash flows, received cash flows and disbursed cash flows. Those variables are calculated at loan-level monthly. The analysis is focused on borrowers; therefore, the loan-level cash flow variables are aggregated and turned into borrower-level cash flow variables. Then, the following step is to calculate borrowers' financial performance indices, such as borrower's cash performance (BCP) and borrower's liquidity performance (BLP). These indices are calculated considering a six-month period before the starting date of analysis. Equations (1) and (2) below present the calculation of indices. (1) and (2). Once selected, on-site credit risk examination assesses borrowers to confirm bad credit risk suggested and possible LLP insufficiency. Borrowers presenting credits over 90-day past due are considered bad and the provision assigned by the bank is compared to regulation disposals.

Cash-flow-based analysis selects borrowers under cash flow indices
This study investigates the need for credit supervision as conducted by CBB. It builds on a real bank on-site credit examination to compare the performance of CBB on-site supervision with a hypothetical self-supervision, where banks themselves assess their loan portfolios without external intervention. The experiment trains a machine learning algorithm to develop two classification models: the first one based on past on-site loan portfolio examinations conducted by CBB's banking supervision, and the second one based on good and bad ratings informed by banks. Then, both classification models (1) BCP = are applied to the real bank loan portfolio to compare performances. Figure 1 presents the procedure adopted to conduct the analysis.

Empirical analysis
Technology revolution reached the financial system. Although the extension and depth of its effects in the conduct of business is yet to be fully realized, the only certainty is that business will change. Alongside the emergence of new entrants in the financial system, supervisory policymakers and standard setters around the world draw attention to risks and opportunities for financial stability. Technology-enabled innovation in financial services (FinTech) develops rapidly and demands a continuous assessment of the adequacy of regulatory frameworks (FSB 2017).
Financial supervision gradually absorbs innovative technology approaches and the terms regtech and suptech were coined to capture a series of initiatives that use innovative technologies in the financial supervision domain (FSI 2018). Regtech accounts for innovative technologies used in support of compliance with the financial regulation, whereas suptech refers to the conduct of financial supervision underpinned by innovative technologies. Kou et al. (2019) surveyed the existing literature on assessment and measurement of financial systemic risk combined with machine learning technologies, including big data analysis, network analysis, and sentiment analysis. They introduced the current research on financial systemic risk with machine learning methods and proposed directions for future work.
Concerning financial supervision, artificial intelligence techniques, mainly those involving machine learning algorithms, are the most used (FSI 2018). Samuel (1959) defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed. Generally, machine learning deals with (automated) optimization, prediction, and categorization, not with causal inference. In other words, classifying whether the borrower is a good or bad credit risk is a machine learning task. However, determining what factors drive credit quality is unlikely to be a machine learning challenge (FSB 2017_A).
The different categories of machine learning algorithms relate to the extent of the human intervention required. In supervised learning, the algorithm receives a set of training data containing labels that classify the observations. Contrarily, unsupervised learning detects patterns in the data through similar underlying characteristics, making labels needless (Kou et al. 2014).

Methodology
According to Kirasich et al. (2018), selecting a learning algorithm to implement for a particular application remains an ad hoc process based on fundamental benchmarks such as the classifier's overall loss function and misclassification metrics.
Even so, models for predicting bankruptcy and default events have been the object of intensive research. Kou et al. (2021) proposed a bankruptcy prediction model for SMEs that use transactional data and payment-network-based variables instead of financial (accounting) data. Meanwhile, Wang et al. (2020) investigated credit default risk in P2P lending, arguing that standard binary classifiers are inappropriate in P2P lending because of multiple credit classes, in which misclassification costs vary largely across classes in P2P lending. Using publicly available data from Lending Club, the authors modeled credit rating in P2P lending as a cost-sensitive multi-class classification problem and showed that the cost-sensitive classifiers can significantly reduce the total cost. Meanwhile, Shen et al. (2020) proposed a novel three-stage reject inference learning framework using unsupervised transfer learning and three-way decision theory to infer the possible repayment behavior of rejected credit applicants. The framework was validated on Chinese credit data.
In particular, traditional statistics techniques were compared with artificial intelligence models. Barboza et al. (2017) tested different machine learning models, such as the random forest, to predict bankruptcy 1 year before the event, and compare their performance with results from statistics techniques, such as logistic regression. Using data from 1985 to 2013 on North American firms, the authors concluded that machine learning models show, on average, approximately 10% more accuracy than the traditional models. Comparing the best models, random forest led to 87% accuracy, whereas logistic regression analysis led to 69% accuracy in the testing sample. Addo et al. (2018) built binary classifiers based on machine and deep learning models and used real data to predict loan default probability. Their findings suggest that tree-based models are more stable than models based on multilayer artificial neural networks.  (2017) A more traditional statistic technique, such as logistic regression, might lead to similar results concerning the comparison between the supervisory approaches. However, one must keep in mind that when dealing with more complex datasets, a linear or continuous statistical model cannot fit complex non-linear and non-monotonic behavior and may not be efficient to segment the class labels, thus leading to poor accuracies. More  Define the borrowers from which the matrices of features will be built and whose labels (good or bad borrowers, the endogenous variable) are known. In this study, we use two different sets of borrowers, belonging to the two supervisory approaches investigated 3 Build the two datasets that will be used to train the algorithms, according to the two supervisory approaches analyzed 4 Run (train) the algorithms on the datasets and evaluate their performance 5 Build the validation set from the real bank credit portfolio to be classified 6 Apply the trained algorithms to the validation set and compare the outcomes Fig. 4 Example of random forest decision trees sophisticated algorithms, such as random forest, may then be required because they can learn from a non-linear decision boundary and thus achieve higher accuracy scores, as presented in Fig. 2 (Bacham and Zhao 2017). The statistical method used in this study is random forest (RF), introduced by Breiman (2001) as an extension of the decision tree method (Breiman et al. 1984).
Random forest consists of a large number of decision trees that operate as an ensemble. Each individual tree in the random forest is made of successive splits of the sample into two leaves, according to a single exogenous variable exceeding or not a threshold.
The quality of each split is measured at the node by an impurity function, such as entropy or information gain, as presented in Eq. (3).
where p and q are the probability of success and failure, respectively, in each node. Entropy measures the degree of disorganization in a system, hence the amount of information necessary to describe it. The node chosen to be split is the one that has the lowest entropy as compared with the parent node. By the end of the process, each tree defines a class prediction, which equals to one vote. The most voted class is the model prediction (Figs. 3 and 4).
However, before applying a machine learning algorithm, one must train it on a dataset with known outcomes, namely, a labeled training set. Therefore, to turn a machine learning algorithm into a classification device, the steps presented in Table 2 shall be followed.
(3) Entropy = −p log 2 (p) − q log 2 (q)  The first step is to choose the exogenous variables to make up the datasets. Instead of adding every information about borrowers available in CBB's databases, which would lead us to a matrix of features with hundreds of variables and a computational consuming process, we opted by a parsimonious approach. We applied the experience of years of on-site supervision to choose which variables better inform about the risk quality of a borrower. In other words, we developed 26 proxies that reflect on-site experience in classifying good and bad borrowers. The proxies reflect ex-post information related to the borrower's past credit risk behavior because on-site supervision focuses on incurred loss, instead of expected loss (Table 9 in Appendix describes the matrix of features employed in the study and Table 10 in Appendix presents the descriptive statistics for the datasets used).
The next stage is to choose the labeled borrowers whose data will form the datasets. The labeled borrowers are the endogenous variable of the datasets. In particular, "1" is assigned to bad borrowers and "0" is assigned to good borrowers. The first set of borrowers comprises 6,581 samples of good and bad borrowers (5,483 good and 1,098 bad) derived from 12 previous on-site credit portfolio examinations conducted by CBB's banking supervision from 2015 to 2018. Table 3 presents the parameters used to select borrowers and the criteria used to label them as good or bad.
The second set of borrowers gathers 1,005,653 samples of good and bad borrowers (956,016 good and 49,592 bad) obtained from banks' experience and extracted from credit risk information banks file monthly in CBB's repositories.
As for the dataset built from banks' experience, Table 4 presents the parameters used to select borrowers and the criteria used to label them as good or bad.
After these preliminary steps, datasets are gathered through the selection of the variables that constitute the matrix of features for each labeled borrower. In other words, the datasets are the merging of the endogenous variable and the exogenous variables. These datasets are used to train the algorithm.
The training procedure is to apply the machine learning algorithm to the datasets. That allows the algorithm to combine the matrix of features (26 fields of information about each borrower) and the labels to learn the general rule of classification to predict labels in any other out-of-sample dataset. When running the training, the dataset is split into two subsets: the training set and the test set. Following a usual rule of thumb, we assigned 70% of the dataset to form the training set and the remaining 30% to the test set. We applied the random forest algorithm (RF) to the datasets (Table 11 in Appendix details the settings used to tune the algorithm). The algorithm is coded in Python and run on scikit-learn package. The algorithm was subject to regularization procedures and K-fold cross validation. The datasets are quite homogeneous, and thus, not much difference is detected between the training and test sets. Therefore, independently of the results obtained in the training phase, trained models are not guaranteed to perform properly in an out-of-sample dataset.
Having trained two models to classify good and bad borrowers, according to the supervisory approaches under comparison, the next step is to apply them to the real bank dataset and compare performances. The real bank dataset comprises 1,338 borrowers, with a minimum amount due of R$ 10,000. 4 To establish a common ground truth against which the performance of both supervisory approaches can be assessed, the other front of analysis involves the mapping of good and bad borrowers in the real bank dataset through an on-site credit examination. After excluding all borrowers rated as "H" by the bank, that is, 100% provisioned, on-site examination concluded that 1,279 borrowers were considered good ("0") and 59 borrowers were considered bad ("1") ( Table 5).
The assumption that the results of the on-site examination are the correct classification, that is, the ground truth, is central to the analysis. For that reason, one could consistently argue that this procedure biases the results toward on-site supervision approach. To oppose this argument, we posit that on-site supervision focuses on incurred losses, which renders less judgmental analysis, and follows objective regulatory rules, which make little room for discretionary decisions, thus mitigating this issue.
The criteria used by on-site credit supervision to identify bad loans are rather straightforward and clarify the frontier between objectivity and discretion. A credit that is over 90-day past due (rating "E, " or worse) is considered a bad credit; hence, it is labeled as "1, " and "0" otherwise. However, it is common to find evergreened credits, that is, credits artificially kept under the 90-day past-due threshold through successive rollovers. Another practice used to evergreen credits is to distribute the expected cash flow asymmetrically. In other words, small installments, smaller than the interest accrued, are concentrated at the beginning of the credit cash flow, while principal and the remaining interest are placed long in the future. That makes the credit easy to be paid, though artificially. In both cases, the effect of these practices is disregarded, and borrowers are considered bad and thus labeled as "1. " Having mapped the real bank credit portfolio, to evaluate the supervisory approaches is just a matter of matching results.

Results analysis
If the comparison proves that self-supervision approach built upon bank's experience outperforms CBB's on-site credit risk supervision, there would be a strong argument in favor of revising the scope assigned to on-site credit risk supervision. Therefore, the last step is to compare the performance of on-site banking supervision and the hypothetical self-supervision approaches against the ground truth provided by the real bank on-site examination results.
As discussed before, the role assigned to on-site credit supervision is to detect bad borrowers classified as good ones and to quantify the consequent amount of insufficient incurred LLP. Therefore, picking up bad borrowers is central to the analysis, which makes type-2 errors, that is, classifying bad borrowers as good ones, much worse than type-1 errors. Good borrowers, even those mistakenly classified as such, are not revised during an on-site credit examination. Thus, from the supervisory standpoint, minimizing type-2 errors is crucial, even if the cost is maximizing type-1 errors, because these cases are revised and dumped during the examination.
Another aspect to highlight is that the distribution of good and bad borrowers in credit portfolios is heavily unbalanced, because typically the number of good borrowers is much greater than bad borrowers. Accordingly, some measures used to assess machine learning algorithms performance may present the false sense of efficiency. The   Tables 6 and 7 present the confusion matrix for both supervisory approaches, while Table 8 presents the efficiency measures.
The confusion matrix is a performance measurement device for machine learning classification algorithms. It combines actual and predicted values to produce the elementary outcomes, which allow one to compute the efficiency metrics used to assess performance. This study consists of a binary classification problem; thus, four outcomes are derived from the confusion matrix, namely, True Positive (TP), True Negative (TN), False-Positive (FP, also type-1 error), and False-Negative (FN, also type-2 error). In the Appendix, Table 12 provides details on the confusion matrix specifics, whereas Table 13 describes the efficiency metrics.
From the confusion matrices, one can notice that the self-supervised approach classified less bad borrowers (23) than on-site supervision approach (70), which is positive from the efficiency standpoint, as it demands less work hours to examine the loan portfolio. However, the efficiency comes at a cost, because the narrower the sample, the harder it is to minimize type-2 error. On this matter, of the 59 bad borrowers in the loan portfolio, 39 bad borrowers were correctly classified by on-site supervision, resulting in a true positive rate of 0.66. As for the type-2 error, the approach failed to identify 20 of 59 bad borrowers, leading to a false-negative rate of 0.33. As for the self-supervised approach, only 17 bad borrowers are correctly classified, a true positive rate of 0.28 and a type-2 error of 0.71, which evidences the 42 bad borrowers it failed to identify.
For the sake of completeness, Table 8 also displays other performance measures. However, due to specificities of the borrowers' classification issue addressed in the study, the information they convey is minor. The number of bad borrowers is insignificant; thus, comments on the false-positive rate are irrelevant. Similarly, the heavily unbalanced distribution of good and bad borrowers in loan portfolios makes accuracy a fragile indicator. Regarding the precision measure, the main focus was on the correctly classified bad borrowers and does not consider false negatives, which, as commented before, is crucial for credit supervision. Therefore, although the selfsupervision approach presents a higher precision, it has no significant meaning.  score combines precision and recall (true positive rate) in the same measure. Hence, it informs how precise the classifier is, as well as how robust it is. The greater the F1 score, the better the performance. Although the on-site supervision approach is less precise because it produces more false positives, this approach presents a much better recall than the self-supervision approach, as the number of false negatives is smaller. Consequently, on-site supervision approach shows a better F1 score, that is, a better performance, than the self-supervision approach. Figure 5 provides ROC curves and AUC measures for both approaches. Once more, the heavily unbalanced dataset compromises the explanatory power of this classical measure. Nonetheless, the performance of CBB's supervisory approach is clearly superior.
In brief, the overall efficiency of CBB's supervisory approach is higher, and the number of bad borrowers unidentified by the self-supervised approach, the type-2 error, is nearly twice as big as CBB's approach. Apart from moral hazard issues, which do not belong to the scope of this analysis, the results of the self-supervision approach could be worse in the absence of CBB's on-site supervision because ratings "F, " "G, " and "H" that banks assign to their credit portfolios are sometimes imposed by CBB's supervision.

Concluding remarks
This study investigates the need for credit supervision as conducted by the CBB. To the extent the revised literature informs, this study is the first to use a real bank on-site credit examination to compare the performance of supervisory approaches. In particular, two machine learning classification models are employed: the first one is based on past on-site loan portfolio examinations conducted by CBB's banking supervision, and the second one is based on good and bad ratings informed by banks.
The overall efficiency of CBB's supervisory approach is higher, and the number of bad borrowers unidentified by the self-supervised approach, the type-2 error, is nearly twice as big as CBB's approach. The on-site supervision approach is capable of identifying 39 of 59 bad borrowers, which correspond to a true positive rate of 0.66. Meanwhile, the self-supervision approach catches 17 of 59 bad borrowers, denoting a true positive rate of 0.28. From the type-2 error standpoint, on-site supervision approach failed to identify 20 bad borrowers, leading to a false-negative rate of 0.33, whereas self-supervised approach failed to identify 42 bad borrowers, leading to a type-2 error rate of 0.71.
The consistently higher performance of CBB's supervisory approach as compared with the self-supervision approach implies the necessity of on-site credit portfolio examination, as conducted by CBB. Alternatively, from the opposite perspective, the self-supervision approach fails to inform bad credit risk in accordance with the regulation.
The contribution of this study is threefold. First, it innovates by comparing the performance of on-site supervision with that of self-supervision against a common ground represented by a real bank on-site credit examination. Second, Regarding the methodology, this study uses recently available machine learning algorithms to develop classification models based on on-site credit supervision experience and banks' experience to establish the comparison. Third, it asserts the necessity of on-site credit supervision conducted by an independent external agent, such as the Central Bank.