A high-dimensionality-trait-driven learning paradigm for high dimensional credit classification

To solve the high-dimensionality issue and improve its accuracy in credit risk assessment, a high-dimensionality-trait-driven learning paradigm is proposed for feature extraction and classifier selection. The proposed paradigm consists of three main stages: categorization of high dimensional data, high-dimensionality-trait-driven feature extraction, and high-dimensionality-trait-driven classifier selection. In the first stage, according to the definition of high-dimensionality and the relationship between sample size and feature dimensions, the high-dimensionality traits of credit dataset are further categorized into two types: 100 < feature dimensions < sample size, and feature dimensions ≥ sample size. In the second stage, some typical feature extraction methods are tested regarding the two categories of high dimensionality. In the final stage, four types of classifiers are performed to evaluate credit risk considering different high-dimensionality traits. For the purpose of illustration and verification, credit classification experiments are performed on two publicly available credit risk datasets, and the results show that the proposed high-dimensionality-trait-driven learning paradigm for feature extraction and classifier selection is effective in handling high-dimensional credit classification issues and improving credit classification accuracy relative to the benchmark models listed in this study.

increasing amount of traits (Yu and Liu 2003), such as data noise (Yu et al. 2020a), data missing (Yu et al. 2020b), data imbalance (Yu et al. 2018), and small sample data (Yu and Zhang 2021). With the increase of feature dimensions, more redundant features appear. The problem of sparse data and complicated calculation caused by too many features is known as the curse of dimensionality. This kind of high-dimensionality problem becomes particularly important in credit risk classification. It increases not only the cost of credit classification but also the calculation time exponentially; therefore, the accuracy of classification will decline . Many traditional data mining algorithms will fail or become less effective when directly applied to the high-dimensional data. Therefore, credit risk classification of high-dimensional features has become a challenging task. Although many research achievements have been made in the past decades, there are still many problems and challenges to be solved in this field.
For the high dimensional credit dataset, reducing the data dimension is an essential operation in credit classification, and feature extraction is one of the main methods to do that. Latent semantic analysis (LSA) (Deerwester et al. 2010) is one of the earliest feature extraction methods, and its main idea is to transform the feature values with singular value decomposition (SVD), change the spatial relationship of the original features, and combine them to obtain new variables through the analysis of the relationship between features. Feature extraction mainly includes two methods: linear and nonlinear feature extraction. One of the prominent linear dimensionality reduction methods is principal component analysis (PCA) (Kambhatla and Leen 1997), followed by locality preserving projections (LPP) (He 2003), neighborhood preserving embedding (NPE) (He et al. 2005), and linear discriminant analysis (LDA) (Fisher 1936). However, LDA can only reduce the dimension to (n-1) at most after feature extraction for n-element classification problem (Fisher 1936), so it cannot be used as an effective feature extraction strategy for a binary credit classification problem in this paper. Furthermore, LPP focuses on the local structure information of data without considering the global structure sufficiently (He 2003). Therefore, PCA is selected as the linear feature extraction method in this study.
Nonlinear feature extraction methods mainly include the nonlinear feature extraction based on kernel methods and the manifold learning, such as kernel PCA (KPCA) (Zhang 2009), locally linear embedding (LLE) (Roweis and Saul 2000), and isometric mapping (ISOMAP) (Tenenbaum et al. 2000). It is important to note that nonlinear feature extraction has a certain advantage in reducing original data dimensions, but in practical applications, there are many problems regarding its performance. For example, nonlinear feature extraction based on kernel methods needs to perform nonlinear transformation for each sample, which will result in a huge calculation burden and a dimension disaster problem. Meanwhile, the nonlinear feature extraction method based on manifold learning requires dense sampling (Roweis and Saul 2000), which is a major hurdle for high-dimensional circumstances. Therefore, the performance of the nonlinear dimension reduction method in practice is often worse than expected.
After reducing the feature dimension through feature extraction, the next step is to use classification algorithms to classify the data samples. The most popular credit scoring methods include expert systems, statistical and econometric models, artificial intelligence (AI) techniques, and their hybrid forms (Yu et al. 2015). At present, the most commonly used statistical and econometric methods include linear discriminant analysis (LDA) (Blei et al. 2003), logistic regression (LogR) (Grablowsky and Talley 1981;Feder and Just 1977), wavelet method (Mabrouk 2020), mathematical programming model (MPM) (Mangasarian 1965), and k-nearest neighbor (KNN) algorithm (Henley and Hand 1996). With the rapid development of AI and machine learning (Kou et al. 2019;Pabuçcu et al. 2020), AI-based algorithms have proved to be more effective in credit risk classification compared with traditional methods; hence, more and more scholars apply those. The main AI algorithms include artificial neural networks (ANN) (Odom and Sharda 1990;Tam and Kiang 1992;Donskoy 2019), support vector machines (SVM) (Cortes and Vapnik 1995;Yu et al. 2020c), decision trees (DT) (Waheed et al. 2006;Rutkowski et al. 2014), and extreme learning machines (ELM) (Xin et al. 2014;Nayak and Misra 2020). These single classifiers can be divided into two types, linear and nonlinear. In addition to these single models, hybrid and ensemble classifiers, such as bagging and neural network ensemble classifiers, are other substantial types of credit risk classification and prediction methods (Yu et al. 2008;Yu et al. 2010;Wang and Ma 2010;Song and Wang 2019).
In summary, there are numerous dimensionality reduction methods and classification models applied to the high-dimensional credit risk classification, but the experimental results are also affected by the data itself, and the same model may perform differently with different data traits. However, little attention has been paid to the relationship between the data traits and model selection. Nelson and Plosser (1982) found that the traditional econometric models would show pseudo-regression when the data presented non-stationary traits. Tang et al. (2013Tang et al. ( , 2014 proposed a novel data-characteristicdriven modeling methodology for nuclear energy consumption forecasting and proved that the performance of this methodology is clearly superior. The data-trait-driven modeling methodology confirms that an effective model must match the data trait of the research sample. For this purpose, this paper tries to propose a high-dimensionality-trait-driven learning paradigm for feature extraction and classifier selection in credit classification with high dimensionality.
The main objective is to provide and select different feature extraction strategy and classifiers regarding the two categories of high dimensionality and establish the connection between high-dimensional data traits and feature extraction and classifier selection in credit risk classification. The rest of this paper is organized as follows: "Methodology formulation" section describes the proposed learning paradigm in detail. To verify and compare the validity of the proposed model, two real-world credit datasets are used, and the experimental design is presented in "Data descriptions and experimental design" section. "Results and discussion" section reports the results and further discussions. Finally, "Summary and discussion" section concludes the paper.

Methodology formulation
In this section, a high-dimensionality-trait-driven learning paradigm for feature extraction and classifier selection is proposed for the high-dimensional credit risk classification problem. In particular, the main purpose is to select the most suitable feature extraction method and classifier considering the two categories of high dimensionality, and to establish the connection between the trait of high dimensionality and feature extraction and classifier selection in credit risk classification. The general framework of the proposed high-dimensionality-trait-driven learning paradigm is shown in Fig. 1.
As can be seen from Fig. 1, the proposed paradigm includes three main stages: categorization of high dimensionality, high-dimensionality-trait-driven feature extraction, and high-dimensionality-trait-driven classifier selection. When 100 < feature dimensions < sample size (Chandrashekar and Sahin 2014), PCA is selected as the feature extraction strategy, and single linear classifier is selected as the classification model. When feature dimensions ≥ sample size (Mwangi et al. 2014;Hua et al. 2009), nonfeature extraction is selected as the feature extraction strategy, and linear ensemble classifier is selected as the classification model. The detailed descriptions and related methodology of the three stages are given in "Categorization of high dimensionality"-"High-dimensionality-trait-driven classifier selection" sections below.

Categorization of high dimensionality
In the existing literature, there are two different definitions of high dimensionality. On the one hand, high dimensionality means that the number of attribute features is larger than sample size (Mwangi et al. 2014;Hua et al. 2009). For example, Bai and Li (2012) claim that the number of attribute features equal to, or greater than sample size, can be called a high dimensionality in the sample. On the other hand, some studies have found that no matter how many samples there are, the number of attribute features will significantly affect the performance of the classifier. For example, Chandrashekar and Sahin (2014) reported that hundreds of variables could lead to high dimensionality and even to the curse of dimensionality problem. Based on the definitions of high-dimensionality and the quantitative relationship between the number of attribute features and the number of samples, the highdimensionality traits of credit dataset are further categorized into two categories, 100 < feature dimensions < sample size (Chandrashekar and Sahin 2014) and feature dimensions ≥ sample size (Mwangi et al. 2014;Hua et al. 2009), for the convenience of analysis and computation. In terms of high-dimensionality-trait-driven idea, feature extraction strategy and classifier selections are carried out under two categories of highdimensional conditions, which will be elaborated below.

High-dimensionality-trait-driven feature extraction strategy selection
The concept of the curse of dimensionality was proposed by Bellman in 1961, and was later used to refer to various high dimensionality problems in data analysis caused by an excessive number of features. To overcome the problems caused by the dimensionality disaster in credit risk classification, feature extraction is one of the effective dimensionality reduction methods. Based on the idea of high-dimensionality trait-driven modeling, this paper selects different feature extraction strategies according to different categories of high dimensionality. In particular, we focus mainly on the selection of feature extraction method when 100 < feature dimensions < sample size and feature dimensions ≥ sample size. In these circumstances, linear feature extraction, nonlinear feature extraction, and non-feature extraction are used as three typical extraction strategies, which will be illustrated below. In particular, non-feature extraction means that the classification is performed directly without dimensionality reduction.

Feature extraction strategy selection for 100 < feature dimensions < sample size
When 100 < feature dimensions < sample size (Chandrashekar and Sahin 2014), there are many attribute features, and the number of samples is relatively large; hence, a large amount of redundant information may be easily produced. In order to reduce the impact of redundant features on the classification performance and the calculation cost caused by data size, the dimension reduction is performed on the high-dimensionality dataset. This dimension reduction includes both nonlinear and linear feature extractions.
The nonlinear feature extraction mainly includes kernel based feature extraction, such as kernel principal component analysis (KPCA), and manifold learning based feature extraction, such as isometric mapping (ISOMAP) and locally linear embedding (LLE). In this case, the data sample size is relatively large compared to the high-dimensional trait. Using the nonlinear feature extraction method to reduce the dimension has certain advantages in theory, but in practical applications, especially if the sample size is large, it would produce huge calculation burden (Li and Lu 1999).The nonlinear dimension reduction performance is greatly influenced by the noise in the data, which is often worse than expected (Geng et al. 2005).
Therefore, the linear feature extraction method is chosen for dimensionality reduction under the condition of "100 < feature dimensions < sample size" in order to reduce the computational time and complexity, which exist in the nonlinear feature extraction methods. Meanwhile, the linear relationships of attribute features are strong when the datasets are directly classified, which proves the necessity of using the linear feature extraction strategy (Rosenblatt 1988). Usually, linear feature extraction refers to a method of constructing the linear dimension reduction mapping and obtaining the low-dimensional representation of the high-dimensional data. This type of method is not only suitable for dealing with linear structure but also for dealing with high dimensional traits with more samples, such as 100 < feature dimensions < sample size. Some conventional linear feature extraction methods include principal component analysis (PCA), linear discriminant analysis (LDA), and locality preserving projections (LPP). It is worth noting that LDA can only reduce the dimension to (n-1) at most after feature extraction for n-element classification problem (Fisher 1936), so it cannot be used as an effective feature extraction strategy for the binary credit classification problem in this study. However, LPP focuses on the local structure information of data, without considering the global structure sufficiently (He 2003). Therefore, when 100 < feature dimensions < sample size, PCA is selected as the high-dimensionality-trait-driven feature extraction strategy, according to the trait of high dimensionality. The details of PCA can be found in other papers, such as Kambhatla and Leen's (1997).

Feature extraction strategy selection for feature dimensions ≥ sample size
When feature dimensions ≥ sample size (Mwangi et al. 2014;Hua et al. 2009), the number of samples is relatively small compared to the usual sample. Then, dimensionality reduction could further compress the amount of data, which may lead to insufficient information for classification and affect the subsequent classification performance (Li et al. 2011). In this case, using feature extraction has no obvious advantage in reducing computational complexity and saving computational time. Therefore, when feature dimensions ≥ sample size, non-feature extraction is selected as the high-dimensionality-trait-driven feature extraction strategy, according to the trait of high dimensionality. With this strategy, the classification is carried out directly, without dimensionality reduction. It is often used in small-scale datasets with high dimensionality traits, which is a typical case of feature dimensions larger than sample size. This study will select different feature extraction strategies for different high-dimensional traits. In particular, PCA is used for feature extraction when 100 < feature dimensions < sample size, and non-feature extraction is performed when feature dimensions ≥ sample size in the experimental analysis, as illustrated in Fig. 1.

High-dimensionality-trait-driven classifier selection
In order to obtain the good classification performance, different classifiers will be used. In this paper, single linear classifier, single nonlinear classifier, and their corresponding linear or nonlinear ensemble classifiers will be used in terms of high-dimensionality traits. In particular, single classifier refers to the single classification model rather than the integration of multiple classifiers.

Classifier selection when 100 < feature dimensions < sample size
As mentioned in the previous section, when 100 < feature dimensions < sample size (Chandrashekar and Sahin 2014), a typical linear feature extraction strategy, PCA, will be used for dimension reduction. In this case, 12 classifiers are utilized, and the experimental results show that the linear classifier performs the best. The main reasons are two-fold: On the one hand, there are strong linear relationships among the attribute features in the testing dataset, thus the linear classifier can fit those linear features well. On the other hand, under the condition of "100 < feature dimensions < sample size", if the number of samples is relatively large, resulting in a large data scale, the linear classifier can reduce the computational complexity greatly. Therefore, when 100 < feature dimensions < sample size, PCA is selected as the feature extraction strategy, and single linear classifier is selected as the classification model, according to the high dimensionality trait.
Usually, the linear classifier is a typical classification model that can separate positive and negative samples with a hyperplane. The single linear classifier used in this study includes LDA (Blei et al. 2003) and LogR (Grablowsky and Talley 1981).

Classifier selection when feature dimensions ≥ sample size
As mentioned earlier, when feature dimensions ≥ sample size (Mwangi et al. 2014;Hua et al. 2009), non-feature extraction strategy is selected. In this case, the number of samples is relatively small, and the classification performance of a single classifier is unstable and prone to errors. Therefore, the ensemble classifier is chosen to reduce the fluctuation error of the single linear classifier. Meanwhile, because of the strong linear relationship between the data features, the linear ensemble classifier is selected as the generic classification model under the high-dimensional features.
The linear ensemble classifier used in this paper is obtained by integrating linear single classifier with bootstrap aggregating (bagging) and majority voting method. The bagging algorithm (Breiman 1996) selects m subsets from the training set uniformly and uses the algorithms of classification and regression on the m training sets to obtain m single classification models, before obtaining the results of bagging through the methods of majority voting. The main reason for selecting bagging as an ensemble method is that it has lower computational complexity compared to the other ensemble method, such as boosting.

Data descriptions and evaluation criteria
In this section, two real-world credit datasets, Kaggle loan default prediction dataset (Kreienkamp and Kateshov 2014) and China Unionpay credit dataset (Liu et al. 2019), are used to test the effectiveness of the proposed high-dimensionality-trait-driven learning paradigm. The specific descriptions of the two datasets are shown below.

Kaggle loan default prediction dataset
This publicly available credit dataset is obtained from Kaggle's website (https:// www. kaggle. com/c/ loan-defau lt-predi ction) and consists of 105,471 samples with 769 attributes. For the original dataset, some preprocessing steps are performed. First, data cleaning was performed, where some samples with missing values were deleted directly. Second, classification label transformation was conducted. After deleting the samples with missing values, the default loss value of each sample was transformed into a binary classification problem illustrating whether or not to default, represented by 0 or 1. Third, imbalance data processing was performed. After classification label transformation, the ratio of the non-default samples (good samples) to the default samples (bad samples) was about 10:1, thus the dataset is highly imbalanced. In order to reduce the influence of imbalance on the experimental results, the dataset was undersampled based on clustering results (Chao et al. 2020). The main idea of clustering is to generate 10 clusters by clustering the good samples of the dataset, undersampling them at the specified sampling rate in each cluster, and integrating the 5275 good samples from the undersampled samples with the same number of bad samples to formulate a new analytical dataset with an imbalance rate of 1:1.
After that, the new dataset is composed of 10,550 samples with 769 features, and it meets the high dimensional trait condition because 100 < feature dimensions (769) < sample size (10,550). In order to meet the standards of two high dimensional traits simultaneously, 700 samples are randomly selected from the dataset after preprocessing to construct a dataset with other high dimensional traits, so the condition of "feature dimensions (769) ≥ sample size (700)" can be satisfied.

China Unionpay credit dataset
The China Unionpay credit dataset is obtained from the data competition created by China Unionpay (https:// open. china ums. com/#/ intro). This dataset is a binary classification problem, which divides 11,017 observations into two classes: good credits (8873 observations) and bad credits (2144 observations). The dataset describes the observations on 199 feature attributes, including six major dimensions: identity information and property status, cardholder information, trading information, loan information, repayment information, and loan application information.
Similar to the Kaggle dataset, the pre-processing steps are conducted. For the missing values of samples, the average interpolation method is initially used for imputation. Then, using the random undersampling, 2144 samples from the good credit ones are randomly selected. Finally, all bad credit samples are combined with randomly selected 2144 good samples into analytical samples with an imbalance rate of 1:1.
After processing, the new analytical dataset is composed of 4288 samples with 199 features, and it meets the high dimensional trait condition since 100 < feature dimensions (199) < sample size (4288). In order to meet another high dimensional trait condition, 199 samples are selected from the dataset by random undersampling after pre-processing, so that the condition of "feature dimensions (199) ≥ sample size (199)" is satisfied.
To evaluate the performance of the analytical model, Total accuracy (Total for short), true positive accuracy (TP for short), true negative accuracy (TN for short), and area under curve (AUC for short) (Bradley 1997;Shen et al. 2020) are selected as evaluation criteria.

Experimental design
In the experimental design, two main operations, including feature extraction and classifier selection, are conducted. In the selection of the feature extraction strategy, PCAbased linear feature extraction was performed when 100 < feature dimensions < sample size, according the high dimensionality traits, and non-feature extraction was selected as a comparing benchmark model. Similarly, non-feature extraction was carried out when feature dimensions ≥ sample size, and the PCA feature extraction was selected as a benchmark model. The performance of 12 classifiers are compared to evaluate the effectiveness of high-dimensionality-trait-driven feature extraction strategy selection. When PCA was used for dimension reduction, principal components with cumulative variance contribution rate up to 98% were selected as the new attribute features after dimensionality reduction.
In the classifier selection, this study chose 12 classifiers: LDA, LogR, KNN, SVM, back propagation neural network (BPNN), classification and regression tree (CART), and the ensemble classifiers with bagging corresponding to these 6 single classifiers. Among them, LDA and LogR belong to the single linear classifiers, and their respective ensembles belong to the linear ensemble classifiers. KNN, SVM, BPNN, and CART belong to the single nonlinear classifiers, and their respective ensembles belong to the nonlinear ensemble classifiers. When 100 < feature dimensions < sample size, single linear classifier is used after the dimension reduction of PCA. When feature dimensions ≥ sample size, the linear ensemble classifier is used directly, without feature extraction considering the description of "Methodology formulation" section.
Regarding model specification, LDA selects the "diaglineard" discriminant function, and the tolerance of LogR iteration termination condition is set to 0.7. The k value of KNN is set to 12. SVM uses RBF kernel function with regularization parameter C = 12 and σ 2 = 2. The number of neurons in each layer of BPNN is set to 5, the transfer functions for hidden lalyer and output layer are "logsig" and "purelin", respectively, and the training function is "traincgf ". For the decision tree method, the CART uses default parameters. When bagging is utilized for the ensemble of the base classifier, the number of base classifiers is set to 3.
In addition, the datasets are divided into training sets and test sets, with a 7:3 ratio. Due to the random initial conditions and the randomness generated by the training set, each model would be run 10 times (Yu et al. 2008). The average values of the Total, TP, TN, and AUC, and the corresponding standard deviation are used as the results of these models. These results were then used to select suitable feature extraction methods and classifiers under different high-dimensionality trait conditions, as well as further to verify the effectiveness of the proposed high-dimensionality-trait-driven learning paradigm.

Results and discussion
Experimental results when 100 < feature dimensions < sample size According to the experimental design, the empirical results of two datasets when 100 < feature dimensions < sample size are shown in Tables 1, 2, 3 and 4.
Tables 1 and 2 present the performance of the two datasets in the 12 classification algorithms after PCA feature extraction, for 100 < feature dimensions < sample size. To make the comparison more intuitive and easy to comprehend, the classification results without the feature extraction method are marked in the brackets as the benchmark model for comparison. In addition, the results with bold font are the best in the tables.
As seen from Tables 1 and 2, under the condition of 100 < feature dimensions < sample size, the high-dimensionality-trait-driven learning paradigm with PCA feature extraction method has a better classification performance compared to the non-feature extraction strategy in most of the 12 classifiers. This indicates the effectiveness of the high-dimensionality-trait-driven learning paradigm for feature extraction selection. This can be interpreted with the following two reasons: On the one hand, in the dataset with high-dimensional traits, the data has a strong linear relationship and the information redundancy is high. PCA-based feature extraction can reduce noise and improve the accuracy of classification. On the other hand, in this case, PCA-based feature extraction can also reduce computational complexity and save computational time for a large sample size.
After demonstrating the importance of PCA-based feature extraction, the subsequent task is to illustrate the effectiveness of the classifier selection in terms of the proposed learning paradigm. According to the experimental results in Tables 3 and 4, when 100 < feature dimensions < sample size, single linear classifier should be selected as a classification model because single linear classifiers perform better than other classifiers listed in this paper, considering the experimental results of two credit datasets.
To verify its effectiveness further, single nonlinear classification, linear ensemble classifier, and nonlinear ensemble classifier are compared with the benchmark models. In 10 experiments, the corresponding results of each type of classifier are composed of the average results of the multiple base classifiers. For example, the average of the classification results of LDA and logR in each experiment is expressed as the result of a linear single classifier. In each experiment, the performance of four categories of classifiers under four evaluation criteria can be compared, as reported in Tables 3 and 4. It should be noted that the results with bold font illustrate the best performance under the same evaluation indicators in each experiment. As can be seen from Tables 3 and 4, no matter what the Kaggle or the Unionpay dataset is utilized in 10 experiments, single linear classifier performs the best in terms of Total, TN, and AUC, and nonlinear ensemble classifier performs the best in terms of TP. Furthermore, regarding average and standard deviation, the PCA-based single linear classifier obtains the best results considering the four evaluation criteria, indicating that the proposed high-dimensionality-trait-driven learning paradigm has strong robustness. The main reasons involve the following two aspects: First, when 100 < feature dimensions < sample size, PCA-based feature extraction can reduce the impact of redundant features on the classification performance as well as the calculation cost caused by the data sample size. Second, the experimental results show that the datasets have a strong linear relationship without feature extraction, so the linear single classifier can be effective.
Therefore, based on the four evaluation criteria, it can be conclude that when 100 < feature dimensions < sample size, the single linear classifier performs the best after PCA feature extraction, demonstrating the effectiveness of the proposed highdimensionality-trait-driven learning paradigm. This indicates that different feature extraction strategies and classifiers selection should be carefully determined by the different traits of high dimensionality.

Experimental results when feature dimensions ≥ sample size
Similar to "Experimental results when 100 < feature dimensions < sample size" section, this section will report the experimental results of two datasets under the condition of "feature dimensions ≥ sample size" in terms of the framework of Fig. 1. Accordingly, the computational results are presented in Tables 5, 6, 7 and 8. In detail, Tables 5 and 6 show the performance of the two datasets in the 12 classification algorithms without feature extraction, when feature dimensions ≥ sample size. Similarly, the classification results with the PCA feature extraction method are marked in brackets as the benchmark model, for comparison purpose. In addition, the results with bold font are the best in the tables.
As seen from Tables 5 and 6, when feature dimensions ≥ sample size, the highdimensionality-trait-driven learning paradigm without feature extraction performs better compared to the PCA-based feature extraction strategy in most of the 12 classifiers, proving the effectiveness of the high-dimensionality-trait-driven learning paradigm for feature extraction selection. There are two possible reasons: On the one hand, under the condition of "feature dimensions ≥ sample size", there is only a small  number of samples. If feature extraction is conducted, the samples cannot provide sufficient information for classification task, which can affect the classification performance (Li et al. 2011). One the other hand, when samples are small, using the feature extraction method to reduce cost and improve the calculation efficiency has no clear advantage.

Results
After proving the importance of non-feature extraction when feature dimensions ≥ sample size, the subsequent task is to demonstrate the effectiveness of the classifier selection. According to the experimental results in Tables 5 and 6, when feature dimensions ≥ sample size, linear ensemble classifier should be selected because the performance of this type of classifiers is superior considering the experimental results of two credit datasets.
To verify its effectiveness further, single linear classifier, single nonlinear classifier, and nonlinear ensemble classifier are selected as the benchmark models for comparison purposes. In 10 experiments, the results of each type of classifier are composed of the average results of the multiple base classifiers it contains. In each experiment, the performance of four categories of classifiers under four evaluation criteria was compared    and presented in Tables 7 and 8. It should be noted that the results with bold font are the best performance under the same evaluation criteria in every experiment. As seen from Tables 7 and 8, in the two datasets, the linear ensemble classifier performs better in the Total index, with excellent classification performance for the other three evaluation criteria, as well. Overall, the linear ensemble classifier performs the best compared to other benchmark classifiers. Moreover, it has the lowest standard deviation for most evaluation criteria in the 10 experiments, indicating strong robustness. The evidence explaining this phenomenon is that bagging ensemble helps to reduce the errors caused by the fluctuation of training data. Due to the small sample size and unstable performance of the single classifier in this case, bagging can reduce the variance of the base classifier and improve the generalization performance.

Results
Therefore, based on the four evaluation criteria, it is not hard to find that when feature dimensions ≥ sample size, the linear ensemble classifier has the best performance without feature extraction, demonstrating the effectiveness of the proposed  high-dimensionality-trait-driven learning paradigm, as shown in Fig. 1. This also indicates that different feature extraction strategies and classifiers should be carefully considered in terms of the different traits of high dimensionality hidden in the credit dataset.

Summary and discussion
From the experimental results and analysis in "Experimental results when 100 < feature dimensions < sample size" and "Experimental results when feature dimensions ≥ sample size" sections, several important findings and implications can be summarized. First, when 100 < feature dimensions < sample size, the high-dimensionality-traitdriven learning paradigm with the PCA feature extraction method has a better classification performance compared to the non-feature extraction strategy. The single linear classifier performs the best after the PCA processing, according to the traits of high-dimensional data, as shown in "Experimental results when 100 < feature dimensions < sample size" section.  Second, when feature dimensions ≥ sample size, direct use of the linear ensemble classifier, without feature extraction, can achieve better classification performance, as shown in "Experimental results when feature dimensions ≥ sample size" section.

Results
Third, for the selection of a feature extraction strategy, PCA-based linear feature extraction is carried out when 100 < feature dimensions < sample size and non-feature extraction is conducted when feature dimensions ≥ sample size. This demonstrates the effectiveness of the proposed high-dimensionality-trait-driven learning paradigm, as shown in Fig. 1.
Fourth, for the classifier selection inspired by the proposed high-dimensionality-traitdriven learning paradigm, when 100 < feature dimensions < sample size, the single linear classifier is selected as the generic classification model. In addition, when feature dimensions ≥ sample size, the linear ensemble classifier is chosen.
Finally, the above analysis proves the effectiveness of the high-dimensionality-traitdriven learning paradigm for feature extraction and classifier selection. When 100 < feature dimensions < sample size, PCA is selected as the feature extraction strategy, and single linear classifier is selected as the classification model. When feature dimensions ≥ sample size, non-feature extraction is selected as the feature extraction strategy, and linear ensemble classifier is selected as the classification model.
Although the proposed high-dimensionality-trait-driven learning paradigm provides a reliable guideline for feature extraction and classifier selection in high-dimensional credit classification, feature dimension categories are dependent on the sample data, and lack the strict mathematical reasoning and proof. This issue may limit the use of the proposed learning paradigm. In the future research, more datasets should be used to verify its effectiveness.

Conclusions
To solve the high-dimensionality issue and improve its accuracy in credit risk assessment, a high-dimensionality-trait-driven learning paradigm was proposed for feature extraction and classifier selection. For verification purposes, two credit datasets have been presented to test the classification capability and effectiveness of the learning paradigm proposed in this paper. The experimental results show that it can be better utilized to solve high-dimensionality issues in credit risk classification.
Moreover, the study can provide some important references for the selection of feature extraction and classifier for different high-dimensionality datasets, implying that the proposed high-dimensionality-trait-driven learning paradigm can be used as a promising credit risk assessment tool with high dimensionality traits. In practical applications, the proposed paradigm can help financial institutions to make suitable decisions and choose different strategies when faced with different situations of high dimensionality traits. This can not only improve the classification accuracy but also reduce the possible economic loss for financial institutions. Accordingly, it brings sufficient practical significance.
In addition, directions to improve the proposed learning paradigm further are suggested. Regarding the selection of feature extraction methods, the combination of different methods can be performed. In terms of classifier training, popular optimization algorithms, such as PSO, and powerful ensemble methods can be used to improve and optimize the classification performance of the classifier further. We plan to examine these issues in the future.