The method of residual-based bootstrap averaging of the forecast ensemble

This paper presents an optimization approach—residual-based bootstrap averaging (RBBA)—for different types of forecast ensembles. Unlike traditional residual-mean-square-error-based ensemble forecast averaging approaches, the RBBA method attempts to find optimal forecast weights in an ensemble and allows for their combination into the most effective additive forecast. In the RBBA method, all the different types of forecasts obtain the optimal weights for ensemble residuals that are statistically optimal in terms of the fitness function of the residuals. Empirical studies have been conducted to demonstrate why and how the RBBA method works. The experimental results based on the real-world time series of contemporary stock exchanges show that the RBBA method can produce ensemble forecasts with good generalization ability.


Introduction
The stock exchange data forecasting problem is very large and complex when traditional statistical methods are used alone. There are several examples of incorrect forecasts from the real world that are made by even the best specialists and forecasting business-consulting centers. Even the success of artificial neural networks (ANNs) and ANN ensembles in forecasting (Liu and Yao 1999) is insufficient to avoid typical ANN problems, such as too-deep-to-learn, too-wide-to-forget, or overfitting, whereby surplus memory may be mistakenly regarded as a good learning result. Different types of assets, as well as ample dependencies and external factors, make ANN-based stock exchange forecasting a very difficult task. In practice, it is possible to obtain reliable but not sufficiently accurate predictions via simple standard methods such as SARIMAX, single-layer neural networks, wavelets, exponential smoothing, asymptotic linear regression, and so on.
This paper deals exclusively with random processes that occur on the stock exchange and are in one way or another implicitly related to human behavior. The random processes under consideration differ from the high-frequency ergodic random processes occurring in nature.
It would be naive to believe that random stock exchange factors related mainly to human behavior fully correspond to the classical concepts of the statistics of Ivanyuk Financial Innovation (2023) 9:37 stochastic processes. If there were such a correspondence, most economists would have been able to unequivocally predict both the beginning of the Arab Spring and the crisis of 2008, as well as the beginning of the energy crisis of 2022, a few months before these events occurresd.
This problem primarily implies the adaptation of predictive methods based on the actual situation. Unfortunately, the currently available methods of expert forecasting (references to EIA forecasts from 2017 to 2021) as well as the classical methods of statistical forecasting proposed more than 100 years ago by Fienberg and Lazar (2001) are rather weak. The objective of this study is to develop a statistical evaluation method vis-à-vis the performance of a forecast that provides a more accurate result than the currently known predictive ensemble methods.
In this case, the major task is to combine forecasts such that the residuals exclusively contain the random walk component or express it most accurately (Hashem et al. 1994). In most cases, any optimal ensemble gives better forecasting results than individual methods (Kuncheva and Whitaker 2003). Meanwhile, it is almost generally accepted that an increase in the differences in extrapolation methods enhances the overall quality of forecasts (Gashler et al. 2008). Considering both traditional methods of ensemble optimization, such as the Bayes optimal classifier (Mitchell 1997), bootstrap aggregating (Salman et al. 2021), boosting (Emer 2018), and advanced ensemble methods based on statistical estimates of the forecast correlation (Liu and Yao 1999), the author reflects on a tendency that is common to all methods and consists of using the residual mean square error (RMSE) or its variations as a generally accepted geometric measure of forecast error. Paying attention to its properties, such as unidimensionality and asymmetry of the estimate, as a function of the area of the deviation bounds, the author has conducted a study that allows for the development of a multidimensional geometric evaluation function of the ensemble residuals that is easily applicable to bootstrap averaging as a penalty function and shows quite interesting results, especially for large samples.
Considering the works of contemporary authors, Kou et al. (2021) developed a model for predicting bankruptcies of small and medium-sized enterprises. This model uses transactional data and two-stage multi-objective feature selection. Wen et al. (2019) revealed a correlation between investor interest in an asset and asset quality. The better the asset, the higher the investor's interest in it. Conversely, if the investor's interest in the asset decreases, the asset soon collapses. Li et al. (2022) proposed a new clustering approach. They improved the multidimensional K-means algorithm, which allowed for better data clustering. This approach helps to reduce the number of clusters and obtain a more accurate image.
The remainder of this paper is organized as follows: Sect. "Multidimensional evaluation of residuals: idea and implementation" describes the idea and implementation of a multidimensional function of the geometric evaluation of the ensemble residual quality and the residual-based bootstrap averaging (RBBA) method itself. Section "Experimental verification of the RBBA method" describes the conditions and procedures for experimenting with a comparative analysis of the RBBA and traditional predictive ensemble methods. Section "The results of the experiment" presents the results of the real stock asset forecasting experiment, compares them with standard Bayesian optimization, AdaBoost, and bootstrap aggregating methods, and provides some discussions. Finally, Sect. "Discussion" concludes with a summary of the paper and a few remarks.

Computation age and artificial neural networks
It is usually assumed that RMSE minimization is a fairly good criterion for network training or prediction optimization, but the practical application of this method does not always present adequate results. Having agreed that the residuals of the approximation of the method should have minimum error, we nevertheless do not consider the basic statistical idea of the random component (Wetherill 1981). Often, an overfitted neural network or an idealized predictive method, being sufficiently trained, optimized, and showing good results on homogeneous, standardly distributed data, still turns out to be ineffective for practical application. With the stock market, this means that, in fact, the incoming data are not only heterogeneous but also have a purposeful tendency to deviate from statistically sound forecasts because a high-quality forecast that becomes a matter of common knowledge is actually a factor that minimizes the agent's profit. Thus, the use of standard methods of predictive error estimation forces the forecaster to deliberately expand the forecast corridor to avoid real estimates beyond theoretical boundaries. Of course, the least squares method, invented by Gauss more than 200 years ago (Stigler 1981), is currently one of the most commonly used optimization criteria for approximating dependencies. This method is extremely simple and requires minimal computing resources. However, it does not guarantee that it will separate the main functions and dependencies from a random component; it only finds its most plausible placement. As a rule, the evaluation of approximation residuals is more often used as a verification method rather than an optimizing one; traditionally, in the pre-computer and early computer era (primary and secondary information age), it was considered highly resource-intensive. Since 2008, the computing power of nonspecialized private systems has grown so much that we have entered a new computing era (tertiary information age) (Hilbert 2020). Modern computing resources have become widely available and extremely cheap over the past half-century, and the emergence of multi-core video processors (Corana 2015), method of massive error backpropagation (Goodfellow et al. 2016), and software libraries for the Levenberg-Marquardt algorithm (Transtrum and Sethna 2012) have allowed the widespread use of neural networks and computational ensembles as universal approximating and predictive mathematical tools. These three tools, mutually reinforcing one another, have created the conditions for a technological breakthrough in the field of processing and analyzing large amounts of data in small laboratories. The availability of supercomputing capacities has ceased to play a significant role because the computing power of conventional desktop computing systems has reached hundreds of teraflops.
Nearly every modern forecasting method uses the squared deviation or its modifications in one form or another, such as the fitness penalty function and the Levenberg-Marquardt algorithm, sometimes combined with stochastic functions, as a method of finding the optimum. Although the algorithm for finding a local or global error minimum solely affects the speed of achieving the result, the fitness penalty function is responsible for its quality. The idea of improving the quality of the forecast by replacing the optimization criterion appeared during the actual forecasting of stock data, when many forecasts of real data turned out to be of insufficient quality despite the low standard error and excellent approximation quality.

The idea of a new quality criterion of residuals
The main criteria for the quality of random residuals, regardless of whether they appeared as a result of multiplicative (Bessel distribution), additive (Gaussian distribution), or mixed effects of random factors, are the actual characteristics reflecting their association with a set of random processes. Such characteristics include, first, the mutual independence of residuals of the time series, the absence of autocorrelation, and plausibility expressed in proximity to the normal distribution (symmetry, pronounced modality, moderate excessivity). Such a synthetic criterion may be multidimensional and expressed geometrically, similar to the RMSE, in terms of volume or hypervolume corresponding to the deviation from the set of ideal values of statistical tests of residuals corresponding to zero values.

Mathematical implementation of the penalty function
The following criteria were selected as components of the residual quality measure: the Bienaymé turning point test (Kendall 1973), as a characteristic of the mutual dependence of time series elements; the Durbin-Watson statistic (Durbin and Watson 1950), as a characteristic of the presence of autocorrelation; and the Shapiro-Wilk test (Shapiro and Wilk 1965), as a measure of deviation from normality.

Modified turning point test
A test first described by Bienaymé (1874) and further considered in detail by Kendall (1973) characterizes the presence of connectivity of time series elements. In most of the literature (Brockwell and Davis 2002), as well as in the implementation of statistical software packages, such as R (the turningpoint.test function), counting turning points is a rather slow procedure because it uses two comparison operations and, accordingly, branches for each point. We have developed and first introduced an equation that allows it to speed up this procedure because it does not use comparisons and branches but is based on the properties of the modulus of a number and the sign function (1). This equation contains a single slow operation (division) performed at the end of the calculations and can be optimized for data flow processing. Typically, the total number of turning points of a series, calculated as (1), is estimated only by the left boundary of the criterion, checking for the presence of linear dependencies between the elements of the series. However, the right boundary also characterizes the presence of a connection, but unlike the left boundary, it is periodic. Thus, the necessary components normalized to the optimum corresponding to the zero value can be expressed as (2). (1) where respectively, T is the actual number of turning points, n is the sample size of the approximation residuals, sgn ( where respectively, D p is the penalty function component, n is the sample size of the approximation residuals, x t are the values of the time series of the approximation residuals, and t is the number of sample elements corresponding to the time points of the series.
Similar to the previous component, this modified criterion is normalized to one, and its optimum point corresponds to zero.

Modified Shapiro-Wilk test
This criterion which determines the plausibility of the normality of the residual distribution has limited statistical tables that do not allow estimating time series of more than 50 elements. To expand its applicability, we can use the Kazakevičius approximation (Kazakevičius 1988), expressed by Eqs. (4)-(6), which allows working with any sample size, normalized to one with an optimum at zero.
where respectively, z t is the approximated tabular value used to calculate the coefficients of a t , n is the sample size of the approximation residuals, t is the number of sample elements corresponding to the time point of the series, a t is the approximated tabular value Ivanyuk Financial Innovation (2023) 9:37 for calculating the value of W p , W p is the penalty function component, x t is the values of the time series of the approximation residuals, and x denotes the arithmetic mean of x t .
Despite the apparent complexity of Kazakevičius' equations, unlike the original Shapiro-Wilk test, they not only allow the evaluation of a sample of any volume, but are also well suited for parallelizing calculations on multiprocessor systems. In addition, it should be noted that the coefficients z t and a t are essentially tabular values that are calculated once and appear to be common to all estimated time series residuals.
Penalty function: as a result, having characteristics of the series of residuals as mutual independence, characterized by the value T p , the absence of autocorrelation, characterized by the value D p , and the proximity of the distribution to the normal, characterized as W p , considering that all three parameters have critical values normalized to unity with an optimum point at zero, we can create a function characterizing all three parameters as a volume normalized in the interval [0; 1] with an optimum at the origin point. The final function is given by Eq. (7).
In the future, we will consider RBBA to minimize the penalty function P p of the time series residuals by selecting the optimal values of the ensemble of variable coefficients of the approximated series by any gradient or stochastic method.

Experimental verification of the RBBA method
For the experimental verification of the method, a statistical assessment of the quality of additive ensemble forecasts formed by the following ensemble methods was applied: Bayes optimal classifier (Mitchell 1997), bootstrap aggregating (Breiman 1996), Adaptive Boosting (Hastie et al. 2009), and particularly, RBBA.
Ensemble components: The following types of forecasts were used as the main components of the additive predictive ensemble: classical linear, classical indicative, asymptotic linear, asymptotic indicative, classical wavelet, combined neural, and classical perceptron-based.
The classical and asymptotic linear forecasts are described by Eqs. (8) and (9): where respectively, F (t) lin denotes the value of the linear forecast, a is the coefficient of the linear equation selected by the optimization method, t denotes the number of sample elements corresponding to the time point of the series, b is the coefficient of the linear equation selected by the optimization method, F (t) asl is the asymptotic linear forecast value, n is the sample size of the time series elements, e is the Euler number (McCartin 2006), c is the asymptotic coefficient selected by the optimization method, and τ is the forecast lead time.
The classical and asymptotic indicative forecasts are described by Eqs. (10) and (11): where respectively: F (t) e is the indicative forecast value, H is the Heaviside function (Zhang and Zhou 2020), a is the coefficient of the linear equation selected by the optimization method, x max is the maximum element of the time series, x min is the minimum element of the time series, e is the Euler number (McCartin 2006), b is the coefficient of the linear equation selected by the optimization method, t denotes the number of sample elements corresponding to the time point of the series, F (t) ae is the asymptotic indicative forecast value, n is the sample size of the time series elements, c is the asymptotic coefficient selected by the optimization method, and τ is the forecast lead time.
Because asymptotic forecasts are non-standard methods ( Figs. 1 and 2), we suggest a graphic presentation of the difference between these forecasts and the existing classical methods, which allows for enhancing the prediction quality as a result of the asymptotic expansion of the forecast values.
The wavelet forecast is formed by the method of sequential decomposition of a time series by a set of wavelets similar to Eq. (12), all the wavelets are combined into a common additive forecast, according to Eq. (13).
where respectively, ψ i (t) is the value of the wavelet, a is the amplitude of the wavelet selected by the optimization method, i is the number of wavelets obtained during (10) F (t) e =H (a) x max − x min e −bt + H(−a) x min + x max e −bt , A perceptron-based neural forecast, which reflects the impact of external correlating factors, is formed by a single-layered nonrecurrent neural network with multiple inputs receiving data on the values of influencing external factors. Such a forecast is calculated according to Eq. (14): where respectively, F (t) ann is the value of the neural network forecast, ς is the Verhulst logistic function (Verhulst 1838), m is the unit depth of the neural network, k is the coefficients of the receptive field selected by the network training method, n is the unit width of the neural network, x ij is the input value of the neural network, t is the number of sample elements of the time series corresponding to the time point of the series, and τ is the forecast lead time.
Residual-based optimization function.
To perform optimization using gradient methods in an empirical study, a fourdimensional criterion of the residual quality was chosen, combining both the classical least-squares method and the proposed method for evaluating the quality of residuals corresponding to Eq. (15) (14) where respectively, F (t) ens is the ensemble forecast value, k is the coefficient of receptive confidence selected using the ensemble averaging method, and F(t) is the corresponding forecast described above.
The main optimization task for this type of forecast is to determine the confidence coefficients, which is the concept of ensemble optimization (ensemble averaging). Notably, it is just the method of determining these coefficients that constitutes the main difference between the basic approaches of the evaluated ensemble methods. Therefore, for the Bayes optimal classifier method, the probability of a correct forecast for each of the ensemble components determines the confidence coefficients; bootstrap aggregating is based on discarding the worst forecasts characterizing certain properties of the series already used in forecasts with better results; Adaptive Boosting, despite the pronounced tendency to overfitting, relies on the idea of gradual improvement of confidence coefficients based on a continuous reassessment of the squared forecast deviations; and RBBA, combining the ideas of Adaptive Boosting and bootstrap aggregating, implies a gradual sequential selection of confidence coefficients based on the optimization of comprehensive quality estimates of predictive residuals of a set of forecasts.
To assess the quality of the forecasts, a random sample of real stock exchange data was used. The main characteristics of the forecast quality are the estimates of the standard deviation (17), confidence interval (18), and total percentage of forecasting errors with the corresponding lead time. The evaluation results are averaged arithmetically.
where RMSE is the forecast error, i is the number of sample elements of the forecast residuals, n is the sample size of forecast residuals, x i is the value of the time series of the forecast residuals, x denotes the arithmetic mean of x i , �F (t) ens is the forecast confidence interval, and t α is the Student's t-distribution coefficient corresponding to the significance level α.
Quality evaluation of ensembles Below is the general result of the quality evaluation of ensembles for α = 0.05 , see Table 1 and Fig. 3.

Discussion
Evaluating the above, it is worth noting that the idea of the ideality of random residuals in itself is just a widespread hypothesis. In fact, when analyzing real data, we often observe periods of the distribution of residuals that are not only extremely close to normal (for example, the dollar exchange rate in the Russian Federation during the 2009-2014 period) but also extended periods with significant asymmetry and frequent outliers (for example, the dollar exchange rate in the Russian Federation in the 1994-2000 period), when the residuals do not correspond to statistical characteristics of normality by most criteria, which means that they imply trends that are unaccounted for and dependencies missed or deliberately ignored by forecasters. The optimization criteria chosen by the authors, despite the increase in the quality of predictive ensembles, do not guarantee that the approximation residuals correspond perfectly to the normal distribution but are only an alarming red flag indicating a significant intensity of processes and trends that are overlooked in the forecasts. The main advantages of the proposed method include higher prediction accuracy and no need to perform a quality test on the residuals of the forecast, which leaves out a whole mandatory stage from the standard methodology of statistical research. Owing to the optimization of the quality of residuals, we obtain a multidimensional, optimal result as early as the forecasting stage. However, because the proposed method uses criteria applicable exclusively for large samples, it is not applicable for small ones, which is its real downside. Over the past 120 years, statistics has come a long way as a separate scientific field, but the period of transition to the third information age, when computing resources became extremely cheap and new statistical methods gained high demand again, was practically ignored by it. Despite the allure of the idea of an approximating "self-learning smart black box, " which served the development and widespread use of multilayer neural networks of deep learning and the latest forecasting methods, we could not qualitatively predict the crisis of 2008, the Arab Spring, or the energy crisis of 2020-2022.

Conclusion
The idea of a multidimensional residual-based optimization approach is not limited to this evaluation method. As a penalty function, one can use third-and fourth-order deviations from the normal moments, as well as higher-order estimates, or other criteriaand non-criteria-based methods. It is evident that both the method of turning points and the Shapiro-Wilk test described in this article imply the evaluation of large samples. Meanwhile, the RBBA method remains suitable for samples with a small number of elements, although in this case, it is necessary to use similar criteria designed for small samples. It is also worth noting that the RBBA method cannot be sensitive to any initial conditions or parameters for calculating the forecast because, unlike stochastic bootstrap methods, which depend on the initial time, it uses statistical criteria and gradient methods.
It is also worth noting that the quality criteria of the residuals chosen by the authors are far from a comprehensive assessment, as parameters such as correlations of deep orders, higher moments of distributions, and systematic external asymmetry are clearly not considered in the penalty function, which is the main criterion for RBBA optimization. Perhaps, to improve the quality of forecasts, it is necessary to add other measures that primarily characterize the presence of unaccounted trends, hidden dependencies, and multiplicative factors.

RBBA
Residual-based bootstrap averaging RMSE Residual mean square error ANN Artificial neural network SARIMAX Seasonal auto-regressive integrated moving average with eXogenous factors.