 Research
 Open Access
 Published:
Robust monitoring machine: a machine learning solution for outofsample R\(^2\)hacking in return predictability monitoring
Financial Innovation volume 9, Article number: 94 (2023)
Abstract
The outofsample \(R^2\) is designed to measure forecasting performance without lookahead bias. However, researchers can hack this performance metric even without multiple tests by constructing a prediction model using the intuition derived from empirical properties that appear only in the test sample. Using ensemble machine learning techniques, we create a virtual environment that prevents researchers from peeking into the intuition in advance when performing outofsample prediction simulations. We apply this approach to robust monitoring, exploiting a dynamic shrinkage effect by switching between a proposed forecast and a benchmark. Considering stock return forecasting as an example, we show that the resulting robust monitoring forecast improves the average performance of the proposed forecast by 15% (in terms of meansquarederror) and reduces the variance of its relative performance by 46% while avoiding the outofsample \(R^2\)hacking problem. Our approach, as a final touch, can further enhance the performance and stability of forecasts from any models and methods.
Introduction
The outofsample \(R^2\) is no better than insample \(R^2\) regarding data snooping concerns. Persistent researchers can attempt multiple prediction models and ingenuously report only the best performing one (Inoue and Kilian 2005, 2006, de Prado 2019). Alternatively, careful researchers examine a model of their own choice, but only the lucky ones obtain false positive results that are good enough for publication (Chordia et al. 2017). However, the problem, in reality, is even deeper than that. Researchers often construct a prediction model using the intuition derived from recent empirical findings that did not exist in the training sample period (Yae Forthcoming). It is unlikely that prior to the recent findings, forecasters chose such a model without a hint from the future; that is, an unintended lookahead bias arises in pseudo outofsample testing.
For example, some early studies, such as Pesaran and Timmermann (1995), show that stock return predictability is timevarying: the predictability is stronger in recessions than in expansions. Then, many followup studies utilize such empirical facts to improve outofsample predictability further.^{Footnote 1} The truth is, however, a hypothetical forecaster in outofsample prediction simulations would not choose such stylized models without sufficient evidence at the moment of prediction (Martin and Nagel 2022). Therefore, the best option previously available for the forecaster is to consider all possible models and choose one or a combination, exante optimally, without help of the notyetavailable intuition.
We consider a machine learning approach to tackle this comprehensive outofsample \(R^2\)hacking problem. However, our strategy differs from others, which highlight the superior prediction accuracy of machine learning. We find a new solution by exploiting the weakness of machine learning: the blackboxlike nature that potentially exacerbates the outofsample \(R^2\)hacking problem and blurs economic intuition.^{Footnote 2} Figuratively, using this black box, we create a virtual environment that prevents researchers from peeking into the intuition from the future when they perform outofsample prediction simulations.^{Footnote 3} Therefore, we do not attempt to maximize the average forecasting performance expost. Instead, we aim to measure the attainable level of outofsample predictability in practice using machine learning.
As a practical example, we demonstrate our solution for outofsample \(R^2\)hacking in the context of robust monitoring forecasts; that is, a forecast that switches between a proposed forecast and a benchmark while utilizing the conditional predictabilities, which are monitored in realtime.^{Footnote 4} To avoid outofsample \(R^2\)hacking, we suggest the following threestep approach: (1) choose an information set (i.e., conditioning variables) for monitoring, independently of intuition available only after the moment of prediction, (2) approximate the entire timeseries model space of the conditioning variables with feature engineering, and (3) apply a combination of ensemble machine learning algorithms, instead of choosing what works best expost after attempting many. That is, we design each step to exclude the outofsample \(R^2\)hacking associated with selecting data, models, and optimization algorithms, respectively. Therefore, implementing the first two steps are as important as the last step, which ensembles multiple machine learning algorithms. We call this class of machine learning approach for robust monitoring forecasts, robust monitoring machine.
We apply our robust monitoring machine to a realworld prediction example for demonstration purposes. The proposed forecast in this example is a combination forecast from Rapach et al. (2010), which set an equalweighted average of the 14 stock market predictors studied in Goyal and Welch (2008). The benchmark forecast is a realtime historical average of stock market returns, as Campbell and Thompson (2008) suggests, while defining the outofsample \(R^2\). Then, we apply the threestep approach as follows. First, we choose the smallest possible information set for monitoring: a history of the prediction loss (e.g., squarederror loss) differences between the proposed and benchmark forecast. Next, following Christ et al. (2018), we convert the variable in our information set into hundreds of timeseries features, which act as building blocks to approximate the entire timeseries model space. Finally, we blindly combine (with equal weights) three popular ensemble machine learning algorithms that we sequentially trained and validated for outofsample forecasting: random forest, extremely randomized trees, and gradient boosting. We stress that we rule out potential outofsample \(R^2\)hacking problem by intentionally performing these tasks in an unsophisticated manner.
Nevertheless, the outofsample classification performance of the robust monitoring machine remains outstanding, despite our efforts to avoid outofsample \(R^2\)hacking problems. Its sensitivity is 57.0% and specificity is 56.8% without any lookahead biases, which statistically differ from those of uninformative random classifiers. Our results imply that the past returnpredictability truly contains information for the future return predictability of proposed and benchmark forecasts.
To quantify the advantages of the robust monitoring machine, we measure a prediction loss difference (between a forecast and the benchmark forecast). Its mean represents the average forecasting performance, while its variance indicates the stability of forecasting performance. We find that our monitoring forecast beats the proposed forecast in both aspects. Our approach improves the mean prediction loss difference by 15%, while the variance falls by 46.3% because of its expected benefits. We emphasize two things here. First, our approach aims to attain a realistic level of outofsample prediction rather than compete with other algorithms that maximize expost performance metrics. Second, a robust monitoring machine can be easily added to other forecasting methods as a final step. In particular, forecasters should apply the robust monitoring machine if they evaluate performances relative to a benchmark.
We also confirm that the robust monitoring machine captures the wellknown economic intuition in realtime: return predictability is greater in bad economic periods. The robust monitoring machine tends to favor the proposed forecast exante in bad economic periods, which are characterized by low stock valuation and high macro uncertainty. Previously, this intuition looked clear only when researchers studied the whole sample period in hindsight.
This study contributes to two strands of the literature. First, a few recent works use a forwardlooking approach to model selection or averaging; Zhu and Timmermann (2017), Gibbs and Vasnev (2018), and Granziera and Sekhposyan (2019) optimize weighting strategy conditional on the expected future performance of the prediction models. This approach represents a considerable departure from the prior literature with backwardlooking approaches. We then fully scale up their forwardlooking approach by searching the entire timeseries model space as approximated by feature engineering (e.g., Kou et al. 2021), instead of considering only a few models possibly chosen by subjective intuitions. Second, stock return predictions via machine learning made tremendous progress recently in academia and practice.^{Footnote 5} Prevailing research, however, concentrates on improving the accuracy of the prediction.^{Footnote 6} Instead, we exploit the blackboxlike nature of machine learning to intentionally block intuition from the hypothetical future data (test sample) in outofsample simulations. That is, this study resolves the outofsample \(R^2\)hacking problem by machine learning techniques instead of creating one. Unlike existing solutions for multiple testing, our approach does not compromise forecasting performance.
To the best of our knowledge, no prior works attempt to resolve the \(R^2\)hacking problem by machine learning techniques. This gap creates the need for a new approach. This study offers the first attempt by proposing how to properly use existing machine learning algorithms to avoid the outofsample \(R^2\)hacking problem instead of proposing a single machine learning algorithm that maximizes forecasting accuracy by multiple testing.
The remainder of the paper is organized as follows. Section Monitoring forecasts revisited revisits forecast monitoring in a simple framework. Section Robust monitoring explains the benefits of the robust monitoring forecast. Section Robust monitoring machine outlines our machine learning approach to robust monitoring. We describe its realworld example in Sect. Applications: robust monitoring for returnpredictability and present the results in Sect. Empirical performance of monitoring forecasts. We discuss several important topics, such as how to interpret the results in conjunction with business cycles in Sect. Discussion and conclude in Sect. Conclusion.
Monitoring forecasts revisited
Monitoring forecasts can be viewed as an extreme case of combination forecasts with dynamic weights. If a researcher is unsure which forecast predicts conditionally best, it is better to combine multiple forecasts for variance reduction. However, selecting a single predictor can be better if an accurate signal for conditional performance is available. In this section, we characterize the condition on which a monitoring forecast can outperform individual forecasts in a simple model. We also include a few metrics and statistical tests to quantify monitoring performance.
A simple model of monitoring
Suppose that r is the target variable to predict and its unconditional mean is \(\mu\). There are two unbiased forecasts, \(f^{(a)}\) and \(f^{(b)}\); that is, \(E[f^{(a)}]=E[f^{(b)}]=E[r]=\mu\), as follows.
The individual shocks \((e^{(y)},e^{(s)},e^{(a)},e^{(b)})\) follow a normal distribution centered at zero, respectively. These shocks are uncorrelated to each other, and their standard deviations are \(SD(e^{(y)}) =\sigma _y\), \(SD(e^{(s)}) =\sigma _s\), and \(SD(e^{(a)}) = SD(e^{(b)}) = \sigma _f\), respectively. The indicator variable S follows a Bernoulli distribution with probability \(p_s\).
Without loss of generality, we assume that the forecast \(f^{(a)}\) is unconditionally more accurate than \(f^{(b)}\) such that \(p_s>\frac{1}{2}\). If \(S=1\), then the forecast \(f^{(a)}\) is conditionally more accurate than \(f^{(b)}\) exante, while the forecast \(f^{(b)}\) is if \(S=0\).
Now consider a monitoring forecast \(f^{(m)}\) optimally switching between two forecasts \(f^{(a)}\) and \(f^{(b)}\) given a signal on S. We measure the accuracy of the signal, \(p_m\), as
M is a monitoring signal following a Bernoulli distribution. The higher \(p_m\) is, the more accurate the monitoring forecast is. For example, if \(p_m=1\), then the monitoring forecast always picks the more accurate forecast between \(f^{(a)}\) and \(f^{(b)}\). Next, we adopt a squared forecast error as a loss function:
and define a loss difference between two forecasts as follows.
Then, we can easily derive the condition for which the monitoring forecast \(f^{(m)}\) outperforms both individual forecasts on average. The monitoring forecast \(f^{(m)}\) is unconditionally more accurate than \(f^{(a)}\) (and so \(f^{(b)}\)) exante if and only if
Condition (8) states that the monitoring forecast will outperform if a monitoring signal is accurate enough to dominate the accuracy advantage of an individual forecast over the other. For example, in case two forecasts are similar in accuracy \(p_s=1/2\), then it is not difficult to beat both forecasts by monitoring; we need only \(p_m>1/2\).
Outofsample \(R^2\): a conventional metric for forecast performance
Campbell and Thompson (2008) suggests an outofsample \(R^2\) metric to evaluate the performances of stock return forecasts. Suppose \(f^{(i)}_{tt1}\) is a given forecast for the target \(r_t\). Then, the outofsample \(R^2\) for the period from \(t=t_0\) to \(t=t_1\) is
where \(f^{(b)}_{tt1}\) is a benchmark forecast, which is the historical average of the past returns \(r_t\), commonly used in the stock return predictability literature. The outofsample \(R^2\) measures the reduction in mean squared prediction error for a forecast relative to the benchmark forecast. If \(R^2_{OS}\) is positive, then the forecast \(f^{(i)}_{tt1}\) outperforms the benchmark forecast in terms of the mean square prediction error metric.
Metrics and tests for monitoring performance
Monitoring forecasts solves a classification problem: which forecast performs better conditionally? Therefore, we can adopt metrics and statistical tests from the classification literature to evaluate monitoring performance. For example, we can define the event in which \(f^{(a)}\) outperforms \(f^{(b)}\); that is, \(\Delta L^{(b,a)}>0\), is positive, and negative otherwise. Then, the sensitivity, or true positive rate (TPR), refers to the empirical probability that the monitoring forecast \(f^{(m)}\) equals \(f^{(a)}\) when \(\Delta L^{(b,a)}>0\) (i.e., \(f^{(a)}\) outperforms \(f^{(b)}\)). Likewise, specificity, or true negative rate (TNR), refers to the empirical probability that \(f^{(m)}\) equals \(f^{(b)}\) when \(\Delta L^{(b,a)}<0\). We also adopt other metrics, such as positive predicted value (PPV), negative predictive value (NPV), and accuracy (ACC), following their conventional definitions in a confusion matrix.
If the classifier is purely random, then both ‘‘sensitivity + specificity’’ and ‘‘PPV + NPV’’ should be one in the population. We compute their 95% confidence intervals as
where \(n_{1}\) and \(n_{2}\) are the numbers of true positives and negatives in the data, respectively.
where \(n_{3}\) and \(n_{4}\) are the numbers of predicted positives and negatives in the data, respectively. If these confidence intervals do not contain one, then we can conclude that a monitoring task is informative (\(p_m>1/2\)) at the 5% significance level.
We adopt two formal tests of monitoring performance from the classification context: Fisher’s Exact test and the Chisquare test for binary classifiers. The null hypothesis \(H_{0}\) in both tests is that the true (predicted) positives and true (predicted) negatives are equally likely to be predicted as (true) positives. Therefore, low pvalues of these tests are evidence that a monitoring task is informative.
Robust monitoring
Monitoring Forecast is an aggressive technique to maximize predictability in contrast to a Combination Forecast, which aims to reduce the variance of forecast errors. That is, monitoring forecast and combination forecast are traditional counterparts of boosting and bagging (i.e., bootstrap aggregating) in machine learning, respectively. However, monitoring the forecast, which is aggressive by nature, can produce a robust conservative predictor when it switches between a given proposed forecast and a benchmark. That is, monitoring forecast becomes a robust version of the originally proposed forecast, whatever it is. We briefly explain the intuition in the following sections.
Using a benchmark to make a forecast robust
Suppose a researcher wants to measure the forecasting ability of a proposed forecast \(f^{(a)}\) relative to a benchmark forecast \(f^{(b)}\) in the context of the previous section. Then, the loss difference, \(\Delta L^{(b,a)} = L^{(b)}L^{(a)}\), between the forecasts \(f^{(a)}\) and \(f^{(b)}\) is a measure of how much the forecast \(f^{(a)}\) outperforms the benchmark forecast \(f^{(b)}\). With a quadratic loss function, the expected loss difference \(E(\Delta L^{(b,a)})\) is the expected reduction in mean squared error when we replace the benchmark \(f^{(b)}\) by \(f^{(a)}\).
Moreover, researchers can construct a monitoring forecast \(f^{(m)}\) that switches between a proposed forecast \(f^{(a)}\) and a benchmark forecast \(f^{(b)}\). They may wonder if the monitoring forecast \(f^{(m)}\) outperforms the proposed forecast \(f^{(a)}\). Note that comparing the mean squared errors of the forecasts \(f^{(a)}\) and \(f^{(m)}\) is equivalent to comparing the expected loss differences \(E(\Delta L^{(b,m)})\) and \(E(\Delta L^{(b,a)})\) because of the following identity:
Here, the empirical loss difference \(\Delta L\) is the main building block for evaluating relative forecasting performance. Its first moment is a difference in the mean squared errors, a key comparison metric in the forecasting literature. Researchers, therefore, compare their means \(E(\Delta L^{(b,m)})\) and \(E(\Delta L^{(b,a)})\) to see if the monitoring forecast \(f^{(m)}\) outperforms the original forecast \(f^{(a)}\) on average. However, what about their variances \(Var(\Delta L^{(b,m)})\) and \(Var(\Delta L^{(b,a)})\)? Should researchers ignore or care about them? What do they even mean?
Those variances represent the uncertainty of how well the forecasts \(f^{(a)}\) and \(f^{(m)}\) perform at a given time relative to the benchmark forecast \(f^{(b)}\). Yae (2018) adopts the idea of a trackingerrorvolatility (TEV)efficient frontier from the portfolio optimization literature and argues that riskaverse researchers should care about such variances if they evaluate performance relative to the benchmark forecast.^{Footnote 7} The idea of relative performance is common in investments. An investor who wants to outperform the market would consider deviating from the market as taking risks for higher returns. The more the portfolio deviates from the market, the greater the downside risk (and upside opportunity) relative to the market performance.
Nevertheless, conventional forecasting performance metrics focus only on the mean performances while ignoring information in the variance of performances. Such secondmoment information is used only in some formal tests of relative forecast performance such as Diebold and Mariano (2002).^{Footnote 8} This firstmomentoriented practice should look alarming to financial economists because it ignores risk, which is the core of investment performance evaluation. Additionally, the variance of relative performance is actually common in economics and finance.^{Footnote 9} For example, in “KeepingupwiththeJoneses” preferences, an agent’s utility is determined relative to others’ consumption level (Abel 1990, Gali 1994). Similarly, stock market movements do not compensate or penalize a mutual fund manager whose official benchmark is the market portfolio.
A Robust monitoring forecast is “a monitoring forecast switching between a proposed forecast and a benchmark.” It has a builtin shrinkage effect when the forecast is evaluated relative to a benchmark in terms of \(\Delta L^{(b,m)}\). The monitoring technique in this context makes a newly proposed forecast robust, and exploits shifts in the conditional predictability between two forecasts. Its mechanism is simple. If monitoring is informative, then the monitoring forecast will become a benchmark forecast \(f^{(m)}=f^{(b)}\) when the other forecast \(f^{(a)}\) is unlikely to outperform the benchmark. Whenever \(f^{(m)}=f^{(b)}\), the loss difference \(\Delta L^{(b,m)}\) becomes exactly zero, and the total variance of the loss difference becomes lower than that of the proposed forecast \(f^{(a)}\). In the monitoring model of the previous section, the law of total variance implies
where \(p_a \equiv Prob[f^{(m)}=f^{(a)}] = p_m p_s + (1p_m)(1p_s)\) denotes the unconditional probability that the monitoring forecast deviates from the benchmark \(f^{(m)}=f^{(a)}\). The second term is negligible unless the proposed forecast \(f^{(a)}\) outperforms the benchmark significantly and persistently, which implies the benchmark is improperly chosen as a straw man. Therefore, we can approximate the ratio of variance of loss differences \(\Delta L^{(b,m)}\) as \(p_a\):
The variance ratio in the lefthandside is always lower than one by construction. Therefore, monitoring with a benchmark forecast always lowers the variance of loss difference \(\Delta L^{(b,m)}\), and the monitoring forecast will produce more statistically significant evidence on superior forecasting performance. This is a hidden benefit of forecast monitoring.
However, this benefit is never a free lunch. The following tradeoff between accuracy gain and variance reduction in loss difference exists^{Footnote 10}:
We derive this result by eliminating \(p_m\) by combining \(p_a = p_m p_s + (1p_m)(1p_s)\) and Eq. (8). Informative monitoring (i.e., high \(p_m\)) improves the accuracy of the monitoring forecast but decreases the reduction in lossdifference variance, as follows.
Note that this tradeoff is beyond the biasvariance tradeoff in statistical estimators or machine learning algorithms. The biasvariance tradeoff aims to maximize the average forecasting performance. Equation (11) then represents a tradeoff between the average forecasting performance and its variance. This new kind of trade is about the mean and variance of squared forecast errors, which correspond to the second and fourth moments of forecast errors, while the biasvariance tradeoff is about the first and second moments of forecast errors. In a numerical example with \(p_s=p_m=3/4\), we can still expect a \(37.5\%\) reduction in variance of lossdifference with the same accuracy as the originally proposed forecast in terms of the mean squared error.
The definition of variance of loss difference depends on the choice of benchmark forecast. In many applications, the benchmark is not subjective. For example, the classical view in the stock market is that price follows a random walk, with absolutely no predictability. In other words, the market risk premium is unconditionally constant and investors have no conditioning variables to predict it. If researchers want to show the existence of a successful predictor or predictability of market risk premium, then they needs to set up a constant market risk premium as a null hypothesis and use a historical average stock market return as a benchmark forecast.^{Footnote 11}
Metrics for robust monitoring performance
We adopt two metrics for robust monitoring performance from Yae (2018). The metrics are analogous to investment performance measures: the risk premium as a raw metric and alpha as a riskadjusted metric.^{Footnote 12} Here, we use concise notations for performance metric inputs:
Monitoring Risk Premium Suppose a researcher chooses a forecast \(f^{(a)}\) when considering a benchmark forecast \(f^{(b)}\) as a reference point. Then, \(E[d_a]\) can represent the average forecasting performance as the expected reduction in mean squared forecast errors relative to the benchmark. If the researcher chooses \(f^{(b)}\) instead, then she obtains only \(E[d_b]\), which is zero by definition. We interpret \(E[d_a]\) as a kind of premium in the forecasting context. As an analogy, if an investor chooses the stock market portfolio over her benchmark riskfree asset, then the expected return difference between these two is called a market risk premium. The investor deviates from the benchmark riskfree asset in hope of earning the premium. Likewise, a researcher deviates from the benchmark forecast \(f^{(b)}\) to \(f^{(a)}\) in hope of earning \(E[d_a]\). She can earn \(E[d_m]\) instead by switching between \(f^{(a)}\) and \(f^{(b)}\) as a robust monitoring forecast \(f^{(m)}\). Then, \(E[d_m]\) scaled by \(E[d_a]\) is called the monitoring risk premium.
where \(R^2_{OS,a}\) and \(R^2_{OS,m}\) are the outofsample \(R^2\) for the proposed forecast \(f^{(a)}\) and the robust monitoring forecast \(f^{(m)}\) that switches between \(f^{(a)}\) and the benchmark forecast \(f^{(b)}\), respectively. If monitoring is sufficiently informative as in Condition (8), then \(RP_m\) can be larger than one.
Monitoring Alpha Monitoring risk premium measures average forecasting performance but ignores any risk adjustment. By contrast, imagine investors who use CAPM alpha (or alpha from a multifactor model) as a riskadjusted investment performance metric. Similarly, we define the monitoring alpha as
where \(m^*\) is the uninformed monitoring forecast that shifts randomly between the forecasts \(f^{(a)}\) and \(f^{(b)}\). It is a random strategy whose mixing probability is set so that its \(E[d_{m^*}^2]\) equals \(E[d_m^2]\). Thus, \(E[d_{m^*}]\) represents accuracy gain by randomly shifting between the forecasts, and the real gain by informative monitoring should not include \(E[d_{m^*}]\). For example, if \(\alpha _m=0.4\), then informative monitoring adds 40% of relative accuracy of the proposed forecast on top of the benefit from random switching. Alternatively, we can also express the monitoring alpha as^{Footnote 13}
The first term is the total accuracy gain through monitoring—scaled by the accuracy gain through the forecast \(f^{(a)}\); that is, the monitoring risk premium, while the second term adjusts the relative risk increased by the monitoring procedure. When the unpredictable component is large in its scale \(\sigma _y\), we have \((E[d_m])^2\ll Var[d_m]\), so \(E[d_m^2]\approx Var[d_m]\). Therefore, the monitoring alpha is analogous to a utility level of meanvariance preference. Note that the monitoring alpha of the two forecasts \(f^{(a)}\) and \(f^{(b)}\) are zero by definition.
Robust monitoring machine
Data snooping is a common issue in empirical research. The problem arises when a researcher reports only the best model or statistically significant variable after numerous failed trials. The same problem can also appear when numerous researchers try only one model or variable, but only a few researchers can successfully publish their results, which is dictated by luck, as shown in Chordia et al. (2017). This fundamental problem of empirical research is difficult to avoid and persists even in the forecasting context. Robust monitoring is not an exception.
Outofsample R\(^2\)hacking problem
Data snooping (or phacking) comes from two root causes: data and models (algorithms). First, researchers face infinite combinations of choices of variables, sample periods, and trainingevaluation sample splits. Second, they must also choose a model along with tuning parameters, algorithms, and estimation methods. Analyzing the whole datamodel space exceeds an individual researcher’s cognitive ability. Therefore, they end up selecting one or a few combinations arbitrarily or deliberately, which might even worsen data snooping issues.
The phacking problem regarding variable choices is well known in the insample fitting context, but the same problem also exists in outofsample analysis regarding both variable and model choices. For example, in the return predictability literature, outofsample \(R^2\) is mainly used as a forecasting performance metric. Nonetheless, outofsample \(R^2\) (or any other crossvalidation metric) still faces the multiple testing problem and lookahead bias. For example, some early papers, such as the one by Pesaran and Timmermann (1995), report that predictability is timevarying and conditional on other variables. That is, the predictability is stronger in recessions than during expansions. Then, many followup papers internalize such empirical facts in a specific timeseries model to further improve outofsample predictability, expost.
The pitfall of this practice is that researchers select such successful models and variables based on their intuition, which did not exist in the sample period of the back tests. The intuition is formed by empirical knowledge only available now but unavailable at the beginning of the outofsample test period, such as information that the predictability is stronger in recession than expansions. Therefore, applying such models and variables since the beginning of the outofsample test period is highly unlikely for the real forecasters at that time due to lack of prior evidence. That is, all sophisticated models inspired by such expost intuition are subject to this unintended lookahead bias. The best option for a forecaster at that time, if feasible, was to compare all possible models and variables to find the best one (or best combination) exante since the beginning of the outofsample testing period while repeating the process sequentially.
A machine learning solution
Machine learning algorithms are often criticized because of their blackboxlike nature. Despite their superior prediction ability, they make researchers blind to hidden mechanisms by blurring economic intuition. Here, we focus on the bright side of the blackboxlike nature and transform this criticism into a crucial device in our study. We make our solution for the robust monitoring problem intentionally blind to any intuition based on the information in the evaluation (test) sample period, as Yae (Forthcoming) suggests. Therefore, the goal of our approach is to confirm the existence of useful information for monitoring in the realtime data rather than maximize the average forecasting performance expost. To achieve this goal, we implement the following three steps.
Robust Monitoring Machine

1.
Choose a set of conditioning variables for monitoring, independent of information from the evaluation sample period.

2.
Approximate the entire timeseries model space of the conditioning variables with feature engineering.

3.
Apply a combination of ensemble machine learning algorithms instead of choosing what works best expost after trying many.
Each step is designed to exclude the outofsample \(R^2\)hacking associated with selecting data, models, and optimization algorithms. Therefore, how to implement the first two steps are as important as the last step. We emphasize that this threestep approach is beyond a wellknown ensemble technique which is only the step 3 here.
We call this machine learning approach the robust monitoring machine and implement each step as follows. First, we make the most parsimonious choice of conditioning variables rather than the most universal. In monitoring, a researcher needs to have at least one forecast target and two competing forecasts. We use a single timeseries loss difference between two competing individual forecasts, \(\Delta L^{(b,a)}_t\), to predict its next sign. The other extreme is to consider the entire information space, yet it is difficult to fathom, collect, or even approximate. Furthermore, many data sources are private or highly costly and are thus out of reach of some researchers. As the entire information space is neither known nor accessible, random sampling from it is also impossible. Second, we expand a single timeseries of a conditioning variable \(\Delta L^{(b,a)}_t\) into hundreds of timeseries features as building blocks to approximate the entire timeseries model space. Finally, we combine multiple ensemble machine learning algorithms. We rely on machine learning algorithms to handle the complexity of the feature set, but avoid cherrypicking an algorithm that appears to work best expost.
Applications: robust monitoring for returnpredictability
We apply our robust monitoring machine to the US stock market prediction problem. Section Data, benchmark, and proposed forecasts explains the data source and our choice of benchmark and proposed forecasts. Section Robust monitoring machine in action describes our threestep approach in detail.
Data, benchmark, and proposed forecasts
The randomwalk hypothesis is the traditional view of the stock market. If investors are rational and fully utilize public information to price stocks, then such public information should not predict future stock returns. Following the literature on return predictability, we set the target variable to predict, as monthly stock market excess returns: continuously compounded returns \(r_{sp,t+1}\) on the S &P 500 index, including dividends, in excess of the riskfree rate \(r_{f,t}\) implied by the Treasury bill rate.^{Footnote 14} Henceforth, we call the target variable simply ‘‘return.’’
Suppose the randomwalk hypothesis is true. Then, econometricians will find that no publicly available variables are correlated with the subsequent return \(r_{t+1}\) or its conditional expectation \(E[r_{t+1}{\mathcal {F}}_t]\), where \({\mathcal {F}}_t\) denotes the information set of econometricians at time t. The expected return, therefore, should look like a constant \(E[r_{t+1}{\mathcal {F}}_t]=E[r_{t+1}]\), whether it truly is or not. The natural benchmark forecast under the randomwalk hypothesis as a null will be an estimate for the unconditional expected return \(E[r_{t+1}]\). That is, the historical average of the past returns is the benchmark forecast.
The idea of the randomwalk hypothesis is theoretically appealing and relates to the wellknown efficient market hypothesis. However, it was mostly difficult to reject the hypothesis early on because of the insufficient sample size. Later researchers, however, began to find some statistical evidence on a few variables predicting stock market returns based on insample regression analysis.^{Footnote 15} However, Goyal and Welch (2008) examine the outofsample performance of the realtime individual OLS regression estimators of 14 variables \(x_{i, t}\) and find that none of them have outofsample predictability. From the following 14 individual predictive regressions:
Goyal and Welch (2008) and Rapach et al. (2010) define the realtime individual forecasts as
where \(\hat{\alpha }_{i, t}\) and \(\hat{\beta }_{i, t}\) are the OLS coefficient estimates of \(\alpha _i\) and \(\beta _i\), respectively, using information up to time t (expanding window), consistent with Goyal and Welch (2008) and Rapach et al. (2010). Later, following Hendry and Clements (2004), Timmermann (2006), and many others, Rapach et al. (2010) construct the equalweighted average of these 14 individual forecasts as follows and show it outperforms the benchmark forecast (17) in outofsample prediction.
We use this equalweight combination forecast as our proposed forecast in monitoring.^{Footnote 16} Then, we will optimally switch our choice on forecast between the benchmark forecast in (17) and the proposed forecast in (20).
We obtain the monthly data for the returns and predictor variables from Amit Goyal’s website.^{Footnote 17} The sample period is from January 1927 to December 2017. Note that the combination forecast in our analysis differs slightly from that of Rapach et al. (2010) because their training sample starts in 1947. However, this is barely an issue because we do not attempt to criticize their combination forecast. On the contrary, we need only to demonstrate our idea using some forecast that overall outperforms the benchmark forecast but its performance varies. The 14 predictor variables in our example include dividendprice ratio (dp), dividend yield (dy), earningprice ratio (ep), dividendpayout ratio (de), equity risk premium volatility (rvol), booktomarket ratio (bm), net equity expansion (ntis), treasury bill rate (tbl), longterm yield (lty), longterm return (ltr), term spread (tms), default yield spread (dfy), default return spread (dfr), and inflation (infl). See Appendix 1 for detailed descriptions.
Robust monitoring machine in action
This section explains how we implement the robust monitoring machine in detail.
Labeling for Binary Classification We define a new dependent variable \(y_t\) as follows to convert the monitoring problem to a binary classification.
where \(\Delta L_{t}^{(b,a)}=(r_t  f^{(b)}_{tt1})^2  (r_t  f^{(a)}_{tt1})^2\) is the difference in squared forecast errors as a loss function. Target variable \(r_t\), benchmark forecast \(f^{(b)}_{tt1}\), and proposed forecast \(f^{(a)}_{tt1}\) are as defined in Sect. Data, benchmark, and proposed forecasts. This binary timeseries variable \(y_t\), as a label, indicates which forecast outperform the other, expost. Note the resulting dependent variable \(y_t\) is identical whether the definition of \(\Delta L_{t}^{(b,a)}\) is based on \(L^2\)norm or \(L^1\)norm.
Features for Monitoring As we discussed in Sect. A machine learning solution, we include no arbitrary conditioning variables in the monitoring task. Instead, we use a single timeseries of \(\Delta L_{t1}^{(b,a)}\) to predict \(y_{t}\). Following (Christ et al. 2018), we perform feature engineering. We transform the past 60 months of \(\Delta L_{t}^{(b,a)}\) into 441 timeseries model features \(Z_{t1}\).^{Footnote 18} These features include the entire characteristics of the timeseries such as the number of peaks, average, maximal value, autocorrelation, and linear trend.^{Footnote 19}
The features for monitoring \(Z_{t1}\) offer building blocks to approximate the entire space of timeseries models that predict \(y_t\).
Three Ensemble DecisionTree Algorithms To handle hundreds of features, we use the three flagship ensemble decisiontree algorithms to predict \(y_t\): the random forest, extremely randomized trees, and gradient boosting. They can effectively accommodate nonlinearity and interaction in features. They also avoid overfitting problems in traditional logistic regressions by combining the forecasts from many small trees into a single forecast. We summarize the technical differences of these three algorithms in Appendix 3.
Training and Tuning Parameters To predict \(y_{t}\), we train the tree using inputlabel pairs \((y_{t\tau }, Z_{t\tau 1})\) for \(\tau =1,2,...,120\) (ten years of monthly data). As feature engineering requires the past 60 months of data, we need 15 years of data to predict \(y_{t}\). Note that using rollingwindow is a natural choice since the idea of monitoring is based on timevarying performance. We choose a tuning parameter that maximizes the ROCAUC statistic measure which is the Area Under The Curve (AUC) of Receiver Operating Characteristics (ROC) curve. In simple terms, this procedure maximizes TPRs relative to false positive rates. We split these 120 sample pairs into three of 40 sample pairs \(D_1\), \(D_2\), and \(D_3\) chronologically. Then, we train trees using \(D_1\) and test on \(D_2\). Again, we train the tree using \(D_1\) and \(D_2\) and test on \(D_3\). We choose the optimal tuning parameters based on these two test results (i.e., validation sample). The outofsample forecast starts from January 1947, following (Goyal and Welch 2008).
Ensemble At time t, the three different algorithms will produces their best guesses on the probability of \(y_{t+1}=1\); say, \(Prob[y_{t+1}=1A_i, {\mathcal {F}}_{t}]\) for \(i=1,2,3\), where \(A_i\) is an algorithm and \({\mathcal {F}}_{t}\) is an information set. Then, we compute the equalweight average of such probabilities relying on the wisdom of crowds in the algorithm domain.
The following rule will determine our monitoring forecast of \(y_{t+1}\) at time t:
Empirical performance of monitoring forecasts
Performance of the proposed forecast
The first row of Table 1 shows the performance of the proposed forecast \(f^{(a)}\) in Eq. (20). The outofsample \(R^2\) of the proposed forecast is 0.5% when the evaluation sample starts in 1947. The performance metric, however, is unstable and sensitive to the sample split date. When we gradually change the date from 1947 to 2007, the outofsample \(R^2\) decreases and even becomes negative since 1987. The proposed forecast performs poorly, showing no predictability (\(R^2_{OS}=0.24\%\)) when the evaluation sample starts from 2007 and ends in 2017.
Figure 1 Panel (a) also confirms that the performance of the proposed forecast \(f^{(a)}\) significantly varies over time, with a downward trend. The timeseries in the plot is the 60month trailing moving average of the loss difference between the benchmark and the combination forecasts \(\Delta L_t^{b,a}\), which measures the performance of the proposed forecast relative to the benchmark. The plot shows two deep negative values, implying that the proposed forecast often greatly underperforms the benchmark, especially in recent periods. Such unstable and deteriorating performance can be a serious concern to investors, although the proposed forecast overall outperforms the benchmark during the testing period 1947–2007.
The unstable performance of the proposed forecast is a fundamental problem of a class of combination forecasts, as Rapach et al. (2010) suggests. We explore different weighting schemes in combination forecasts to demonstrate the severity of the problem rather than propose improved weighting schemes. First, we use the median of 14 forecasts instead of their mean. Second, following (Stock and Watson 2004) Equation (4), we compute the combination weights of 14 individual forecasts based on their Discounted Mean Squared Forecast Errors (DMSFEs) with two tuning parameters: 1) the trailing sample period and 2) discount factor \(\delta\).
However, the unstable pattern of performance persists in all cases of different combination weights. For example, the outofsample \(R^2\) can rise up to 1.17% with weights proportional to the DMSFE (up to the past one month of data with a discount factor \(\delta =1\)). Yet the outofsample \(R^2\) drops to \(\)1.26% when the evaluation sample starts from 2007 and ends in 2017. Here, suppose we focus only on the average performance metric \(R^2_{OS}\)=1.17% and argue that we find a superior forecast. If we really do so, then this is cherrypicking, an example of outofsample \(R^2\)hacking. Furthermore, it comes with a high price tag. This cherrypicked forecast has the worst performance of all forecasts over the recent decade. This unstable performance is a red flag that implies cherrypicking practices. To tackle this problem, we propose a machine learning solution in the next section.
Robust monitoring performance
We first evaluate how well our robust monitoring machine can conditionally choose between \(f^{(a)}\) and \(f^{(b)}\). The confusion matrix in Table 2 analyzes the monitoring performance as an outofsample classification problem without any lookahead biases. Sensitivity and specificity are 57.0% and 56.8%, respectively, both higher than 50%. The 95% analytic confidence interval of the sum of sensitivity and specificity is (1.07, 1.21), above one, implying informative monitoring. Note that the sum of sensitivity and specificity should be one if a classifier is uninformative and purely random. Similarly, The PPV and NPV are 55.5% and 58.3%, respectively, and both are greater than 50%. The 95% analytic confidence interval of the sum of PPV and NPV is (1.07, 1.21), which is also above one. Furthermore, the pvalues of Fisher’s exact test and Chisquare test are both less than 0.01%. Our robust monitoring machine is informative in the outofsample prediction context.
Figure 2 visualizes two performance metrics for robust monitoring. The monitoring risk premium \(RP_m\) is the ratio of \(E[d_m]\) to \(E[d_a]\). The metric \(RP_m\) is larger than one since \(E[d_m]>E[d_a]\), and therefore the monitoring forecast \(f^{(m)}\) outperforms the proposed forecast \(f^{(a)}\) in term of average outofsample \(R^2\). On the other hand, the monitoring alpha \(\alpha _m\) is the ratio of \(E[d_m]E[d_{m^*}]\) to \(E[d_a]\). This metric adjusts the risk in forecasts with respect to the benchmark forecast \(f^{(b)}\). Note that \(E[d_{m^*}]\) represents the baseline performance of an uninformative monitoring forecast. Thus, the monitoring alpha is the net gain of performance by our robust monitoring machine, in addition to the risk reduction by robust monitoring described in Eqs. (10) and (11). The monitoring premium and alpha of our robust monitoring machine are 1.15 and 0.61, respectively, as Table 3 reports. Their Bootstrap pvalues are lower than 5%. Robust monitoring machine truly predicts which forecast performs better. Our monitoring performance metrics are consistent with the DieboldMarino (DM) test results. The pvalues of the DM test for the robust monitoring machine is 0.4% while 6.8% for the proposed forecast. The robust monitoring machine shows stronger statistical significance (lower pvalues) as \(E[d_m]>E[d_a]\) though \(Var[d_m]<Var[d_a]\). The monitoring forecast boosts the average loss difference by 15.1% but reduces its variance by 46.3%, which is a typical benefit of robust monitoring.
Figure 1 Panel (b) repeats Panel (a) but for the robust monitoring machine, showing the 60month trailing average of \(\Delta L_{t}^{(b,m)}\). This forecasting performance metric still fluctuates but rarely drops below zero. Its variation is much lower relative to the proposed forecast in Figure Panel (a). Table 4 shows the outofsample \(R^2\) with different sample split dates. Unlike the proposed forecast \(f^{(a)}\), the outofsample \(R^2\) of the robust monitoring forecast \(f^{(m)}\) never becomes negative. For the full sample period starting in 1947, the outofsample \(R^2\) is 0.57% for the robust monitoring forecast and 0.50% for the proposed forecast. Our robust monitoring machine increases average performance by 14% in terms of outofsample \(R^2\) while reducing the variation of conditional performance. A shrinkage forecast, which is a simple average of \(f^{(a)}\) and \(f^{(b)}\), fails to improve the performance. We include a few alternative approaches to robust monitoring for comparison. They all fail to improve the proposed forecast; that is, the superior performance of the robust monitoring machine is not easy to achieve.
Certainty equivalent return
Following (Campbell and Thompson 2008) and (Ferreira and SantaClara 2011), we compute the certainty equivalent return (CER) for an investor with a meanvariance preference who monthly allocates her capital across equities and riskfree bills using market return forecasts. At the end of month t, the investor optimally allocates the share \(w_t\) of the portfolio to a market index fund and the remaining share \(1  w_t\) to a riskfree bill. The investor holds the position until the end of month \(t+1\) and repeats her asset allocation task every month. Then, we can compute the optimal share \(w_t\) as
where \(r_{t+1}\) is the excess return (raw return less the riskfree rate) and \(E[r_{t+1}{\mathcal {F}}_t]\) and \(Var[r_{t+1}{\mathcal {F}}_t]\) are its conditional mean and variance, respectively. Following (Campbell and Thompson 2008), we assume the followings. First, the investor replaces the conditional mean and variance by a given return forecast and simple variance estimator from the past fiveyear monthly returns. Second, we restrict the share proportion \(w_t\) to lie between 0 and 1.5. Third, we assume the relative risk coefficient \(\gamma\) to be 5. Then, the CER measure of the portfolio is
where \({{\hat{\mu }}}_p\) and \(\hat{\sigma }_p^2\) are the timeseries average and variance estimate, respectively, of the investor’s portfolio return \(r_{p, t+1}\) over the forecast evaluation period. We can easily calculate the portfolio return \(r_{p, t+1}\) for month \(t+1\) as \(r_{p, t+1} = w_t r_{t+1} + (1w_t) r_{f, t+1}\), expost. Unlike outofsample \(R^2\), the CER measure explicitly accounts for the risk taken by an investor during the outofsample test period. One can interpret the CER as the riskfree rate of return that an investor is willing to trade with her optimal risky portfolio.
The CER gain is the difference between the CER for the investor who uses any candidate forecast \(f^{(i)}\) of the market return and the CER for an investor who uses the historical average benchmark forecast \(f^{(b)}\).
We multiply CER gain by 1,200 so it represents the annual percentage portfolio management fee that an investor would be willing to pay to have access to the given forecast \(f^{(i)}\) instead of the historical average benchmark forecast \(f^{(b)}\).
Table 5 repeats the analysis in Table 4, but reports the annualized CER gains; that is, the economic values of forecasting market returns by given forecasts instead of the benchmark forecast. We observe the same pattern in the outcomes, confirming the results expressed in terms of outofsample \(R^2\) in Table 4. The CER gain is \(1.05 \%\) from 1947 to 2017, while the combination forecast has a CER gain of \(0.90\%\). The improvement increases when we evaluate more recent samples since 2007 (0.84 over 0.40) and since 1997 (1.22 over 0.36). Therefore, the CER gains using the robust monitoring machine are relatively stable over time.
Discussion
Outofsample performance
We further investigate the recent poor performance of the combination forecast by Rapach et al. (2010). We repeat the rollingcorrelation tests in Rapach et al. (2010) by extending the ending date of the data from 2005 to 2017. We compute the correlations between the equity premium and 14 individual predictors in Sect. Data, benchmark, and proposed forecasts based on 10year rolling windows. The correlation plots (Fig. 5 in Appendix 2) reveal the following. First, the correlations between realized risk premium and the 14 predictor variables that make up the combination forecasts are highly unstable since 2006. Therefore, any forecasts trained using the past predictor variable data suffer from this strong instability, which is too severe to be mitigated even by rollingwindow estimation rather than expanding window. Second, we do not find that the recent poor performance is particularly linked to business cycles. We have several business cycles in the 1947–2017 sample period, but the performance in the 2007–2017 period is much worse than the rest. Therefore, the recent poor performance probably differs from the usual business cycle narrative that return predictability is better in bad times than in good times.
We stress that we do not argue that our robust monitoring forecast is better than the combination forecast only because of their relative performance (measured by outofsample \(R^2\)) in the recent period (2007–2017), ignoring their previous performances. Instead, we argue that a forecast should be chosen based on not only their average performances (e.g., outofsample \(R^2\) or average loss difference, E[d]) but also the (in)stability of their performance; for example, the variance of loss difference, var(d). Suppose we live in the end of 2005 (when the test sample ends in Rapach et al. 2010) and try to choose between the robust monitoring forecast and the combination forecast. Table 6 shows that their average performances up to 2015 are practically identical, consistent with Rapach et al. (2010). However, Table 6 also shows that the combination forecast displays twice larger forecast instability than the robust monitoring forecast. The 2006–2017 period then demonstrates how dramatically a forecast with unstable performance (i.e., combination forecast) can fail in extreme times such as the financial crisis. Therefore, we consider the financial crisis as an extreme observation that tests forecast instability, not as an outlier to remove. This interesting pattern resembles the poor performance of ETFs after their inceptions (Brightman and Li 2015) and vanishing anomalies after academic publications (McLean and Pontiff 2016).
Empirical relationship to the economy
Mounting evidence shows that return predictability increases in bad economic times.^{Footnote 20} This established empirical fact produces hindsight theories that try to explain it and allows prediction models to take advantage of it for better expost performance.^{Footnote 21} However, the real question is if we can create a prediction model that can foresee the concentration of return predictability before the observed data reveals it. To answer this question, we compare our robust monitoring machine outputs with two variables that represent bad economic times.
We first consider a variable for stock valuation. The log dividendprice ratio of S &P500 is commonly used in prior studies as the valuation ratio and a return predictor. High dividendprice ratios mean low valuation of stocks and thus bad economic times. Figure 3 displays both the log dividedprice ratio (dotted line) and the computed realtime probability \(p_t\) (solid line) that the proposed forecast \(f^{(a)}\) will outperform the benchmark \(f^{(b)}\), calculated by the robust monitoring machine. The fluctuation patterns of two graphs are similar. When stock valuation is low, the robust monitoring machine favors the proposed forecast over the benchmark. However, determining whether the dividendprice ratio is low or high is often a hindsight observation because the true longrun mean of the dividendprice ratio is unknown, For example, the dividedprice ratio is systematically lower in the last two decades than in the twentieth century because of either a permanent regime change or temporal abnormality. By contrast, the computed realtime probability \(p_t\) is always measured between zero and one so that a forecaster can easily interpret its meaning and magnitude.
Similarly, Fig. 4 plots the macro uncertainty measure from Jurado et al. (2015) (dotted line) and the probability \(p_t\) (solid line). When macro uncertainty is high (usually during bad economic times), the robust monitoring machine favors the proposed forecast over the benchmark. This macro uncertainty measure consists of the principal components that require a fullsample estimation. By contrast, the computed realtime probability \(p_t\) captures the same information in real time. From these two plots, we confirm that our robust monitoring machine can foresee the concentration of return predictability before the observed data explicitly reveals it to econometricians, although high correlations between \(p_t\) and market cycles are not surprising perse.
Factors in the crosssectional variation of stock returns
Starting from the market factor, researchers proposed hundreds of new factors (or anomalies) for crosssectional variation of stock returns, such as information discreteness by Da et al. (2014) or bettingagainstbeta by NovyMarx and Velikov (2022), among many others.^{Footnote 22}
We find that not all factors are free from lookahead bias. Some factors are naturally subject to lookahead bias because of their empirical construction. Some factors are impossible to construct in realtime because of poor database availability in early days. Finally, some factors have insufficient statistical evidence in early periods; therefore, a portfolio strategy based on such factors may look unwise to investors in those periods. The last type of bias is well explained by Martin and Nagel (2022).
Although lookahead bias does not directly affect formal asset pricing tests, the performance of portfolio strategies derived from the proposed anomalies and factors can mislead readers in the presence of lookahead bias. The framework proposed in this paper can potentially help researchers distinguish factors with or without lookahead bias. However, this task is crucially important and sensitive, so it is beyond the scope of this study, and we leave it for future research.
Trustworthy machine learning solutions
The underlying topic in this study extends to general requirements for trustworthy machine learning. We emphasize the pursuit of robustness and eliminating lookahead biases. Such desirable qualities are prerequisites for trustworthy machine learning. Similarly, Holzinger (2021) argues that trustworthy machine learning solutions, or Artificial Intelligence (AI) solutions in a broader sense, should bear other qualities such as comprehensibility, explainability, and interpretability for the human expert, in addition to robustness. Holzinger (2021) emphasizes the importance of human experts in decision processes using artificial intelligence system to achieve such qualities. We relate our study to the lesson from Holzinger (2021), as our approach provides a way of curbing unintentional biases by human experts.
Conclusion
Forecasting is a cornerstone for decisionmaking. Recently, the superior forecasting performance of machine learning methods drew significant attention. However, the blackbox nature of machine learning methods often exacerbates the outofsample \(R^2\)hacking problem, which exaggerates the true forecasting performance through overfitting. In contrast, this study exploits the blackbox nature of machine learning methods to avoid outofsample \(R^2\)hacking.
We provide a machinelearning solution for the outofsample \(R^2\)hacking problem in robust monitoring. The resulting forecast improves the average performance of a proposed forecast by 15% and reduces its variance of performance by 46.3%. The DM test statistic becomes significant, with its pvalue falling from 0.068 to 0.004. The sensitivity and specificity of monitoring as a classifier are 57.0% and 56.8%, respectively, and are statistically different from those of random classifiers. The robust monitoring machine predicts the timevariation of returnpredictability over business cycles without lookahead bias.
The proposed forecast in our application is a combination forecast from Rapach et al. (2010), yet we can apply the robust monitoring machine to any forecast to improve its performance and robustness. Therefore, professional forecasters can use our approach as a final touch to any sophisticated prediction model they choose. Our approach facilitates other forecasting methods instead of competing with them. Additionally, our threestep approach can be implemented in many different ways for further practical improvements.
The framework of the robust monitoring machine can be applied to other types of forecasting examples, such as predicting macroeconomic variables or corporate earnings. Furthermore, the underlying idea of our approach can extend to other fields in finance. For example, we can construct a trading strategy for a mutual fund manager whose performance is evaluated based on the benchmark portfolio. Using machine learning techniques, a fund manager can deviate from the benchmark only when the signal is strong enough to earn extra trading profits above a predetermined threshold. We leave such extensions for future research.
Availability of data and materials
The datasets analyzed in the current study are available at Amit Goyal’s website: http://www.hec.unil.ch/agoyal
Notes
Among many others, Dangl and Halling (2012) study timevarying coefficient models whose coefficients follow a random walk. Henkel et al. (2011) and Zhu and Zhu (2013) choose regime switching models instead. Rapach et al. (2013) have the discounted mean squared forecast error (DMSFE) in the past to determine the combination weights, similarly to Stock and Watson (2004) and Bates and Granger (1969).
Similarly, Kou et al. (2022) propose an extension of group decisionmaking and spherical fuzzy numbers.
Roll (1992) defines tracking error volatility as a square root of the sample second moment of differences in a portfolio and benchmark return. The TEVefficient frontier shows a tradeoff between relative risk premium and relative risk to the benchmark, while the standard efficient frontier in the portfolio theory is a special case where the benchmark is the riskfree asset.
Note this second moment is not the variance of forecast errors but that of differences in squared forecast errors, defined relative to the choice of benchmark. That is, the variance of relative performance is related to the fourth, not the second, moment of forecast errors.
However, unlike the first moment, the second (central) moment comparison requires caution with existence of the benchmark forecast:
$$\begin{aligned} Var(L^{(a)})Var(L^{(m)}) \ne Var(\Delta L^{(b,a)})  Var(\Delta L^{(b,m)}). \end{aligned}$$This tradeoff relationship is from Yae (2018).
Sect. 5 shows an empirical example.
We do not consider a utilityfunctionbased measure such as a certainty equivalent although, for example, smooth ambiguity preference (Klibanoff et al. 2005) can internalize \(Var(\Delta L^{(b,m)})\) and \(Var(\Delta L^{(b,a)})\).
It is easy to show \(E[d_{m^*}]/E[d_{m^*}^2]\) is invariant and so \(E[d_{m^*}]\) is linearly proportional to \(E[d_{m^*}^2]\).
Since the riskfree rate is known at the time of forecast, predicting raw returns is informationally equivalent to predicting excess returns.
It is worth noting that the rejection of the randomwalk hypothesis does not necessarily mean the stock market is inefficient in information processing or investors are irrationally inattentive. It simply means the expected return is timevarying and correlated with some publicly available variables because, roughly speaking, investors’ risk tolerance is timevarying. For example, in recession investors become less risktolerant because of their reduced income and wealth. They avoid investing in stocks even if they know the expected return is higher than in the boom periods.
Alternatively, we can optimize the combination weights or construct a new forecast utilizing nonlinearity and interactions between 14 variables. However, our goal is to build a robust and general approach that can be applied to forecasts from human forecasters or different models/algorithms as well.
The data used in this paper can be found at http://www.hec.unil.ch/agoyal/ along with detailed descriptions.
Using the past sixty month is a standard practice in finance literature due to changing nature of the market: for example, CAPM beta estimation.
See TSFRESH package in Python.
Cujean and Hasler (2017) provide theoretical model to explain why stock return predictability concentrates in bad times.
Abbreviations
 TPR:

True positive rate
 TNR:

True negative rate
 PPV:

Positive predicted value
 NPV:

Negative predictive value
 ACC:

Accuracy
 CER:

Certainty equivalent return
 AI:

Artificial intelligence
References
AbadSegura E, GonzálezZamar MD (2020) Global research trends in financial transactions. Mathematics 8:614
Abel AB (1990) Asset prices under habit formation and catching up with the joneses. Am Econ Rev 80:38
Bailey DH, Ger S, de Prado ML, Sim A (2015) Statistical overfitting and backtest performance. In: Riskbased and factor investing. Elsevier, pp 449–461
Bates JM, Granger CWJ (1969) The combination of forecasts. J Oper Res Soc 20:451–468
Brightman C, Li F, Xi L (2015) Chasing performance with ETFs, Research Affiliates Fundamentals (November)
Campbell JY, Thompson SB (2008) Predicting excess stock returns out of sample: Can anything beat the historical average? Rev Financ Stud 21:1509–1531
Chordia T, Goyal A, Saretto A (2017) phacking: Evidence from two million trading strategies
Christ M, Braun N, Neuffer J, KempaLiehr AW (2018) Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh  A Python package). Neurocomputing 307:72–77
Cujean J, Hasler M (2017) Why does return predictability concentrate in bad times? J Finance 72:2717–2758
Da Z, Gurun UG, Warachka M (2014) Frog in the pan: continuous information and momentum. Rev Financ Stud 27:2171–2218
Dangl T, Halling M (2012) Predictive regressions with timevarying coefficients. J Financ Econ 106:157–181
Prado D, López M (2018) The 10 reasons most machine learning funds fail. J Portfolio Manag 44:120–133
de Prado ML (2019) A data science solution to the multipletesting crisis in financial research. J Financ Data Sci 1:99–110
Diebold FX, Mariano RS (2002) Comparing predictive accuracy. J Bus Econ Stat 20:134–144
Feng G, Giglio S, Xiu D (2020) Taming the factor zoo: a test of new factors. J Financ 75:1327–1370
Ferreira MA, SantaClara P (2011) Forecasting stock market returns: the sum of the parts is more than the whole. J Financ Econ 100:514–537
Freyberger J, Neuhierl A, Weber M (2020) Dissecting characteristics nonparametrically. Rev Financ Stud 33:2326–2377
Gali J (1994) Keeping up with the joneses: consumption externalities, portfolio choice, and asset prices. J Money Credit Bank 26:1–8
Gibbs C, Vasnev AL (2018) Conditionally optimal weights and forwardlooking approaches to combining forecasts, Available at SSRN 2919117
Goldstein I, Spatt CS, Ye M (2021) Big data in finance. Rev Financ Stud 34:3213–3225
Goyal A, Welch I (2008) A comprehensive look at the empirical performance of equity premium prediction. Rev Financ Stud 21:1455–1508
Granziera E, Sekhposyan T (2019) Predicting relative forecasting performance: an empirical investigation. Int J Forecast 35:1636–1657
Gu S, Kelly B, Xiu D (2020) Empirical asset pricing via machine learning. Rev Financ Stud 33:2223–2273
Harvey CR, Liu Y, Zhu H (2016) and the crosssection of expected returns. Rev Financ Stud 29:5–68
Heaton JB, Polson NG, Witte JH (2017) Deep learning for finance: deep portfolios. Appl Stoch Model Bus Ind 33:3–12
Hendry DF, Clements MP (2004) Pooling of forecasts. Economet J 7:1–31
Henkel SJ, Spencer Martin J, Nardari F (2011) Timevarying shorthorizon predictability. J Financ Econ 99:560–580
Holzinger A (2021) The next frontier: ai we can really trust, in machine learning and principles and practice of knowledge discovery in databases  international workshops of ECML PKDD 2021. Springer, pp 427–440
Hou K, Xue C, Zhang L (2020) Replicating anomalies. Rev Financ Stud 33:2019–2133
Inoue A, Kilian L (2005) Insample or outofsample tests of predictability: Which one should we use? Economet Rev 23:371–402
Inoue A, Kilian L (2006) On the selection of forecasting models. J Econom 130:273–306
Jurado K, Ludvigson S, Ng S (2015) Measuring uncertainty. Am Econ Rev 105:1177–1216
Klibanoff P, Marinacci M, Mukerji S (2005) A smooth model of decision making under ambiguity. Econometrica 73:1849–1892
Kou G, Chao X, Peng Y, Alsaadi FE, HerreraViedma E (2019) Machine learning methods for systemic risk analysis in financial sectors. Technol Econ Dev Econ 25:716–742
Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using mcdm methods. Inf Sci 275:1–12
Kou G, Yong X, Peng Y, Shen F, Chen Y, Chang K, Kou S (2021) Bankruptcy prediction for smes using transactional data and twostage multiobjective feature selection. Decis Support Syst 140:113429
Kou G, Yüksel S, Dinçer H (2022) Inventive problemsolving map of innovative carbon emission strategies for solar energybased transportation investment projects. Appl Energy 311:118680
Li T, Kou G, Peng Y, Yu Philip S (2021) An integrated cluster detection, optimization, and interpretation approach for financial data. IEEE Trans Cybern 52:13848–13861
Martin IWR, Nagel S (2022) Market efficiency in the age of big data. J Financ Econ 145:154–177
McLean RD, Pontiff J (2016) Does academic research destroy stock return predictability? J Financ 71:5–32
NovyMarx R, Velikov M (2022) Betting against betting against beta. J Financ Econ 143:80–106
Pesaran MH, Timmermann A (1995) Predictability of stock returns: robustness and economic significance. J Financ 50:1201–1228
Rapach DE, Strauss JK, Zhou G (2010) Outofsample equity premium prediction: combination forecasts and links to the real economy. Rev Financ Stud 23:821–862
Rapach DE, Strauss JK, Zhou G (2013) International stock return predictability: What is the role of the united states? J Finance 68:1633–1662
Roll R (1992) A mean/variance analysis of tracking error. J Portfolio Manag 18:13–22
Stock JH, Watson MW (2004) Combination forecasts of output growth in a sevencountry data set. J Forecast 23:405–430
Timmermann A (2006) Forecast combinations. Handb Econ Forecast 1:135–196
Yae J (2018) The efficient frontier of forecasts: Beyond the biasvariance tradeoff, Working Paper
Yae J (Forthcoming) Unintended lookahead bias in outofsample forecasting, Applied Economics Letters
Zhu X, Zhu J (2013) Predicting stock returns: a regimeswitching combination approach and economic links. J Bank Finance 37:4120–4133
Zhu Yi, Timmermann A (2017) Monitoring forecasting performance, UCSD working paper
Acknowledgements
The authors thank Hitesh Doshi, Cao Fang (2021 Asian FA discussant), Kris Jacobs, Jun Myung Song, Raul Susmel, and seminar participants at 2021 Asian FA Annual Meeting, 2021 AFAANZ Conference, and University of Houston for their valuable feedback.
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to this work. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: A list of individual predictor variables
We list the names and brief descriptions of 14 predictor variables we use as follows:

1.
Dividendprice ratio (dp): log of a 12month moving sum of dividends paid on the S &P 500 Index minus the log of stock prices (S &P 500 Index)

2.
Dividend yield (dy): log of a 12month moving sum of dividends minus the log of lagged stock prices.

3.
Earningprice ratio (ep): log of a 12month moving sum of earnings on the S &P 500 Index minus the log of stock prices.

4.
Dividendpayout ratio (de): log of a 12month moving sum of dividends minus the log of a 12month moving sum of earnings.

5.
Equity risk premium volatility (rvol): based on a 12month moving standard deviation estimator

6.
Booktomarket ratio (bm): booktomarket value ratio for the Dow Jones Industrial Average

7.
Net equity expansion (ntis): ratio of a 12month moving sum of net equity issues by NYSElisted stocks to the total endofyear market capitalization of New York Stock Exchange (NYSE) stocks.

8.
Treasury bill rate (tbl): interest rate on a threemonth Treasury bill (secondary market).

9.
Longterm yield (lty): longterm government bond yield.

10.
Longterm return (ltr): return on longterm government bonds.

11.
Term spread (tms): longterm yield minus the Treasury bill rate.

12.
Default yield spread (dfy): difference between Moody’s BAA and AAArated corporate bond yields.

13.
Default return spread (dfr): longterm corporate bond return minus the longterm government bond return.

14.
Inflation (infl): calculated from the CPI for all urban consumers.
Appendix 2: Timevarying correlation of predictor variables
See Fig. 5
Appendix 3: Machine learning algorithms used in the paper

1.
Random Forest This is an bagging method that combines forecasts from many small decision trees. This algorithm introduces extra randomness when growing trees. Instead of trying to fit the whole sample, each small decision tree will only use a subsample of the dataset. Averaging the results from small decision trees can improve the predictive accuracy and control overfitting. This brings greater diversity, which trades a higher bias for a lower variance and yields an overall better model. This algorithm repeats the process for multiple times while producing one decision tree for each time. At the end, it combines multiple decision trees and averages their forecasts. We use default loss function Gini impurity of RandomForest in scikitlearn package.

2.
2. Extremely Randomized Trees Instead of using subsample of the dataset and searching for the best possible threshold for each feature when splitting a node, this algorithm uses the full sample and random thresholds for each feature. This trades more biaes for a lower variance and usually will be faster to train than regular random forecast since finding the best possible threshold for each feature at every node is very timeconsuming. We use default loss function Gini impurity of ExtraTrees in scikitlearn package.

3.
3. Gradient Boosting This algorithm works by sequentially adding a new decesion tree to an ensemble of previous trees with each new one trying to correct the forecasting errors from its predecessor. It fit the new predictor to the residual errors made by the previous predictor. Shallow trees on their own are “weak learners” with weak predictive power. The theory behind boosting suggests that many weak learners may, as an ensemble, comprise a single “strong learner” with greater stability than a single complex tree. We use default loss function friedmanmse of GradientBoosting in scikitlearn package.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yae, J., Luo, Y. Robust monitoring machine: a machine learning solution for outofsample R\(^2\)hacking in return predictability monitoring. Financ Innov 9, 94 (2023). https://doi.org/10.1186/s4085402300497z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4085402300497z
Keywords
 Machine learning
 Outofsample R \(^2\)hacking
 Return predictability
 Monitoring
JEL Classification
 C52
 C53
 C55
 C58
 G17