Skip to main content

A structural VAR and VECM modeling method for open-high-low-close data contained in candlestick chart

Abstract

The structural modeling of open-high-low-close (OHLC) data contained within the candlestick chart is crucial to financial practice. However, the inherent constraints in OHLC data pose immense challenges to its structural modeling. Models that fail to process these constraints may yield results deviating from those of the original OHLC data structure. To address this issue, a novel unconstrained transformation method, along with its explicit inverse transformation, is proposed to properly handle the inherent constraints of OHLC data. A flexible and effective framework for structurally modeling OHLC data is designed, and the detailed procedure for modeling OHLC data through the vector autoregression and vector error correction model are provided as an example of multivariate time-series analysis. Extensive simulations and three authentic financial datasets from the Kweichow Moutai, CSI 100 index, and 50 ETF of the Chinese stock market demonstrate the effectiveness and stability of the proposed modeling approach. The modeling results of support vector regression provide further evidence that the proposed unconstrained transformation not only ensures structural forecasting of OHLC data but also is an effective feature-extraction method that can effectively improve the forecasting accuracy of machine-learning models for close prices.

Introduction

Technical analysis emerges as a preeminent investment analysis method in financial markets with the purpose of detecting price trends at an early stage to seize and profit from trading opportunities. The efficient market hypothesis (EMH) argues that financial prices comprehensively encapsulate all available information, rendering consistent market outperformance an implausible endeavor through equity selection and timing trades-that is, technical analysis is ineffective (Fama 1970). However, the premises of EMH are often disrupted by reality. For instance, the EMH assumes that financial prices are random walks, whereas there are often historical repetitions in investor behavior, leading to regularity in stock price fluctuations. Typical examples include the weekend (Doyle and Chen 2009), calendar (Ariss et al. 2011), and week-of-the-year effects (Levy and Yagil 2012). After decades of development, numerous scholars have confirmed the feasibility of technical analysis, and various methodologies have flourished (Hsu and Kuan 2005; Smith et al. 2016; Ilham et al. 2022).

Currently, candlestick chart analysis has become the most intuitive and extensively employed technical analysis methodology for analyzing the price movements of financial products (Romeo et al. 2015) owing to its precise definition (Caginalp and Laurent 1998), efficient representation of market signals, and cogent reflection of investors’ overall motivation (Tsai and Quan 2014). The candlestick chart encapsulates the four-dimensional price data of a particular financial product during a given period, including the open, high, low, and close prices, collectively termed OHLC data (Huang et al. 2022b). The academic community has spared no effort in studying the candlestick charts and the OHLC data contained within them, channeling their explorations into two domains: graphical analysis and numerical analysis.

Scholars engaged in graphical analysis ardently endeavor to forecast future price trends by identifying repetitive patterns in candlestick charts. Notably, Caginalp and Laurent (1998) investigated the candlestick charts of the S &P 500 from 1992 to 1996 and confirmed the significant predictive power of the 8-day reversal pattern using an out-of-sample test. The pattern demonstrated an empirical capacity to earn a profit of nearly 1% over a 2-day holding period. Goo et al. (2007) compared to the average returns across different patterns and holding days based on daily data of 25 blue chip stocks in Taiwan from 1997 to 2006. The empirical results show that investors can earn an average return of \(9.99\%\) by holding the Bullish Harami pattern for 10 days. Lan et al. (2011) employed fuzzy logic theory to define the sequence of symptoms before the appearance of reversal points and identified the reversal patterns of candlestick charts in Chinese stock markets. Lu et al. (2012) identified three bullish and three bearish reversal patterns based on the Taiwan Top 50 Tracker Fund data in 2002–2008. Cutting-edge research on the graphical analysis of candlestick charts based on machine-learning methods and artificial neural networks can be found in Ramadhan et al. (2022), Santur (2022), Cagliero et al. (2023), Chen et al. (2023), and Varghese et al. (2023).

However, the predictability and profitability of the patterns derived from the graphical analysis tend to lack universality, evincing notable disparities across different markets and periods. For instance, Shiu and Lu (2011) validated the excellent predictive power of the Bullish Harami pattern by exhaustively analyzing electronic securities data from the Taiwan Stock Exchange from 1998 to 2007. In contrast, Marshall et al. (2008) concluded that the Bullish Harami pattern exhibited negligible predictive ability across 59 stocks within the TOPIX Large 70 Index and 41 stocks within the TOPIX Mid 400 Index during the expansive temporal ambit of 1975–2004. Although graphical analysis commands widespread prominence, an unequivocal consensus eludes the academic domain regarding the profitability of candlestick-chart patterns (Tharavanij et al. 2017). Furthermore, graphical analysis is unable to establish a quantitative relationship between financial prices and their explanatory indicators, thereby warranting supplementary augmentation through a numerical analysis.

Numerical analysis is about buying low and selling high through short-term forecasts of financial prices. Although ubiquitous price data in financial markets always possess an OHLC data structure, most literature concerning numerical analysis concentrates solely on the close price, using historical time series of close prices to forecast future close prices (García-Martos et al. 2013; Sun et al. 2016; García and Jaramillo-Morán 2020; Liu and Shen 2020; Xu and Zhang 2023). Some studies consider the optimal portfolio for a group of stocks based on close-price returns (Mehrjoo et al. 2014; Mahmoudi et al. 2021). A superior approach to modeling OHLC data is to treat it as interval data considering the interval consisting of low and high prices (Mager et al. 2009; von Mettenheim and Breitner 2012). For instance, Arroyo et al. (2011) utilized multilayer perception (MLP), K-nearest neighbor (KNN) algorithm, autoregressive integrated moving average (ARIMA) model, vector autoregression (VAR), vector error correction model (VECM), and exponential smoothing to perform regressions on the interval-based Dow Jones Industrial Average index and Euro-dollar exchange rate.They adopted two classical modeling methodologies for interval data: (1) the Center and Range method and (2) the Min and Max method (Hu and He 2007; Guo et al. 2012; Hao and Guo 2017). These two interval time-series modeling approaches achieved breakthroughs in the structural modeling of binary complex data. Although the modeling object has been expanded from unary to binary, these two methods can only consider low and high prices within the OHLC data.

For OHLC data, in addition to low and high prices, the open and close prices between the two boundaries possess strong explanatory power for future price movements and warrant careful consideration in forecasting models (Cheung 2007; Liu and Wang 2012; Huang et al. 2022a). Regrettably, open and close prices are beyond the scope of the Center and Range and Min and Max methods. As an extension of interval data, a novel structural modeling procedure for OHLC data should be investigated. Only a few studies have endeavored to utilize OHLC data for modeling. However, these studies only utilize OHLC data as model inputs, and their output objective remains to forecast only the close price (Liu and Wang 2012; Luo and Chen 2013; Qiu et al. 2020; Staffini 2022) or to forecast trading signals (Chang et al. 2011; Ahmadi et al. 2018; Chen and Hao 2020; Mahmoodi et al. 2023a, b). Table 1 summarizes the current state of research on OHLC data-forecasting techniques, demonstrating that the existing literature lacks structural modeling of OHLC data. This motivated us to propose a unified structural modeling framework for OHLC data.

Table 1 A summary of literature review

Structural modeling of OHLC data is significant in finance practice, from which we may benefit compared to partial information modeling methodologies, especially for investors in financial markets. First, forecasting OHLC data can substantially assist investors in developing profitable investment plans. Specifically, according to the traditional forecast of the close price, investors can only try to buy a particular financial product at the closing quotation of period t and sell it at the closing quotation of period \((t+1)\) if an upward trend is forecasted (close-to-close strategy) (Dunis et al. 2011). If OHLC data are forecasted, investors can buy financial products at a price near the forecasted low price and sell them at a price near the forecasted high price to obtain excess profits (low-to-high strategy) (von Mettenheim and Breitner 2012). Furthermore, investors can sell promptly at the opening quotation to gain stop-loss profits when a bear market is forecasted (Huang et al. 2022a). In summary, trading based on OHLC data allows investors to obtain overnight returns (Cooper et al. 2008), reduce fund holding periods (Dunis et al. 2011), lower overnight exposure (Kelly and Clark 2011), and derive better profits from high-low price range trades (von Mettenheim and Breitner 2012). Second, candlestick charts can be drawn according to the forecasted OHLC data, whose patterns can reveal the power of demand and supply in financial markets and reflect market conditions and investor sentiment (Nison 2001; Tsai and Quan 2014). A graphical analysis can provide further investment advice based on the patterns of the forecasted candlestick chart, such as up and down indications (Marshall et al. 2006, 2008). Third, a comprehensive information set of OHLC prices can enhance the reliability and explanatory ability of the research (Huang et al. 2022a). As pointed out in Fiess and MacDonald (2002) and Cheung (2007), OHLC prices have proven to possess significant power in explaining price fluctuations and future trends. In addition, Rogers and Satchell (1991) and Magdon-Ismail and Atiya (2003) noted that a more stable and valid estimate of return volatility can be obtained by considering the daily high, low, and open prices in addition to the traditionally used close prices. Finally, forecasting OHLC data offers the possibility of applying a wide range of multivariate modeling techniques to explore the dynamic and structural relationships between the components of multidimensional vectorized complex data (Fiess and MacDonald 2002; Huang et al. 2022b).

The challenge of structurally modeling OHLC data lies in the proper handling of inherent constraints. The three inherent constraints of OHLC data are as follows: (1) the low price should be higher than 0, (2) the high price should be greater than the low price, and (3) the open and close prices should be within the interval consisting of low and high prices. Some studies on interval data have attempted to ensure that the upper boundary is greater than the lower boundary by adding additional conditions to the models (Neto and De Carvalho 2010; González-Rivera and Lin 2013). However, this approach increases model complexity and is not suitable for OHLC data with multiple constraints. Other studies have attempted to model the four prices of OHLC data separately without considering the inherent constraints of OHLC data (Manurung et al. 2018; Kumar and Sharma 2019). The disadvantage of this method is that the modeling results are likely to destroy the OHLC data structure. The three typical modeling failures originating from separate forecasts of open, high, low, and close prices are as follows: (1) the forecasted low price becomes negative (see Fig. 1a), (2) the forecasted high price is lower than the forecasted low price (see Fig. 1b), and (3) the forecasted open price (or forecasted close price) breaks through the boundaries consisting of the forecasted low and forecasted high prices (see Fig. 1c). These misleading forecasting results disrupt investors’ plans and significantly undermine their confidence in their investments (Huang et al. 2022a).

Fig. 1
figure 1

Meaningless modeling results caused by ignoring the inherent constraints in OHLC data (The original data contains 200 periods, and 50 periods are forecasted forward by the linear models. The red dotted line perpendicular to the vertical axis indicates zero value, and the red dotted line perpendicular to the horizontal axis indicates the 200th period, whose right side are forecasted values with a confidence interval of 95% confidence level)

To this end, we propose a novel unconstrained transformation method to transform OHLC data from an original four-dimensional constrained subspace into a four-dimensional real domain full space. The unconstrained transformation, along with its explicit inverse transformation, ensures that the subsequent forecasting models obtain meaningful OHLC prices. As an example of combining multivariate time-series analysis, we illustrate the detailed procedure of VAR and VECM modeling for OHLC data and support vector regression (SVR) as a special application of machine learning. Ample simulation experiments under different forecast periods, time-period basements and signal-to-noise ratios were conducted to validate the effectiveness and stability of the unconstrained transformation method. Furthermore, three financial datasets from the Kweichow Moutai, CSI 100 index, and 50 ETF from their advent in the Chinese stock market to June 14, 2019, were employed to demonstrate the empirical utility of the proposed method. The results showed a satisfactory modeling effect.

Compared with the existing literature, this study offers three contributions. First, the proposed unconstrained transformation method can properly handle the inherent constraints of OHLC data. Simulation experiments and empirical analysis demonstrate that the constraints inherent in OHLC data are satisfied throughout the numerical modeling process without increasing the complexity of the model, which enables more interpretable results. Second, this study proposes the first unified forecasting framework for OHLC. Within this framework, VAR and VECM modeling of OHLC data was implemented. The modeling procedure can take full advantage of the information contained in OHLC data, including open, high, low, and close prices. This enables a more efficient analysis and provides satisfactory predictive accuracy on the three datasets of the Kweichow Moutai, CSI 100, and CSI 50 ETF. Third, the method for dealing with unconstrained transformed variables can be generalized to all types of statistical models. The results from the extended SVR provide evidence that the proposed unconstrained transformation is an effective pre-processing technique for machine learning models and can significantly improve the forecasting accuracy of close prices compared with the direct use of raw OHLC data. From this perspective, this study provides a novel, effective, and scalable alternative to OHLC data analysis, thereby enriching existing literature on structural modeling for complex data.

The remainder of this study is organized as follows. In “Preliminaries” section, we introduce the mathematical definition of OHLC data and its inherent constraints. In “Methodology” section, we propose transformation and inverse transformation formulas to handle the inherent constraints of OHLC data and illustrate the VAR and VECM modeling processes for OHLC data. In “Simulations” section presents the simulation experiments, and “Empirical analysis” section  demonstrates the empirical application of the proposed method in the real financial market. Finally, we conclude the study with a brief discussion in “Conclusions” section.

Preliminaries

To obtain an intuitive depiction of the candlestick chart, we use the daily candlestick chart in the U.S. stock market as an example (all candlestick charts refer to daily candlestick chart in this study), as shown in Fig. 2. Obviously, a daily candlestick chart can not only record the open, high, low, and close prices of a particular stock on that day but also visually reflect the difference between any two prices.

Fig. 2
figure 2

An example of daily candlestick chart

Generally, a candlestick chart is divided into two categories, as shown in Fig. 2. Specifically, Fig. 2a indicates that the close price is greater than the open price, which corresponds to a bull market, while Fig. 2b corresponds to a bear market with the close price being lower than the open price. In the U.S. stock market, green and red are habitually used to mark the real body of the candlestick chart of bull and bear markets, respectively. If daily candlestick charts are arranged in chronological order, a sequence reflecting the historical price changes of a particular financial product is formed, called the candlestick chart series, and the corresponding data are termed OHLC series.

The essence of OHLC series is a four-dimensional time series of stock prices with three inherent constraints. First, all four prices in OHLC data should be positive, because the values of the OHLC data in the financial market cannot be less than zero. Second, the high price must be higher than the low price on the same day. Third, open and close prices should fall within the boundaries of low and high prices. To represent the constraints mathematically for any time period t, we provide the following definition of OHLC data:

Definition 1

A four-dimensional vector \(\varvec{X}_t=(x_t^{(o)}, x_t^{(h)}, x_t^{(l)}, x_t^{(c)})^T\) is typical OHLC data if it satisfies

  1. 1.

    \(x_t^{(l)} > 0\);

  2. 2.

    \(x_t^{(l)} < x_t^{(h)}\);

  3. 3.

    \(x_t^{(o)}, x_t^{(c)} \in \left[ x_t^{(l)},x_t^{(h)}\right]\).

Here, \(x_t^{(o)}\) is the t-period daily open price, \(x_t^{(h)}\) is the t-period daily high price, \(x_t^{(l)}\) is the t-period daily low price, and \(x_t^{(c)}\) is the t-period daily close price.

For the \(\mathcal {T}=[1,T]\) period, the collection of \(\varvec{X}_t\) for any \(t\in \mathcal {T}\) forms the OHLC series, denoted by

$$\begin{aligned} \varvec{S}=\{\varvec{X}_t\}_{t=1}^T. \end{aligned}$$

Compared with the ordinary real domain vector, the biggest difference between the vectors in \(\varvec{S}\) is that there are intrinsic constraint formulas between its four components, which poses a significant challenge to classical statistical analysis. To establish a time-series model of OHLC series, the most difficult problem is ensuring that the calculation process and forecasting results are also subject to these constraint formulas. Otherwise, the modeling results may be meaningless. That is, after obtaining the forecasting results in the forecasting period \((T+m) \; (m \in \mathbb {R}^+)\) using time-series modeling, it must be ensured that

$$\begin{aligned} \hat{x}_{T+m}^{(l)}> & {}\, 0,\\ \hat{x}_{T+m} ^{(l)}< & {} \,\hat{x}_{T+m} ^{(h)}, \\ \hat{x}_{T+m} ^{(o)},\hat{x}_{T+m}^{(c)}\in & {} \,\left[ \hat{x}_{T+m}^{(l)},\hat{x}_{T+m}^{(h)}\right] . \end{aligned}$$

These constraints are not guaranteed to be valid if we directly apply the time-series forecasting methods to the original four time series of OHLC data. To address this problem, a common practice is to remove these inherent constraints via proper data transformation. Then, we can freely forecast the transformed time-series data. Finally, we can obtain the forecaster for the original OHLC data using the corresponding inverse transformation.

Methodology

In “Data-transformation method” section, we first propose a flexible transformation method along with its inverse transformation method for OHLC data as well as a model-independent framework for modeling OHLC data. Then, we use VAR and VECM as implementations of the framework and present the corresponding forecasting procedure in “The VAR and VECM modeling process for OHLC data” section.

Data-transformation method

From Definition 1, the first constraint is \(x_t^{(l)}>0\), which can be relaxed via a commonly used logarithmic transformation. That is,

$$\begin{aligned} y_t^{(1)} = \ln x_t^{(l)}. \end{aligned}$$
(1)

It is quite clear that the transformed data \(y_t^{(1)}\) in Eq. (1) satisfies \(-\infty<y_t^{(1)}<+\infty\) with no positive constraints. Moreover, it preserves a positive relative relationship between the original data, as the logarithm transformation is a monotonically increasing function, but also compresses the scale of the data, which reduces the absolute values of the original data and makes them somewhat more stable.

Second, to guarantee the second constraint \(x_t^{(l)}<x_t^{(h)}\), that is, \(x_t^{(h)}-x_t^{(l)}>0\), the same practice as that in Eq. (1) yields

$$\begin{aligned} y_t^{(2)} = \ln \left( x_t^{(h)}-x_t^{(l)}\right) , \end{aligned}$$
(2)

where \(y_t^{(2)}\) is also free of any constraints, which can be modelled easily.

Finally, the last constraint is \(x_t^{(o)}, x_t^{(c)} \in [x_t^{(l)},x_t^{(h)}]\), implying that both the open and close prices must be higher than the low price and lower than the high price. Without properly processing the raw data, it is highly likely that the forecasted open or close prices are beyond the boundaries. To remedy this situation, based on the concept of a convex combination, we introduce two proxy datasets, \(\lambda _t^{(o)}\) and \(\lambda _t^{(c)}\), which are formulated as

$$\begin{aligned} \lambda _t^{(o)} = \frac{x_t^{(o)}-x_t^{(l)}}{x_t^{(h)}-x_t^{(l)}} \ \ \ \text{ and }\ \ \ \lambda _t^{(c)} = \frac{x_t^{(c)}-x_t^{(l)}}{x_t^{(h)}-x_t^{(l)}}. \end{aligned}$$
(3)

There is \(0\leqslant \lambda _t^{(o)},\lambda _t^{(c)}\leqslant 1\) and the original data \(x_t^{(o)}\) and \(x_t^{(c)}\) can be obtained as follows:

$$\begin{aligned} x_t^{(o)}= & {} \lambda _t^{(o)}x_t^{(h)}+\left( 1-\lambda _t^{(o)}\right) x_t^{(l)}, \end{aligned}$$
(4)
$$\begin{aligned} x_t^{(c)}= & {} \lambda _t^{(c)}x_t^{(h)}+\left( 1-\lambda _t^{(c)}\right) x_t^{(l)}. \end{aligned}$$
(5)

Thus, the original constraint \(x_t^{(o)}, x_t^{(c)} \in [x_t^{(l)},x_t^{(h)}]\) reduces to \(0\leqslant \lambda _t^{(o)},\lambda _t^{(c)}\leqslant 1\) if we deal with the proxy data \(\lambda _t^{(o)}\) and \(\lambda _t^{(c)}\) instead of \(x_t^{(o)}\) and \(x_t^{(c)}\). Moreover, \(\lambda _t^{(o)}\) and \(\lambda _t^{(c)}\) are reasonable. Specifically, a larger \(\lambda _t^{(o)}\) indicates that the open price \(x_t^{(o)}\) is closer to the high price \(x_t^{(h)}\), whereas a smaller \(\lambda _t^{(o)}\) indicates that the open price \(x_t^{(o)}\) is closer to the low price \(x_t^{(l)}\). Similarly, an explanation for \(\lambda _t^{(c)}\) can be obtained.

To further remove the constraint \(0\leqslant \lambda _t^{(o)},\lambda _t^{(c)}\leqslant 1\) on \(\lambda _t^{(o)}\) and \(\lambda _t^{(c)}\), following the idea of logistic regression, we propose the logit transformation to obtain the unconstrained data \(y_t^{(3)}\) and \(y_t^{(4)}\) as follows:

$$\begin{aligned} y_t^{(3)}= & {} \ln \frac{\lambda _t^{(o)}}{1-\lambda _t^{(o)}}, \end{aligned}$$
(6)
$$\begin{aligned} y_t^{(4)}= & {} \ln \frac{\lambda _t^{(c)}}{1-\lambda _t^{(c)}}. \end{aligned}$$
(7)

Until now, via the transformation process, the raw OHLC data \(\varvec{X}_t=(x_t^{(o)}, x_t^{(h)}, x_t^{(l)}, x_t^{(c)})^T\) are transformed to the unconstrained four-dimensional data \(\varvec{Y}_t=(y_t^{(1)}, y_t^{(2)},y_t^{(3)},y_t^{(4)})^T.\) In summary, the proposed transformation method can be described as follows:

$$\begin{aligned} \varvec{Y}_t=\left( \begin{array}{c} y_t^{(1)} \\ y_t^{(2)} \\ y_t^{(3)} \\ y_t^{(4)} \\ \end{array} \right) =\left( \begin{array}{c} \ln x_t^{(l)} \\ \ln \left( x_t^{(h)}-x_t^{(l)}\right) \\ \ln \left( \frac{\lambda _t^{(o)}}{1-\lambda _t^{(o)}}\right) \\ \ln \left( \frac{\lambda _t^{(c)}}{1-\lambda _t^{(c)}}\right) \\ \end{array} \right) , \end{aligned}$$
(8)

where \(\lambda _t^{(o)}\) and \(\lambda _t^{(c)}\) are defined by Eq. (3). The transformation of Eq. (8) not only ranges from \(-\infty\) to \(+\infty\) and is an explicit inverse for the values in its range but also shares the flexibility of the well-known log and logit transformation. Furthermore, the components in \(\varvec{Y}_t\) have fruitful economic relevance. Specifically, \(y_t^{(1)}\) measures the basic price level of a specific financial product, \(y_t^{(2)}\) denotes the trading price range and measures intraday volatility, and \(y_t^{(3)}\) and \(y_t^{(4)}\) can be used to reflect the long- and short-game dynamics in the financial market, as they describe the relative positions of the open and close prices among the boundaries consisting of low and high prices, respectively. The larger the \(y_t^{(3)}\) or \(y_t^{(4)}\) is, the closer the open or close price is to the high price, respectively. Smaller \(y_t^{(3)}\) or \(y_t^{(4)}\) suggests that the open or close price is closer to the low price, respectively. The relative sizes of \(y_t^{(3)}\) and \(y_t^{(4)}\) can also reflect the bullish and bearish attributes of the market with \(y_t^{(3)}>y_t^{(4)}\) implying a bearish market and \(y_t^{(3)}<y_t^{(4)}\) pointing toward a bullish market. In summary, the unconstrained transformation process can effectively extract feature information from the original OHLC data, including the strength of intraday trends and price volatility. As indicated by Fiess and MacDonald (2002), the relationship between trends and volatility provides the underlying information about future price developments.

Therefore, the forecasting model of the OHLC series \(\{\varvec{X}_t\}_{t=1}^T\) can be transformed into forecasts for the unconstrained series \(\{\varvec{Y}_t\}_{t=1}^T\) with the entire real number domain and variance stability, which provides significant convenience for subsequent statistical modeling. That is, we can apply classical forecasting models (ARIMA, VAR, VECM, etc.) or machine-learning models (KNN, MLP, SVR, etc.) to \(\{\varvec{Y}_t\}_{t=1}^T\). After obtaining the forecaster of \(\varvec{Y}_t\) (\(\widehat{\varvec{Y}}_t=(\hat{y}_t^{(1)}, \hat{y}_t^{(2)}, \hat{y}_t^{(3)}, \hat{y}_t^{(4)})^T\), which may contain the results of m-step forecasts, \(m \in \mathbb {R}^+\)), we can obtain the corresponding forecaster of \(\varvec{X}_t\) (\(\widehat{\varvec{X}}_t=(\hat{x}_t^{(o)}, \hat{x}_t^{(h)}, \hat{x}_t^{(l)}, \hat{x}_t^{(c)})^T\)) via the inverse transformation as follows:

$$\begin{aligned} \widehat{\varvec{X}}_t=\left( \begin{array}{c} \hat{x}_t^{(o)} \\ \hat{x}_t^{(h)} \\ \hat{x}_t^{(l)} \\ \hat{x}_t^{(c)} \\ \end{array} \right) =\left( \begin{array}{c} \hat{\lambda }_t^{(o)}\big (\exp \big \{\hat{y}_t^{(1)}\big \}+\exp \big \{\hat{y}_t^{(2)}\big \}\big )+\big (1-\hat{\lambda }_t^{(o)}\big )\exp \big \{\hat{y}_t^{(1)}\big \}\\ \exp \big \{\hat{y}_t^{(1)}\big \}+\exp \big \{\hat{y}_t^{(2)}\big \} \\ \exp \big \{\hat{y}_t^{(1)}\big \} \\ \hat{\lambda }_t^{(c)}\big (\exp \big \{\hat{y}_t^{(1)}\big \}+\exp \big \{\hat{y}_t^{(2)}\big \}\big )+\big (1-\hat{\lambda }_t^{(c)}\big )\exp \big \{\hat{y}_t^{(1)}\big \}\\ \end{array} \right) , \end{aligned}$$
(9)

where

$$\begin{aligned} \hat{\lambda }_t^{(o)}=\frac{\exp \big \{\hat{y}_t^{(3)}\big \}}{1+\exp \big \{\hat{y}_t^{(3)}\big \}} \ \ \ \text{ and }\ \ \ \hat{\lambda }_t^{(c)}=\frac{\exp \big \{\hat{y}_t^{(4)}\big \}}{1+\exp \big \{\hat{y}_t^{(4)}\big \}}. \end{aligned}$$
(10)

The unconstrained transformation expressed in Eq. (8) and the inverse transformation in Eq. (9) provide a new perspective for forecasting OHLC data, which makes the forecasting results obey the three inherent constraints listed in Definition 1 and thus realizes the structural modeling of OHLC data. Basically, the unified structural forecasting process for OHLC data can be summarized as a trilogy: (1) transform \(\{\varvec{X}_t\}_{t=1}^T\) into \(\{\varvec{Y}_t\}_{t=1}^T\) using the unconstrained transformation, (2) model \(\{\varvec{Y}_t\}_{t=1}^T\) using various time-series models to obtain \(\widehat{\varvec{Y}}^{(m)}_t\); and (3) conduct inverse transformation on \(\widehat{\varvec{Y}}^{(m)}_t\) to derive \(\widehat{\varvec{X}}^{(m)}_t\). A detailed unified modeling framework for OHLC data is introduced in Algorithm 1. Furthermore, the proposed method is highly feasible and can be easily generalized to any type of positive interval data with minimum and maximum boundaries greater than zero and multi-valued sequences between the two boundaries. For example, the salaries of groups of people or rainfall in different districts.

Algorithm 1
figure a

Unified forecasting framework for OHLC data

It should be noticed that, in the unconstrained transformation process, we assume that \(x_t^{(o)}, x_t^{(h)}, x_t^{(l)}\), and \(x_t^{(c)}\) are not equal \((\text {except for} \; x_t^{(o)} = x_t^{(c)})\). In other words, \(x_t^{(h)} \ne x_t^{(l)} \ne 0\) and \(\lambda _t^{(o)}, \lambda _t^{(c)} \notin \left\{ 0,1 \right\}\). However, such assumptions are inevitably spoiled in real financial markets. Here, we list the circumstances that render these assumptions invalid and provide a measure to deal with them accordingly. (1) When the subject is in trade suspension and all prices are equal to 0, namely, \(x_t^{(o)}=x_t^{(h)}=x_t^{(l)}=x_t^{(c)}=0\), we exclude these extreme cases from the raw data. (2) When \(\lambda _t^{(o)}\) or \(\lambda _t^{(c)}\) is equal to 0, it corresponds to \(x_t^{(o)}=x_t^{(l)}\) or \(x_t^{(c)}=x_t^{(l)}\), respectively. We add a random term to \(x_t^{(o)}\) or \(x_t^{(c)}\) and make \(\lambda _t^{(o)}\) or \(\lambda _t^{(c)}\) slightly greater than zero. In practice, determining the magnitude of this random term is difficult. In this study, it was set to one percent of the magnitude of the original data. In the future, the size of this random term can be treated as a model parameter for optimization. (3) When \(\lambda _t^{(o)}\) or \(\lambda _t^{(c)}\) is equal to 1, it indicates that \(x_t^{(o)}=x_t^{(h)}\) or \(x_t^{(c)}=x_t^{(h)}\), respectively. We subtract a random term from \(x_t^{(o)}\) or \(x_t^{(c)}\) to make \(\lambda _t^{(o)}\) or \(\lambda _t^{(c)}\) slightly smaller than 1. (4) When a particular financial product reaches a limit-up or limit-down as soon as the opening quotation, there is \(x_t^{(o)}=x_t^{(h)}=x_t^{(l)}=x_t^{(c)}\ne 0\). If a limit-up occurs, we first multiply \(x_t^{(c)}\) and \(x_t^{(h)}\) by 1.1 to make a relatively large interval. If a limit-down occurs, we first multiply \(x_t^{(o)}\) and \(x_t^{(h)}\) by 1.1. We then conduct the measurements given in circumstances (2) and (3). This model is designed to comply with the 10% limit of the Chinese stock market. At the same time, a 10% daily fluctuation is sufficient to reflect strong changes in financial markets. For financial markets without stop limits, the interval magnification can be appropriately increased. (5) In extreme cases, financial markets can produce negative low prices, that is, \(x_t^{(l)}<0\). For instance, the downturn in the crude oil market due to the COVID-19 pandemic caused May U.S. WTI crude oil futures to plummet, eventually closing at \(-\) 37.63 dollars per barrel on April 20, 2020 (the last trading day before the delivery date). When modeling a time series that includes such extremes, the removal of these data should be considered. This is because such extreme prices are subject to rapid adjustments, and severely distorted extremes lose their ability to forecast future price movements. For instance, the price of WTI crude oil futures on April 21, 2020, switched to futures with a delivery date of June 21, 2020, which quickly returned to positive values and closed at 10.01 dollars per barrel. Investigating alternatives to such extreme OHLC data will be a future research direction.

The VAR and VECM modeling process for OHLC data

Here, we employ the VAR and VECM as examples of the framework proposed in Algorithm 1 and present the corresponding procedure for forecasting OHLC data.

VAR for OHLC data

As one of the most widely used multiple time-series analysis methods, VAR, proposed by Sims (1980), has become an important research tool in economic studies with the advantage of capturing the linear interdependencies among multiple time series (Pesaran and Shin 1998). According to Algorithm 1, we embed the VAR into the modeling process of unconstrained four-dimensional time-series data \(\{\varvec{Y}_t\}_{t=1}^T\). Without loss of generality, we first assume that all time series in \(\varvec{Y}_t\) are stationary. Then, a p-order (\(p \in \mathbb {R}^+\)) VAR, denoted by VAR(p), can be formulated as

$$\begin{aligned} \varvec{Y}_t=\varvec{\alpha }+\varvec{A}_1\varvec{Y}_{t-1}+\cdots +\varvec{A}_p\varvec{Y}_{t-p}+\varvec{w}_t= \varvec{\alpha }+\sum \limits _{j = 1}^{p}\varvec{A}_j\varvec{Y}_{t-j}+\varvec{w}_t, \ \ \ t=(p+1),\ldots ,T \end{aligned}$$
(11)

where \(\varvec{Y}_{t-j}\) is the j-th lag of \(\varvec{Y}_t\); \(\varvec{\alpha }=(\alpha _1,\alpha _2,\alpha _3,\alpha _4)^T\) is a four-dementional vector of intercepts; \(\varvec{A}_j\) stands for the time-invariant \(4\times 4\) coefficient matrix; and \(\varvec{w}_t=(w_t^{(1)},w_t^{(2)},w_t^{(3)},w_t^{(4)})^T\) is a four-dimensional error term vector satisfying:

  1. (1)

    Mean zero: \(E(\varvec{w}_t)=\varvec{0}\);

  2. (2)

    No correlation across time: \(E(\varvec{w}_{t-k}^T\varvec{w}_t)=0\), for any non-zero k.

Writing Eq. (11) in the concise matrix form yields

$$\begin{aligned} \varvec{Y}=\varvec{B}\varvec{Z}+\varvec{U}, \end{aligned}$$
(12)

where \(\varvec{Y}=[\varvec{Y}_{p+1}, \varvec{Y}_{p+2},\cdots ,\varvec{Y}_T]\) is a \(4\times (T-p)\) matrix; \(\varvec{B}=[\varvec{\alpha },\varvec{A}_1,\cdots ,\varvec{A}_p]\) is a \(4\times (4p+1)\) coefficient matrix; \(\varvec{U}=[\varvec{w}_{p+1},\varvec{w}_{p+2},\cdots ,\varvec{w}_T]\) is a \(4\times (T-p)\) error term matrix; and

$$\begin{aligned} \varvec{Z}=\begin{pmatrix} 1&{}\quad 1&{}\quad \cdots &{}\quad 1\\ \varvec{Y}_p&{}\quad \varvec{Y}_{p+1}&{}\quad \cdots &{}\varvec{Y}_{T-1}\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ \varvec{Y}_1&{}\quad \varvec{Y}_2&{}\quad \cdots &{}\quad \varvec{Y}_{T-p} \end{pmatrix} \end{aligned}$$

is a \((4p+1)\times (T-p)\) matrix. Then, we can solve for the coefficient matrix \(\varvec{B}\) using a least-squares estimation (Lütkepohl 2005):

$$\begin{aligned} \widehat{\varvec{B}}=\big (\varvec{Z}^T\varvec{Z}\big )^{-1}\varvec{Z}^T\varvec{Y}. \end{aligned}$$
(13)

VECM for OHLC data

The reliability of the VAR estimation is closely related to the stationarity of the variable sequences. If this assumption does not hold, we may use a restricted VAR, that is, the VECM, in the presence of the cointegration among variables. Otherwise, the variables must first be differenced by d times until they can be modelled by VAR or VECM. As evidenced in Cheung (2007), in U.S. stock markets, stock prices are typically characterized by I(1) processes, and the daily highs and lows follow a cointegration relationship. This implies that the VECM may be a more practically relevant model than the VAR in the context of forecasting the OHLC series. Here, we use the augmented Dickey–Fuller (ADF) unit root test to examine the stationarity of each variable and the Johansen test (Johansen 1988) to determine the presence or absence of the cointegration relationship.

Assuming that \(\varvec{Y}_t\) is integrated to order one, the corresponding VECM takes the following form:

$$\begin{aligned} \Delta \varvec{Y}_t=\sum _{j=1}^{p-1}\varvec{\Gamma }_j\Delta \varvec{Y}_{t-j}+\varvec{\gamma }\varvec{\beta }^T\varvec{Y}_{t-1}+\varvec{\alpha }+\varvec{w}_t, \end{aligned}$$
(14)

where \(\Delta\) denotes the first difference, and \(\sum _{j=1}^{p-1}\varvec{\Gamma }_j\Delta \varvec{Y}_{t-j}\) and \(\varvec{\gamma }\varvec{\beta }^T\varvec{Y}_{t-1}\) are the VEC components of the first difference and the error correction term, respectively. Here, \(\varvec{\Gamma }_j\) is a \(4\times 4\) matrix representing the short-term adjustments among the variables across the four equations at the j-th lag. Two matrices, \(\varvec{\gamma }\) and \(\varvec{\beta }\), are of dimension \(4\times r\), where r is the order of cointegration, \(\varvec{\gamma }\) denotes the speed of adjustment, and \(\varvec{\beta }\) represents the cointegrating vector, which can be obtained using the Johansen test (Johansen 1988; Cuthbertson et al. 1992). \(\varvec{\alpha }\) is a \(4\times 1\)-constant vector representing a linear trend, p is the lag structure, and \(\varvec{w}_t\) is a \(4\times 1\)-vector of the white noise error term.

For the VECM in Eq. (14), Johansen (1991) employed the full-information maximum likelihood method to implement the estimation. Specifically, the main procedure consists of (1) testing whether all variables are integrated of order one by applying a unit-root test (Lai and Lai 1991), (2) determining the lag order p such that the residuals from each equation of the VECM are uncorrelated, (3) regressing \(\Delta \varvec{Y}_t\) against the lagged differences of \(\Delta \varvec{Y}_t\) and estimating the cointegrating vectors from the canonical correlations of the set of residuals from these regression equations, and (4) determining the order of cointegration r.

Discussion of parameter selection in VAR and VECM

Finally, we discuss the determination of the lag order p in the VAR model and the order of cointegration r in the VECM to model the OHLC data. First, for p, on one hand, it should be sufficiently large to fully reflect the dynamic characteristics of the constructed model; on the other hand, an increase in p will cause an increase in the parameters to be estimated; thus, the degree of freedom of the model decreases. A trade-off must be evaluated to choose p, and the commonly used criteria in practice are AIC, BIC, and Hannan–Quinn. We prefer AIC because of its conciseness, which is formulated as

$$\begin{aligned} \text{ AIC }(p)=\ln \frac{\sum \nolimits _{i = 1}^{4}\sum \nolimits _{j = 1}^{T}\hat{u}_{ij}^2}{{T}}+\frac{2pK^2}{T}, \end{aligned}$$
(15)

where T denotes the total period number of OHLC series, p is the VAR lag order, K is the VAR dimension, and \(\hat{u}_{ij}=\hat{Y}_{j}^{(i)}-Y_{j}^{(i)}(1\le i\le 4,1\le j\le T)\) represents the VAR residuals. The optimal p is obtained by minimizing \(\text{ AIC }(p)\).

Second, in the order of cointegration, r indicates the dimension of the cointegrating space, and (1) if the rank of \(\varvec{\gamma }\varvec{\beta }^T\) is 4, that is, \(r=4\) and \(\varvec{Y}_t\) are already stationary, the proper specification of Eq. (14) is without the error-correction term and degenerates into a VAR; (2) if \(\varvec{\gamma }\varvec{\beta }^T\) is a null matrix, that is, \(r=0\), then there is no cointegration relation; and (3) if the rank of \(\varvec{\gamma }\varvec{\beta }^T\) is between 0 and 4, that is, \(0<r<4\), there exist r linearly independent columns in the matrix and r cointegration relations in the system of equations. Along the line of Johansen (1991), r is determined by constructing the “Trace” or “Eigen” test statistics, which are two widely used methods in Johansen test. For further details, please refer to Johansen (1995) and Lütkepohl (2005).

Unified modeling framework for OHLC data

In summary, as one of the popular econometric forecasting models in Step 4 of Algorithm 1, the main implementations of VAR and VECM can be summarized as Algorithm 2.

Algorithm 2
figure b

VAR and VECM framework

By incorporating Algorithm 2 into Algorithm 1, we can obtain a unified framework for the statistical modeling of the OHLC series, as shown in Fig. 3.

Fig. 3
figure 3

An unified framework for modeling OHLC data based on VAR and VECM

Simulations

We assessed the performance of the proposed method using finite-sample simulations. We first describe the data construction in “Data construction” section, then provide five indicators to evaluate forecasting accuracy in “Measurements” section and finally report the simulation results in “Results of simulations” section.

Data construction

We generate the simulation data under the VAR structure in Eq. (11) as follows: (1) Assign the lag period p and the coefficient matrices \(\varvec{A}_1, \varvec{A}_2, \cdots , \varvec{A}_p\); (2) Generate an original four-dimensional vector \(\varvec{Y}_1=[y^{(1)}_{1}, y^{(2)}_{1}, y^{(3)}_{1}, y^{(4)}_{1}]^T\); (3) Generate \(\{\varvec{Y}_t\}_{t=2}^T\) in a sequence via the VAR(p) model

$$\begin{aligned} \varvec{Y}_t=\varvec{A}_1\varvec{Y}_{t-1}+\cdots +\varvec{A}_p\varvec{Y}_{t-p}+\varvec{w}_t, \end{aligned}$$

where \(\varvec{w}_t\) follows the multivariate normal distribution with zero mean and covariance matrix \(\Sigma _{\varvec{w}}\). Finally, the simulated OHLC data \(\{\varvec{X}_t\}_{t=1}^T\) are generated by applying the inverse transformation formula in Eq. (9).

To evaluate the robustness of the proposed method, we considered the following scenarios with different variance component levels:

  • Scenario 1: \(p=1, T=220, \varvec{Y}_1=[4,\; 0.7,\; -0.85,\; 0]^T\) and

    $$\begin{aligned} \varvec{A}_1=\begin{pmatrix} 0.55 &{} \quad 0.12 &{}\quad 0.12 &{}\quad 0.12 \\ 0.12 &{}\quad 0.55 &{}\quad 0.12 &{}\quad 0.12 \\ 0.12 &{}\quad 0.12 &{}\quad 0.55 &{}\quad 0.12 \\ 0.12 &{}\quad 0.12 &{}\quad 0.12 &{}\quad 0.55 \end{pmatrix}, \end{aligned}$$

    and \(\Sigma _{\varvec{w}}\) is a \(4\times 4\) diagonal matrix with diagonal element being \(0.05^2\), i.e.,

    $$\begin{aligned} \Sigma _{\varvec{w}}=\text{ diag }\{0.05^2,0.05^2,0.05^2,0.05^2\}. \end{aligned}$$
  • Scenario 2: \(p, T, \varvec{Y}_1\) and \(\varvec{A}_1\) follows Scenario 1 except that

    $$\begin{aligned} \Sigma _{\varvec{w}}=\text{ diag }\{0.07^2,0.07^2,0.07^2,0.07^2\}. \end{aligned}$$
  • Scenario 3: \(p, T, \varvec{Y}_1\) and \(\varvec{A}_1\) follows Scenario 1 except that

    $$\begin{aligned} \Sigma _{\varvec{w}}=\text{ diag }\{0.03^2,0.03^2,0.03^2,0.03^2\}. \end{aligned}$$

All these scenarios present the transformed unconstrained time-series data \(\{\varvec{Y}_t\}_{t=1}^T\) that follow VAR(1), with different variance component levels according to the median (scenario 1), low (scenario 2), and strong (scenario 3) signal-to-noise ratios, respectively. A higher signal-to-noise ratio means that the information contained in the data comes more from the signal than from the noise, indicating a better quality of data. In contrast, a lower signal-to-noise ratio means that the noise carries more interference, indicating worse data quality.

Note that the raw simulation data have 220 periods; we only take 21–220 periods as the final simulated dataset, as the data generated initially may be highly volatile. Considering Scenario 2 as an illustration, Fig. 4 shows the simulated OHLC series \(\{\varvec{X}_t\}_{t=1}^T\).

Fig. 4
figure 4

Simulation OHLC data under Scenario 2

Measurements

Based on the process illustrated in Fig. 3, the VAR and VECM are used to measure the statistical relationships between variables contained in \(\varvec{Y}_t\). As Corrado and Lee (1992) and Marshall et al. (2008) point out, short-term technical analyses can be more helpful for investors than long-term technical analyses. Therefore, we focused on a relatively short-term analysis. Specifically, q periods of the simulated data, namely the time period basement, were used to train the model and make out-of-sample forecasts ahead of m periods. We set q ranges from 30 to 70, and \(m=1,2,3\). For each setting (qm), \(\varvec{Y}_i^{(q)}\) scrolls forward by one period and forecasts \((T-q-m+1)\) times in total, as indicated in Fig. 5.

Fig. 5
figure 5

Specific segment method

The forecasted \(\widehat{\varvec{Y}_i}^{(q,m)}\) is first derived, and then the forecasted \(\widehat{\varvec{X}_i}^{(q,m)}\) is obtained based on the inverse transformation formulas Eq. (9). We evaluated the effectiveness of forecasting in terms of five measurements, which are defined as follows:

  • The mean absolute percentage error (MAPE)

    $$\begin{aligned} \text{ MAPE } = \frac{100\%}{k}\sum \limits _{i = 1}^{k}\left| \frac{x^{(*)}_i-\widehat{x}^{(*)}_i}{x^{(*)}_i}\right| , \end{aligned}$$

    where \(x^{(*)}_i\) and \(\widehat{x}^{(*)}_i\) are the actual and forecasted values with \(x^{(*)}_i\) indicating \(x^{(o)}_i\), \(x^{(h)}_i\), \(x^{(l)}_i\), or \(x^{(c)}_i\), respectively; k is the number of forecasted points.

  • The standard deviation (SD) is defined as the empirical standard derivation of the forecasted values \(\{\widehat{x}^{(*)}_i\}_{i=1}^k\), i.e.,

    $$\begin{aligned} \text{ SD } =\sqrt{ \frac{1}{k-1}\sum \limits _{i = 1}^{k}\Big (\widehat{x}^{(*)}_i-\bar{\widehat{x}}^{(*)}\Big )^2 }, \end{aligned}$$

    where \(\bar{\widehat{x}}^{(*)}=\sum _{i=1}^k\widehat{x}^{(*)}_i/k\).

  • The root mean squared error (RMSE) as defined in Neto and de Carvalho (2008)

    $$\begin{aligned} \text{ RMSE }=\sqrt{\frac{1}{k}\sum \limits _{i = 1}^{k}\Big (x^{(*)}_i-\widehat{x}^{(*)}_i\Big )^2} \end{aligned}$$
  • The RMSE based on the Hausdorff distance (RMSEH) is defined in De Carvalho et al. (2006)

    $$\begin{aligned} \text{ RMSEH }=\sqrt{\frac{1}{k}\sum \limits _{i = 1}^{k}\left( \left| \frac{x_i^{(h)}+x_i^{(l)}}{2}-\frac{\widehat{x}_i^{(h)}+\widehat{x}_i^{(l)}}{2}\right| +\left| \frac{x_i^{(h)}-x_i^{(l)}}{2}-\frac{\widehat{x}_i^{(h)}-\widehat{x}_i^{(l)}}{2}\right| \right) ^2} \end{aligned}$$
  • The accuracy ratio (AR) is adopted in Hu and He (2007)

    $$\begin{aligned} \text{ AR }=\left\{ \begin{array}{lcl} \frac{1}{k}\sum \limits _{i = 1}^{k}\frac{w(SP_i \cap \widehat{SP_i})}{w(SP_i \cup \widehat{SP_i})}, \qquad \qquad \text{ if }\ \ \ \ (w(SP_i \cap \widehat{SP_i})\not =0)\\ 0, \qquad \qquad \qquad \qquad \qquad \text{ if }\ \ \ (w(SP_i \cap \widehat{SP_i})=0)\\ \end{array} \right. , \end{aligned}$$

    where \(w(SP_i \cap \widehat{SP_i})\) and \(w(SP_i \cup \widehat{SP_i})\) represent the length of the intersection and union between the observation interval \([x_i^{(l)}, x_i^{(h)}]\) and the forecasting interval \([\widehat{x}_i^{(l)}, \widehat{x}_i^{(h)}]\), respectively.

Smaller values of MAPE, RMSE, and RMSEH and a larger AR indicate a more accurate forecasting result, whereas a smaller SD indicates a more stable result.

Results of simulations

For Scenario 1, we summarize the results with \(q=40, 50, 70\) and \(m=1,2,3\) in Table 2. From this, we can observe that (1) the overall performance of the proposed method in terms of these five measurements is satisfactory and stable, and (2) for a fixed q, a smaller forecast period m makes the forecasted results more accurate with smaller values of MAPE, RMSE, and RMSEH and larger AR. There is no obvious pattern in terms of SD.

Table 2 Simulation results for Scenario 1 when\(q=40, 50, 70\)and\(m=1,2,3\)

Moreover, we present additional results with q ranging from 30 to 70 and \(m=1,2,3\) to further demonstrate the performance of the proposed method. Specifically, Fig. 6 summarizes the results in terms of MAPE (the left panel), SD (the middle panel), and RMSE (the right panel), while Fig. 7 shows the RMSEH and AR of the forecasted results.

Fig. 6
figure 6

MAPE (Left panel), SD (Middle panel) and RMSE (Right panel) of forecasted values for \(x^{(o)}_t\) (The first row), \(x^{(h)}_t\) (The second row), \(x^{(l)}_t\) (The third row) and \(x^{(c)}_t\) (The fourth row) with different q and m for Scenario 1 respectively

Fig. 7
figure 7

RMSEH (Left) and AR (Right) of forecasted values for different q and m in Scenario 1

Basically from Fig. 6, under different q and m, the MAPE is between 3.08% and \(7.13\%\), the SD is between 0.048 and 0.158, and the RMSE is between 0.051 and 0.148, indicating good forecasting accuracy and stability. As the forecast period m increases, these three indicators increase synchronously, decreasing forecasting accuracy. However, the forecasting performance shows a trend of getting better first and then getting worse as q increases. From Fig. 7, the RMSEH maintains a small value between 0.083 and 0.152. Meanwhile, AR is relatively high, varying from 0.842 to 0.903, which illustrates that the forecasting interval closely coincides with the observation interval, indicating a satisfactory forecasting effect.

Fig. 8
figure 8

MAPE of forecasted values with different q and m for Scenario 2 (The first row) and 3 (The second row), respectively

For Scenarios 2 and 3, we conducted simulations in line with Scenario 1, and the results exhibited the same trend. Owing to space constraints, we only show the forecasted results in terms of MAPE in Fig. 8 for Scenario 2 with low signal-to-noise ratio (the first row) and Scenario 3 with high signal-to-noise ratio (the second row). The MAPE values in the first row of Fig. 8 is between 4.29 and 9.93%, while the corresponding MAPE in the second row of Fig. 8 is between 1.89 and 4.33%. The left panel of Fig. 6 corresponds to the MAPE with medium signal-to-noise ratio, whose MAPE (ranges from 3.08 to 7.13%) is between those in the second and the first row of Fig. 8, which indicates that the accuracy of the forecasted results decreases with the signal-to-noise ratio.

Empirical analysis

We illustrate the practical utility of the proposed method using three different types of real datasets: the OHLC series of Kweichow Moutai, CSI 100 index, and 50 ETF in the Chinese financial market. For each case, we first briefly describe the dataset in “Raw OHLC dataset description” section, then apply the proposed method with different forecasting basement period q and forecasting period m and report the performance in terms of MAPE, RMSE, RMSEH, and AR in in “Results of empirical analysis” section.

Raw OHLC dataset description

  • OHLC series of the Kweichow Moutai The Kweichow Moutai is a well-known company in Chinese liquor industry, which has a long history, and its stamp (SH: 600519) is an important part of the China Securities Index (CSI 100). Here, we study its OHLC series with the time ranging from 27/8/2001 to 14/6/2019, yielding 4243 data in total.

  • OHLC series of the CSI 100 index The CSI 100 index is one of the most important stock price indexes in China, reflecting the overall situation of the companies with the most market influence power in the Shanghai and Shenzhen stock markets. China Securities Index Co., Ltd. officially issued the CSI 100 index on 30/12/2005, and 1000 is its base data and base point. We collected the OHLC series of the CSI 100 index from 30/12/2005 to 14/6/2019 with a total of 3269 periods.

  • OHLC series of the 50 ETF The 50 ETF (code: 510050) is China’s first transactional exchange traded fund, compiled by the Shanghai Stock Exchange, whose base date and base point are 31/12/2003 and 1000, respectively. The investment objective of the 50 ETF is to closely track the Shanghai Stock Exchange 50 (SSE 50) index, minimizing tracking deviation and tracking error. This study collected 3481 OHLC data samples of the 50 ETF from 23/2/2005 to 14/6/2019.

Results of empirical analysis

Based on the results of the simulation experiments, we used the proposed method, as shown in Fig. 3 with q varying from 30 to 70 and \(m=1\), to realize the forecast of OHLC series of the Kweichow Moutai, CSI 100, and 50 ETF.

Specifically, we consider the first vector time series of \(\{\varvec{Y}_t\}_{t=1}^{90}\) (i.e., \(\varvec{Y}_1^{(90)}\) in Algorithm 2) of the 50 ETF as an example to illustrate our modeling process. At the significance level of \(\alpha =0.05\), the four time series are stationary except for \(\{\varvec{y}_t^{(1)}\}_{t=1}^{90}\) with the p-value of ADF test being 0.628, indicating that \(\{\varvec{Y}_t\}_{t=1}^{90}\) cannot be modelled by VAR. The ACF plots in Fig. 9 further demonstrate the distinct auto-correlation and non-stationary of \(\{\varvec{y}_t^{(1)}\}_{t=1}^{90}\). Then, the Johansen test is applied to examine the cointegration relationship between the four variables in \(\{\varvec{Y}_t\}_{t=1}^{90}\), and the essence of Johansen test based on “Trace” investigates the number of cointegration vectors, which is recorded as r. The results show that the possibility of \(r\le 2\) is less than \(1\%\) and the possibility of \(r\le 3\) is greater than \(10\%\); thus, r in VECM is determined to be 3. Finally, a VECM with order of cointegration \(r=3\) is established, and the one-step forecasting value \(\widehat{\varvec{Y}_1}^{(90,1)}\) is obtained using the regression function. Using the inverse transformation method, we obtain forecasted \(\widehat{\varvec{X}_1}^{(90,1)}\). Iterating through each vector time series \(\{\varvec{Y}_{t+l}\}_{t=1}^{90} \ (l=0,1,\ldots , T-91)\), the forecasting accuracy of the entire data samples can be evaluated.

Fig. 9
figure 9

ACF plots of the four time-series variables in \(\varvec{Y}_1^{(90)}\) of the 50 ETF

The forecasting accuracy of the VAR and VECM models with unconstrained transformations proposed in this study is satisfactory with q varying from 30 to 70. (1) The average MAPE of Kweichow Moutai was between 1.247% and 1.369%, RMSE was between 32.759 and 38.046, RMSEH was between 39.842 and 45.818, and the AR was between 0.418 and 0.454. (2) The average MAPE of the 100 index was between 1.000% and 1.070%, RMSE was between 47.387 and 51.594, RMSEH was between 55.389 and 60.486, and the AR was between 0.395 and 0.429. (3) The average MAPE of the 50 ETF was between 0.981 and 1.114%, RMSE was between 0.039 and 0.045, RMSEH was between 0.046 and 0.055, and AR was between 0.403 and 0.442.

Furthermore, we take \(q=30\), \(q=50\), and \(q=70\) as three milestones and summarize the forecasting results in terms of MAPE, RMSE, RMSEH, AR, the ratios of VAR and VECM, and the numbers of the three types of forecasting failures. The forecasting results for the Kweichow Moutai, CSI 100, and CSI 50 ETF are presented in Tables 3, 4, and 5, respectively. Failure 1 refers to the forecasted low price becoming negative; that is, \(\hat{x}_t^{(l)} < 0\). Failure 2 indicates that the forecasted high price is lower than the forecasted low price, that is, \(\hat{x}_t^{(h)} < \hat{x}_t^{(l)}\). Failure 3 is for the forecasted open price (or forecasted close price) to break through the forecasted high-price and forecasted low-price boundaries, that is, \(\hat{x}_t^{(o)}, \hat{x}_t^{(c)} \notin [\hat{x}_t^{(l)},\hat{x}_t^{(h)}]\).

Meanwhile, we compare the VAR and VECM with an unconstrained method (marked as “Yes” in Tables 3–5) with two other methods: (1) The Naive method proposed by Arroyo et al. (2011), which takes the price of the previous day as the price of the day. (2) The VAR and VECM under non-unconstrained method, which employ the raw OHLC data as input (marked as “No” in Tables 3–5). The results demonstrate the following patterns:

  1. (1)

    With the increase of q, the forecasting results of the VAR and VECM under unconstrained and non-unconstrained methods become more accurate; MAPE, RMSE, and RMSEH decrease, and AR increases.

  2. (2)

    Regarding MAPE and RMSE, the VAR and VECM under unconstrained and non-unconstrained methods possess better forecasting accuracy for \(x^{(o)}_t\), \(x^{(h)}_t\), and \(x^{(l)}_t\) than the Naive method while the Naive method has better forecasting accuracy for \(x^{(c)}_t\) than the VAR and VECM. As for RMSEH and AR, the VAR and VECM under unconstrained and non-unconstrained methods are superior to the Naive method.

  3. (3)

    The proportion of utilizing the VECM model is significantly greater than that of the VAR model, which indicates that the Chinese stock market has the same characteristics as the U.S. stock market. That is, stock prices are usually non-stationary, and a cointegration relationship exists between the quaternary OHLC prices or unconstrained variables (Cheung 2007).

  4. (4)

    The modeling of the VAR and VECM under the non-unconstrained method results in a large number of forecasting failures, while the VAR and VECM under the unconstrained method can always guarantee a meaningful forecast of OHLC data. The forecasting failures of the non-unconstrained method are mostly \(\hat{x}_t^{(o)}, \hat{x}_t^{(c)} \notin [\hat{x}_t^{(l)},\hat{x}_t^{(h)}]\). This is because it is unlikely that \(\hat{x}_t^{(l)} < 0\) or \(\hat{x}_t^{(l)} > \hat{x}_t^{(h)}\) will occur under accurate forecasting conditions.

  5. (5)

    For Kweichow Moutai, the average MAPE, RMSE, RMSEH, and AR of the VAR and VECM under unconstrained method are \(10.543\%\), \(10.23\%\), \(4.89\%\), and \(0.71\%\) better than the Naive method, respectively. For CSI 100, the average MAPE, RMSE, RMSEH, and AR of the VAR and VECM under the unconstrained method are improved by \(13.62\%\), \(12.48\%\), \(10.61\%\), and \(1.02\%\) compared to the Naive method, respectively. For 50 ETF, the average MAPE, RMSE, RMSEH, and AR of the VAR and VECM under the unconstrained method are \(10.94\%\), \(8.15\%\), \(3.82\%\), and \(1.16\%\) more optimized than in the Naive method, respectively.

  6. (6)

    The overall forecasting performance of the VAR and VECM under the non-unconstrained method is better than the VAR and VECM under the unconstrained method. However, the forecasting failures of the VAR and VECM under the non-unconstrained method can cause confusion for investors and significantly undermine their investment confidence (Huang et al. 2022a). The results are consistent with our original intention to ensure the integrity of the forecasted OHLC data structure under the possibility of losing a certain forecast accuracy.

Table 3 Results of the VAR and VECM for OHLC data of Kweichow Moutai
Table 4 Results of the VAR and VECM for OHLC data of CSI 100
Table 5 Results of the VAR and VECM for OHLC data of 50 ETF

To obtain a clear depiction of the forecasted performance, we also compare the actual and forecasted stock values of the Kweichow Moutai from November 5, 2003, to June 22, 2004 (left panel); the CSI 100 index from May 11, 2011, to December 16, 2011 (middle panel); and the 50 ETF from August 9, 2007, to March 24, 2008 (right panel) in Fig. 10. The data forecasted by the VAR and VECM combined with the unconstrained transformation are in line with reality. Specifically, for the Kweichow Moutai, the continuous rise that exists before April 9, 2004, and the subsequent fall are perfectly forecasted; for the CSI 100 index, the overall downward trend and two rebounds around July 4, 2011, and November 9, 2011, are also fully reflected; and, for the 50 ETF, two spikes around October 16, 2007, and January 15, 2008, coincide precisely.

Fig. 10
figure 10

Comparison of the real values (top row) and forecasted values (bottom row) of the Kweichow Moutai (left panel), CSI 100 index (middle panel), and 50 ETF (right panel)

This study was complemented by a machine-learning approach for modeling OHLC data using SVR. The selected SVR employs a linear kernel function with a constant of the regularization term in the Lagrange formulation set to 1 and epsilon in the insensitive-loss function set to 0.1. SVR performs out-of-sample forecasting using the first 80% of the data as the training set and the last 20% as the testing set. Table 6 demonstrates the forecasting accuracy of SVR. Several patterns can be found. (1) The overall forecasting accuracy of SVR is significantly improved compared with that of VAR and VECM modeling. Under the unconstrained transformation method, the MAPEs obtained by SVR on the Kweichou Moutai, CSI 100, and CSI 50 ETF datasets are 26.30%, 33.83%, and 15.48% lower compared to those obtained by VAR and VECM modeling, respectively. In particular, the accuracy of SVR in forecasting close prices improved significantly with the close price MAPEs for the three datasets decreasing by 74.33%, 71.84%, and 64.11%, respectively, compared to those obtained from VAR and VECM modeling. At the same time, the high- and low-price MAPEs obtained by SVR are, on average, 32.36% and 14.36% lower than those obtained by VAR and VECM modeling, respectively. The forecasting accuracy of the open price by VAR and VECM is better than that of SVR modeling with an average decrease of 37.39% in the three datasets. (2) The overall forecasting accuracy of the Kweichou Moutai under the unconstrained transformation method is higher than that of the non-unconstrained method, while it performs slightly worse on CSI 100 and 50 ETF. Interestingly, the SVR under the unconstrained method has significantly better forecasting accuracy for the close price than the SVR under the non-unconstrained method for all three datasets of the Kweichou Moutai, CSI 100, and CSI 50 ETF with MAPE reduced by 70.070%, 53.25%, and 40.10% and RMSE reduced by 69.20%, 58.41%, and 48.72%, respectively. The unconstrained transformation not only ensures the structural forecasting of OHLC data but also is a manual feature-extraction method that can effectively enhance the forecasting accuracy of machine-learning models for close prices. Given the importance of the close price in various trading strategies, this study demonstrates the significance of the proposed unconstrained method. (3) With high forecasting accuracy, the three types of forecasting failures rarely occur. The SVR under the non-unconstrained method produced a forecasting failure in the CSI 100 dataset.

Table 6 Results of the SVR for OHLC data of Kweichou Moutai, CSI 100, and 50 ETF

Conclusions

To solve the structural modeling issues of the OHLC data contained in the candlestick chart, we proposed a novel unconstrained transformation method to relax the inherent constraints of OHLC data along with its explicit inverse transformation. The proposed methodology facilitates the subsequent establishment of various forecasting models and guarantees meaningful, structurally forecasted OHLC prices. The unconstrained transformation method not only extends the range of modeling variables to \(\left( -\infty , +\infty \right)\) but also shares the flexibility of the well-known log and logit transformations. Based on this unconstrained transformation, we established a flexible and efficient framework for modeling OHLC data with full utilization of the information. As an application of the multivariate time series, we demonstrated a detailed modeling procedure using VAR and VECM.

The proposed unconstrained transformation has high practical utility owing to its flexibility, simple implementation, and straightforward interpretation. For instance, it is applicable to various positive interval data with internal variables, and the selected model can be generalized to other econometric or machine-learning models. From this perspective, the proposed method provides a novel and useful alternative for OHLC data analysis, thereby enriching existing literature on the structural modeling of complex data.

We documented the finite performance of the OHLC data modeling process via extensive simulation studies on various measurements. The simulation experiments demonstrated that the VAR and VECM under unconstrained transformation obtained stable and satisfactory results with different forecast periods, time period basements, and signal-to-noise ratios, verifying the effectiveness and robustness of the proposed modeling approach. The analysis of OHLC data for three different types of financial products in the Chinese financial market-the Kweichow Moutai, CSI 100 index, and 50 ETF-also illustrated the utility of the unconstrained method. Using raw OHLC data directly as input to the VAR and VECM resulted in a large number of forecasting failures. In contrast, unconstrained and inverse transformations guaranteed structural modeling of OHLC data at the cost of a small loss in forecasting accuracy.

As a complement to machine learning, this study further employed SVR for modeling OHLC data in the empirical analysis section. SVR modeling demonstrated superior performance in forecasting OHLC data. Under unconstrained transformation, SVR can achieve higher forecast accuracy than VAR and VECM while ensuring OHLC data structure. In addition, the SVR under the unconstrained method had significantly better forecasting accuracy for the close price than the SVR under the non-unconstrained method in all three datasets of the Kweichou Moutai, CSI 100, and CSI 50 ETF. The proposed unconstrained method proved to be an effective pre-processing technique for machine learning models. These results provide new evidence for the practical utility and extensibility of the unconstrained method.

Future research can embed various time-series forecasting models into the proposed unified forecasting modeling framework for OHLC time-series data based on unconstrained transformation and its inverse transformation to achieve the structural forecasting of OHLC data for various financial products. In particular, artificial neural networks can be employed to accurately forecast OHLC data. This can help investors manage and hedge their portfolios to earn profits and reduce risks (Huang et al. 2022a). Specifically, based on OHLC data forecasting results, investors can achieve overnight returns (Cooper et al. 2008), reduce fund-holding periods (Dunis et al. 2011), lower overnight exposure (Kelly and Clark 2011), and derive better profits from high-low price range trades (von Mettenheim and Breitner 2012).

Availability of data and materials

Available on request.

Abbreviations

OHLC:

Open-high-low-close

EMH:

Efficient market hypothesis

VAR:

Vector autoregression

VECM:

Vector error correction model

SVR:

Support vector regression

COVID-19:

Coronavirus disease 2019

MAPE:

Mean absolute percentage error

SD:

Standard deviation

RMSE:

Root mean squared error

RMSEH:

RMSE base on the Hausdorff distance

AR:

Accuracy ratio

References

  • Ahmadi E, Jasemi M, Monplaisir L, Nabavi MA, Mahmoodi A, Jam PA (2018) New efficient hybrid candlestick technical analysis model for stock market timing on the basis of the support vector machine and heuristic algorithms of imperialist competition and genetic. Expert Syst Appl 94:21–31

    Article  Google Scholar 

  • Ariss RT, Rezvanian R, Mehdian SM (2011) Calendar anomalies in the gulf cooperation council stock markets. Emerg Mark Rev 12(3):293–307

    Article  Google Scholar 

  • Arroyo J, Espínola R, Maté C (2011) Different approaches to forecast interval time series: a comparison in finance. Comput Econ 37(2):169–191

    Article  Google Scholar 

  • Caginalp G, Laurent H (1998) The predictive power of price patterns. Appl Math Finance 5(3–4):181–205

    Article  Google Scholar 

  • Cagliero L, Fior J, Garza P (2023) Shortlisting machine learning-based stock trading recommendations using candlestick pattern recognition. Expert Syst Appl 216:119493

    Article  Google Scholar 

  • Chang P-C, Liao TW, Lin J-J, Fan C-Y (2011) A dynamic threshold decision system for stock trading signal detection. Appl Soft Comput 11(5):3998–4010

    Article  Google Scholar 

  • Chen Y, Hao Y (2020) A novel framework for stock trading signals forecasting. Soft Comput 24(16):12111–12130

    Article  Google Scholar 

  • Chen J, Wen Y, Nanehkaran YA, Suzauddola MD, Chen W, Zhang D (2023) Machine learning techniques for stock price prediction and graphic signal recognition. Eng Appl Artif Intell 121:106038

    Article  Google Scholar 

  • Cheung Y-W (2007) An empirical model of daily highs and lows. Int J Finance Econ 12(1):1–20

    Article  Google Scholar 

  • Cooper MJ, Cliff MT, Gulen H (2008) Return differences between trading and non-trading hours: like night and day. Available at SSRN 1004081

  • Corrado CJ, Lee S-H (1992) Filter rule tests of the economic significance of serial dependencies in daily stock returns. J Financ Res 15(4):369–387

    Article  Google Scholar 

  • Cuthbertson K, Hall SG, Taylor MP (1992) Applied econometric techniques. P. Allan

  • De Carvalho FAT, de Souza RMCR, Chavent M, Lechevallier Y (2006) Adaptive Hausdorff distances and dynamic clustering of symbolic interval data. Pattern Recognit Lett 27(3):167–179

    Article  ADS  Google Scholar 

  • Doyle JR, Chen CH (2009) The wandering weekday effect in major stock markets. J Bank Finance 33(8):1388–1399

    Article  Google Scholar 

  • Dunis CL, Laws J, Rudy J (2011) Profitable mean reversion after large price drops: a story of day and night in the S &P 500, 400 midcap and 600 smallcap indices. J Asset Manag 12:185–202

    Article  Google Scholar 

  • Fama EF (1970) Efficient capital markets: a review of theory and empirical work. J Finance 25:383–417

    Article  Google Scholar 

  • Fiess NM, MacDonald R (2002) Towards the fundamentals of technical analysis: analysing the information content of high, low and close prices. Econ Model 19(3):353–374

    Article  Google Scholar 

  • García A, Jaramillo-Morán MA (2020) Short-term European union allowance price forecasting with artificial neural networks. Entrep Sustain Issues 8(1):261

    Google Scholar 

  • García-Martos C, Rodríguez J, Sánchez MJ (2013) Modelling and forecasting fossil fuels, CO2 and electricity prices and their volatilities. Appl Energy 101:363–375

    Article  ADS  Google Scholar 

  • González-Rivera G, Lin W (2013) Constrained regression for interval-valued data. J Bus Econ Stat 31(4):473–490

    Article  MathSciNet  Google Scholar 

  • Goo Y, Chen D, Chang Y et al (2007) The application of Japanese candlestick trading strategies in Taiwan. Invest Manag Financ Innov 4(4):49–79

    Google Scholar 

  • Guo J, Li W, Li C, Gao S (2012) Standardization of interval symbolic data based on the empirical descriptive statistics. Comput Stat Data Anal 56(3):602–610

    Article  MathSciNet  Google Scholar 

  • Hao P, Guo J (2017) Constrained center and range joint model for interval-valued symbolic data regression. Comput Stat Data Anal 116:106–138

    Article  MathSciNet  Google Scholar 

  • Hsu P-H, Kuan C-M (2005) Reexamining the profitability of technical analysis with data snooping checks. J Financ Econom 3(4):606–628

    Google Scholar 

  • Hu C, He LT (2007) An application of interval methods to stock market forecasting. Reliab Comput 13(5):423–434

    Article  MathSciNet  Google Scholar 

  • Huang W, Wang H, Qin H, Wei Y, Chevallier J (2022a) Convolutional neural network forecasting of European union allowances futures using a novel unconstrained transformation method. Energy Econ 110:106049

    Article  Google Scholar 

  • Huang W, Wang H, Wang S (2022b) A pseudo principal component analysis method for multi-dimensional open-high-low-close data in candlestick chart. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2022.2155787

    Article  Google Scholar 

  • Ilham RN, Sinta I, Sinurat M (2022) The effect of technical analysis on cryptocurrency investment returns with the 5 (five) highest market capitalizations in Indonesia. J Ekon 11(02):1022–1035

    Google Scholar 

  • Johansen S (1988) Statistical analysis of cointegration vectors. J Econ Dyn Control 12(2–3):231–254

    Article  MathSciNet  Google Scholar 

  • Johansen S (1991) Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econom J Econom Soc 59:1551–1580

    MathSciNet  Google Scholar 

  • Johansen S (1995) Likelihood-based inference in cointegrated vector autoregressive models. Oxford University Press, Oxford

    Book  Google Scholar 

  • Kelly MA, Clark SP (2011) Returns in trading versus non-trading hours: the difference is day and night. J Asset Manag 12:132–145

    Article  Google Scholar 

  • Kumar G, Sharma V (2019) Stock market index forecasting of nifty 50 using machine learning techniques with ANN approach. Int J Mod Comput Sci (IJMCS) 4(3):22–27

    MathSciNet  Google Scholar 

  • Lai KS, Lai M (1991) A cointegration test for market efficiency. J Futures Mark 11(5):567–575

    Article  Google Scholar 

  • Lan Q, Zhang D, Xiong L (2011) Reversal pattern discovery in financial time series based on fuzzy candlestick lines. Syst Eng Procedia 2:182–190

    Article  Google Scholar 

  • Levy T, Yagil J (2012) The week-of-the-year effect: evidence from around the globe. J Bank Finance 36(7):1963–1974

    Article  Google Scholar 

  • Liu H, Shen L (2020) Forecasting carbon price using empirical wavelet transform and gated recurrent unit neural network. Carbon Manag 11(1):25–37

    Article  CAS  Google Scholar 

  • Liu F, Wang J (2012) Fluctuation prediction of stock market index by Legendre neural network with random time strength function. Neurocomputing 83:12–21

    Article  Google Scholar 

  • Lu TH, Shiu Y-M, Liu T-C (2012) Profitable candlestick trading strategies-the evidence from a new perspective. Rev Financ Econ 21(2):63–68

    Article  CAS  Google Scholar 

  • Luo L, Chen X (2013) Integrating piecewise linear representation and weighted support vector machine for stock trading signal prediction. Appl Soft Comput 13(2):806–816

    Article  Google Scholar 

  • Lütkepohl H (2005) New introduction to multiple time series analysis. Springer, Berlin

    Book  Google Scholar 

  • Magdon-Ismail M, Atiya AF (2003) A maximum likelihood approach to volatility estimation for a Brownian motion using high, low and close price data. Quant Finance 3(5):376

    Article  ADS  MathSciNet  Google Scholar 

  • Mager J, Paasche U, Sick B (2009) Forecasting financial time series with support vector machines based on dynamic kernels. In: IEEE conference on soft computing in industrial applications

  • Mahmoodi A, Hashemi L, Jasemi M, Laliberté J, Millar RC, Noshadi H (2023a) A novel approach for candlestick technical analysis using a combination of the support vector machine and particle swarm optimization. Asian J Econ Bank 7(1):2–24

    Article  Google Scholar 

  • Mahmoodi A, Hashemi L, Jasemi M, Mehraban S, Laliberté J, Millar RC (2023b) A developed stock price forecasting model using support vector machine combined with metaheuristic algorithms. OPSEARCH 60(1):59–86

    Article  Google Scholar 

  • Mahmoudi A, Hashemi L, Jasemi M, Pope J (2021) A comparison on particle swarm optimization and genetic algorithm performances in deriving the efficient frontier of stocks portfolios based on a mean-lower partial moment model. Int J Finance Econ 26(4):5659–5665

    Article  Google Scholar 

  • Manurung AH, Budiharto W, Prabowo H (2018) Algorithm and modeling of stock prices forecasting based on long short-term memory (LSTM). Int J Innov Comput Inf Control (ICIC) 12:12

    Google Scholar 

  • Marshall BR, Young MR, Rose LC (2006) Candlestick technical trading strategies: Can they create value for investors? J Bank Finance 30(8):2303–2323

    Article  Google Scholar 

  • Marshall BR, Young MR, Cahan R (2008) Are candlestick technical trading strategies profitable in the Japanese equity market? Rev Quant Finance Account 31(2):191–207

    Article  Google Scholar 

  • Mehrjoo S, Jasemi M, Mahmoudi A (2014) A new methodology for deriving the efficient frontier of stocks portfolios: an advanced risk-return model. J AI Data Min 2(2):113–123

    Google Scholar 

  • Neto EAL, de Carvalho FDAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515

    Article  MathSciNet  Google Scholar 

  • Neto EAL, De Carvalho FDAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347

    Article  MathSciNet  Google Scholar 

  • Nison S (2001) Japanese candlestick charting techniques: a contemporary guide to the ancient investment techniques of the Far East. Penguin

  • Pesaran MH, Shin Y (1998) An autoregressive distributed-lag modelling approach to cointegration analysis. Econom Soc Monogr 31:371–413

    MathSciNet  Google Scholar 

  • Qiu J, Wang B, Zhou C (2020) Forecasting stock prices with long-short term memory neural network based on attention mechanism. PLoS ONE 15(1):e0227222

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ramadhan A, Palupi I, Wahyudi BA (2022) Candlestick patterns recognition using CNN-LSTM model to predict financial trading position in stock market. J Comput Syst Inform (JoSYC) 3(4):339–347

    Article  Google Scholar 

  • Rogers LCG, Satchell SE (1991) Estimating variance from high, low and closing prices. Ann Appl Prob 1:504–512

    Article  MathSciNet  Google Scholar 

  • Romeo A, Joseph G, Elizabeth DT (2015) A study on the formation of candlestick patterns with reference to Nifty index for the past five years. Int J Manag Res Rev 5(2):67

    Google Scholar 

  • Santur Y (2022) Candlestick chart based trading system using ensemble learning for financial assets. Sigma J Eng Nat Sci 40(2):370–379

    Google Scholar 

  • Shiu Y, Lu T (2011) Pinpoint and synergistic trading strategies of candlesticks. Int J Econ Finance 3(1):234–244

    Google Scholar 

  • Sims CA (1980) Macroeconomics and reality. Econom J Econom Soc 48:1–48

    Google Scholar 

  • Smith DM, Wang N, Wang Y, Zychowicz EJ (2016) Sentiment and the effectiveness of technical analysis: evidence from the hedge fund industry. J Financ Quant Anal 51(6):1991–2013

    Article  Google Scholar 

  • Staffini A (2022) Stock price forecasting by a deep convolutional generative adversarial network. Front Artif Intell 5:8

    Article  Google Scholar 

  • Sun G, Chen T, Wei Z, Sun Y, Zang H, Chen S (2016) A carbon price forecasting model based on variational mode decomposition and spiking neural networks. Energies 9(1):54

    Article  Google Scholar 

  • Tharavanij P, Siraprapasiri V, Rajchamaha K (2017) Profitability of candlestick charting patterns in the stock exchange of Thailand. SAGE Open 7(4):2158244017736799

    Article  Google Scholar 

  • Tsai CF, Quan Z-Y (2014) Stock prediction by searching for similarities in candlestick charts. ACM Trans Manag Inform Syst 5(2):1–21

    Article  Google Scholar 

  • Varghese AA, Krishnadas J, Satheesh KR (2023) Candlestick chart based stock analysis system using ensemble learning. In: 2023 International conference on networking and communications (ICNWC). IEEE, pp 1–7

  • von Mettenheim H, Breitner MH (2012) Forecasting and trading the high-low range of stocks and ETFS with neural networks. In: International conference on engineering applications of neural networks. Springer, pp 423–432

  • Xiaojie X, Zhang Y (2023) Coking coal futures price index forecasting with the neural network. Miner Econ 36(2):349–359

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful for the grants and would like to express our sincere gratitude to the reviewers who provided suggestions to our article.

Funding

The authors are grateful for the financial support from the Beijing Natural Science Foundation (Grant No. 9244030) and the National Natural Science Foundation of China (Grant Nos. 72021001, 11701023).

Author information

Authors and Affiliations

Authors

Contributions

WH: Methodology, software, formal analysis, writing—original draft. HW: Conceptualization, methodology, funding acquisition. SW: Validation, methodology, writing—review and editing.

Corresponding author

Correspondence to Shanshan Wang.

Ethics declarations

Competing interests

The authors have declared that no competing interests exist.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, W., Wang, H. & Wang, S. A structural VAR and VECM modeling method for open-high-low-close data contained in candlestick chart. Financ Innov 10, 97 (2024). https://doi.org/10.1186/s40854-024-00622-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40854-024-00622-6

Keywords