Skip to main content

Can the Baidu Index predict realized volatility in the Chinese stock market?

Abstract

This paper incorporates the Baidu Index into various heterogeneous autoregressive type time series models and shows that the Baidu Index is a superior predictor of realized volatility in the SSE 50 Index. Furthermore, the predictability of the Baidu Index is found to rise as the forecasting horizon increases. We also find that continuous components enhance predictive power across all horizons, but that increases are only sustained in the short and medium terms, as the long-term impact on volatility is less persistent. Our findings should be expected to influence investors interested in constructing trading strategies based on realized volatility.

Introduction

Forecasting return volatility is a crucial task in investment, option pricing, and risk management. There are two main ways of forecasting return volatility: The first employs the implied volatility derived from option prices as a key predictor. Assessing this method, Latané and Rendleman (1976) and Chiras and Manaster (1978) show that implied volatility performs better than historical standard deviations, where implied volatility is based on past volatility—e.g., realized volatility and jumps (Busch et al. 2011; Christensen and Prabhala 1998). The second way is inferring information from historical data and incorporating data into a GARCH-type model (Bollerslev 1986; Glosten et al. 1993) and a stochastic volatility (SV) model (Harvey et al. 1994). However, these model types rely on low-frequency data—i.e., daily, weekly, or monthly data.

Notably, Andersen et al. (2003) found that models using high-frequency data outperform GARCH-type and SV models due to the fact that the low-frequency data omit important intraday information (Carnero et al. 2004). In light of the past frequently of collection of historical data, various variables and models have been proposed. Initially, Andersen and Bollerslev (1998) suggested using the realized volatility (RV), computed by summing the squared intraday returns. Following this, it was accepted that both the GARCH and SV models can be measured at a high-frequency level, but that the RV is more objective (Barndorff-Nielsen and Shephard 2001; Fleming et al. 2003). In any case, the Internet has fundamentally changed information diffusion patterns in the stock market in the time since, such that scholars began to consider the Internet as one of the most important information sources, incorporating this into the prediction models—e.g., Internet news (Chua and Tsiaplias 2018; Zhang et al. 2016), Twitter (Behrendt & Schmidt 2018; Li et al. 2017), Sina Weibo (Jin et al. 2016), Internet stock message boards (Li et al. 2018), and Google Trends (Da et al. 2011; Dimpfl and Jank 2016). Reflecting the present impact of Internet information sources, this paper employs Internet data to forecast stock return volatility.

This paper focuses on the Chinese stock market because this market is dominated by individual investors and there is a large number of “netizens.” A recent survey of Shenzhen Stock Exchange (2018) shows that individual investors accounts for 75.1% of the total in Mainland China equities market. By contrast, individual investors account for only 27% and 12.4% of the U.S. equities market (U.S. Securities and Exchange Commission 2013) and the London Stock Exchange (U.K. Office of National Statistics 2020), respectively. According to the 44th China Statistical Report on Internet Development (China Internet Network Information Center 2019), there are about 854 million “netizens” in China. These country-level characteristics provide a rare opportunity to investigate the predictability of individual investors’ information-seeking behavior for return volatility, where the Baidu Index is selected as an appropriate proxy for individual investors’ information-seeking behavior, given that, as illustrated by Zhang et al. (2013), the Baidu Index provides more authentic, scientific, and objective results than Google Trends.Footnote 1 For the empirical design, we consider the constituent stocks of the SSE 50 Index, comparing various forecasting models deriving from Corsi’s (2009) heterogeneous autoregressive (HAR) model. The HAR-type models consider multiscaling features of financial data, where different market participant actions generate different volatility components. Thus, HAR-type models not only produce long-memory volatility (over months), but also deliver clear economic interpretations, which perform better than fractional integration models. Notably, standard GARCH and SV models are not able to reproduce all these features.

Specifically, we construct a novel HAR-type model by incorporating the Baidu Index—i.e., the HAR-RSV-B model, which contains positive and negative realized semivariance, to forecast RV. Therefore, our paper contributes to the existing literature in two ways. Firstly, it contributes to the forecasting literature (e.g., Andersen et al. 2007; Corsi and Reno 2009; Shen et al. 2017) by advocating for the use of a novel and superior predictor—i.e., a weighted Baidu Index. In particular, we find that its predictive ability is more accurate in the long-run, which is interesting as most studies analyzing the Internet communication effect focus on the performance of investor attention in the short term (e.g., Audrino et al. 2020; Bollen et al. 2011; Hamid and Heiden 2015; Ramos et al. 2020; Tantaopas et al. 2016; Vozlyublennaia 2014). Secondly, our findings accord with recent studies on the interdependence between Internet-based activities and stock market performance (Ping and Li 2018; Wen et al. 2019; Yuan 2019). Our study analyzes the predictive power of jump and continuous components, semivariance, and signed jumps that coexist with investor attention and provide evidence regarding the mechanisms of continuous (Andersen et al. 2007) and jump components (Martens et al. 2009).

The remainder of this paper is organized as follows. Section Literature review reviews the relevant literature; Sect. Methodology outlines the methodological approach; Sect. Data describes the data used; Sect. Empirical results and discussion presents the results; and Sect. Conclusion concludes.

Literature review

With more and more frequently collected (intraday) historical data becoming commonplace in financial markets, more sophisticated methods of forecasting return volatility have recently been demanded. Although Blair et al. (2001) found that intraday data only provided little added benefit in implied volatility. The empirical results of Martens and Zein (2002) and Pong et al. (2004) indicate that implied volatility is able to forecast at least as accurately as GARCH models using high-frequency data. So, recent studies have identified a trend of convergence between various methods: Koopman et al. (2005) introduced RV into a GARCH model to perfect the forecast performance, Deo et al. (2006) combined an ARFIMA model with a stochastic volatility model to forecast realized volatility, Dobrev and Szerszen (2010) estimated stochastic volatility by realized volatility measures, Hansen et al. (2012) proposed a measurement equation that added the realized measure to the conditional variance of returns, and Shin and Shin (2019) applied a vector error correction model to take advantage of the cointegration relation between realized volatility and implied volatility.

Intraday data contains many forms of disaggregated information that can improve the accuracy of volatility predictions. Andersen et al. (2004) showed that simple time series models based on RV outperform GARCH-class models. In their 2004 study, Barndorff-Nielsen and Shephard produced an asymptotic model to separate quadratic variation into its continuous and jump components. When these two parts are incorporated into the HAR model, the relevant HAR-CJ models appear (Andersen et al. 2007). The literature initially considered jumps to exhibit weak forecasting ability because of their high prevalence and less enduring nature; but continuous components to be exactly opposite (Andersen et al. 2007; Forsberg and Ghysels 2006). However, in finding a small sample bias for bi-power variation in computing jumps, Corsi et al. (2010) proposed that jumps also have a significant impact on future volatility.

Additionally, semivariance is an important measurement. However, since numerous empirical studies (e.g., Chunhachinda et al. 1997; Fama 1965) show that security returns are not symmetrically distributed, a variable is needed to measure the investment risk. Semivariance, as introduced by Markovitz (1959), is one of the common downsides to risk measures (Huang 2008a). However, Choobineh and Branting (1986) specify optimal estimators for semivariance, and semivariance is applied in asset pricing models by Ang et al. (2006) and in portfolio choice by Huang (2008b), as well as in other sectors.

The use of realized volatility has advantages for long-memory models (Koopman et al. 2005). These long-memory fractional integration models were popular in the past (Shin 2018), but, more recently, diverse modifications based on the HAR model have been proposed by the literature. For instance, Corsi and Reno (2009) added negative returns to investigate the asymmetric leverage effect, a number of empirical analyses indicated that leveraged HAR models improve forecasting ability (e.g., Asai et al. 2012; Audrino and Knaus 2016), and Patton and Sheppard (2015) constructed various HAR-type models with realized semivariance and jumps.

Through more recent studies, scholars continued to improve the ability of models to forecast stock market volatility. Wu and Hou (2019) and Yuan (2019) find that time-varying parameters have greater forecasting accuracies than constant parameters, Wang et al. (2019) find that time-varying transition probabilities (TVTPs) also help the Markov-switching heterogeneous autoregressive (MS-HAR) model perform better, Ma et al. (2019) construct a new jump component in the U.S. stock market, and Ping and Li (2018) propose a truncated two-scale realized volatility (TTSRV) estimator as the continuous part of RV.

The study of the determinants of realized volatility is mainly divided into two aspects. The first relates to the investor agent and participant behavior. In this area of study, Lux and Marchesi (1999) found that noise trade can generate large fluctuations in periods of high volatility, Foucault et al. (2011) showed that retail traders contribute to about 23% of volatility in stock returns, and Barber and Odean (2008) discovered that individual investors are net buyers of attention-grabbing stocks. The second aspect is the effects of related factors on volatility. For instance, Peltomäki et al. (2018) estimate three practical innovations of the investor attention variable in equity and currency markets, Andrei and Hasler (2015) find that both attention and uncertainty are key determinants of asset prices, and Hervé et al. (2019) find investor attention and the participant structure of the market to be closely related.

There are two primary methods of measuring investor attention. The first is direct measurement from the asset itself. Avramov et al. (2006) classify informed and uninformed traders by trade sizes. Many attention-grabbing events are proposed and confirmed, like unusual trading volumes and extreme returns (Barber and Odean 2008), and returns and record events of broader market indeces (Yuan 2015; Hu et al. 2020, 2021). The second is indirect proxies related to the asset. As investors now commonly use the Internet as a primary information channel, many recent studies have constructed novel proxies,Footnote 2 linking them to investors’ psychological biases.

Methodology

This section provides an empirical definition of volatility and of the components extracted from intraday data and the Baidu Index that will be used in our models (i.e., continuous components, semivariance, signed jumps, and investor attention).

Realized volatility

For a given day \(t\) and sample frequency \(1/N\), the daily realized volatility proposed by Andersen and Bollerslev (1998) is defined as:

$${RV}_{t,N}=\sum_{j=2}^{N+1}{r}_{t,j}^{2}$$
(1)

where \({r}_{t,j}=100\left(ln{P}_{t,j}-ln{P}_{t,j-1}\right)\) is an intraday return (\(j=2,\dots ,N+1\)) on day \(t\). \({P}_{t,j}\) is the last price at time \(j\) on day \(t\). Therefore, there are \(N\) intervals and \(N+1\) intraday closing prices in one trading day. The call market dominates price discovery (Ellul et al. 2009), and is also a part of daily variance. As such, we adjust the realized volatility to:

$${RV}_{t}={RV}_{t,N}+{r}_{t,1}^{2}=\sum_{j=1}^{M}{r}_{t,j}^{2}$$
(2)

where \({r}_{t,1}=100\left(ln{P}_{t,1}-ln{P}_{t-1,end}\right)\) is the call auction variance on day \(t\), \({P}_{t,1}\) is the opening price of continuous trading on day \(t\), and \({P}_{t-1,end}\) is the closing price on day \(t-1\). \({RV}_{t}\) is the daily complete realized volatility on day \(t\). The length of the supplemental return series \({r}_{t,j}\) is \(M=N+1\).

Jump and continuous components

We employ a standard jump-diffusion process to estimate the log price of the SSE 50 index \(p(t)\) on a trading day:

$$dp(t)=\mu (t)dt+\sigma (t)d{W}_{t}+\kappa d{q}_{t}$$
(3)

where \(\mu (t)\) and \(\sigma (t)\) denote the drift and instantaneous volatility, \({W}_{t}\) is a standard Brownian motion and \(\kappa d{q}_{t}\) is the pure jump component. Barndorff-Nielsen and Shephard (2004) prove that when \(M\to \infty\) daily realized volatility is a consistent estimator of quadratic variation \({QV}_{t}\):

$$R{V}_{t}\stackrel{M\to \infty }{\to }{QV}_{t}={\int }_{t-1}^{t}{\sigma }_{s}^{2}ds+\sum_{t-1<s\le t}{\kappa }_{s}^{2}$$
(4)

where \({\int }_{t-1}^{t}{\sigma }_{s}^{2}ds\) is an integrated variation of the continuous component and \(\sum_{t-1<s\le t}{\kappa }_{s}^{2}\) is the jump component. Meanwhile, the continuous component can be estimated by the realized bi-power variation (RBV) proposed by Barndorff-Nielsen and Shephard (2004):

$${RBV}_{t}={\mu }_{1}^{-2}\frac{M}{M-2}\sum_{j=3}^{M}\left|{r}_{t,j}\right|\left|{r}_{t,j-2}\right|$$
(5)

where \({\mu }_{p}=E\left({\left|Z\right|}_{p}\right)={2}^{p/2}\frac{\Gamma \left(\left(p+1\right)/2\right)}{\Gamma \left(1/2\right)}\) is the mean of the absolute value of a standard normally distributed random variable and \(RBV\) is a consistent estimator of integrated variation. Following Barndorff-Nielsen and Shephard (2006) and Huang and Tauchen (2005), we use Z-statistics to test the significance of the jump component:

$${Z}_{t}=\frac{({RV}_{t}-{RBV}_{t}){RV}_{t}^{-1}}{\sqrt{({\mu }_{1}^{-4}+2{\mu }_{1}^{-2}-5)\frac{1}{M}\mathrm{max}(1,\frac{{RTQ}_{t}}{{RBV}_{t}^{2}})}}$$
(6)

where \({RTQ}_{t}=M{\mu }_{4/3}^{-3}(\frac{M}{M-4})\sum_{j=5}^{M}{\left|{r}_{t,j-4}\right|}^{4/3}{\left|{r}_{t,j-2}\right|}^{4/3}{\left|{r}_{t,j}\right|}^{4/3}\) is the jump-robust realized tri-power quarticity statistic, \({\mu }_{1}=\sqrt{2/\pi }\) and \({\mu }_{4/3}={2}^\frac{2}{3}\Gamma (\frac{7}{6})\Gamma {(\frac{1}{2})}^{-1}\).

Thus, the jump component \({J}_{t}\) can be defined as:

$${J}_{t}=\left({RV}_{t}-{RBV}_{t}\right)\times {I}_{\left[{Z}_{t}>{\Phi }_{\alpha }\right]}$$
(7)

where \(I(\bullet )\) is the indicator function used to identify the significance and the significance threshold \(\alpha\) is 0.99, as per Andersen et al. (2007). Thus, the remainder of the realized volatility is continuous variation \({C}_{t}\), which can be calculated as:

$${C}_{t}={RBV}_{t}\times {I}_{\left[{Z}_{t}>{\Phi }_{\alpha }\right]}+{RV}_{t}\times {I}_{\left[{Z}_{t}\le {\Phi }_{\alpha }\right]}$$
(8)

Semivariance and signed jumps

The realized semivariance is proposed by Barndorff-Nielsen et al. (2008). The negative realized semivariance estimator is defined as:

$${RSV}_{t}^{-}=\sum_{j=1}^{M}{r}_{t,j}^{2}\times {I}_{\left[{r}_{t,j}<0\right]}$$
(9)

Whilst the positive realized semivariance estimator is written as:

$${RSV}_{t}^{+}=\sum_{j=1}^{M}{r}_{t,j}^{2}\times {I}_{\left[{r}_{t,j}>0\right]}$$
(10)

The signed jumps defined by Patton (2011) can be constructed as:

$$\Delta {J}_{t}={RSV}_{t}^{+}-{RSV}_{t}^{-}$$
(11)

Furthermore, the signed jumps can be divided into positive signed jumps \(\Delta {J}_{t}{I}_{\left[\Delta {J}_{t}>0\right]}\) and negative signed jumps \(\Delta {J}_{t}{I}_{\left[\Delta {J}_{t}<0\right]}\).

Investor attention based on the Baidu Index

The Baidu Index is based on the number of times users search for keywords, such that it reflects the interest of search engine users to content related to keywords. When investors are interested in one stock, they may search for the security name or its company name in a search engine. However, other users, who are not investors, are more likely to search the company name for contact or recruitment information rather than investment information. Therefore, as a proxy for investor attention, the search query volume of a company name is likely to include a lot of noise, such that the Baidu Index of a security name is more effective. Thus, to investigate the attention given to a security market index, we compute the capitalization-weighted sum of the aggregate Baidu Index of market index components, not market index name, as the proxy variable (Zhang and Wang 2015). Because individual investors are more likely to influence the market index fluctuations by dealing stocks than by trading stock index futures, and generally, institutional investors also do not search for stock index futures before trading them. Thus, the proxy variable for investor attention, \({B}_{t}\), is defined as:

$${B}_{t}=\frac{\sum_{c=1}^{S}({cap}_{c,t}\bullet \mathrm{ln}(1+{b}_{c,t}))}{\sum_{c=1}^{S}{cap}_{c,t}}$$
(12)

where \({cap}_{c,t}\) is the market capitalization of component security \(c\) in the given market index on day \(t\) and \({b}_{c,t}\) is the Baidu Index of the component security name. \(S\) is the number of shares in the market index.

Model specifications

This paper uses 22 models: 11 existing models and 11 models created for this analysis. These new models are nested models, formulated by adding \(B\) to previous models.

Model 1: HAR-RV

The HAR model, as proposed by Corsi (2009), forms the basis of all the models used in our research because it reproduces the long-memory effect of asset volatility. It is specified as:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{1} RV_{t} + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \varepsilon_{t}$$
(13)

where \(h\) is the forecasting horizon, \({RV}_{t+1,t+h}\) is the average realized volatility from \(t+1\) to \(t+h\), and \({RV}_{t+1,t+h}=({RV}_{t+1}+{RV}_{t+2}+\dots +{RV}_{t+h})/h\). The forecasting result considers the last 1-day, 1-week, and 1-month realized variance, which, according to Corsi (2009), correspond to short-term, medium-term and long-term effects.

Model 2: HAR-RV-J

Andersen et al. (2007) developed their HAR-RV-J model to improve forecast accuracy, adding the last daily jump component to the HAR-RV model to produce:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{1} RV_{t} + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \beta_{J1} J_{t} + \varepsilon_{t}$$
(14)

where \({J}_{t}\) is the jump variation on day \(t\), as computed by Eq. (7).

Model 3: HAR-CJ

The HAR-CJ model proposed by Andersen et al. (2007) is also based on the HAR-RV model, disaggregating realized volatility in each horizon into jump and continuous components, as below:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{C1} C_{t} + \beta_{J1} J_{t} + \beta_{C5} C_{t - 4,t} + \beta_{J5} J_{t - 4,t} + \beta_{C22} C_{t - 21,t} + \beta_{J22} J_{t - 21,t} + \varepsilon_{t}$$
(15)

where \({C}_{t}\) is the continuous component on day \(t\) defined in Eq. (8), \({C}_{t-4,t}\) is the average continuous variation over the period \([t-4, t]\), and \({C}_{t-21,t}\) is the average of the month-lag continuous component. \({J}_{t-4,t}\) and \({J}_{t-21,t}\) are the average weekly and monthly jumps, respectively.

Model 4: PS

The PS model proposed by Patton and Sheppard (2015) decomposes daily realized volatility into positive and negative realized semivariance, as below:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{1}^{ - } RSV_{t}^{ - } + \beta_{1}^{ + } RSV_{t}^{ + } + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \varepsilon_{t}$$
(16)

where \({RSV}_{t}^{-}\) is the negative realized semivariance defined in Eq. (9) and \({RSV}_{t}^{+}\) is the positive realized semivariance specified in Eq. (10).

Model 5: PSLev

The PSLev model adds the leverage effect, as defined by Martens et al. (2009) and generated by negative returns, to the PS model. Patton and Sheppard (2015) proposed assessing if the leverage effect leads to a superior significance of the negative realized semivariance. The model is specified as:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{1}^{ - } RSV_{t}^{ - } + \beta_{1}^{ + } RSV_{t}^{ + } + \beta_{m1} RV_{t} I_{{\left[ {r_{t} < 0} \right]}} + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \varepsilon_{t}$$
(17)

where \({RV}_{t}{I}_{[{r}_{i}<0]}\) is the leverage effect and \({I}_{[{r}_{i}<0]}\) is the indicator function that only a negative return is valid for computing realized volatility in Eq. (1).

Model 6: HAR-RSV

The model developed by Patton and Sheppard (2015) divides realized volatility into positive realized semivariance and negative realized semivariance to assess whether positive and negative parts have different impacts on forecasting. The model is specified as:

$$\begin{aligned} RV_{t + 1,t + h} & = \beta_{0} + \beta_{1}^{ - } RSV_{t}^{ - } + \beta_{1}^{ + } RSV_{t}^{ + } + \beta_{5}^{ - } RSV_{t - 4,t}^{ - } + \beta_{5}^{ + } RSV_{t - 4,t}^{ + } \\ & \quad + \beta_{22}^{ - } RSV_{t - 21,t}^{ - } + \beta_{22}^{ + } RSV_{t - 21,t}^{ + } + \varepsilon_{t} \\ \end{aligned}$$
(18)

where \({RSV}_{t-4,t}^{+}\) and \({RSV}_{t-4,t}^{-}\) are average weekly positive and negative semivariance, respectively. \({RSV}_{t-21,t}^{+}\) and \({RSV}_{t-21,t}^{-}\) are semivariance for the month horizon.

Model 7: HAR-RSV-J

Chen and Ghysels (2011) produce their HAR-RSV-J model by adding the daily lag jump component to the HAR-RSV model, such that this model can be specified as:

$$\begin{gathered} RV_{t + 1,t + h} = \beta_{0} + \beta_{1}^{ - } RSV_{t}^{ - } + \beta_{1}^{ + } RSV_{t}^{ + } + \beta_{5}^{ - } RSV_{t - 4,t}^{ - } + \beta_{5}^{ + } RSV_{t - 4,t}^{ + } \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, + \beta_{22}^{ - } RSV_{t - 21,t}^{ - } + \beta_{22}^{ + } RSV_{t - 21,t}^{ + } + \beta_{J1} J_{t} + \varepsilon_{t} \hfill \\ \end{gathered}$$
(19)

Model 8: HAR-RV-SJ

The HAR-RV-SJ model investigates the effect of signed jumps by replacing the daily realized volatility with continuous component and signed jumps in HAR-RV models. It is specified as:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{\delta J1} \Delta J_{t} + \beta_{C1} C_{t} + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \varepsilon_{t}$$
(20)

where \(\Delta {J}_{t}\) is the signed jumps on day \(t\), which is defined in Eq. (11).

Model 9: HAR-CSJ

This model is identical to the HAR-CJ model except for the replacement of jump components with signed jumps. We consider longer-period signed jumps than previous HAR-RV-SJ models by specifying that:

$$\begin{aligned} RV_{t + 1,t + h} & = \beta_{0} + \beta_{\delta J1} \Delta J_{t} + \beta_{C1} C_{t} + \beta_{\delta J5} \Delta J_{t - 4,t} + \beta_{C5} C_{t - 4,t} \\ & \quad + \beta_{\delta J22} \Delta J_{t - 21,t} + \beta_{C22} C_{t - 21,t} + \varepsilon_{t} \\ \end{aligned}$$
(21)

where \(\Delta {J}_{t-4,t}\) and \(\Delta {J}_{t-21,t}\) are week-lag and month-lag signed jumps.

Model 10: HAR-RV-SJd

The HAR-RV-SJd model represents an improvement over the HAR-RV-SJ model by dividing daily signed jumps into positive signed jumps and negative signed jumps, as below:

$$\begin{aligned} RV_{t + 1,t + h} & = \beta_{0} + \beta_{\delta J1}^{ - } \Delta J_{t} I_{{\left[ {\Delta J_{t} < 0} \right]}} + \beta_{\delta J1}^{ + } \Delta J_{t} I_{{\left[ {\Delta J_{t} > 0} \right]}} + \beta_{C1} C_{t} \\ & \quad + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \varepsilon_{t} \\ \end{aligned}$$
(22)

where \(\Delta {J}_{t}{I}_{\left[\Delta {J}_{t}<0\right]}\) is the negative daily signed jump and \(\Delta {J}_{t}{I}_{\left[\Delta {J}_{t}>0\right]}\) is the positive daily signed jump.

Model 11: HAR-CSJd

The HAR-CSJd was proposed by Sévi (2014) and considers many previously stated factors, including dividing signed jumps into positive and negative parts, long-period variables and continuous components. It is written as:

$$\begin{aligned} RV_{t + 1,t + h} & = \beta_{0} + \beta_{\delta J1}^{ - } \Delta J_{t} I_{{\left[ {\Delta J_{t} < 0} \right]}} + \beta_{\delta J1}^{ + } \Delta J_{t} I_{{\left[ {\Delta J_{t} > 0} \right]}} + \beta_{C1} C_{t} \\ & \quad \beta_{\delta J5}^{ - } \Delta J_{t - 4,t} I_{{\left[ {\Delta J_{t - 4,t} < 0} \right]}} + \beta_{\delta J5}^{ + } \Delta J_{t - 4,t} I_{{\left[ {\Delta J_{t - 4,t} > 0} \right]}} + \beta_{C5} C_{t - 4,t} \\ & \quad + \beta_{\delta J22}^{ - } \Delta J_{t - 21,t} I_{{\left[ {\Delta J_{t - 21,t} < 0} \right]}} + \beta_{\delta J22}^{ + } \Delta J_{t - 21,t} I_{{\left[ {\Delta J_{t - 21,t} > 0} \right]}} + \beta_{C22} C_{t - 21,t} + \varepsilon_{t} \\ \end{aligned}$$
(23)

Model 12: HAR-RV-B

The HAR-RV-B model is a new specification that adds investor attention to the HAR-RV model. We concentrate on the forecast accuracy improvement that \(B\) provides, by specifying:

$$RV_{t + 1,t + h} = \beta_{0} + \beta_{1} RV_{t} + \beta_{5} RV_{t - 4,t} + \beta_{22} RV_{t - 21,t} + \beta_{B} B_{t} + \varepsilon_{t}$$
(24)

where \({B}_{t}\) is the capital-weighted Baidu Index defined in Eq. (12).

Model 13 to 22: B Models

We then develop ten further models by adding \({B}_{t}\) to Models (2) to (11) to make Models (12) to (22), which all end with “−B.” To avoid repetition, we omit the descriptions of these new models, but Table 1 displays the names and IDs of all 22 models.

Table 1 Models specifications

Model comparison

The model comparison consists of in-sample analysis and out-of-sample analysis, with OLS regression applied to investigate the aptness of a linear explanation. According to Giot and Laurent (2007), an out-of-sample analysis is the only effective way to evaluate forecasting performance in realized volatility. Generally, the DMW statistic, developed by Diebold and Mariano (1995) and West (1996), is widely used within the forecasting literature.

The DMW test needs a loss function to measure the difference between a real value and a forecasted result in the out-of-sample period. As we use a proxy to estimate the volatility instead of observing it directly, a robust loss function is needed to rank two competing models unbiasedly (Patton 2011). As a result of its robustness, Patton (2011) proposes the Q-LIKE loss function, which is defined as:

$$L\left({\widehat{\sigma }}^{2},v\right)=\mathrm{log}v+\frac{{\widehat{\sigma }}^{2}}{v}$$
(25)

where \({\widehat{\sigma }}^{2}\) is a conditionally unbiased volatility proxy, such as realized volatility and \(v\) is the forecasted volatility. Then, the difference in loss function for Models A and B at time \(t\) is defined as:

$${d}_{t,\left\{A,B\right\}}=L\left({\widehat{\sigma }}_{t}^{2},{v}_{t}^{A}\right)-L\left({\widehat{\sigma }}_{t}^{2},{v}_{t}^{B}\right)$$
(26)

With a given rolling window size, the moving process will compute a series of losses. The DMW statistic is then given by:

$${DM-QLIKE}_{\{A,B\}}=\frac{{\stackrel{-}{d}}_{\{A,B\}}}{\sqrt{\widehat{Var}\left({\stackrel{-}{d}}_{\{A,B\}}\right)}}$$
(27)

where \({\stackrel{-}{d}}_{\{A,B\}}\) is the mean of the difference and \(\widehat{Var}\left({\stackrel{-}{d}}_{\{A,B\}}\right)\) is an approximate asymptotic standard variance, which can be estimated as:

$$\widehat{Var}\left(\stackrel{-}{d}\right)=\frac{1}{P}\left({\widehat{\gamma }}_{0}+2\sum_{k=1}^{h}{\widehat{\gamma }}_{k}\right)$$
(28)

where \(P\) is the length of the loss series and \(h\) is the forecast horizon. \({\gamma }_{k}\) is the autocovariance of \({d}_{t}\), which can be computed by:

$${\widehat{\gamma }}_{k}=\frac{1}{P}\sum_{t=k+1}^{n}\left({d}_{t}-\stackrel{-}{d}\right)\left({d}_{t-k}-\stackrel{-}{d}\right)$$
(29)

However, the DMW statistic is inappropriate when comparing nested models. Clark and West (2007) adjust the mean squared prediction error (MSPE) and propose the CW statistic. The MSPE of a parsimonious model is expected to be smaller than that of a larger model, as an MSPE-adjusted model is needed to account for the noise (Clark and West 2007).

By way of explanation, we take Model B as the larger model which nests the smaller Model A. \(h\)-day ahead forecasts are conducted at time \(t\), such that the real value at time \(t+h\) is \({y}_{t+h}\) and the forecasts of the two models are \({\widehat{y}}_{1t,t+h}\) and \({\widehat{y}}_{2t,t+h}\) with corresponding forecast errors \({y}_{t+h}-{\widehat{y}}_{1t,t+h}\) and \({y}_{t+h}-{\widehat{y}}_{2t,t+h}\). Generally, the sample MSPE is computed by \({\left({y}_{t+h}-{\widehat{y}}_{1t,t+h}\right)}^{2}\) and \({\left({y}_{t+h}-{\widehat{y}}_{2t,t+h}\right)}^{2}\). Improving on this form, the adjusted MSPE is defined as:

$${\widehat{f}}_{t+h}={\left({y}_{t+h}-{\widehat{y}}_{1t,t+h}\right)}^{2}-\left[{\left({y}_{t+h}-{\widehat{y}}_{2t,t+h}\right)}^{2}-{\left({\widehat{y}}_{1t,t+h}-{\widehat{y}}_{2t,t+h}\right)}^{2}\right]$$
(30)

Letting \(\stackrel{-}{f}\) be the sample average of \({\widehat{f}}_{t+h}\), the test statistic becomes:

$$\frac{\sqrt{P}\stackrel{-}{f}}{\sqrt{var\left({\widehat{f}}_{t+h}-\stackrel{-}{f}\right)}}$$
(31)

where \(P\) is the forecasting length. We reject the null hypothesis if the statistic is greater than + 1.282 at a 10% significant level and + 1.645 at a 5% level.

Data

This paper uses data from the SSE 50 Index of China’s securities market. The SSE 50 Index contains 50 stocks of the Shanghai Stock Exchange that are sufficiently large in scale and have good liquidity, and are broadly representative of Chinese enterprises. The sampling frequency of realized variance is five minutes because very few frequencies can beat standard five-minute realized volatility measures in forecasting exercises (Liu et al. 2015). We downloaded all five-minute high-frequency price data from the RESSET dataset.

The Baidu Index data is taken from https://index.baidu.com,Footnote 3 which supplies separate indices for different client devices and geographical regions. However, we use the complete index from all regions and devices to investigate the attention of the whole market. We downloaded the component security list, security name, and weight on each trading day from the RESSET dataset.

The investor attention \({B}_{t}\), defined in Eq. (12), is a weighted aggregate measure of the Baidu Index for all SSE 50 companies, except those securities whose names are not included in the keyword directory. Figure 1 illustrates the time series of \({B}_{t}\) over the entire sample period. It shows that investor attention boomed in 2015, when the Chinese stock market was experiencing large fluctuations. In other periods, fluctuations are not as exaggerated and occur over shorter periods.

Fig. 1
figure1

Investor attention (B) based on Baidu Index

Since Baidu only began publishing its “Baidu Index” product in January 2011, the study’s sample period is from January 2011 to May 2019. We remove all non-trading days and obtain 2029 daily observations. Each record contains one realized volatility, \({RV}_{t}\), the jump component, \({J}_{t}\), the continuous component, \({C}_{t}\), the positive semivariance, \({RSV}_{t}^{+}\), the negative semivariance, \({RSV}_{t}^{-}\), signed jumps \(\Delta {J}_{t}\), and investor attention, \({B}_{t}\), for that day.

The logarithm of daily returns, realized volatility and signed jumps for the SSE 50 are shown by Fig. 2. Periods of relatively low volatility clustering are observed in 2013 and 2018, with a period of very high volatility in 2015. Daily returns and signed jumps are prone to large fluctuations, often moving in unison. Additionally, we find that negative signed jumps are more likely to cause higher volatility in 2013, 2015, and 2018, as would be expected when forecasting short-term volatility.

Fig. 2
figure2

Log-returns (top panel), realized volatility (middle panel) and sqrt root of signed jump (bottom panel) of SSE 50 over the period. We square the absolute value of signed jumps and keep the sign to reduce the data range

Figure 3 compares the autocorrelation of realized volatility (\(RV\)), positive realized semivariance (\({RSV}^{+}\)), negative realized semivariance (\({RSV}^{-}\)), continuous components (\(C\)), jump components (\(J\)), and signed jumps (\(\Delta J\)). The results for \(RV\), \({RSV}^{+}\), \({RSV}^{-}\), and \(C\) all reveal autocorrelation and long-memory processes, but the continuous component displays a more regular autocorrelation. We observe that \(J\) and \(\Delta J\) are only autocorrelated over only one day, indicating that long-term jumps and signed jumps are almost impossible to predict.

Fig. 3
figure3

Sample autocorrelation functions for realized volatility (\(RV\)), positive realized semivariance (\({RSV}^{+}\)), negative realized semivariance (\({RSV}^{-}\)), continuous component (\(C\)), jump component (\(J\)) and signed jump (\(\Delta J\))

Table 2 reports the statistical properties of all variables for all models. It reveals that the average value of daily, weekly, and monthly variables is approximately equal, but that the variance gradually decreases as the timespan increases. According to the Ljung-Box Q-statistic results, all the variables reject the null hypothesis, and show dynamic dependence at a lag of 5, 10, and 15 days. This phenomenon is beneficial for our regression models. The last column of Table 2 shows the results of an augmented Dickey-Fuller test, which indicates that all variables are stable time series except for monthly mean realized volatility, \({RV}_{t-21,t}\), the monthly continuous component, \({C}_{t-21,t}\), and investor attention, \({B}_{t}\).

Table 2 Summary statistics for all variables

Empirical results and discussion

This section provides the main results. Firstly, an in-sample analysis of all 22 models forecasting average realized volatility for 1–66 days is provided. We then compare the out-of-sample performance of both existing models and new models. The number of daily observations in our sample is 2008 (from January 2011 to May 2019). These observations are divided into two subgroups: in-sample volatility data covering the first 1000 days and out-of-sample data covering the remaining 1008 days.

In-sample analysis

We estimate Models (1) to (22) introduced in the previous section through OLS regression for \(h\hspace{0.17em}=\hspace{0.17em}\)1–66 (a forecasting horizon ranging from 1 to 66 days that covers the short term, medium term and long term). This provides a clear picture of the performance of each model and the predictive power of various components.

First, Fig. 4 compares the performance of existing models and new models by plotting the mean adjusted R2 of each model type. These values are high in the short-term but are much lower when the forecast horizon is longer than 15 days. As the time range increases, the gap between existing models and new models is found to widen and the new models perform even better in long-term forecasting. Evidently, it is investor attention \(({B}_{t})\) that improves the precision of forecasts.

Fig. 4
figure4

The mean adjusted R2 of 11 existing models and 11 new models. Existing models are model (1) to (11) and new models denote model (12) to (22). While calculating the average, the weight of each model is the same

Table 3 presents more model-specific results over different time horizons. When predicting the realized volatility of the next day, the results of old models and new models are found to be very similar, as the Baidu Index only improves accuracy in poor models. In forecasting medium-term volatility, the Baidu Index plays a more important role, such that new models perform better. The HAR-CSJd-type models perform the best, producing the highest adjusted R2 values either with or without the Baidu Index, but the gap between HAR-CJ-type, HAR-CSJ-type, and HAR-CSJd-type models is reduced. These three model types with continuous components offer improvements on all other models, whilst the positive and negative semivariance in the HAR-RSV-type and HAR-RSV-J-type models also improve forecasting ability. This confirms the positive impact of disaggregating the realized volatility in prediction.

Table 3 The adjusted R2 of existing models and new models

Finally, we choose the time ranges of 22, 44, and 66 days to assess the accuracy of long-term predictions. As the time horizon increases, the contribution of \({B}_{t}\) to all existing models is found to rise, which is consistent with the relationship observed in Fig. 4. For the long-term result, we can still discriminate between models with continuous components, but disparities between new models decrease. The difference between the best new model and the worst new model is 0.050 when h = 5, but this value falls to 0.005 when h = 66, which indicates the reduced importance of continuous components.

To be able to draw conclusions about the significance of coefficients, we also consider the estimated parameters of new models. Table 4 reports the estimated result for a 1-day horizon and shows that investor attention is statistically significant at the 5% level for all models. In the HAR-RV-B model, the mean realized volatility of the last day and the last week are significantly positive, but \({RV}_{t-21,t}\) is not. The HAR-CJ-B model leads to a significant increase in explanatory power due to the decomposition of realized volatility. Jumps have a positive impact on the realized volatility in the short term but the coefficients of jumps over the medium and long term are negative, indicating that they offset the impact of short-term jumps, thereby shadowing the conclusions reached by Andersen et al. (2007).

Table 4 Regression parameters of new models for 1-day horizon

The coefficient \({\beta }_{J1}\) in the HAR-RV-J-B and HAR-RSV-J-B models cannot show the real effect of \({J}_{t-1,t}\) because realized volatility and semivariance also contain jump factors. As is defined in Eqs. (7) and (8), the realized volatility is the sum of jump and continuous components, such that, for example, the sum of \({\beta }_{1}\) and \({\beta }_{J1}\) is the actual coefficient of the daily jump component of the HAR-RV-J-B model. In Table 3, we show that the HAR-CJ-B, HAR-CSJ-B, and HAR-CSJd-B models with continuous components of each horizon possess the most explanatory power. The 1-day realized volatility is more closely related to the past short-term and medium-term continuous components.

In Table 4, Rows 4–7 report the results for models with positive and negative semivariance. Comparing with the HAR-RV-B model, the decomposition by positive and negative semivariances contributes to the fit of the predictive regression. The 1-day-lagged negative semivariance has a positive effect on the realized volatility, in line with the significance of the downside risk identified by Barndorff-Nielsen et al. (2008). However, interestingly, the positive semivariance of the last week causes higher volatility, but this does not exhibit a strong leverage effect. In the HAR-RV-SJ-B, HAR-CSJ-B, HAR-RV-SJd-B, and HAR-CSJd-B models, signed jumps defined by subtracting negative semivariance from positive semivariance can be used to predict volatility. The higher adjusted R2 of the HAR-CSJ-B and HAR-CSJd-B models fits the result obtained by Patton and Sheppard (2015), who find that the jump size and sign are the gains from realized jumps. The negative sign of coefficient \({\beta }_{\delta J1}\) matches the leverage effect of negative semivariance \({RSV}_{t}^{-}\) in the PS-B, PSLev-B, HAR-RSV-B, and HAR-RSV-J-B models. However, 1-day, 1-week, and 1-month signed jumps have different effects on short-term volatility prediction, which corresponds to the findings from the semivariance. Notably, but perhaps as a result of the nature of the asset assessed, this result contrasts with those obtained by Patton and Sheppard (2015) when analyzing oil future markets. In the stock market, a strong volatility appears likely to follow a positive medium-term semivariance. Thus, overall, the 1-day lagged and 1-week lagged variables are found to be the most important factors in short-term forecasting.

Table 5 reports the in-sample regression result when h = 5. The 1-month lagged realized volatility is not statistically significant, but the continuous and jump components extracted from this variable are significant, which indicates that the volatility follows a jump process. The HAR-CJ-B, HAR-CSJ-B, and HAR-CSJd-B models, with different horizons of continuous composition, are shown to outperform other new models, confirming the findings of Andersen et al. (2007) that almost all of the predictability in return volatility comes from non-jump components. Yet, we also find evidence that the long-term historical jump components or signed jumps are more important in market volatility forecasting. For 1-week horizon forecasting, the explanatory power of monthly realized volatility is not significant. As the main component of realized volatility, the continuous component \({C}_{22}\) also has little predictive effect, and it is the monthly jump and signed jumps that contribute the most to the explanatory power. The opposite direction of the coefficient \({\beta }_{C22}\) in the HAR-CJ-B and HAR-CSJ-B models also indicates that the jump and signed jumps are more dominant than the continuous component. However, as a result of daily and weekly realized volatility, the effect does not appear in short-term and medium-term parameters. For medium-term forecasting, we note that the coefficients of monthly semivariance and signed jumps are all statistically significant, and exhibit a stronger downside risk effect than signed jumps in other horizons. This result demonstrates that China’s stock markets have significant “negative effects” in the long period.

Table 5 Regression parameters of new models for 1-week horizon

Table 6 reports the estimated parameters for the 1-month horizon. In forecasting long-term volatility, the coefficient of investor attention is larger, but those of other variables are reduced. This change confirms that it is investor attention that narrows the gap between different HAR-type models in volatility forecasting. Many of the short- and medium-term lagged factors are not statistically significant, including 1-day lagged jumps and signed jumps, 5-day lagged semivariance, and realized volatility. However, we note that long-term factors still play a key role in prediction. In addition, comparing the PS-B and the HAR-RSV-B models, we observe that the decomposition of medium-term and long-term semivariance produces a result that is consistent with the long-memory features highlighted by Corsi (2009). We find that the daily signed jump component is insignificant at the 10% level and that the adjusted R2 of the model is similar to that of the HAR-RV-J-B. This indicates that there is no specific gain to be made from considering signed jumps. However, all the continuous components remain significant with a strong explanatory potential in the long term.

Table 6 Regression parameters of new models for 1-month horizon

Summarizing the results of the in-sample analysis, we find that investor attention can significantly improve prediction accuracy over the long-term horizon. Comparing different forecast horizons, we find that the range of historical data matches the prediction period. For instance, the future long-term realized volatility depends upon historical monthly components, not 1-day lagged and 1-week lagged variables. This result also confirms the advantages of HAR-type models in forecasting long-term volatility. The decomposition of realized volatility advocated by Andersen et al. (2007) is found to have a significant impact on volatility forecasting (especially the continuous component), but signed jumps perform better than jump components in the SSE 50. Specifically, the HAR-CSJd-B model generates the highest adjusted R2 over the 1-day and 1-week horizons and the HAR-CSJ-B model produces the highest adjusted R2 over a 1-month horizon.

Out-of-sample analysis

In this section, we analyze the out-of-sample performance of the 11 existing models and the 11 new models. Specifically, we compare the existing models and their corresponding new models to identify the importance of investor attention. We then compare between different new models for short-term, medium-term, and long-term predictions. A rolling window method is employed to estimate the volatility forecasting results of each model, by adding one new day and removing the most distant day in turn. Therefore, the sample used to estimate the models remains fixed at length \(w=1000\) and the forecasts do not overlap. The number of daily out-of-sample observations is \(T=1008\). For each forecast horizon \(h\), each model will re-estimated \(P=T-h+1\) times, and its parameters are time varying with different samples. Following this process, we produce the loss series of each model with length \(\tau\), and evaluate their out-of-sample performance.

Table 7 reports the CW test result for the out-of-sample analysis between existing models and new models. Each new model is the nested model of its correspondent existing model—i.e., the HAR-RV-B model is the larger model which nests the smaller HAR-RV model. There are only two non-significant values in Table 7, which are the HAR-CJ-B and HAR-CSJd-B models for the 1-day forecasting horizon, indicating that the investor attention in these two new models is unable to improve the accuracy of short-term prediction. In addition, the HAR-CJ-B and HAR-CSJd-B models outperform other new models in the in-sample analysis, which indicates that the continuous and jump components have strong predictive power. As the forecasting horizon increases, the gap between existing models and new models widens. Investor attention is thus playing an increasingly important role in volatility forecasting, further verifying the conclusion drawn from the previous analysis.

Table 7 The CW test between existing models and new models

Next, we compare the out-of-sample performance of new models and report the DMW statistics for various horizons in Tables 8 and 9. Table 8 presents the test result for h = 1, 5, and 10, which covers the short term and medium term. The results indicate that the differences between the new models are greater: In Panel A, the result obtained at Row HAR-RV-B Column HAR-RV-J-B is 4.0807, which indicates that the HAR-RV-J-B model performs better than the HAR-RV-B model when h = 1. The PS-B, PSLev-B, and HAR-RSV-B models, which only contain realized volatility and semivariance components, were outperformed by most of the other models, including the original HAR-RV-B model. Furthermore, the decomposition of realized volatility into semivariance does not contribute to volatility forecasting. As expected, given the results of the in-sample analysis, the jump and signed jump indeed play a significant role. The HAR-RSV-J-B model, with the help of the 1-day lagged jump component, outperforms the HAR-RSV-B model.

Table 8 The DMW statistic for new models in forecasting short-term and medium-term realized volatility
Table 9 The DMW statistic for new models in forecasting long-term realized volatility

Considering the 1-week horizon in Panel B, we note that the gap between the models increases and the models with semivariance still do not offer improved performance. The HAR-CJ-B and HAR-CSJd-B model outperform most of the other models, especially in the 1-week lagged and 1-month lagged jumps and signed jumps over the 1-week horizon. At the same time, the HAR-RV-J-B, HAR-RV-SJ-B, and HAR-RV-SJd-B models are outperformed by the HAR-CJ-B and HAR-CSJd-B models. The HAR-CSJ-B model also demonstrates prediction accuracy, but not as effectively as the HAR-CSJd-B model, which indicates that dividing the signed jump into positive and negative aspects is an effective approach.

Panel C shows that the HAR-CSJd-B model is still the most appropriate in the two-week forecasting horizon, but the HAR-CJ-B does not perform as well over the 1-day and 1-week forecasting horizon. The worst models are the PS-B, PSLev-B, and HAR-RSV-B models, which underperform against other models in the short and medium forecasting horizon.

Table 9 reports the DMW statistics for 1-month, two-month and three-month forecasts, with results over the long term differing quite significantly to short-term results. Based on the conclusion that investor attention is a strong predictor over the long term, we note that when h ≥ 22 all new models mainly rely on the Baidu Index, not the components extracted from realized volatility. In Panel A, the best model is the HAR-CSJ-B model, rather than the HAR-CJ-B or HAR-CSJd-B models. The HAR-RV-J-B model is only more effective than the worst two predictors—the PS-B and PSLev-B models. In Panel B, the HAR-CSJ-B model outperforms other models with significant results. However, the DMW statistic that compares between the HAR-CJ-B and HAR-CSJ-B models is not significant. In Panel C, even the HAR-CSJ-B model only outperforms two models and the jump component does not have a significant predictive impact over the long term, unlike the result of the in-sample analysis.

We conclude that these results are caused by two factors. Firstly, the jump component often derives from macroeconomic events, which makes it difficult to predict and a major driver of short-term volatility. Secondly, the coefficient of the jump component may also be susceptible to external conditions. In the in-sample analysis, all observations are used to evaluate the parameters, but in the out-of-sample analysis, the model trained using historical data is unable to accurately forecast if the condition will change in the future. In addition, the HAR-CSJd-B model is also outperformed by the HAR-CSJ-B model in regards to long-term forecasting. The positive and negative signed jumps can provide more information in short-term and medium-term forecasting but they lead to model overfit for the HAR-CSJd-B model when h is increasing.

To summarize, we conclude that investor attention is valuable in forecasting, but that positive and negative semivariance are not. Furthermore, the in-sample performance can be dramatically improved by disaggregating jump and continuous components over the entire forecasting horizon. However, in long-term forecasting, jumps do not contribute more than other factors extracted from realized volatility, whilst the predictive ability of jumps in long-term forecasting is also affected by other conditions in stock market.

Conclusion

This paper investigates the impact of investor attention on forecasting volatility in the Chinese stock market. Specifically, it adds the Baidu Index as a proxy for investor attention to existing HAR-type models to forecast SSE 50 Index volatility. Using five-minute high-frequency data and collating the Baidu indices of the component security names in the SSE 50 Index, we propose 11 new models by adding the investor attention variable to 11 previously existing models. We then compare their in-sample and out-of-sample predictive power.

The comparison of the models identifies the predictive ability of the variables when taking investor attention into account. The continuous component is found to play an important role in prediction, while the jump component only significantly improves models in the short- and medium-term. Over the long-term horizon, predictive power is reduced by macroeconomic shocks.

It is also shown that investor attention is a useful indicator in forecasting volatility, especially over the long-term horizon. Thus, for security investors, our findings offer an effective risk management and option pricing tool. Specifically, as more option products can be traded in the future, the weighted Baidu Index of component securities will greatly improve the accuracy of original models in predicting long-term volatility. This result is of particular interest because much of the previous research finds the impacts of search query data to be short lived. Consequently, our article provides a new form of evidence within the investor attention research field. Based on our results, it appears feasible that long-term forecasting ability may be related to a discovered long-memory property (Fan et al. 2017), but we leave the analysis of this potential relationship to future research.

Availability of data and materials

The datasets used are available from the corresponding author upon reasonable request.

Notes

  1. 1.

    For a detailed illustration, refer to Sect. Realized volatility: Description of Baidu Index of Zhang et al. (2013).

  2. 2.

    Many novel proxies based on Internet information have been described in the Introduction. For brevity, we do not repeat them in this section.

  3. 3.

    The Baidu Index (https://index.baidu.com/Helper/?tpl=help) provides data updates daily. The previous day’s Baidu Index is usually available before the beginning of market trading.

References

  1. Andersen TG, Bollerslev T (1998) Answering the skeptics: yes, standard volatility models do provide accurate forecasts. Int Econ Rev 39(4):885–905

    Article  Google Scholar 

  2. Andersen TG, Bollerslev T, Diebold FX, Labys P (2003) Modeling and forecasting realized volatility. Econometrica 71(2):579–625

    Article  Google Scholar 

  3. Andersen TG, Bollerslev T, Meddahi N (2004) Analytical evaluation of volatility forecasts. Int Econ Rev 45(4):1079–1110

    Article  Google Scholar 

  4. Andersen TG, Bollerslev T, Diebold FX (2007) Roughing it up: including jump components in the measurement, modeling, and forecasting of return volatility. Rev Econ Stat 89(4):701–720

    Article  Google Scholar 

  5. Andrei D, Hasler M (2015) Investor attention and stock market volatility. Rev Financ Stud 28(1):33–72

    Article  Google Scholar 

  6. Ang A, Chen J, Xing Y (2006) Downside risk. Rev Financ Stud 19(4):1191–1239

    Article  Google Scholar 

  7. Asai M, Mcaleer M, Medeiros MC (2012) Asymmetry and long memory in volatility modeling. J Financ Econom 10(3):495–512

    Article  Google Scholar 

  8. Audrino F, Knaus SD (2016) Lassoing the HAR model: a model selection perspective on realized volatility dynamics. Econom Rev 35:1485–1521

    Article  Google Scholar 

  9. Audrino F, Sigrist F, Ballinari D (2020) The impact of sentiment and attention measures on stock market volatility. Int J Forecast 36(2):334–357

    Article  Google Scholar 

  10. Avramov D, Chordia T, Goyal A (2006) The impact of trades on daily volatility. Rev Financ Stud 19(4):1241–1277

    Article  Google Scholar 

  11. Barber BM, Odean T (2008) All that glitters: The effect of attention and news on the buying behavior of individual and institutional investors. Rev Financ Stud 21(2):785–818

    Article  Google Scholar 

  12. Barndorff-Nielsen OE, Shephard N (2001) Non-Gaussian Ornstein–Uhlenbeck-based models and some of their uses in financial economics. J R Stat Soc Ser B (Stat Methodol) 63(2):167–241

    Article  Google Scholar 

  13. Barndorff-Nielsen OE, Shephard N (2004) Power and bipower variation with stochastic volatility and jumps. J Financ Econom 2(1):1–37

    Article  Google Scholar 

  14. Barndorff-Nielsen OE, Shephard N (2006) Econometrics of testing for jumps in financial economics using bipower variation. J Financ Econom 4(1):1–30

    Article  Google Scholar 

  15. Barndorff-Nielsen OE, Kinnebrock S, Shephard N (2008) Measuring downside risk-realised semivariance. CREATES Research Paper (2008-42)

  16. Behrendt S, Schmidt A (2018) The Twitter myth revisited: intraday investor sentiment, Twitter activity and individual-level stock return volatility. J Bank Finance 96:355–367

    Article  Google Scholar 

  17. Blair BJ, Poon SH, Taylor SJ (2001) Forecasting S&P 100 volatility: the incremental information content of implied volatilities and high-frequency index returns. J Econom 105(1):5–26

    Article  Google Scholar 

  18. Bollen J, Mao H, Zeng X (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8

    Article  Google Scholar 

  19. Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econom 31(3):307–327

    Article  Google Scholar 

  20. Busch T, Christensen BJ, Nielsen MO (2011) The role of implied volatility in forecasting future realized volatility and jumps in foreign exchange, stock, and bond markets. J Econom 160(1):48–57

    Article  Google Scholar 

  21. Carnero MA, Pena D, Ruiz E (2004) Persistence and kurtosis in GARCH and stochastic volatility models. J Financ Econom 2(2):319–342

    Article  Google Scholar 

  22. Chen X, Ghysels E (2011) News—good or bad—and its impact on volatility predictions over multiple horizons. Rev Financ Stud 24(1):46–81. https://doi.org/10.1093/rfs/hhq071

    Article  Google Scholar 

  23. China Internet Network Information Center (2019) The 44th China statistical report on internet development. http://www.cac.gov.cn/pdf/20190829/44.pdf. Accessed 18 Nov 2020

  24. Chiras DP, Manaster S (1978) The information content of option prices and a test of market efficiency. J Financ Econ 6(2–3):213–234

    Article  Google Scholar 

  25. Choobineh F, Branting D (1986) A simple approximation for semivariance. Eur J Oper Res 27(3):364–370

    Article  Google Scholar 

  26. Christensen BJ, Prabhala NR (1998) The relation between implied and realized volatility. J Financ Econ 50(2):125–150

    Article  Google Scholar 

  27. Chua CL, Tsiaplias S (2018) Information flows and stock market volatility. J Appl Econom 34(1):129–148

    Article  Google Scholar 

  28. Chunhachinda P, Dandapani K, Hamid S, Prakash AJ (1997) Portfolio selection and skewness: evidence from international stock markets. J Bank Finance 21(2):143–167

    Article  Google Scholar 

  29. Clark TE, West KD (2007) Approximately normal tests for equal predictive accuracy in nested models. J Econom 138(1):291–311

    Article  Google Scholar 

  30. Corsi F (2009) A simple approximate long-memory model of realized volatility. J Financ Econom 7(2):174–196

    Article  Google Scholar 

  31. Corsi F, Reno R (2009) HAR volatility modelling with heterogeneous leverage and jumps. Available at SSRN 1316953

  32. Corsi F, Pirino D, Reno R (2010) Threshold bipower variation and the impact of jumps on volatility forecasting. J Econom 159(2):276–288

    Article  Google Scholar 

  33. Da Z, Engelberg J, Gao P (2011) In search of attention. J Finance 66(5):1461–1499

    Article  Google Scholar 

  34. Deo R, Hurvich C, Lu Y (2006) Forecasting realized volatility using a long-memory stochastic volatility model: estimation, prediction and seasonal adjustment. J Econom 131(1–2):29–58

    Article  Google Scholar 

  35. Diebold FX, Mariano RS (1995) Comparing predictive accuracy. J Bus Econ Stat 20(1):134–144

    Article  Google Scholar 

  36. Dimpfl T, Jank S (2016) Can internet search queries help to predict stock market volatility? Eur Financ Manag 22(2):171–192

    Article  Google Scholar 

  37. Dobrev D, Szerszen P (2010) The information content of high-frequency data for estimating equity return models and forecasting risk. Soc Sci Res Netw 2010(1005):1–42

    Google Scholar 

  38. Ellul A, Shin HS, Tonks I (2009) Opening and closing the market: evidence from the London stock exchange. J Financ Quant Anal 40(4):779–801

    Article  Google Scholar 

  39. Fama EF (1965) Portfolio analysis in a stable Paretian market. Manag Sci 11(3):404–419

    Article  Google Scholar 

  40. Fan X, Yuan Y, Zhuang X, Jin X (2017) Long memory of abnormal investor attention and the cross-correlations between abnormal investor attention and trading volume, volatility respectively. Phys A 469:323–333

    Article  Google Scholar 

  41. Fleming J, Kirby C, Ostdiek B (2003) The economic value of volatility timing using “realized” volatility. J Financ Econ 67(3):473–509

    Article  Google Scholar 

  42. Forsberg L, Ghysels E (2006) Why do absolute returns predict volatility so well. J Financ Econom 5(1):31–67

    Article  Google Scholar 

  43. Foucault T, Sraer D, Thesmar DJ (2011) Individual investors and volatility. J Finance 66(4):1369–1406

    Article  Google Scholar 

  44. Giot P, Laurent S (2007) The information content of implied volatility in light of the jump/continuous decomposition of realized volatility. J Fut Mark 27(4):337–359

    Article  Google Scholar 

  45. Glosten LR, Jagannathan R, Runkle DE (1993) On the relation between the expected value and the volatility of the nominal excess return on stocks. J Finance 48(5):1779–1801

    Article  Google Scholar 

  46. Hamid A, Heiden M (2015) Forecasting volatility with empirical similarity and Google trends. J Econ Behav Organ 117:62–81

    Article  Google Scholar 

  47. Hansen PR, Huang Z, Shek HH (2012) Realized GARCH: a joint model for returns and realized measures of volatility. J Appl Econom 27(6):877–906

    Article  Google Scholar 

  48. Harvey A, Ruiz E, Shephard N (1994) Multivariate stochastic variance models. Rev Econ Stud 61(2):247–264

    Article  Google Scholar 

  49. Hervé F, Zouaoui M, Belvaux B (2019) Noise traders and smart money: evidence from online searches. Econ Model 83:141–149

    Article  Google Scholar 

  50. Hu Y, Li X, Shen D (2020) Attention allocation and international stock return comovement: evidence from the Bitcoin market. Res Int Bus Finance 54:101286

    Article  Google Scholar 

  51. Hu Y, Li X, Goodell JW, Shen D (2021) Investor attention shocks and stock co-movement: substitution or reinforcement? Int Rev Financ Anal 73:101617

    Article  Google Scholar 

  52. Huang XX (2008a) Mean-semivariance models for fuzzy portfolio selection. J Comput Appl Math 217(1):1–8

    Article  Google Scholar 

  53. Huang XX (2008b) Portfolio selection with a new definition of risk. Eur J Oper Res 186(1):351–357

    Article  Google Scholar 

  54. Huang X, Tauchen G (2005) The relative contribution of jumps to total price variance. J Financ Econom 3(4):456–499

    Article  Google Scholar 

  55. Jin X, Shen D, Zhang W (2016) Has microblogging changed stock market behavior? Evidence from China. Phys A 452:151–156

    Article  Google Scholar 

  56. Koopman SJ, Jungbacker B, Hol E (2005) Forecasting daily variability of the S&P 100 stock index using historical, realised and implied volatility measurements. J Empir Finance 12(3):445–475

    Article  Google Scholar 

  57. Latané HA, Rendleman RJ (1976) Standard deviations of stock price ratios implied in option prices. J Finance 31(2):369–381

    Article  Google Scholar 

  58. Li X, Shen D, Xue M, Zhang W (2017) Daily happiness and stock returns: the case of Chinese company listed in the United States. Econ Model 64:496–501

    Article  Google Scholar 

  59. Li X, Shen D, Zhang W (2018) Do Chinese internet stock message boards convey firm-specific information? Pac Basin Finance J 49:1–14

    Article  Google Scholar 

  60. Liu LY, Patton AJ, Sheppard K (2015) Does anything beat 5-minute RV? A comparison of realized measures across multiple asset classes. J Econom 187(1):293–311

    Article  Google Scholar 

  61. Lux T, Marchesi M (1999) Scaling and criticality in a stochastic multi-agent model of a financial market. Nature 397(6719):498–500

    Article  Google Scholar 

  62. Ma F, Wahab MIM, Zhang Y (2019) Forecasting the U.S. stock volatility: an aligned jump index from G7 stock markets. Pac Basin Finance J 54:132–146

    Article  Google Scholar 

  63. Markovitz H (1959) Portfolio selection: efficient diversification of investments. Wiley, Hoboken

    Google Scholar 

  64. Martens M, Zein J (2002) Predicting financial volatility: high-frequency time-series forecasts vis-a-vis implied volatility. J Fut Mark 24(11):1005–1028

    Article  Google Scholar 

  65. Martens M, Van Dijk D, De Pooter M (2009) Forecasting S&P 500 volatility: long memory, level shifts, leverage effects, day-of-the-week seasonality, and macroeconomic announcements. Int J Forecast 25(2):282–303

    Article  Google Scholar 

  66. Patton AJ (2011) Volatility forecast comparison using imperfect volatility proxies. J Econom 160(1):246–256

    Article  Google Scholar 

  67. Patton AJ, Sheppard K (2015) Good volatility, bad volatility: signed jumps and the persistence of volatility. Rev Econ Stat 97(3):683–697

    Article  Google Scholar 

  68. Peltomäki J, Graham M, Hasselgren A (2018) Investor attention to market categories and market volatility: the case of emerging markets. Res Int Bus Finance 44:532–546

    Article  Google Scholar 

  69. Ping Y, Li R (2018) Forecasting realized volatility based on the truncated two-scales realized volatility estimator (TTSRV): evidence from China’s stock market. Finance Res Lett 25:222–229

    Article  Google Scholar 

  70. Pong SY, Shackleton MB, Taylor SJ, Xu XZ (2004) Forecasting currency volatility: a comparison of implied volatilities and AR(FI)MA models. J Bank Finance 28(10):2541–2563

    Article  Google Scholar 

  71. Ramos SB, Latoeiro P, Veiga H (2020) Limited attention, salience of information and stock market activity. Econ Model 87:92–108

    Article  Google Scholar 

  72. Sévi B (2014) Forecasting the volatility of crude oil futures using intraday data. Eur J Oper Res 235(3):643–659

    Article  Google Scholar 

  73. Shen D, Zhang Y, Xiong X, Zhang W (2017) Baidu index and predictability of Chinese stock returns. Financ Innov. https://doi.org/10.1186/s40854-017-0053-1

    Article  Google Scholar 

  74. Shenzhen Stock Exchange (2018) Individual Investor Status Survey Report: 2017. http://www.szse.cn/aboutus/trends/news/t20180315_519202.html. Accessed 18 Nov 2020

  75. Shin DW (2018) Forecasting realized volatility: a review. J Korean Stat Soc 47(4):395–404

    Article  Google Scholar 

  76. Shin JW, Shin D (2019) Vector error correction heterogeneous autoregressive forecast model of realized volatility and implied volatility. Commun Stat Simul Comput 48(5):1503–1515

    Article  Google Scholar 

  77. Tantaopas P, Padungsaksawasdi C, Treepongkaruna S (2016) Attention effect via internet search intensity in Asia-Pacific stock markets. Pac Basin Finance J 38:107–124

    Article  Google Scholar 

  78. U.K. Office of National Statistics (2020) Ownership of UK quoted shares: 2018. https://www.ons.gov.uk/economy/investmentspensionsandtrusts/bulletins/ownershipofukquotedshares/2018. Accessed 18 Nov 2020

  79. U.S. Securities and Exchange Commission (2013) Institutional Investors: Power and Responsibility. https://www.sec.gov/news/speech/2013-spch041913laahtm#P18_1663. Accessed 18 Nov 2020

  80. Vozlyublennaia N (2014) Investor attention, index performance, and return predictability. J Bank Finance 41:17–35

    Article  Google Scholar 

  81. Wang XX, Shrestha K, Sun Q (2019) Forecasting realised volatility: a Markov switching approach with time-varying transition probabilities. Account Finance 59:1947–1975

    Article  Google Scholar 

  82. Wen F, Xu L, Ouyang G, Kou G (2019) Retail investor attention and stock price crash risk: Evidence from China. Int Rev Financ Anal 65:101376

    Article  Google Scholar 

  83. West KD (1996) Asymptotic inference about predictive ability. Econom J Econom Soc 64:1067–1084

    Google Scholar 

  84. Wu XY, Hou XM (2019) Forecasting realized variance using asymmetric HAR model with time-varying coefficients. Finance Res Lett 30:89–95

    Article  Google Scholar 

  85. Yuan Y (2015) Market-wide attention, trading, and stock returns. J Financ Econ 116(3):548–564

    Article  Google Scholar 

  86. Yuan P (2019) Forecasting realized volatility dynamically based on adjusted dynamic model averaging (AMDA) approach: evidence from China’s stock market. J Account Finance 4(2):44

    Google Scholar 

  87. Zhang B, Wang Y (2015) Limited attention of individual investors and stock performance: evidence from the ChiNext market. Econ Model 50:94–104

    Article  Google Scholar 

  88. Zhang W, Shen D, Zhang Y, Xiong X (2013) Open source information, investor attention, and asset pricing. Econ Model 33:613–619

    Article  Google Scholar 

  89. Zhang Y, Song W, Shen D, Zhang W (2016) Market reaction to internet news: information diffusion and price pressure. Econ Model 56:43–49

    Article  Google Scholar 

Download references

Funding

This work is supported by the National Natural Science Foundation of China (71790594, 71701150, and U1811462).

Author information

Affiliations

Authors

Contributions

All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Dehua Shen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Yan, K. & Shen, D. Can the Baidu Index predict realized volatility in the Chinese stock market?. Financ Innov 7, 7 (2021). https://doi.org/10.1186/s40854-020-00216-y

Download citation

Keywords

  • Realized volatility
  • HAR model
  • Baidu Index
  • Chinese stock market