Skip to main content

Detecting the lead–lag effect in stock markets: definition, patterns, and investment strategies

Abstract

Human activities widely exhibit a power-law distribution. Considering stock trading as a typical human activity in the financial domain, the first aim of this paper is to validate whether the well-known power-law distribution can be observed in this activity. Interestingly, this paper determines that the number of accumulated lead–lag days between stock pairs meets the power-law distribution in both the U.S. and Chinese stock markets based on 10 years of trading data. Based on this finding this paper adopts the power-law distribution to formally define the lead–lag effect, detect stock pairs with the lead–lag effect, and then design a pure lead–lag investment strategy as well as enhancement investment strategies by integrating the lead–lag strategy into classic alpha-factor strategies. Tests conducted on 20 different alpha-factor strategies demonstrate that both perform better than the selected benchmark strategy and that the lead–lag strategy provides useful signals that significantly improve the performance of basic alpha-factor strategies. Our results therefore indicate that the lead–lag effect may provide effective information for designing more profitable investment strategies.

Introduction

The lead–lag phenomenon, a phenomenon in which a security leads the price movement of another with some time delay, has been empirically evidenced as widely existing in financial markets (Gong et al. 2016). Although the “lead–lag effect” concept has been adopted in many studies (Kobayashi and Takaguchi 2018), few have provided a formal definition of this concept, and its underlying meaning is not always consistent. Some studies have focused on how to generate greater stock returns by utilizing the “lead–lag phenomenon” (Stübinger 2019) but have often failed to mine its embedded features. To this end, this study aims to answer the following questions: (1) Are there several stable patterns in stock markets that are characterized by the lead–lag phenomenon? (2) How can we formally define the lead–lag effect to provide a solid foundation for detecting such an effect? (3) Can detecting the lead–lag effect enable the design of more profitable investment strategies that are more likely to earn excess returns?

The definition of the “lead–lag effect” is not equivalent to that of the “lead–lag relationship.” That is, if one stock’s volatility today mimics another stock’s volatility yesterday, the two stocks are said to have a “lead–lag relationship” over the two successive days in which the former is the follower, and the latter is the leader. In fact, it is quite common for one stock to follow another stock some days during a year. Thus, an occasional lead–lag relationship could be regarded as random, which would not be very meaningful. However, if the lead–lag days of one stock pair are long enough to differ significantly from a random event, an effect can be deemed to exist between the pair. Accordingly, the first motivation of our work is to define the lead–lag effect by providing a statistical testing model, the goal of which is to judge whether the days characterized by a lead–lag relationship (hereafter, “lead–lag days”) are significantly long in statistics.

Once the definition of the lead–lag effect is scientifically determined, a method for detecting stock pairs characterized by the lead–lag effect can be proposed. However, two questions must first be addressed. These are: (1) how do external variables affect the detection results and (2) are the detection results sensitive to these influential external variables? The answers to these two questions will deepen our understanding of the proposed detection model. The patterns of external variables that influence the results will enable us to adopt the proposed model by selecting the appropriate variable values. The robustness of the proposed model is notable for its usage in investment practices in real-world stock markets because a model’s robustness is desirable for designing investment strategies. Accordingly, the answers to these two questions will reveal the properties of the proposed model.

As a typical application, the detected lead–lag effect aims to be adopted in guiding investments in real-world stock markets. Apparently, detecting stock pairs with a significant lead–lag effect can benefit investors because the price movements of followers will mimic those of their leaders. Thus, this study will first examine the performance of the pure lead–lag strategy and then judge if it is satisfactory. If it is satisfactory, we will regard the detected lead–lag effect as an enhancement signal, and then add it to some classic investment strategies to propose enhancement investment strategies. Generally, when a basic strategy is enhanced by another strategy, we refer to it as a single-enhancement investment strategy. The alpha-factor strategy is selected as the basic strategy, and our proposed lead–lag strategy is adopted to enhance it. Accordingly, the third motivation is to design profitable investment strategies based on the detected lead–lag effect, and then test its performance in a pure investment strategy and the proposed enhancement strategies.

To sum up, the contributions of this study are as follows: (1) The features of the lead–lag phenomenon are explored in the context of both the U.S. and Chinese stock markets. As a result, the number of stock pairs characterized by the lead–lag relationship meets the well-known power-law distribution, which offers novel evidence that the power-law distribution exists widely in the real world (Clauset et al. 2009) and specifically in the financial domain (Gabaix et al. 2003). (2) A formal definition of the lead–lag effect is provided according to the principles of statistical testing, and a detection approach is proposed based on this definition. It is worth noting that most existing studies regard the lead–lag relationship between stocks as a phenomenon (Scherbina and Schlusche 2020; Dao et al. 2018; Huth and Abergel 2014), whereas this study elevates this phenomenon into an effect. Accordingly, the lead–lag effect must be formally defined via statistical testing, which lays a foundation for future studies to compare and detect the lead–lag effect in various scenarios. The rationality and robustness of the proposed detection approach are carefully examined by determining how external variables influence the lead–lag effect. (3) A few profitable investment strategies are designed and validated based on the detected lead–lag effect, in parallel to previous design and validation studies such as Shen et al. (2017), Xiong et al. (2020), Flori and Regoli (2021), and Zhang et al. (2021). Here both the pure lead–lag strategy and the enhancement strategies report sound results regarding the functionality of the detected lead–lag effect.

The remainder of this paper is organized as follows. “Section Related work” reviews the related work to clearly delineate the aforementioned contributions; “Section Method for detecting the lead–lag effect” defines the lead–lag effect and proposes a detection methodology; “Section Main results and validation in real-world stock markets” explores the features of the lead–lag phenomenon based on a selected real-world dataset, applies the proposed detection method, and tests the method’s robustness; “Section Investment strategies based on the detected lead–lag effect” designs investment strategies and validates their performance to reveal the functionality of the detected lead–lag effect; and “Section Conclusions and future work” summarizes the study and discusses potential future work.

Related work

Our work is directly related to two fields of existing studies: this includes the lead–lag phenomenon in stock markets and the focused alpha-factor strategy widely adopted in stock markets. Each field is reviewed individually in the following sections.

Lead–lag phenomenon in stock markets

The lead–lag phenomenon is a classic financial topic that has attracted the attention of numerous researchers (Conlon et al. 2018). First, one fundamental question has been widely examined in the literature: does the lead–lag phenomenon exist in the stock market? Generally, the lead–lag phenomenon can be observed in high-frequency data such as 5-min stock price movements. Both Jong and Nijman (1997) and Huth and Abergel (2014) deemed that the lead–lag relationship is an essential stylized fact at high frequencies. Fonseca and Zaatour (2017), Dao et al. (2018), Buccheri et al. (2019), Campajola et al. (2020), and many others do not mention the existence of the lead–lag phenomenon in high-frequency data, the influencing factors, or even its potential origins. However, when observed in high-frequency data, this phenomenon is often called the “lead–lag relationship” rather than the “lead–lag effect”. In most cases, the lead–lag relationship is unstable in high-frequency data. Since, according to Tóth and Kertész (2006) and Curme et al. (2015), its appearance is likely to be occasional, our work aims to formulate a new approach to finding a stable lead–lag relationship over a long time period based on statistical testing and to rename such stable lead–lag relationship “the lead–lag effect” as an indication of its statistical significance.

Second, the existing literature explores how to take advantage of the lead–lag phenomenon in designing investment strategies for real-world stock markets. Typically, investment strategies that utilize the lead–lag phenomenon are often variations on the high-frequency trading strategy, which is in accord with the results. Stübinger (2019) developed an optimal causal path algorithm and designed statistical arbitrage strategies for high-frequency data based on the lead–lag phenomenon. However, designing an investment strategy based on high-frequency data still has drawbacks. According to Krauss (2017), high-frequency trading strategies are associated with greater commission fees and a higher transaction threshold for investors.

In contrast, the stable lead–lag effect discovered in low-frequency data facilitates the practices of small and medium investors because of its ample optional trading time and low technical threshold. Scherbina and Schlusche (2020) and Gupta and Chatterjee (2020) have pointed out that the lead–lag relationship enables out-of-sample forecasting and thus helps in the design of investment strategies. From this perspective, this study can also be seen as the development of our previously published work (Li et al., 2021), which focuses on identifying the factors that cause the lead–lag phenomenon. However, this study aims to develop new investment strategies by utilizing the lead–lag phenomenon, and thus the two have divergent research aims. In contrast to the successive lead–lag days analyzed in our previously published work, this study considers the number of cumulative lead–lag days that would benefit extending the application of the model in real-world stock markets.

According to the aforementioned literature and gaps in the existing research, we believe that it is meaningful and even necessary to study the lead–lag effect in low-frequency data for the following reasons: (1) the definition of the lead–lag effect is not unified or discussed in depth in the existing literature, and thus the underlying significance of the lead–lag phenomenon often differs despite their use of the same name; (2) traditional studies on detecting the lead–lag phenomenon are conducted using classical econometrics or empirical research methods, and thus the use of data-driven technical analysis to detect the lead–lag effect can supplement existing studies with a new perspective; and (3) building on the traditional methods of designing investment strategies by using the discovered lead–lag phenomenon, our study may identify effective signals, which will have a guiding significance for the development of investment strategy. Accordingly, this study contributes to the literature by providing a unified and solid definition of the lead–lag effect and by utilizing the lead–lag effect to design profitable trading strategies in real-world stock markets.

Alpha-factor strategy

Concerning our targeted enhancement investment strategies, the alpha-factor strategy that originated with the capital asset pricing model is chosen as the primary strategy due to its popularity and effectiveness in real-world investment practices (Sharpe 1964; Makarov and Plantin 2015). The alpha factor in the alpha-factor strategy reacts to one or some stock attributes; in other words, different alpha factors reflect different stock attributes. Thus, the alpha-factor strategy consists of numerous specific strategies using various alpha factors. Since alpha factors are used as buying and selling signals in the alpha-factor strategy, its choice is the core of the strategy. Generally, existing studies focus mainly on the following two types of alpha factors: value alphas and transactional alphas.

Value alphas are derived from the fundamentals of one stock and describe its value attributes. Value alphas include but are not limited to value factors (Balatti et al. 2017; Eisdorfer et al. 2019), size factors (Liu et al. 2019), growth factors (Fama and French 1998), profitability factors (Hou et al., 2015; Fama and French 2015), and momentum factors (Fama and French 2012; Berggrun et al. 2020). Based on the mature factor model, value alphas provide not only a valuable tool for stock valuation, but also a reasonable explanation for the cross-section of stock returns (Harvey et al. 2016). Accordingly, when value alphas are adopted in a strategy, it indicates that the investor cares about the value investment’s underlying factors (Fama and French 2016). In contrast to traditional value alphas, transactional alphas pay more attention to the patterns embedded in trading behaviors (Casgrain and Jaimungal 2019). Transactional alphas are obtained by means of technical analysis and derived from transaction data. With the current progression of computer science, millions of transactional alpha factors have been identified by automated algorithms. Despite the lack of a good explanation, the marginal revenue contributed by transactional alpha factors is relatively satisfactory (Kakushadze 2016); large financial institutions favor such transactional alphas. For example, the 101 alpha factors proposed by the World Quant and the 191 alpha factors from Guotai Junan Securities have been welcomed by many institutions and investors.

The alpha-factor strategy is always used for stock selection. The proposed lead–lag strategy in our work helps allocate the weight of each selected stock in an investment portfolio. Therefore, it is convenient to combine the two strategies when designing an enhancement strategy. Since the alpha-factor strategy includes numerous specific models with various alpha factors, we select it as the primary strategy to demonstrate representativeness. The lead–lag effect falls into the category of technology-driven analysis and therefore resonates with transactional alphas, which are determined by technical analysis.

For this reason, it would be more natural to combine the lead–lag trading strategy with transactional alphas. Accordingly, this study focuses mainly on transactional alpha-factor strategies, regarded as the basic strategies when designing enhancement investment strategies. Our work exploits the great potential of the existing alpha factors and provides a framework for enhancement strategies by integrating the lead–lag effect into the existing alpha-factor strategies.

Method for detecting the lead–lag effect

The daily lead–lag network

Let ri,t denote the yield rate of stock i on day t. Its mathematical expression is as follows:

$$r_{i,t} = \frac{{p_{i,t} - p_{i,t - 1} }}{{p_{i,t - 1} }},$$
(1)

where pi,t denotes the closing price of stock i on day t. Here, the adopted stock price is restoring the right price rather than the price of ex-rights. If a suspension occurs for stock i on day t, then both ri,t and ri,t+1 are set to “NAN.” Next, given a manufactured threshold Δ (0 ≤ Δ < 1), the definition that stock j follows stock i on day t is defined as follows:

Definition 1

The conditions for forming a lead–lag link. If and only if the following condition holds:

$$\left\{ {\begin{array}{*{20}c} {(1 - \Delta )r_{j,t - 1} \le r_{i,t} \le (1 + \Delta )r_{j,t - 1} ,{\text{ when }}r_{i,t} \ge 0;} \\ {(1 + \Delta )r_{j,t - 1} \le r_{i,t} \le (1 - \Delta )r_{j,t - 1} ,{\text{ when }}r_{i,t} < 0.} \\ \end{array} } \right.$$
(2)

then, stock i follows stock j on day t.

Definition 1 states that if the difference between the yield rate of stock i on day t and that of stock j on day t–1 is within the given threshold, Δ, stock i is judged to follow stock j on day t. Further, let Gt denote the lead–lag network on day t, and its element gij,t reflects the status of stock i following stock j on day t. If stock i is judged as the follower of stock j on day t according to Definition 1, then gij,t = 1; otherwise, gij,t = 0. Our model allows one stock to follow itself, and thus it is possible that gii,t = 1 holds. Then, given the closing prices of all concerned stocks during the sequential T + 1 trading days, we can achieve T lead–lag networks according to Definition 1.

During the targeted period (e.g., the total number of T + 1 trading days), the achieved T lead–lag networks can tell us how many days stock i follows stock j in total. Formally, let dij denote the number of accumulated days that stock i follows stock j during the targeted period, which can be calculated as follows:

$$d_{ij} = \sum\limits_{t = 1}^{T} {g_{ij,t} } .$$
(3)

Gt is an asymmetrical matrix in most cases, considering that dij is not often equal to dji.

Concerning the manufactured threshold Δ, a larger Δ will cause the achieved daily lead–lag networks to have more directed links than a smaller Δ, thus the threshold Δ affects network density. Because it is an artificial variable, we will explore how it affects the results and check whether our method is robust under different threshold values in Sect. 4.2.1. The mainstream literature defining the relationship between stock pairs, such as Huang et al. (2009), Kumar and Deo (2012), Peralta and Zareei (2016), Xia et al. (2018), Deev and Lyócsa (2020), and many others, has often adopted the correlation coefficient. Note that most existing literature related to this definition uses data from a defined period to calculate the so-called “correlation coefficient,” whereas our study uses daily data to define each day’s lead–lag relationship between stock pairs. Therefore, the novel idea of using the selected data (i.e., “daily usage to define the lead–lag relationship” or “usage together to calculate a correlation coefficient during the selected period”) leads to one of the differences between our study and the existing literature.

Definition and detection of the lead–lag effect

As explained in the introduction, when dij (which is defined in Eq. (3)) is long enough to be significantly distinct from the amount achieved in a random event, we tend to believe that the lead–lag effect from stock j to stock i holds, where stock j is the leader and stock i is the follower. However, the criterion for judging whether dij is sufficiently long or not should be determined before formally defining the lead–lag effect. Fortunately, statistical testing enables us to formulate the following criterion: the null hypothesis is set to “all the links in the daily lead–lag networks are randomly formed,” the null hypothesis will allow us to obtain the distribution of the accumulated days of all stock pairs. Then, given the statistical significance level (e.g., 0.10, 0.05, 0.01, etc.), the criterion can be immediately achieved in the obtained distribution. To clarify, let \(\hat{d}\) denote the criterion. The meaning of the term “lead–lag effect” is provided in Definition 2.

Definition 2

Lead–lag effect. Based on the calculated \(\hat{d}\), for any pair of stocks (e.g., i and j), if the dij achieved from Eq. (3) satisfies \(d_{ij} \ge \hat{d}\), the lead–lag effect from stock j to stock i is judged to hold. If \(d_{ii} \ge \hat{d}\), the lead–lag effect from stock i to itself is judged to hold.

Note that the principles of statistical testing imply that it is almost impossible for a rare event to occur in one random trial. Given the null hypothesis and the statistical significance level, the criterion for judging whether an event is rare or not can be achieved. Then, if a rare event occurs in the analyzed real-world data, we can reject the null hypothesis under the given statistical significance level, or we can determine that the rare event has a statistically significant effect. In fact, few studies have formally defined the lead–lag effect. As mentioned in the Related Work section, the lead–lag phenomenon or relationship was more often examined in the existing literature rather than the lead–lag effect. We detected the lead–lag effect via formal statistical tests and null reference networks, but the existing literature adopted different approaches to detecting the lead–lag relationship, such as the Granger test (Scherbina and Schlusche 2020; Zeng and Atta Mills 2021) and the optimal causal path algorithm (Jiang et al. 2019). Accordingly, the approach adopted leads to different definitions, so our definition is new in this field.

Table 1 shows the detailed process of achieving criterion \(\hat{d}\). Random networks are first generated to achieve the distribution of the accumulated days between all stock pairs under the null hypothesis. Then, criterion \(\hat{d}\) can be obtained given the statistical significance level in Step (1). Here, we refer to the configuration model proposed by Newman et al. (2001). The generated random networks retain the characteristics of the daily network as much as possible. Although the network indicator of the real-world lead–lag network changes each day, the adopted configuration model guarantees that each day’s random network and the same day’s real-world network share an almost identical node degree distribution, which is superior to the model that retains only the same edge number. Next, the statistical significance level δ is set to 0.001 since a lower significance level means a more rigorous criterion for determining the lead–lag effect. Once output \(\hat{d}\) is achieved based on the process shown in Table 1, Definition 2 directly judges which stock pair features the lead–lag effect. Hereafter, the stock pairs detected with the lead–lag effect are called “lead–lag stock pairs.”

Table 1 The process of achieving the criterion \(\hat{d}\)

Example

A simple example is presented to show how the proposed detection method works. This example analyzes the closing price of 10 stocks on 11 sequential trading days, and then obtains 10 daily lead–lag networks using Eq. (2). As displayed in Fig. 1, each node represents one stock, and the direction of the link points from the leader to the follower. The color of the node distinguishes between differences in the node out-degree (i.e., the number of followers): the more significant the out-degree, the darker the color.

Fig. 1
figure 1

Graphs of 10 daily lead–lag networks

Following Steps (1) and (2) in Table 1, one group of simulations can achieve the following 10 sequential random networks, as shown in Fig. 2. Here, each day’s random network retains the node degree distribution of the same day’s real-world lead–lag network, which can be checked by comparing the counterpart in Figs. 1 and 2.

Fig. 2
figure 2

Graphs of 10 sequential random networks

Then, by conducting 500 groups of simulations according to Step (3) in Table 1, the distribution of all the achieved accumulated days is achieved and displayed in Fig. 3. When the statistical significance level (\(\delta\)) is set to 0.001, the criterion \(\widehat{d}\) is equal to 6, based on the achieved distribution in Fig. 3. Accordingly, the detected leader–follower pairs are 3 → 4, 4 → 5, and 6 → 7.

Fig. 3
figure 3

Distribution of the accumulated following days via 500 groups of simulations

Main results and validation in real-world stock markets

To apply the previous simple example to real-world stock markets, this section selects the stock markets of mainland China and the U.S. as the targets of analysis. This section applies the proposed method to detect which stock pairs are characterized by the defined lead–lag effect and explores how the man-made variables embedded in the detection method affect the results. The following subsections introduce the process of data selection, report the main results in different stock markets, and discuss these findings.

Data preparation and main statistical results

Data preparation

Two stock sets are selected for the application and validation of the proposed method. One is the set of 300 stocks contained in the China Securities Index 300 (CSI 300), considering that these 300 stocks are the most liquid stocks in mainland China’s A-share stock market; therefore, they are often used to reflect its overall performance. The other is the set of stocks included in Standard & Poor’s 500 Index (S&P 500), which helps us understand the proposed method’s performance in the U.S. stock market. Note that the stocks in the CSI 300 and the S&P 500 are not permanent, although adjustments to the stock set are quite infrequent.

The closing price of each stock in the two selected stock sets is collected on each trading day between January 1, 2010, and December 31, 2019 (i.e., over 10 years). The data were obtained from the Compustat database at https://wrds-www.wharton.upenn.edu/; each year has an average of 250 trading days. The stocks featured in each stock set changes over time because new stocks were added and others were removed during the chosen period. Almost every trading day witnessed stock suspensions due to some reason or rule, and thus the size of the daily lead–lag network fluctuates. As shown in Fig. 4, different stock sets feature different overall directed lead–lag networks in terms of their diameter (DM), density (DS), average path length, average node degree (ND), and clustering coefficient.

Fig. 4
figure 4

The lead–lag networks achieved in the two selected stock sets (Δ = 0.20)

Recalling Eqs. (12) in Sect. 3.1, the daily lead–lag networks can be immediately achieved in each stock set based on the above-prepared data once the man-made threshold Δ is given. Here and hereafter, taking Δ = 0.20 if no special statements are provided, the upper part of Fig. 4 displays the lead–lag networks achieved in each stock set on December 12, 2019. The overall lead–lag network can be obtained in each stock set by summing up each day’s lead–lag network. The bottom part of Fig. 4 shows the overall lead–lag network of each stock set during the entire period; the link thickness is proportional to the number of cumulative days on which one stock followed the other in this directed link.

To display more results under different values of the man-made threshold, Δ, Fig. 5 shows the achieved lead–lag networks in the two selected stock sets when Δ = 0.10. By comparing Figs. 4 and 5, we find that different values of Δ cause only slight changes in the overall lead–lag networks and their corresponding indicators in both markets, except that the average ND decreases with a decrease in Δ. However, the change in Δ has a significant impact on the daily networks of both markets because the daily network is not as robust as the overall network.

Fig. 5
figure 5

The lead–lag networks achieved in the two selected stock sets (Δ = 0.10)

Power-law distribution

Before formally describing the detailed analysis, we will first recall basic knowledge about the random network, the scale-free network, and the power-law distribution that is often seen in the fields of complexity science and network analysis. First, a random network indicates that the links in the network are randomly formed; in other words, the links are generated with a given probability (Barabási and Albert 1999). The random network is often used as a testable null hypothesis about network structure (Volz 2004). Its link distribution is thin-tailed, and our work follows this idea. In contrast to a random network, a scale-free network refers to one with a degree distribution that meets the power law, at least asymptotically (Barabási and Bonabeau 2003). Roughly speaking, the distribution discrepancy between a random network and a scale-free network often originates in human activity, that is, human activity causes the change from a thin-tailed to a heavy-tailed distribution (represented by the power-law distribution). In addition, human activity also makes the power-law distribution more prevalent and special in the field of complexity science, and even the power-law distribution is viewed as a signature of complexity by noting that such a distribution can reflect the underlying pattern of a complex process (Rickles 2011). Our study considers the function of human activity in the stock market and thus tests the power-law distribution as stated below.

Although the overall lead–lag network of each stock set is unique, we wonder whether some identical patterns exist for different stock sets. If they do, we can call the discovered identical pattern a feature, because different stock sets do not alter the features embedded in the lead–lag phenomenon. To answer this question, we will focus on the link thickness displayed in Fig. 4 and examine its distribution. The distribution of the concerned link thickness is equal to that of variable dij defined in Eq. (3) by carefully considering the meaning of link thickness. Figure 6 displays the distribution in each stock set using Δ = 0.20. As displayed in Fig. 6, the points in the tail of each distribution are almost in a line in the log–log coordinates (i.e., a feature of the power-law distribution), indicating that the tested distribution is quite likely to meet the power-law distribution.

Fig. 6
figure 6

Distribution of the link thickness in each stock set as well as the test results

Next, according to the mainstream testing methods used in the existing literature (Clauset et al. 2009; Malevergne et al. 2011; Toda 2012) to verify the power-law distribution, we apply three methods to obtain sound results: the Kolmogorov–Smirnov Test (K–S), the Kuiper Test (Kuiper 1960), and the Anderson–Darling test (A–D) (Scholz and Stephens 1987; Coronel-Brizio and Hernández-Montoya 2010). Recalling Eq. (2), the manufactured threshold, Δ, affects the achieved daily lead–lag networks as well as the overall lead–lag networks in both markets. Here, we test whether the power-law distribution holds under different values of Δ. Table 2 shows the results: none of the three methods rejects the null hypothesis that “the data meets the power-law distribution” at the statistical significance level of 0.05. Therefore, we believe that the power-law distribution can be regarded as a stable pattern underlying the lead–lag phenomenon.

Table 2 The results of the power-law distribution test under different manufactured thresholds, Δ

In addition, we conduct additional tests to exclude the other possible distributions and provide additional evidence supporting the discovered power-law distribution. As both markets witnessed steep decays in the log–log coordinates shown in Fig. 6, two possible discrete and thin-tailed distributions such as the Poisson distribution or the binomial distribution are estimated and tested using the three testing methods. To make our results sound, we change the value of Δ to test the sensitivity of the results to this manufactured parameter. The results for the two tested distributions are shown in Tables 3 and 4. When the statistical significance level is 0.05, the two distributions are rejected in both markets in most cases, although several exceptions exist for the binomial distribution at Δ = 0.15 under the Kuiper Test. In summary, these results provide more evidence that the verified distribution is likely to meet the power-law distribution.

Table 3 The results of the Poisson distribution test under different manufactured thresholds, Δ
Table 4 The results of the binomial distribution test under different manufactured thresholds, Δ

Based on these findings, we now address why the discovered power-law distribution is important in our work. Our proposed detection approach is more meaningful when facing a power-law distribution than a thin-tailed distribution because very few stock pairs (the number is negligible) can be detected as having a lead–lag effect with a thin-tailed distribution, but the power-law distribution guarantees that a considerable number of stock pairs may be detected. As expected, more detected stock pairs implies more opportunities to utilize the information contained in the lead–lag effect to improve earnings, which lays a foundation for designing more profitable investment strategies.

Main results and validation

By recalling the proposed detection approach, two manufactured variables will affect the detection results: the threshold Δ and the period ζ. As we have explained, the threshold, Δ, influences the achieved daily lead–lag networks. The period, ζ, is also an influencing factor because the predictability is likely to differ when different periods are chosen. The following two subsections will explore how these variables affect the detection results. These findings can also partially answer questions related to the model’s robustness and the predictability of the results.

Detection results as a function of Δ

Recalling Eq. (2), the manufactured threshold Δ will affect the link formation in a daily lead–lag network to further influence the distribution of the variable dijs (by recalling Eq. (3) or Fig. 6). This subsection focuses on how the manufactured threshold Δ affects the aforementioned distribution. If the distributions obtained under different values of Δ differ significantly, the output of our model is sensitive to Δ, or, in other words, is not robust, and vice versa. To this end, DDi, Δj) is defined in Eq. (4) by following the K–S test (Massey 1951) to measure the difference in the distribution as follows:

$$DD(\Delta_{i} ,\Delta_{j} ) = \mathop {\max }\limits_{d} \left| {cdf(d;\Delta_{i} ) - cdf(d;\Delta_{j} )} \right|.$$
(4)

where cdf(d, Δi) and cdf(d, Δj) denote the cumulative distribution function under thresholds Δi and Δj, respectively. Because the measurement defined in Eq. (4) is a K–S statistic, the K–S test can be conducted to check whether the difference is significant. Considering different combinations of Δi and Δj, Tables 5 and 6 report the statistic DDi, Δj) of each combination and its corresponding p value using the K–S test.

Table 5 Robustness results in CSI 300
Table 6 Robustness results in S&P 500

The numbers in bold in Tables 5 and 6 indicate that the difference between the two distributions is not significant at the significance level of 0.05. In addition, when |Δi − Δj|≤ 10%, none of the distribution differences under different combinations are significant, implying robustness, especially when the deviation of the two threshold values is not too large. Moreover, not surprisingly, DDi, Δj) increases with |Δi − Δj| in all the combinations in the two stock markets, and, even if the deviation of the two threshold values is as great as 20%, the distribution under some combinations is also insignificant. Overall, the achieved distributions are robust considering that they are not quite sensitive to the parameter Δ.

Detection results as a function of ζ

Before discussing the function of ζ, we first focus on the prediction task: the detected leader during period ζ serves as a signal, and the price movements of the detected followers act as the predicted target. Specifically, if leader stock i and its follower stock j are one of these detected lead–lag stock pairs during the given period ζ (i.e., ζ months), the price movement of stock j on day t can be inferred from that of stock I on day t–1. Then, we compare the real price movement of stock j with the movement predicted by its leader i on each trading day in the targeted month; thus, the prediction accuracy of the month can be calculated. To simplify the problem, we use 1, − 1, and 0 to denote the three price movements without considering the degree. In addition, if one follower has multiple leaders, the movement direction of the follower is determined by the majority of the leaders. When half of the leaders move up and half move down, the movement of the follower is predicted to be 0. Finally, by averaging all followers’ prediction accuracy, we obtain the performance of the prediction task in the targeted month. The detailed process of the prediction task is displayed in Fig. 7.

Fig. 7
figure 7

The detailed process of the prediction task

Note that the detection results on the lead–lag stock pairs are dependent on the variable ζ. Thus, this subsection will explore the optimal value of ζ to achieve the best prediction performance. The performance is measured based on the overall accuracy shown in Fig. 7. On the one hand, the answer to this question will unveil the function of ζ on the detection results and even the accuracy of the simple prediction task, laying a foundation for designing profitable investment strategies; on the other hand, the answer will enable us to understand how much information is contained in the detected lead–lag stock pairs, although the prediction task is quite simple. If the mean overall prediction accuracy, as expected, is significantly greater than 50% (or say, a random guess), we tend to believe that the detected lead–lag stock pairs contain valuable information; a higher value means that they will be more helpful in designing profitable investment strategies in practice. Otherwise, we should consider how to better utilize and mine the information contained in the detection results.

Following the prediction task, Fig. 8 displays prediction accuracy under different values of ζ in each selected stock set. Here, the box plot under each value of ζ is achieved by 120 accuracy values, that is, the set of the overall accuracy obtained for each month (for prediction, as displayed in Fig. 7) over the 10 years between January 2010 and December 2019. As shown in Fig. 8, the medians of overall accuracy under different values of ζ are only a little higher than 0.50 for the CSI 300 and much higher than 0.50 for the S&P 500. Accordingly, the one-sample t-test is needed, especially for the CSI 300, to check whether the mean values of overall accuracy are significantly higher than 0.50 for every value of ζ, as stated in the previous paragraph. To this end, the results are listed in Table 7, showing that all the mean values are significantly higher than 0.50, at least under the significance level of 0.10, regardless of the stock set and the value of ζ.

Fig. 8
figure 8

Prediction accuracy under different values of ζ in each stock set

Table 7 Results of one-sample T tests by comparing the mean value to 0.50 in two stock sets

Combining the results reported in Fig. 8 and Table 7, we find that the information contained in the detected lead–lag stock pairs helps design profitable investment strategies. In addition, the performance is robust to the manufactured variable ζ by noting that the discrepancy between the highest and lowest mean accuracy values is within 2% in both stock sets. Furthermore, the accuracies achieved in the S&P 500 are all higher than those in the CSI 300, implying that the detected lead–lag stock pairs will be much more beneficial in the S&P 500, which will be validated in “Section Investment strategies based on the detected lead–lag effect”.

In addition, following the prediction task, different combinations of the two parameters (i.e., Δ and ζ) will yield varying accuracies. More importantly, the result in which combination has the best preformation will be useful for selecting parameters in designing investment strategies (see the next section). The two thermodynamic graphs displayed in Fig. 9 show the results for each stock set. According to Fig. 9, the prediction accuracy first increases and then decreases with an increase in ζ, in most cases, when Δ is fixed. The prediction accuracy increases with a decline in Δ, on average, but there are some exceptions when ζ is fixed. All the achieved accuracies are greater than 50%, demonstrating that the detected lead–lag stock pairs are helpful, even with the simple prediction task. Interestingly and more importantly, the best accuracy is achieved with the same parameter combination in the two stock sets; thus, the combination of Δ = 0.10 and ζ = 4 will lead to the most profitable lead–lag stock pairs, which will be adopted to design a more complicated investment strategy in the next section.

Fig. 9
figure 9

Prediction accuracy under different combinations of two parameters in each stock set

Investment strategies based on the detected lead–lag effect

The simple prediction task, described in Sect. 4.2.2, can be regarded as one of the most straightforward investment strategies because it only considers the direction of the predicted price movement without considering the trading details. In addition, the simple prediction task lays a foundation for designing more complicated investment strategies by revealing that the detected lead–lag stock pairs will yield more profitable information when Δ = 0.10 and ζ = 4. Based on the achieved parameter combination, this section extends the aforementioned simple prediction task into two types of practical investment strategies: the so-called “pure lead–lag strategy” and “enhancement strategies,” which are determined by integrating the pure lead–lag strategy into well-known alpha-factor strategies. The following subsections will first present the two strategies designed in this study and then report their performance to guide investors’ practices in real-world stock markets.

Pure lead–lag strategy

Our designed pure lead–lag strategy consists of three main steps based on the detected lead–lag stock pairs. The steps are listed in detail below.

Step 1::

Calculate the strength of the influence of the leader on the follower.

A bipartite graph model is adopted to depict the leaders, followers, and their relationship, where the set of leaders and followers is denoted as N and M, respectively. For any \(p \in N\) and \(q \in M\), dpq denotes the number of accumulated days that stock q follows stock p during the analyzed period by recalling Eq. (3). Then, let spq represent the influence strength of leader p on follower q with the following mathematical expression:

$$s_{pq} = \frac{{d_{pq} }}{{\sum\nolimits_{p \in N,q \in M} {d_{pq} } }}.$$
(5)

According to Eq. (5), a greater number of accumulated days indicates a stronger influence. Once the detected bipartite graph is determined, the strength of the influence of all detected lead–lag stock pairs is also determined.

Step 2::

Calculate each day’s accumulated influence on the follower.

Let wq,t reflect the accumulated influence of the leader set on follower q on day t, which can be calculated as follows:

$$w_{q,t} = \sum\limits_{p \in N} {r_{p,t} s_{pq} } ,$$
(6)

where rp,t is the yield rate of stock p on day t according to Eq. (1). Then, similar to Eq. (5), normalizing the calculated wq,t achieves the following ratio variable vq,t, which helps to determine the holding position of the follower stock q on day t. Equation (7) shows the specific expression for vq,t as follows:

$$v_{q,t} = \frac{{w_{q,t} }}{{\sum\nolimits_{q \in M} {w_{q,t} } }}.$$
(7)
Step 3::

Adjust the holding position of follower stock q based on vq,t.

At the end of trading day t, the holding positions of all follower stocks can be adjusted based on the calculated vq,t. Here, we assume that our adjustments can be instantly completed according to each follower’s market price at the closing of the stock market that day. Let Ct denote total assets just before adjusting stock positions on trading day t; generally, Ct contains the holding stocks and currencies. Taking follower stock q as a representative example, the rule for adjusting stock positions is as follows: (1) when vq,t > 0, the market value of stock q held in hand is adjusted to vq,tCt through buying or selling, where the market value is measured at trading time; (2) when vq,t ≤ 0, the amount of stock q held in hand should be adjusted to 0.

Enhancement strategy

As mentioned in Sect. 2.2, the alpha-factor strategy selects stocks by calculating and ranking the value of the adopted alpha factor. Interestingly, the pure lead–lag strategy provides the selected stocks and the buy-and-sell signal. Naturally, the buy-and-sell signal can be adapted to the stock sets selected by both the alpha-factor and lead–lag strategies. As a result, the enhancement strategy can be designed by combining the buy-and-sell signal and stock selection, as explained previously. In addition, the trading framework of the commonly used alpha-factor strategy requires that the calculated value of the concerned alpha-factor should be updated each month. In practice, the value of the concerned alpha factor is calculated at the end of each month based on the technical data of that month, and then stock selection and trades are immediately made. Therefore, the trading day in the pure lead–lag strategy is identical to the alpha-factor strategy. The two strategies are coherent in terms of trading time when forming the enhancement strategy.

There are a total of four steps to conducting the enhancement strategy. The first step is to detect the lead–lag stock pairs based on the preceding work of this study and then determine the set of follower stocks, denoted as Q. The second step is to choose one alpha factor and then calculate the alpha value of each stock in set Q. Without any loss of generality, let αq denote the calculated alpha value of stock q for any stock \(q \in Q\). The third step is to achieve the variable value vq (for any \(q \in Q\), hereafter) by following the first two steps in the pure lead–lag strategy as stated in Sect. 5.1. The last step is to provide the trading rules based on αq and vq achieved in the foregoing steps and then use the rules to adjust the holding position of stock q.

Here, we take our designed alpha-01 as an example to describe the aforementioned steps of the enhancement strategy to present them more clearly. We assume that the first step has been conducted and designated the follower stock set Q. Then, according to the second step, Table 8 describes the detailed process of calculating the alpha-01 value of each stock belonging to set Q.

Table 8 The process of calculating the alpha-01 value of each stock set Q in month TT

Before conducting the steps of the proposed enhancement strategy, we first pay attention to the achieved \(\alpha_{i}^{01}\). On the one hand, if CORRi is low among all stocks, RKi will be a large number and thus \(\alpha_{i}^{01}\) will be high. In other words, if one stock’s opening price is not consistent with its trading volume in the analyzed month, the calculated alpha-01 value of this stock will be high. On the other hand, Eq. (8) guarantees that the calculated alpha-01 value ranges between − 0.5 and 0.5; half of the stocks traded on the stock market have a positive alpha-01 value. Note that different alpha factors have different calculation processes. Our study adopts 20 different alpha factors by following Kakushadze (2016) to ensure that our results are sound. Their detailed calculation processes are presented in Appendix 1.

Performance

This section aims to validate the performance of our proposed lead–lag strategy and test whether this strategy improves the performance of the pure alpha-factor strategies in the formed enhancement strategy. As in the work of Stübinger (2019), the trading cost is set to 0.25%, and the naïve buy-and-hold investment strategy (MKT) is chosen as the benchmark. In addition, we choose the upper 5% daily return rate, or the Sharpe ratio, of a series of random investment operations as another benchmark, where a random investment operation means buying one stock on a random day and selling it on a later random day. With the aim of obtaining sound results, we designed 20 different alpha-factor strategies (see Appendix 1) for validation and chose a testing period of 10 years (i.e., from January 2010 to December 2019).

Because the alpha-01 strategy is the example in Sect. 5.2, this section first focuses on the performance of the pure lead–lag strategy (PLL), the pure alpha-01 strategy (Pure-01), and the enhancement strategy of alpha-01 (Enhan-01). Table 9 reports their performance in the two target stock markets, where “mean returns” are achieved by averaging the daily return rate. By comparing the mean returns of each strategy, we find that the enhancement strategy performs best in both markets. Therefore, this finding indicates that the proposed lead–lag strategy significantly improves the performance of the pure-01 factor strategy. Furthermore, PLL performs better than MKT in both markets in different degrees in terms of mean returns, meaning that PLL contains valuable information for investment. However, when the signals provided by the lead–lag strategy are added to the pure-01 factor strategy, the achieved enhancement strategy performs better, which implies that the value of the information contained in the lead–lag strategy is superior to that of the pure strategy. Considering the other indices listed in Table 9, the positively higher values of skewness and the Sharpe ratio in Enhan-01 displays a more desirable property for any potential investor in both markets (Cont 2010; Fievet and Sornette 2018). In addition, compared to Pure-01, the usage of the signal from the PLL significantly reduces the max-draw of Enhan-01, and thus, increases the advantage afforded by the enhancement strategy.

Table 9 The performance of PLL, Pure-01, Enhan-01, and MKT in each stock market

Next, we conduct random investment operations in each stock market during the selected 10-year period, record the average daily return rate (mean returns) and the Sharpe ratio of each operation, and then rank them in the figures. Figure 10 displays the results of 5,000 simulations, and the upper 5% of the ranked mean return value, or Sharpe ratio, is set as the threshold. When the corresponding value of one proposed strategy is above the threshold, we deem that the performance of the proposed strategy is significantly better than the benchmark at a significance level of 0.05. Recalling the performance results of Enhan-01 shown in Table 9, the mean values are 0.00036 and 0.01257 in both the U.S. and Chinese stock markets, respectively, which is higher than the corresponding highlighted thresholds reported in Panels (a) and (c) of Fig. 10. Concerning the Sharpe ratio, a similar result holds.

Fig. 10
figure 10

The ranked mean returns and Sharpe ratios for each market via random investment operations

Following this analytical process, the performance of the remaining 19 alpha-factor strategies and their corresponding enhancement strategies is also tested. All performance results are listed in Tables 10 and 11 for the two stock markets. To facilitate comparison, Fig. 11 displays the mean return of each selected alpha factor in each market as well as for the two types of benchmarks. Here, it is not difficult to find that the benchmark from random investment operations is higher than that from MKT. According to Fig. 11, we can obtain the following findings: (1) all of the enhancement strategies perform better than the two benchmarks in both markets, demonstrating the usefulness of the proposed strategies; and (2) almost all of the enhancement strategies perform better than the corresponding pure alpha-factor strategies, illustrating that the signal provided by the lead–lag strategy that we proposed does improve the performance of pure alpha-factor strategies in most cases.

Table 10 The performance of Pure-02 to Pure-20 and Enhan-02 to Enhan-20 in the CSI 300
Table 11 The performance of Pure-02 to Pure-20 and Enhan-02 to Enhan-20 in the S&P 500
Fig. 11
figure 11

Performance of each selected alpha strategy and its enhancement strategy in each market

Discussion

Section 4.2.2, Fig. 8, and Table 7 show that the overall prediction accuracy is significantly higher than 50% (i.e., better than a random guess) in each case, implying that the stocks with the lead–lag effect provide useful information for prediction and strategy design. However, a degree of more than 50% is not exceptionally high, particularly in the CSI 300; thus, some stock pairs perform worse than a random guess in each prediction period. Inspired by this result, we provide a refined process in which the stock pairs (i.e., the detected lead–lag relationship with effect) with a prediction accuracy less than 50% are eliminated from the selected stock set. Accordingly, the refined process makes the set of stock pairs with the lead–lag effect more minor by deleting those with inferior prediction performance in the trained data. Then, putting the refined process into the enhancement strategy proposed above, the so-called refined strategy is proposed and the test of its performance is like what was done in Sect. 5.3.

All performance results are listed in Table 12 for each market, and Fig. 12 displays the mean returns of each enhancement strategy and its corresponding refined strategy in each market with the two selected benchmarks. According to Fig. 12, the refined strategies have different degrees of improvement in profit over the original enhancement strategies in the CSI 300, indicating that the refined process does provide more practical information for investing in the CSI 300. However, for the S&P 500, most refined strategies outperform the original enhancement strategies, but some perform worse than the original enhancement strategy, especially when the original strategy is already very profitable. This result implies that the refined process does not always work better than the original process, possibly because some helpful information may be dropped during the refinement process.

Table 12 The performance of Refined-01 to Pefined-20 in both markets
Fig. 12
figure 12

Performance of each enhancement strategy and its refined strategies in each market

Furthermore, in the risk analysis, Table 12 shows that the refined strategy generally improves the Sharpe ratio and reduces the maximum draw-down rate in the CSI 300. In contrast, improvement is not evident in the S&P 500. These results indicate that the refinement process effectively discards risky lead–lag signals, but its performance depends on the application scenario. Overall, the refined strategy is more suitable for the CSI 300, but for the S&P 500, it may serve as an alternative to the original enhancement strategy.

Conclusions and future work

The power-law distribution is often observed in human activity, which explains its widespread existence in stock markets. Interestingly, this study finds that the number of accumulated lead–lag days between stock pairs fulfills the power-law distribution in both the U.S. and Chinese stock markets based on 10 years of data. Because the power-law distribution features a heavy tail, this study also formally defines the lead–lag effect via statistical testing and then proposes a new method for detecting stock pairs characterized by the previously defined lead–lag effect. Robustness and the functions of the parameters embedded in the detection method are tested and explored. As an application, a PLL investment strategy is first proposed based on stock pairs identified with the lead–lag effect. Although the proposed lead–lag strategy can beat a naïve buy-and-sell strategy, its leading edge is too limited to be satisfied. To this end, enhancement strategies are also designed by integrating the lead–lag strategy in the selected basic alpha-factor strategies. Then, a series of validations are conducted on 20 different alpha factors to guarantee sound results. The results demonstrate that the enhancement strategy significantly improves the performance of the basic alpha-factor strategies and the PLL strategy in most cases.

In theory, the discovered power-law distribution implies that the lead–lag phenomenon common in stock markets is attributable not only to random factors but is also influenced by human behaviors such as irrationality, herding, gaming behavior, and many others. Importantly, this finding provides new evidence in support of behavioral finance theory. The proposed detection method can be considered solid and credible because it originates from the principle of statistical testing, contributing to the existing methods for detecting the lead–lag phenomenon or effect. In practice, because the lead–lag effect is demonstrated in this study to provide effective information, it will benefit the designing of innovative and effective investment strategies that are especially suitable for low-frequency data due to its maneuverability. The idea of an enhancement strategy (i.e., a basic strategy supplemented by the lead–lag strategy) provides investors with a new framework for strategy design with potentially positive back-tested performance and practicality.

Our study does have some limitations. Although we selected two representative stock markets as the targets for this examination, the analysis and validation of additional stock markets is required. In future work, we will study the characteristics of the lead–lag phenomenon in different emerging markets at different stages of economic development. Many previous studies have confirmed that there are more opportunities for profit in emerging markets than in mature markets; thus, if our proposed strategy will be effective in various emerging markets remains a question of interest. Although our proposed lead–lag strategy and enhancement strategies exhibit a significant improvement compared to the selected benchmarks, the type of basic investment strategy (i.e., the alpha strategy) is relatively singular in this work. In future work, other investment strategies can be selected as the basic strategy that will be enhanced by implementing the lead–lag strategy to design more competitive stock market investment strategies. As this is a preliminary examination into these two directions, more colorful findings and profitable investment strategies are welcome in the future.

Availability of data and material

Data and codes are available at https://github.com/liuchaos03/Power-law-distribution-Lead-lag-effect-and-Investment-strategies-in-Stock-Markets.

References

  • Balatti M, Brooks C, Kappou K (2017) Fundamental indexation revisited: new evidence on alpha. Int Rev Financ Anal 51:1–15

    Article  Google Scholar 

  • Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512

    Article  Google Scholar 

  • Barabási AL, Bonabeau E (2003) Scale-free networks. Sci Am 288(5):60–69

    Article  Google Scholar 

  • Berggrun L, Cardona E, Lizarzaburu E (2020) Profitability of momentum strategies in Latin America. Int Rev Financ Anal 70:101502

    Article  Google Scholar 

  • Buccheri G, Corsi F, Peluso S (2019) High-frequency lead–lag effects and cross-asset linkages: a multi-asset lagged adjustment model. J Bus Econ Stat. https://doi.org/10.1080/07350015.2019.1697699

    Article  Google Scholar 

  • Campajola C, Lillo F, Tantari D (2020) Unveiling the relation between herding and liquidity with trader lead–lag networks. Quant Finance 20(11):1765–1778

    Article  Google Scholar 

  • Casgrain P, Jaimungal S (2019) Trading algorithms with learning in latent alpha models. Math Financ 29(3):735–772

    Article  Google Scholar 

  • Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703

    Article  Google Scholar 

  • Conlon T, Cotter J, Gencay R (2018) Long-run wavelet-based correlation for financial time series. Eur J Oper Res 271(2):676–696

    Article  Google Scholar 

  • Cont R (2010) Empirical properties of asset returns: stylized facts and statistical issues. Quant Finance 1(2):223–236

    Article  Google Scholar 

  • Coronel-Brizio HF, Hernández-Montoya AR (2010) The Anderson-Darling test of fit for the power-law distribution from left-censored samples. Physica A Stat Mech Appl 389(17):3508–3515

    Article  Google Scholar 

  • Curme C, Tumminello M, Mantegna RN, Stanley HE, Kenett DY (2015) Emergence of statistically validated financial intraday lead–lag relationships. Quant Finance 15(8):1375–1386

    Article  Google Scholar 

  • Dao TM, Mcgroarty F, Urquhart A (2018) Ultra-high-frequency lead–lag relationship and information arrival. Quant Finance 18(5):725–735

    Article  Google Scholar 

  • Deev O, Lyócsa Š (2020) Connectedness of financial institutions in Europe: a network approach across quantiles. Phys A Stat Mech Appl 550:124035–124041

    Article  Google Scholar 

  • Eisdorfer A, Goyal A, Zhdanov A (2019) Equity misvaluation and default options. J Financ 74(2):845–898

    Article  Google Scholar 

  • Fama EF, French KR (2012) Size, value, and momentum in international stock returns. J Financ Econ 105(3):457–472

    Article  Google Scholar 

  • Fama EF, French KR (1998) Value versus growth: the international evidence. J Financ 53:1975–1999

    Article  Google Scholar 

  • Fama EF, French KR (2015) A five-factor asset pricing model. J Financ Econ 116(1):1–12

    Article  Google Scholar 

  • Fama EF, French KR (2016) Dissecting anomalies with a five-factor model. Rev Financ Stud 29(1):69–103

    Article  Google Scholar 

  • Fievet L, Sornette D (2018) Decision trees unearth return sign predictability in the S&P 500. Quant Finance 18(11):1797–1814

    Article  Google Scholar 

  • Flori A, Regoli D (2021) Revealing pairs-trading opportunities with long short-term memory networks. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2021.03.009

    Article  Google Scholar 

  • Fonseca DJ, Zaatour R (2017) Correlation and lead–lag relationships in a hawkes microstructure model. J Futur Mark 37(3):260–285

    Article  Google Scholar 

  • Gabaix X, Gopikrishnan P, Plerou V, Stanley HE (2003) A theory of power-law distributions in financial market fluctuations. Nature 423(6937):267–270

    Article  Google Scholar 

  • Gong CC, Ji SD, Su LL, Li SP, Ren F (2016) The lead–lag relationship between stock index and stock index futures: a thermal optimal path method. Physica A 444:63–72

    Article  Google Scholar 

  • Gupta K, Chatterjee N (2020) Selecting stock pairs for pairs trading while incorporating lead–lag relationship. Phys A Stat Mech Appl 551:124103

    Article  Google Scholar 

  • Harvey CR, Liu Y, Zhu H (2016) … and the cross-section of expected returns. Rev Financ Stud 29(1):5–68

    Article  Google Scholar 

  • Hou K, Xue C, Zhang L (2015) Digesting anomalies: an investment approach. Rev Financ Stud 28(3):650–705

    Article  Google Scholar 

  • Huang WQ, Zhuang XT, Yao S (2009) A network analysis of the Chinese stock market. Physica A 388(14):2956–2964

    Article  Google Scholar 

  • Huth N, Abergel F (2014) High frequency lead/lag relationships—empirical facts. J Empir Financ 26:41–58

    Article  Google Scholar 

  • Jiang T, Bao S, Li L (2019) The linear and nonlinear lead–lag relationship among three SSE 50 Index markets: the index futures, 50ETF spot and options markets. Physica A Statis Mech Appl 525:878–893

    Article  Google Scholar 

  • Jong DF, Nijman T (1997) High frequency analysis of lead–lag relationships between financial markets. J Empir Financ 4(2–3):259–277

    Article  Google Scholar 

  • Kakushadze Z (2016) 101 formulaic alphas. Wilmott 2016(84):72–81

    Article  Google Scholar 

  • Kuiper NH (1960) Tests concerning random points on a circle. Proc Ser A 63(1):38–47

    Google Scholar 

  • Kobayashi T, Takaguchi T (2018) Social dynamics of financial networks. EPJ Data Sci 7(1):15

    Article  Google Scholar 

  • Krauss C (2017) Statistical arbitrage pairs trading strategies: review and outlook. J Econ Surv 31(2):513–545

    Article  Google Scholar 

  • Kumar S, Deo N (2012) Correlation and network analysis of global financial indices. Phys Rev E 86(2):026101

    Article  Google Scholar 

  • Li Y, Liu C, Wang T, Sun B (2021) Dynamic patterns of daily lead–lag networks in stock markets. Quant Finance 21(12):2055–2068

    Article  Google Scholar 

  • Liu J, Stambaugh RF, Yuan Y (2019) Size and value in china. J Financ Econ 134(1):48–69

    Article  Google Scholar 

  • Makarov I, Plantin G (2015) Rewarding trading skills without inducing gambling. J Financ 70(3):925–962

    Article  Google Scholar 

  • Malevergne Y, Pisarenko V, Sornette D (2011) Testing the Pareto against the lognormal distributions with the uniformly most powerful unbiased test applied to the distribution ofcities. Phys Rev E 83(3):

    Article  Google Scholar 

  • Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78

    Article  Google Scholar 

  • Newman ME, Strogatz SH, Watts DJ (2001) Random graphs with arbitrary degree distributions and their applications. Phys Rev E 64(2):026118

    Article  Google Scholar 

  • Peralta G, Zareei A (2016) A network approach to portfolio selection. J Empir Financ 38:157–180

    Article  Google Scholar 

  • Rickles D (2011) Econophysics and the complexity of financial markets. In: Hooker C (ed) Philosophy of complex systems. North-Holland, Amsterdam, pp 531–565

    Chapter  Google Scholar 

  • Scherbina A, Schlusche B (2020) Follow the leader: using the stock market to uncover information flows between firms. Rev Finance 24(1):189–225

    Google Scholar 

  • Scholz FW, Stephens MA (1987) K-sample Anderson-Darling tests. J Am Stat Assoc 82(399):918–924

    Google Scholar 

  • Sharpe WF (1964) Capital asset prices: a theory of market equilibrium under conditions of risk. J Financ 19(3):425–442

    Google Scholar 

  • Shen D, Zhang Y, Xiong X, Zhang W (2017) Baidu index and predictability of Chinese stock returns. Financ Innov 3(1):1–8

    Article  Google Scholar 

  • Stübinger J (2019) Statistical arbitrage with optimal causal paths on high-frequency data of the S&P 500. Quant Finance 19(6):921–935

    Article  Google Scholar 

  • Toda AA (2012) The double power law in income distribution: explanations and evidence. J Econ Behav Org 84(1):364–381

    Article  Google Scholar 

  • Tóth B, Kertész J (2006) Increasing market efficiency: Evolution of cross-correlations of stock returns. Physica A 360(2):505–515

    Article  Google Scholar 

  • Volz E (2004) Random networks with tunable degree distribution and clustering. Phys Rev E 70(5):056115

    Article  Google Scholar 

  • Xia L, You D, Jiang X, Chen W (2018) Emergence and temporal structure of Lead-Lag correlations in collective stock dynamics. Phys A Statis Mech Appl 502:545–553

    Article  Google Scholar 

  • Xiong X, Cui Y, Yan X, Liu J, He S (2020) Cost-benefit analysis of trading strategies in the stock index futures market. Financ Innov 6(1):1–17

    Article  Google Scholar 

  • Zeng K, Atta Mills EFE (2021) Can economic links explain lead–lag relations across firms? Int J Finance Econ. https://doi.org/10.1002/ijfe.2480

    Article  Google Scholar 

  • Zhang W, Yan K, Shen D (2021) Can the Baidu Index predict realized volatility in the Chinese stock market? Financ Innov 7(1):1–31

    Article  Google Scholar 

Download references

Funding

This work was supported by the National Natural Science Foundation of China (72171059, 71771041), the Fundamental Research Funds for the Central Universities (FRFCU5710000220) and the Natural Science Foundation of Heilongjiang Province, China (No. YQ2020G003).

Author information

Authors and Affiliations

Authors

Contributions

YL: Conceptualization, Methodology, Formal analysis, and Writing - Original Draft. TW: Methodology, Software, Formal analysis, and Writing - Original Draft. BS: Writing - Review & Editing, Supervision, Validation. CL: Visualization, Software, Validation, and Data Curation. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yongli Li or Baiqing Sun.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The designed 20 alpha factors and their expressions

We present the designed 20 alpha factors in details here by providing their formulaic expressions. Specifically, Table 13 shows symbolic descriptions of the variables related to the collected data, and Table 14 shows the operators and functions adopted in these formulaic expressions of alpha factors.

Table 13 Symbolic descriptions of the data variables
Table 14 Operators and functions in the formulaic expressions

Accordingly, the designed 20 alpha formulas are expressed one by one as below.

  • Alpha-01: \({\rm Rank}\left(-1*{\rm Correlation}\left(Open,Volume,20\right)\right)-0.5\)

  • Alpha-02: \({\rm Rank}(-1*{\rm Correlation}({\rm Rank}({\rm Delta}(Volume,10)),{\rm Rank}(\frac{Close-Open}{Open}),20))-0.5\)

  • Alpha-03: \({\rm Rank}(-1*{\rm Ts}\_{\rm Rank}(Low,20))-0.5\)

  • Alpha-04: \({\rm Rank}\left(-1*{\rm Correlation}\left(Open,Volume,20\right)\right)-0.5\)

  • Alpha-05: \({\rm Rank}\left({\rm Sign}\left({\rm Delta}\left(Volume,1\right)\right)*\left(-1*{\rm Delta}\left(Close,20\right)\right)\right)-0.5\)

  • Alpha-06: \(0.5-1*{\rm Rank}({\rm Covariance}({\rm Rank}(Close),{\rm Rank}(Volume),20))\)

  • Alpha-07: \({\rm Rank}((-1*{\rm Rank}({\rm Delta}(Returns,10)))*{\rm Correlation}(Open,Volume,20))-0.5\)

  • Alpha-08: \(0.5-1*{\rm Rank}({\rm Correlation}({\rm Rank}(High),{\rm Rank}(Volume), 20))\)

  • Alpha-09: \(0.5-1*{\rm Rank}({\rm covariance}({\rm Rank}(High),{\rm Rank}(Volume),20))\)

  • Alpha-10: \(0.5-1*{\rm Rank}({\rm Correlation}({\rm Ts}\_{\rm Rank}(Volume,5),{\rm Ts}\_{\rm Rank}(High,5),15))\)

  • Alpha-11: \({\rm Rank}({\rm Correlation}(Adv20,Low,5)+((High+Low)/2)-Close)-0.5\)

  • Alpha-12: \({\rm Rank}({\rm Correlation}({\rm Delay}((Open-Close),1),Close,20))+{\rm Rank}((Open-Close))-0.5\)

  • Alpha-13: \({\rm Rank}(-1*{\rm Rank}({\rm Std}(High,20))*{\rm Correlation}(High,Volume,20))-0.5\)

  • Alpha-14: \({\rm Rank}(-1*{\rm Correlation}(High,{\rm Rank}(Volume),20))-0.5\)

  • Alpha-15: \({\rm Rank}(-1*{\rm Delta}((2*Close-Low-High)/(Close-Low),20))-0.5\)

  • Alpha-16: \({\rm Rank}({\rm Correlation}((Low-Close)*(Open^5),(Low-High)*(Close^5),20))-0.5\)

  • Alpha-17: \(0.5-{\rm Rank}({\rm Correlation}({\rm Rank}(\frac{Close-{\rm Min}(Low,12)}{{\rm Max}\left(High,12\right)-{\rm Min}\left(Low,12\right)}),{\rm Rank}(Volume),6))\)

  • Alpha-18: \({\rm Rank}({\rm Correlation}\left(Close-Open,High-Low\right),20)-0.5\)

  • Alpha-19: \({\rm Rank}(2-{\rm Rank}({\rm Std}(Returns,7)/{\rm Std}(Returns,20))-{\rm Rank}({\rm Delta}(Close,7)))-0.5\)

  • Alpha-20: \({\rm Rank}({\rm Ts}\_{\rm Rank}(Volume/Adv20,20)*{\rm Ts}\_{\rm Rank}(-1*{\rm Delta}(Close,7),7))-0.5\)

Appendix 2: List of ababreviations

See Table 15.

Table 15 Abbreviations and their full names

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Wang, T., Sun, B. et al. Detecting the lead–lag effect in stock markets: definition, patterns, and investment strategies. Financ Innov 8, 51 (2022). https://doi.org/10.1186/s40854-022-00356-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40854-022-00356-3

Keywords

  • Power-law distribution
  • Lead–lag effect
  • Stock market
  • Complex network
  • Investment strategy