 Research
 Open access
 Published:
Detecting the lead–lag effect in stock markets: definition, patterns, and investment strategies
Financial Innovation volume 8, Article number: 51 (2022)
Abstract
Human activities widely exhibit a powerlaw distribution. Considering stock trading as a typical human activity in the financial domain, the first aim of this paper is to validate whether the wellknown powerlaw distribution can be observed in this activity. Interestingly, this paper determines that the number of accumulated lead–lag days between stock pairs meets the powerlaw distribution in both the U.S. and Chinese stock markets based on 10 years of trading data. Based on this finding this paper adopts the powerlaw distribution to formally define the lead–lag effect, detect stock pairs with the lead–lag effect, and then design a pure lead–lag investment strategy as well as enhancement investment strategies by integrating the lead–lag strategy into classic alphafactor strategies. Tests conducted on 20 different alphafactor strategies demonstrate that both perform better than the selected benchmark strategy and that the lead–lag strategy provides useful signals that significantly improve the performance of basic alphafactor strategies. Our results therefore indicate that the lead–lag effect may provide effective information for designing more profitable investment strategies.
Introduction
The lead–lag phenomenon, a phenomenon in which a security leads the price movement of another with some time delay, has been empirically evidenced as widely existing in financial markets (Gong et al. 2016). Although the “lead–lag effect” concept has been adopted in many studies (Kobayashi and Takaguchi 2018), few have provided a formal definition of this concept, and its underlying meaning is not always consistent. Some studies have focused on how to generate greater stock returns by utilizing the “lead–lag phenomenon” (Stübinger 2019) but have often failed to mine its embedded features. To this end, this study aims to answer the following questions: (1) Are there several stable patterns in stock markets that are characterized by the lead–lag phenomenon? (2) How can we formally define the lead–lag effect to provide a solid foundation for detecting such an effect? (3) Can detecting the lead–lag effect enable the design of more profitable investment strategies that are more likely to earn excess returns?
The definition of the “lead–lag effect” is not equivalent to that of the “lead–lag relationship.” That is, if one stock’s volatility today mimics another stock’s volatility yesterday, the two stocks are said to have a “lead–lag relationship” over the two successive days in which the former is the follower, and the latter is the leader. In fact, it is quite common for one stock to follow another stock some days during a year. Thus, an occasional lead–lag relationship could be regarded as random, which would not be very meaningful. However, if the lead–lag days of one stock pair are long enough to differ significantly from a random event, an effect can be deemed to exist between the pair. Accordingly, the first motivation of our work is to define the lead–lag effect by providing a statistical testing model, the goal of which is to judge whether the days characterized by a lead–lag relationship (hereafter, “lead–lag days”) are significantly long in statistics.
Once the definition of the lead–lag effect is scientifically determined, a method for detecting stock pairs characterized by the lead–lag effect can be proposed. However, two questions must first be addressed. These are: (1) how do external variables affect the detection results and (2) are the detection results sensitive to these influential external variables? The answers to these two questions will deepen our understanding of the proposed detection model. The patterns of external variables that influence the results will enable us to adopt the proposed model by selecting the appropriate variable values. The robustness of the proposed model is notable for its usage in investment practices in realworld stock markets because a model’s robustness is desirable for designing investment strategies. Accordingly, the answers to these two questions will reveal the properties of the proposed model.
As a typical application, the detected lead–lag effect aims to be adopted in guiding investments in realworld stock markets. Apparently, detecting stock pairs with a significant lead–lag effect can benefit investors because the price movements of followers will mimic those of their leaders. Thus, this study will first examine the performance of the pure lead–lag strategy and then judge if it is satisfactory. If it is satisfactory, we will regard the detected lead–lag effect as an enhancement signal, and then add it to some classic investment strategies to propose enhancement investment strategies. Generally, when a basic strategy is enhanced by another strategy, we refer to it as a singleenhancement investment strategy. The alphafactor strategy is selected as the basic strategy, and our proposed lead–lag strategy is adopted to enhance it. Accordingly, the third motivation is to design profitable investment strategies based on the detected lead–lag effect, and then test its performance in a pure investment strategy and the proposed enhancement strategies.
To sum up, the contributions of this study are as follows: (1) The features of the lead–lag phenomenon are explored in the context of both the U.S. and Chinese stock markets. As a result, the number of stock pairs characterized by the lead–lag relationship meets the wellknown powerlaw distribution, which offers novel evidence that the powerlaw distribution exists widely in the real world (Clauset et al. 2009) and specifically in the financial domain (Gabaix et al. 2003). (2) A formal definition of the lead–lag effect is provided according to the principles of statistical testing, and a detection approach is proposed based on this definition. It is worth noting that most existing studies regard the lead–lag relationship between stocks as a phenomenon (Scherbina and Schlusche 2020; Dao et al. 2018; Huth and Abergel 2014), whereas this study elevates this phenomenon into an effect. Accordingly, the lead–lag effect must be formally defined via statistical testing, which lays a foundation for future studies to compare and detect the lead–lag effect in various scenarios. The rationality and robustness of the proposed detection approach are carefully examined by determining how external variables influence the lead–lag effect. (3) A few profitable investment strategies are designed and validated based on the detected lead–lag effect, in parallel to previous design and validation studies such as Shen et al. (2017), Xiong et al. (2020), Flori and Regoli (2021), and Zhang et al. (2021). Here both the pure lead–lag strategy and the enhancement strategies report sound results regarding the functionality of the detected lead–lag effect.
The remainder of this paper is organized as follows. “Section Related work” reviews the related work to clearly delineate the aforementioned contributions; “Section Method for detecting the lead–lag effect” defines the lead–lag effect and proposes a detection methodology; “Section Main results and validation in realworld stock markets” explores the features of the lead–lag phenomenon based on a selected realworld dataset, applies the proposed detection method, and tests the method’s robustness; “Section Investment strategies based on the detected lead–lag effect” designs investment strategies and validates their performance to reveal the functionality of the detected lead–lag effect; and “Section Conclusions and future work” summarizes the study and discusses potential future work.
Related work
Our work is directly related to two fields of existing studies: this includes the lead–lag phenomenon in stock markets and the focused alphafactor strategy widely adopted in stock markets. Each field is reviewed individually in the following sections.
Lead–lag phenomenon in stock markets
The lead–lag phenomenon is a classic financial topic that has attracted the attention of numerous researchers (Conlon et al. 2018). First, one fundamental question has been widely examined in the literature: does the lead–lag phenomenon exist in the stock market? Generally, the lead–lag phenomenon can be observed in highfrequency data such as 5min stock price movements. Both Jong and Nijman (1997) and Huth and Abergel (2014) deemed that the lead–lag relationship is an essential stylized fact at high frequencies. Fonseca and Zaatour (2017), Dao et al. (2018), Buccheri et al. (2019), Campajola et al. (2020), and many others do not mention the existence of the lead–lag phenomenon in highfrequency data, the influencing factors, or even its potential origins. However, when observed in highfrequency data, this phenomenon is often called the “lead–lag relationship” rather than the “lead–lag effect”. In most cases, the lead–lag relationship is unstable in highfrequency data. Since, according to Tóth and Kertész (2006) and Curme et al. (2015), its appearance is likely to be occasional, our work aims to formulate a new approach to finding a stable lead–lag relationship over a long time period based on statistical testing and to rename such stable lead–lag relationship “the lead–lag effect” as an indication of its statistical significance.
Second, the existing literature explores how to take advantage of the lead–lag phenomenon in designing investment strategies for realworld stock markets. Typically, investment strategies that utilize the lead–lag phenomenon are often variations on the highfrequency trading strategy, which is in accord with the results. Stübinger (2019) developed an optimal causal path algorithm and designed statistical arbitrage strategies for highfrequency data based on the lead–lag phenomenon. However, designing an investment strategy based on highfrequency data still has drawbacks. According to Krauss (2017), highfrequency trading strategies are associated with greater commission fees and a higher transaction threshold for investors.
In contrast, the stable lead–lag effect discovered in lowfrequency data facilitates the practices of small and medium investors because of its ample optional trading time and low technical threshold. Scherbina and Schlusche (2020) and Gupta and Chatterjee (2020) have pointed out that the lead–lag relationship enables outofsample forecasting and thus helps in the design of investment strategies. From this perspective, this study can also be seen as the development of our previously published work (Li et al., 2021), which focuses on identifying the factors that cause the lead–lag phenomenon. However, this study aims to develop new investment strategies by utilizing the lead–lag phenomenon, and thus the two have divergent research aims. In contrast to the successive lead–lag days analyzed in our previously published work, this study considers the number of cumulative lead–lag days that would benefit extending the application of the model in realworld stock markets.
According to the aforementioned literature and gaps in the existing research, we believe that it is meaningful and even necessary to study the lead–lag effect in lowfrequency data for the following reasons: (1) the definition of the lead–lag effect is not unified or discussed in depth in the existing literature, and thus the underlying significance of the lead–lag phenomenon often differs despite their use of the same name; (2) traditional studies on detecting the lead–lag phenomenon are conducted using classical econometrics or empirical research methods, and thus the use of datadriven technical analysis to detect the lead–lag effect can supplement existing studies with a new perspective; and (3) building on the traditional methods of designing investment strategies by using the discovered lead–lag phenomenon, our study may identify effective signals, which will have a guiding significance for the development of investment strategy. Accordingly, this study contributes to the literature by providing a unified and solid definition of the lead–lag effect and by utilizing the lead–lag effect to design profitable trading strategies in realworld stock markets.
Alphafactor strategy
Concerning our targeted enhancement investment strategies, the alphafactor strategy that originated with the capital asset pricing model is chosen as the primary strategy due to its popularity and effectiveness in realworld investment practices (Sharpe 1964; Makarov and Plantin 2015). The alpha factor in the alphafactor strategy reacts to one or some stock attributes; in other words, different alpha factors reflect different stock attributes. Thus, the alphafactor strategy consists of numerous specific strategies using various alpha factors. Since alpha factors are used as buying and selling signals in the alphafactor strategy, its choice is the core of the strategy. Generally, existing studies focus mainly on the following two types of alpha factors: value alphas and transactional alphas.
Value alphas are derived from the fundamentals of one stock and describe its value attributes. Value alphas include but are not limited to value factors (Balatti et al. 2017; Eisdorfer et al. 2019), size factors (Liu et al. 2019), growth factors (Fama and French 1998), profitability factors (Hou et al., 2015; Fama and French 2015), and momentum factors (Fama and French 2012; Berggrun et al. 2020). Based on the mature factor model, value alphas provide not only a valuable tool for stock valuation, but also a reasonable explanation for the crosssection of stock returns (Harvey et al. 2016). Accordingly, when value alphas are adopted in a strategy, it indicates that the investor cares about the value investment’s underlying factors (Fama and French 2016). In contrast to traditional value alphas, transactional alphas pay more attention to the patterns embedded in trading behaviors (Casgrain and Jaimungal 2019). Transactional alphas are obtained by means of technical analysis and derived from transaction data. With the current progression of computer science, millions of transactional alpha factors have been identified by automated algorithms. Despite the lack of a good explanation, the marginal revenue contributed by transactional alpha factors is relatively satisfactory (Kakushadze 2016); large financial institutions favor such transactional alphas. For example, the 101 alpha factors proposed by the World Quant and the 191 alpha factors from Guotai Junan Securities have been welcomed by many institutions and investors.
The alphafactor strategy is always used for stock selection. The proposed lead–lag strategy in our work helps allocate the weight of each selected stock in an investment portfolio. Therefore, it is convenient to combine the two strategies when designing an enhancement strategy. Since the alphafactor strategy includes numerous specific models with various alpha factors, we select it as the primary strategy to demonstrate representativeness. The lead–lag effect falls into the category of technologydriven analysis and therefore resonates with transactional alphas, which are determined by technical analysis.
For this reason, it would be more natural to combine the lead–lag trading strategy with transactional alphas. Accordingly, this study focuses mainly on transactional alphafactor strategies, regarded as the basic strategies when designing enhancement investment strategies. Our work exploits the great potential of the existing alpha factors and provides a framework for enhancement strategies by integrating the lead–lag effect into the existing alphafactor strategies.
Method for detecting the lead–lag effect
The daily lead–lag network
Let r_{i,t} denote the yield rate of stock i on day t. Its mathematical expression is as follows:
where p_{i,t} denotes the closing price of stock i on day t. Here, the adopted stock price is restoring the right price rather than the price of exrights. If a suspension occurs for stock i on day t, then both r_{i,t} and r_{i,t+1} are set to “NAN.” Next, given a manufactured threshold Δ (0 ≤ Δ < 1), the definition that stock j follows stock i on day t is defined as follows:
Definition 1
The conditions for forming a lead–lag link. If and only if the following condition holds:
then, stock i follows stock j on day t.
Definition 1 states that if the difference between the yield rate of stock i on day t and that of stock j on day t–1 is within the given threshold, Δ, stock i is judged to follow stock j on day t. Further, let G_{t} denote the lead–lag network on day t, and its element g_{ij,t} reflects the status of stock i following stock j on day t. If stock i is judged as the follower of stock j on day t according to Definition 1, then g_{ij,t} = 1; otherwise, g_{ij,t} = 0. Our model allows one stock to follow itself, and thus it is possible that g_{ii,t} = 1 holds. Then, given the closing prices of all concerned stocks during the sequential T + 1 trading days, we can achieve T lead–lag networks according to Definition 1.
During the targeted period (e.g., the total number of T + 1 trading days), the achieved T lead–lag networks can tell us how many days stock i follows stock j in total. Formally, let d_{ij} denote the number of accumulated days that stock i follows stock j during the targeted period, which can be calculated as follows:
G_{t} is an asymmetrical matrix in most cases, considering that d_{ij} is not often equal to d_{ji}.
Concerning the manufactured threshold Δ, a larger Δ will cause the achieved daily lead–lag networks to have more directed links than a smaller Δ, thus the threshold Δ affects network density. Because it is an artificial variable, we will explore how it affects the results and check whether our method is robust under different threshold values in Sect. 4.2.1. The mainstream literature defining the relationship between stock pairs, such as Huang et al. (2009), Kumar and Deo (2012), Peralta and Zareei (2016), Xia et al. (2018), Deev and Lyócsa (2020), and many others, has often adopted the correlation coefficient. Note that most existing literature related to this definition uses data from a defined period to calculate the socalled “correlation coefficient,” whereas our study uses daily data to define each day’s lead–lag relationship between stock pairs. Therefore, the novel idea of using the selected data (i.e., “daily usage to define the lead–lag relationship” or “usage together to calculate a correlation coefficient during the selected period”) leads to one of the differences between our study and the existing literature.
Definition and detection of the lead–lag effect
As explained in the introduction, when d_{ij} (which is defined in Eq. (3)) is long enough to be significantly distinct from the amount achieved in a random event, we tend to believe that the lead–lag effect from stock j to stock i holds, where stock j is the leader and stock i is the follower. However, the criterion for judging whether d_{ij} is sufficiently long or not should be determined before formally defining the lead–lag effect. Fortunately, statistical testing enables us to formulate the following criterion: the null hypothesis is set to “all the links in the daily lead–lag networks are randomly formed,” the null hypothesis will allow us to obtain the distribution of the accumulated days of all stock pairs. Then, given the statistical significance level (e.g., 0.10, 0.05, 0.01, etc.), the criterion can be immediately achieved in the obtained distribution. To clarify, let \(\hat{d}\) denote the criterion. The meaning of the term “lead–lag effect” is provided in Definition 2.
Definition 2
Lead–lag effect. Based on the calculated \(\hat{d}\), for any pair of stocks (e.g., i and j), if the d_{ij} achieved from Eq. (3) satisfies \(d_{ij} \ge \hat{d}\), the lead–lag effect from stock j to stock i is judged to hold. If \(d_{ii} \ge \hat{d}\), the lead–lag effect from stock i to itself is judged to hold.
Note that the principles of statistical testing imply that it is almost impossible for a rare event to occur in one random trial. Given the null hypothesis and the statistical significance level, the criterion for judging whether an event is rare or not can be achieved. Then, if a rare event occurs in the analyzed realworld data, we can reject the null hypothesis under the given statistical significance level, or we can determine that the rare event has a statistically significant effect. In fact, few studies have formally defined the lead–lag effect. As mentioned in the Related Work section, the lead–lag phenomenon or relationship was more often examined in the existing literature rather than the lead–lag effect. We detected the lead–lag effect via formal statistical tests and null reference networks, but the existing literature adopted different approaches to detecting the lead–lag relationship, such as the Granger test (Scherbina and Schlusche 2020; Zeng and Atta Mills 2021) and the optimal causal path algorithm (Jiang et al. 2019). Accordingly, the approach adopted leads to different definitions, so our definition is new in this field.
Table 1 shows the detailed process of achieving criterion \(\hat{d}\). Random networks are first generated to achieve the distribution of the accumulated days between all stock pairs under the null hypothesis. Then, criterion \(\hat{d}\) can be obtained given the statistical significance level in Step (1). Here, we refer to the configuration model proposed by Newman et al. (2001). The generated random networks retain the characteristics of the daily network as much as possible. Although the network indicator of the realworld lead–lag network changes each day, the adopted configuration model guarantees that each day’s random network and the same day’s realworld network share an almost identical node degree distribution, which is superior to the model that retains only the same edge number. Next, the statistical significance level δ is set to 0.001 since a lower significance level means a more rigorous criterion for determining the lead–lag effect. Once output \(\hat{d}\) is achieved based on the process shown in Table 1, Definition 2 directly judges which stock pair features the lead–lag effect. Hereafter, the stock pairs detected with the lead–lag effect are called “lead–lag stock pairs.”
Example
A simple example is presented to show how the proposed detection method works. This example analyzes the closing price of 10 stocks on 11 sequential trading days, and then obtains 10 daily lead–lag networks using Eq. (2). As displayed in Fig. 1, each node represents one stock, and the direction of the link points from the leader to the follower. The color of the node distinguishes between differences in the node outdegree (i.e., the number of followers): the more significant the outdegree, the darker the color.
Following Steps (1) and (2) in Table 1, one group of simulations can achieve the following 10 sequential random networks, as shown in Fig. 2. Here, each day’s random network retains the node degree distribution of the same day’s realworld lead–lag network, which can be checked by comparing the counterpart in Figs. 1 and 2.
Then, by conducting 500 groups of simulations according to Step (3) in Table 1, the distribution of all the achieved accumulated days is achieved and displayed in Fig. 3. When the statistical significance level (\(\delta\)) is set to 0.001, the criterion \(\widehat{d}\) is equal to 6, based on the achieved distribution in Fig. 3. Accordingly, the detected leader–follower pairs are 3 → 4, 4 → 5, and 6 → 7.
Main results and validation in realworld stock markets
To apply the previous simple example to realworld stock markets, this section selects the stock markets of mainland China and the U.S. as the targets of analysis. This section applies the proposed method to detect which stock pairs are characterized by the defined lead–lag effect and explores how the manmade variables embedded in the detection method affect the results. The following subsections introduce the process of data selection, report the main results in different stock markets, and discuss these findings.
Data preparation and main statistical results
Data preparation
Two stock sets are selected for the application and validation of the proposed method. One is the set of 300 stocks contained in the China Securities Index 300 (CSI 300), considering that these 300 stocks are the most liquid stocks in mainland China’s Ashare stock market; therefore, they are often used to reflect its overall performance. The other is the set of stocks included in Standard & Poor’s 500 Index (S&P 500), which helps us understand the proposed method’s performance in the U.S. stock market. Note that the stocks in the CSI 300 and the S&P 500 are not permanent, although adjustments to the stock set are quite infrequent.
The closing price of each stock in the two selected stock sets is collected on each trading day between January 1, 2010, and December 31, 2019 (i.e., over 10 years). The data were obtained from the Compustat database at https://wrdswww.wharton.upenn.edu/; each year has an average of 250 trading days. The stocks featured in each stock set changes over time because new stocks were added and others were removed during the chosen period. Almost every trading day witnessed stock suspensions due to some reason or rule, and thus the size of the daily lead–lag network fluctuates. As shown in Fig. 4, different stock sets feature different overall directed lead–lag networks in terms of their diameter (DM), density (DS), average path length, average node degree (ND), and clustering coefficient.
Recalling Eqs. (1–2) in Sect. 3.1, the daily lead–lag networks can be immediately achieved in each stock set based on the aboveprepared data once the manmade threshold Δ is given. Here and hereafter, taking Δ = 0.20 if no special statements are provided, the upper part of Fig. 4 displays the lead–lag networks achieved in each stock set on December 12, 2019. The overall lead–lag network can be obtained in each stock set by summing up each day’s lead–lag network. The bottom part of Fig. 4 shows the overall lead–lag network of each stock set during the entire period; the link thickness is proportional to the number of cumulative days on which one stock followed the other in this directed link.
To display more results under different values of the manmade threshold, Δ, Fig. 5 shows the achieved lead–lag networks in the two selected stock sets when Δ = 0.10. By comparing Figs. 4 and 5, we find that different values of Δ cause only slight changes in the overall lead–lag networks and their corresponding indicators in both markets, except that the average ND decreases with a decrease in Δ. However, the change in Δ has a significant impact on the daily networks of both markets because the daily network is not as robust as the overall network.
Powerlaw distribution
Before formally describing the detailed analysis, we will first recall basic knowledge about the random network, the scalefree network, and the powerlaw distribution that is often seen in the fields of complexity science and network analysis. First, a random network indicates that the links in the network are randomly formed; in other words, the links are generated with a given probability (Barabási and Albert 1999). The random network is often used as a testable null hypothesis about network structure (Volz 2004). Its link distribution is thintailed, and our work follows this idea. In contrast to a random network, a scalefree network refers to one with a degree distribution that meets the power law, at least asymptotically (Barabási and Bonabeau 2003). Roughly speaking, the distribution discrepancy between a random network and a scalefree network often originates in human activity, that is, human activity causes the change from a thintailed to a heavytailed distribution (represented by the powerlaw distribution). In addition, human activity also makes the powerlaw distribution more prevalent and special in the field of complexity science, and even the powerlaw distribution is viewed as a signature of complexity by noting that such a distribution can reflect the underlying pattern of a complex process (Rickles 2011). Our study considers the function of human activity in the stock market and thus tests the powerlaw distribution as stated below.
Although the overall lead–lag network of each stock set is unique, we wonder whether some identical patterns exist for different stock sets. If they do, we can call the discovered identical pattern a feature, because different stock sets do not alter the features embedded in the lead–lag phenomenon. To answer this question, we will focus on the link thickness displayed in Fig. 4 and examine its distribution. The distribution of the concerned link thickness is equal to that of variable d_{ij} defined in Eq. (3) by carefully considering the meaning of link thickness. Figure 6 displays the distribution in each stock set using Δ = 0.20. As displayed in Fig. 6, the points in the tail of each distribution are almost in a line in the log–log coordinates (i.e., a feature of the powerlaw distribution), indicating that the tested distribution is quite likely to meet the powerlaw distribution.
Next, according to the mainstream testing methods used in the existing literature (Clauset et al. 2009; Malevergne et al. 2011; Toda 2012) to verify the powerlaw distribution, we apply three methods to obtain sound results: the Kolmogorov–Smirnov Test (K–S), the Kuiper Test (Kuiper 1960), and the Anderson–Darling test (A–D) (Scholz and Stephens 1987; CoronelBrizio and HernándezMontoya 2010). Recalling Eq. (2), the manufactured threshold, Δ, affects the achieved daily lead–lag networks as well as the overall lead–lag networks in both markets. Here, we test whether the powerlaw distribution holds under different values of Δ. Table 2 shows the results: none of the three methods rejects the null hypothesis that “the data meets the powerlaw distribution” at the statistical significance level of 0.05. Therefore, we believe that the powerlaw distribution can be regarded as a stable pattern underlying the lead–lag phenomenon.
In addition, we conduct additional tests to exclude the other possible distributions and provide additional evidence supporting the discovered powerlaw distribution. As both markets witnessed steep decays in the log–log coordinates shown in Fig. 6, two possible discrete and thintailed distributions such as the Poisson distribution or the binomial distribution are estimated and tested using the three testing methods. To make our results sound, we change the value of Δ to test the sensitivity of the results to this manufactured parameter. The results for the two tested distributions are shown in Tables 3 and 4. When the statistical significance level is 0.05, the two distributions are rejected in both markets in most cases, although several exceptions exist for the binomial distribution at Δ = 0.15 under the Kuiper Test. In summary, these results provide more evidence that the verified distribution is likely to meet the powerlaw distribution.
Based on these findings, we now address why the discovered powerlaw distribution is important in our work. Our proposed detection approach is more meaningful when facing a powerlaw distribution than a thintailed distribution because very few stock pairs (the number is negligible) can be detected as having a lead–lag effect with a thintailed distribution, but the powerlaw distribution guarantees that a considerable number of stock pairs may be detected. As expected, more detected stock pairs implies more opportunities to utilize the information contained in the lead–lag effect to improve earnings, which lays a foundation for designing more profitable investment strategies.
Main results and validation
By recalling the proposed detection approach, two manufactured variables will affect the detection results: the threshold Δ and the period ζ. As we have explained, the threshold, Δ, influences the achieved daily lead–lag networks. The period, ζ, is also an influencing factor because the predictability is likely to differ when different periods are chosen. The following two subsections will explore how these variables affect the detection results. These findings can also partially answer questions related to the model’s robustness and the predictability of the results.
Detection results as a function of Δ
Recalling Eq. (2), the manufactured threshold Δ will affect the link formation in a daily lead–lag network to further influence the distribution of the variable d_{ij}s (by recalling Eq. (3) or Fig. 6). This subsection focuses on how the manufactured threshold Δ affects the aforementioned distribution. If the distributions obtained under different values of Δ differ significantly, the output of our model is sensitive to Δ, or, in other words, is not robust, and vice versa. To this end, DD(Δ_{i}, Δ_{j}) is defined in Eq. (4) by following the K–S test (Massey 1951) to measure the difference in the distribution as follows:
where cdf(d, Δ_{i}) and cdf(d, Δ_{j}) denote the cumulative distribution function under thresholds Δ_{i} and Δ_{j}, respectively. Because the measurement defined in Eq. (4) is a K–S statistic, the K–S test can be conducted to check whether the difference is significant. Considering different combinations of Δ_{i} and Δ_{j}, Tables 5 and 6 report the statistic DD(Δ_{i}, Δ_{j}) of each combination and its corresponding p value using the K–S test.
The numbers in bold in Tables 5 and 6 indicate that the difference between the two distributions is not significant at the significance level of 0.05. In addition, when Δ_{i} − Δ_{j}≤ 10%, none of the distribution differences under different combinations are significant, implying robustness, especially when the deviation of the two threshold values is not too large. Moreover, not surprisingly, DD(Δ_{i}, Δ_{j}) increases with Δ_{i} − Δ_{j} in all the combinations in the two stock markets, and, even if the deviation of the two threshold values is as great as 20%, the distribution under some combinations is also insignificant. Overall, the achieved distributions are robust considering that they are not quite sensitive to the parameter Δ.
Detection results as a function of ζ
Before discussing the function of ζ, we first focus on the prediction task: the detected leader during period ζ serves as a signal, and the price movements of the detected followers act as the predicted target. Specifically, if leader stock i and its follower stock j are one of these detected lead–lag stock pairs during the given period ζ (i.e., ζ months), the price movement of stock j on day t can be inferred from that of stock I on day t–1. Then, we compare the real price movement of stock j with the movement predicted by its leader i on each trading day in the targeted month; thus, the prediction accuracy of the month can be calculated. To simplify the problem, we use 1, − 1, and 0 to denote the three price movements without considering the degree. In addition, if one follower has multiple leaders, the movement direction of the follower is determined by the majority of the leaders. When half of the leaders move up and half move down, the movement of the follower is predicted to be 0. Finally, by averaging all followers’ prediction accuracy, we obtain the performance of the prediction task in the targeted month. The detailed process of the prediction task is displayed in Fig. 7.
Note that the detection results on the lead–lag stock pairs are dependent on the variable ζ. Thus, this subsection will explore the optimal value of ζ to achieve the best prediction performance. The performance is measured based on the overall accuracy shown in Fig. 7. On the one hand, the answer to this question will unveil the function of ζ on the detection results and even the accuracy of the simple prediction task, laying a foundation for designing profitable investment strategies; on the other hand, the answer will enable us to understand how much information is contained in the detected lead–lag stock pairs, although the prediction task is quite simple. If the mean overall prediction accuracy, as expected, is significantly greater than 50% (or say, a random guess), we tend to believe that the detected lead–lag stock pairs contain valuable information; a higher value means that they will be more helpful in designing profitable investment strategies in practice. Otherwise, we should consider how to better utilize and mine the information contained in the detection results.
Following the prediction task, Fig. 8 displays prediction accuracy under different values of ζ in each selected stock set. Here, the box plot under each value of ζ is achieved by 120 accuracy values, that is, the set of the overall accuracy obtained for each month (for prediction, as displayed in Fig. 7) over the 10 years between January 2010 and December 2019. As shown in Fig. 8, the medians of overall accuracy under different values of ζ are only a little higher than 0.50 for the CSI 300 and much higher than 0.50 for the S&P 500. Accordingly, the onesample ttest is needed, especially for the CSI 300, to check whether the mean values of overall accuracy are significantly higher than 0.50 for every value of ζ, as stated in the previous paragraph. To this end, the results are listed in Table 7, showing that all the mean values are significantly higher than 0.50, at least under the significance level of 0.10, regardless of the stock set and the value of ζ.
Combining the results reported in Fig. 8 and Table 7, we find that the information contained in the detected lead–lag stock pairs helps design profitable investment strategies. In addition, the performance is robust to the manufactured variable ζ by noting that the discrepancy between the highest and lowest mean accuracy values is within 2% in both stock sets. Furthermore, the accuracies achieved in the S&P 500 are all higher than those in the CSI 300, implying that the detected lead–lag stock pairs will be much more beneficial in the S&P 500, which will be validated in “Section Investment strategies based on the detected lead–lag effect”.
In addition, following the prediction task, different combinations of the two parameters (i.e., Δ and ζ) will yield varying accuracies. More importantly, the result in which combination has the best preformation will be useful for selecting parameters in designing investment strategies (see the next section). The two thermodynamic graphs displayed in Fig. 9 show the results for each stock set. According to Fig. 9, the prediction accuracy first increases and then decreases with an increase in ζ, in most cases, when Δ is fixed. The prediction accuracy increases with a decline in Δ, on average, but there are some exceptions when ζ is fixed. All the achieved accuracies are greater than 50%, demonstrating that the detected lead–lag stock pairs are helpful, even with the simple prediction task. Interestingly and more importantly, the best accuracy is achieved with the same parameter combination in the two stock sets; thus, the combination of Δ = 0.10 and ζ = 4 will lead to the most profitable lead–lag stock pairs, which will be adopted to design a more complicated investment strategy in the next section.
Investment strategies based on the detected lead–lag effect
The simple prediction task, described in Sect. 4.2.2, can be regarded as one of the most straightforward investment strategies because it only considers the direction of the predicted price movement without considering the trading details. In addition, the simple prediction task lays a foundation for designing more complicated investment strategies by revealing that the detected lead–lag stock pairs will yield more profitable information when Δ = 0.10 and ζ = 4. Based on the achieved parameter combination, this section extends the aforementioned simple prediction task into two types of practical investment strategies: the socalled “pure lead–lag strategy” and “enhancement strategies,” which are determined by integrating the pure lead–lag strategy into wellknown alphafactor strategies. The following subsections will first present the two strategies designed in this study and then report their performance to guide investors’ practices in realworld stock markets.
Pure lead–lag strategy
Our designed pure lead–lag strategy consists of three main steps based on the detected lead–lag stock pairs. The steps are listed in detail below.
 Step 1::

Calculate the strength of the influence of the leader on the follower.
A bipartite graph model is adopted to depict the leaders, followers, and their relationship, where the set of leaders and followers is denoted as N and M, respectively. For any \(p \in N\) and \(q \in M\), d_{pq} denotes the number of accumulated days that stock q follows stock p during the analyzed period by recalling Eq. (3). Then, let s_{pq} represent the influence strength of leader p on follower q with the following mathematical expression:
According to Eq. (5), a greater number of accumulated days indicates a stronger influence. Once the detected bipartite graph is determined, the strength of the influence of all detected lead–lag stock pairs is also determined.
 Step 2::

Calculate each day’s accumulated influence on the follower.
Let w_{q,t} reflect the accumulated influence of the leader set on follower q on day t, which can be calculated as follows:
where r_{p,t} is the yield rate of stock p on day t according to Eq. (1). Then, similar to Eq. (5), normalizing the calculated w_{q,t} achieves the following ratio variable v_{q,t}, which helps to determine the holding position of the follower stock q on day t. Equation (7) shows the specific expression for v_{q,t} as follows:
 Step 3::

Adjust the holding position of follower stock q based on v_{q,t}.
At the end of trading day t, the holding positions of all follower stocks can be adjusted based on the calculated v_{q,t}. Here, we assume that our adjustments can be instantly completed according to each follower’s market price at the closing of the stock market that day. Let C_{t} denote total assets just before adjusting stock positions on trading day t; generally, C_{t} contains the holding stocks and currencies. Taking follower stock q as a representative example, the rule for adjusting stock positions is as follows: (1) when v_{q,t} > 0, the market value of stock q held in hand is adjusted to v_{q,t}C_{t} through buying or selling, where the market value is measured at trading time; (2) when v_{q,t} ≤ 0, the amount of stock q held in hand should be adjusted to 0.
Enhancement strategy
As mentioned in Sect. 2.2, the alphafactor strategy selects stocks by calculating and ranking the value of the adopted alpha factor. Interestingly, the pure lead–lag strategy provides the selected stocks and the buyandsell signal. Naturally, the buyandsell signal can be adapted to the stock sets selected by both the alphafactor and lead–lag strategies. As a result, the enhancement strategy can be designed by combining the buyandsell signal and stock selection, as explained previously. In addition, the trading framework of the commonly used alphafactor strategy requires that the calculated value of the concerned alphafactor should be updated each month. In practice, the value of the concerned alpha factor is calculated at the end of each month based on the technical data of that month, and then stock selection and trades are immediately made. Therefore, the trading day in the pure lead–lag strategy is identical to the alphafactor strategy. The two strategies are coherent in terms of trading time when forming the enhancement strategy.
There are a total of four steps to conducting the enhancement strategy. The first step is to detect the lead–lag stock pairs based on the preceding work of this study and then determine the set of follower stocks, denoted as Q. The second step is to choose one alpha factor and then calculate the alpha value of each stock in set Q. Without any loss of generality, let α_{q} denote the calculated alpha value of stock q for any stock \(q \in Q\). The third step is to achieve the variable value v_{q} (for any \(q \in Q\), hereafter) by following the first two steps in the pure lead–lag strategy as stated in Sect. 5.1. The last step is to provide the trading rules based on α_{q} and v_{q} achieved in the foregoing steps and then use the rules to adjust the holding position of stock q.
Here, we take our designed alpha01 as an example to describe the aforementioned steps of the enhancement strategy to present them more clearly. We assume that the first step has been conducted and designated the follower stock set Q. Then, according to the second step, Table 8 describes the detailed process of calculating the alpha01 value of each stock belonging to set Q.
Before conducting the steps of the proposed enhancement strategy, we first pay attention to the achieved \(\alpha_{i}^{01}\). On the one hand, if CORR_{i} is low among all stocks, RK_{i} will be a large number and thus \(\alpha_{i}^{01}\) will be high. In other words, if one stock’s opening price is not consistent with its trading volume in the analyzed month, the calculated alpha01 value of this stock will be high. On the other hand, Eq. (8) guarantees that the calculated alpha01 value ranges between − 0.5 and 0.5; half of the stocks traded on the stock market have a positive alpha01 value. Note that different alpha factors have different calculation processes. Our study adopts 20 different alpha factors by following Kakushadze (2016) to ensure that our results are sound. Their detailed calculation processes are presented in Appendix 1.
Performance
This section aims to validate the performance of our proposed lead–lag strategy and test whether this strategy improves the performance of the pure alphafactor strategies in the formed enhancement strategy. As in the work of Stübinger (2019), the trading cost is set to 0.25%, and the naïve buyandhold investment strategy (MKT) is chosen as the benchmark. In addition, we choose the upper 5% daily return rate, or the Sharpe ratio, of a series of random investment operations as another benchmark, where a random investment operation means buying one stock on a random day and selling it on a later random day. With the aim of obtaining sound results, we designed 20 different alphafactor strategies (see Appendix 1) for validation and chose a testing period of 10 years (i.e., from January 2010 to December 2019).
Because the alpha01 strategy is the example in Sect. 5.2, this section first focuses on the performance of the pure lead–lag strategy (PLL), the pure alpha01 strategy (Pure01), and the enhancement strategy of alpha01 (Enhan01). Table 9 reports their performance in the two target stock markets, where “mean returns” are achieved by averaging the daily return rate. By comparing the mean returns of each strategy, we find that the enhancement strategy performs best in both markets. Therefore, this finding indicates that the proposed lead–lag strategy significantly improves the performance of the pure01 factor strategy. Furthermore, PLL performs better than MKT in both markets in different degrees in terms of mean returns, meaning that PLL contains valuable information for investment. However, when the signals provided by the lead–lag strategy are added to the pure01 factor strategy, the achieved enhancement strategy performs better, which implies that the value of the information contained in the lead–lag strategy is superior to that of the pure strategy. Considering the other indices listed in Table 9, the positively higher values of skewness and the Sharpe ratio in Enhan01 displays a more desirable property for any potential investor in both markets (Cont 2010; Fievet and Sornette 2018). In addition, compared to Pure01, the usage of the signal from the PLL significantly reduces the maxdraw of Enhan01, and thus, increases the advantage afforded by the enhancement strategy.
Next, we conduct random investment operations in each stock market during the selected 10year period, record the average daily return rate (mean returns) and the Sharpe ratio of each operation, and then rank them in the figures. Figure 10 displays the results of 5,000 simulations, and the upper 5% of the ranked mean return value, or Sharpe ratio, is set as the threshold. When the corresponding value of one proposed strategy is above the threshold, we deem that the performance of the proposed strategy is significantly better than the benchmark at a significance level of 0.05. Recalling the performance results of Enhan01 shown in Table 9, the mean values are 0.00036 and 0.01257 in both the U.S. and Chinese stock markets, respectively, which is higher than the corresponding highlighted thresholds reported in Panels (a) and (c) of Fig. 10. Concerning the Sharpe ratio, a similar result holds.
Following this analytical process, the performance of the remaining 19 alphafactor strategies and their corresponding enhancement strategies is also tested. All performance results are listed in Tables 10 and 11 for the two stock markets. To facilitate comparison, Fig. 11 displays the mean return of each selected alpha factor in each market as well as for the two types of benchmarks. Here, it is not difficult to find that the benchmark from random investment operations is higher than that from MKT. According to Fig. 11, we can obtain the following findings: (1) all of the enhancement strategies perform better than the two benchmarks in both markets, demonstrating the usefulness of the proposed strategies; and (2) almost all of the enhancement strategies perform better than the corresponding pure alphafactor strategies, illustrating that the signal provided by the lead–lag strategy that we proposed does improve the performance of pure alphafactor strategies in most cases.
Discussion
Section 4.2.2, Fig. 8, and Table 7 show that the overall prediction accuracy is significantly higher than 50% (i.e., better than a random guess) in each case, implying that the stocks with the lead–lag effect provide useful information for prediction and strategy design. However, a degree of more than 50% is not exceptionally high, particularly in the CSI 300; thus, some stock pairs perform worse than a random guess in each prediction period. Inspired by this result, we provide a refined process in which the stock pairs (i.e., the detected lead–lag relationship with effect) with a prediction accuracy less than 50% are eliminated from the selected stock set. Accordingly, the refined process makes the set of stock pairs with the lead–lag effect more minor by deleting those with inferior prediction performance in the trained data. Then, putting the refined process into the enhancement strategy proposed above, the socalled refined strategy is proposed and the test of its performance is like what was done in Sect. 5.3.
All performance results are listed in Table 12 for each market, and Fig. 12 displays the mean returns of each enhancement strategy and its corresponding refined strategy in each market with the two selected benchmarks. According to Fig. 12, the refined strategies have different degrees of improvement in profit over the original enhancement strategies in the CSI 300, indicating that the refined process does provide more practical information for investing in the CSI 300. However, for the S&P 500, most refined strategies outperform the original enhancement strategies, but some perform worse than the original enhancement strategy, especially when the original strategy is already very profitable. This result implies that the refined process does not always work better than the original process, possibly because some helpful information may be dropped during the refinement process.
Furthermore, in the risk analysis, Table 12 shows that the refined strategy generally improves the Sharpe ratio and reduces the maximum drawdown rate in the CSI 300. In contrast, improvement is not evident in the S&P 500. These results indicate that the refinement process effectively discards risky lead–lag signals, but its performance depends on the application scenario. Overall, the refined strategy is more suitable for the CSI 300, but for the S&P 500, it may serve as an alternative to the original enhancement strategy.
Conclusions and future work
The powerlaw distribution is often observed in human activity, which explains its widespread existence in stock markets. Interestingly, this study finds that the number of accumulated lead–lag days between stock pairs fulfills the powerlaw distribution in both the U.S. and Chinese stock markets based on 10 years of data. Because the powerlaw distribution features a heavy tail, this study also formally defines the lead–lag effect via statistical testing and then proposes a new method for detecting stock pairs characterized by the previously defined lead–lag effect. Robustness and the functions of the parameters embedded in the detection method are tested and explored. As an application, a PLL investment strategy is first proposed based on stock pairs identified with the lead–lag effect. Although the proposed lead–lag strategy can beat a naïve buyandsell strategy, its leading edge is too limited to be satisfied. To this end, enhancement strategies are also designed by integrating the lead–lag strategy in the selected basic alphafactor strategies. Then, a series of validations are conducted on 20 different alpha factors to guarantee sound results. The results demonstrate that the enhancement strategy significantly improves the performance of the basic alphafactor strategies and the PLL strategy in most cases.
In theory, the discovered powerlaw distribution implies that the lead–lag phenomenon common in stock markets is attributable not only to random factors but is also influenced by human behaviors such as irrationality, herding, gaming behavior, and many others. Importantly, this finding provides new evidence in support of behavioral finance theory. The proposed detection method can be considered solid and credible because it originates from the principle of statistical testing, contributing to the existing methods for detecting the lead–lag phenomenon or effect. In practice, because the lead–lag effect is demonstrated in this study to provide effective information, it will benefit the designing of innovative and effective investment strategies that are especially suitable for lowfrequency data due to its maneuverability. The idea of an enhancement strategy (i.e., a basic strategy supplemented by the lead–lag strategy) provides investors with a new framework for strategy design with potentially positive backtested performance and practicality.
Our study does have some limitations. Although we selected two representative stock markets as the targets for this examination, the analysis and validation of additional stock markets is required. In future work, we will study the characteristics of the lead–lag phenomenon in different emerging markets at different stages of economic development. Many previous studies have confirmed that there are more opportunities for profit in emerging markets than in mature markets; thus, if our proposed strategy will be effective in various emerging markets remains a question of interest. Although our proposed lead–lag strategy and enhancement strategies exhibit a significant improvement compared to the selected benchmarks, the type of basic investment strategy (i.e., the alpha strategy) is relatively singular in this work. In future work, other investment strategies can be selected as the basic strategy that will be enhanced by implementing the lead–lag strategy to design more competitive stock market investment strategies. As this is a preliminary examination into these two directions, more colorful findings and profitable investment strategies are welcome in the future.
Availability of data and material
Data and codes are available at https://github.com/liuchaos03/PowerlawdistributionLeadlageffectandInvestmentstrategiesinStockMarkets.
References
Balatti M, Brooks C, Kappou K (2017) Fundamental indexation revisited: new evidence on alpha. Int Rev Financ Anal 51:1–15
Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Barabási AL, Bonabeau E (2003) Scalefree networks. Sci Am 288(5):60–69
Berggrun L, Cardona E, Lizarzaburu E (2020) Profitability of momentum strategies in Latin America. Int Rev Financ Anal 70:101502
Buccheri G, Corsi F, Peluso S (2019) Highfrequency lead–lag effects and crossasset linkages: a multiasset lagged adjustment model. J Bus Econ Stat. https://doi.org/10.1080/07350015.2019.1697699
Campajola C, Lillo F, Tantari D (2020) Unveiling the relation between herding and liquidity with trader lead–lag networks. Quant Finance 20(11):1765–1778
Casgrain P, Jaimungal S (2019) Trading algorithms with learning in latent alpha models. Math Financ 29(3):735–772
Clauset A, Shalizi CR, Newman MEJ (2009) Powerlaw distributions in empirical data. SIAM Rev 51(4):661–703
Conlon T, Cotter J, Gencay R (2018) Longrun waveletbased correlation for financial time series. Eur J Oper Res 271(2):676–696
Cont R (2010) Empirical properties of asset returns: stylized facts and statistical issues. Quant Finance 1(2):223–236
CoronelBrizio HF, HernándezMontoya AR (2010) The AndersonDarling test of fit for the powerlaw distribution from leftcensored samples. Physica A Stat Mech Appl 389(17):3508–3515
Curme C, Tumminello M, Mantegna RN, Stanley HE, Kenett DY (2015) Emergence of statistically validated financial intraday lead–lag relationships. Quant Finance 15(8):1375–1386
Dao TM, Mcgroarty F, Urquhart A (2018) Ultrahighfrequency lead–lag relationship and information arrival. Quant Finance 18(5):725–735
Deev O, Lyócsa Š (2020) Connectedness of financial institutions in Europe: a network approach across quantiles. Phys A Stat Mech Appl 550:124035–124041
Eisdorfer A, Goyal A, Zhdanov A (2019) Equity misvaluation and default options. J Financ 74(2):845–898
Fama EF, French KR (2012) Size, value, and momentum in international stock returns. J Financ Econ 105(3):457–472
Fama EF, French KR (1998) Value versus growth: the international evidence. J Financ 53:1975–1999
Fama EF, French KR (2015) A fivefactor asset pricing model. J Financ Econ 116(1):1–12
Fama EF, French KR (2016) Dissecting anomalies with a fivefactor model. Rev Financ Stud 29(1):69–103
Fievet L, Sornette D (2018) Decision trees unearth return sign predictability in the S&P 500. Quant Finance 18(11):1797–1814
Flori A, Regoli D (2021) Revealing pairstrading opportunities with long shortterm memory networks. Eur J Oper Res. https://doi.org/10.1016/j.ejor.2021.03.009
Fonseca DJ, Zaatour R (2017) Correlation and lead–lag relationships in a hawkes microstructure model. J Futur Mark 37(3):260–285
Gabaix X, Gopikrishnan P, Plerou V, Stanley HE (2003) A theory of powerlaw distributions in financial market fluctuations. Nature 423(6937):267–270
Gong CC, Ji SD, Su LL, Li SP, Ren F (2016) The lead–lag relationship between stock index and stock index futures: a thermal optimal path method. Physica A 444:63–72
Gupta K, Chatterjee N (2020) Selecting stock pairs for pairs trading while incorporating lead–lag relationship. Phys A Stat Mech Appl 551:124103
Harvey CR, Liu Y, Zhu H (2016) … and the crosssection of expected returns. Rev Financ Stud 29(1):5–68
Hou K, Xue C, Zhang L (2015) Digesting anomalies: an investment approach. Rev Financ Stud 28(3):650–705
Huang WQ, Zhuang XT, Yao S (2009) A network analysis of the Chinese stock market. Physica A 388(14):2956–2964
Huth N, Abergel F (2014) High frequency lead/lag relationships—empirical facts. J Empir Financ 26:41–58
Jiang T, Bao S, Li L (2019) The linear and nonlinear lead–lag relationship among three SSE 50 Index markets: the index futures, 50ETF spot and options markets. Physica A Statis Mech Appl 525:878–893
Jong DF, Nijman T (1997) High frequency analysis of lead–lag relationships between financial markets. J Empir Financ 4(2–3):259–277
Kakushadze Z (2016) 101 formulaic alphas. Wilmott 2016(84):72–81
Kuiper NH (1960) Tests concerning random points on a circle. Proc Ser A 63(1):38–47
Kobayashi T, Takaguchi T (2018) Social dynamics of financial networks. EPJ Data Sci 7(1):15
Krauss C (2017) Statistical arbitrage pairs trading strategies: review and outlook. J Econ Surv 31(2):513–545
Kumar S, Deo N (2012) Correlation and network analysis of global financial indices. Phys Rev E 86(2):026101
Li Y, Liu C, Wang T, Sun B (2021) Dynamic patterns of daily lead–lag networks in stock markets. Quant Finance 21(12):2055–2068
Liu J, Stambaugh RF, Yuan Y (2019) Size and value in china. J Financ Econ 134(1):48–69
Makarov I, Plantin G (2015) Rewarding trading skills without inducing gambling. J Financ 70(3):925–962
Malevergne Y, Pisarenko V, Sornette D (2011) Testing the Pareto against the lognormal distributions with the uniformly most powerful unbiased test applied to the distribution ofcities. Phys Rev E 83(3):
Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78
Newman ME, Strogatz SH, Watts DJ (2001) Random graphs with arbitrary degree distributions and their applications. Phys Rev E 64(2):026118
Peralta G, Zareei A (2016) A network approach to portfolio selection. J Empir Financ 38:157–180
Rickles D (2011) Econophysics and the complexity of financial markets. In: Hooker C (ed) Philosophy of complex systems. NorthHolland, Amsterdam, pp 531–565
Scherbina A, Schlusche B (2020) Follow the leader: using the stock market to uncover information flows between firms. Rev Finance 24(1):189–225
Scholz FW, Stephens MA (1987) Ksample AndersonDarling tests. J Am Stat Assoc 82(399):918–924
Sharpe WF (1964) Capital asset prices: a theory of market equilibrium under conditions of risk. J Financ 19(3):425–442
Shen D, Zhang Y, Xiong X, Zhang W (2017) Baidu index and predictability of Chinese stock returns. Financ Innov 3(1):1–8
Stübinger J (2019) Statistical arbitrage with optimal causal paths on highfrequency data of the S&P 500. Quant Finance 19(6):921–935
Toda AA (2012) The double power law in income distribution: explanations and evidence. J Econ Behav Org 84(1):364–381
Tóth B, Kertész J (2006) Increasing market efficiency: Evolution of crosscorrelations of stock returns. Physica A 360(2):505–515
Volz E (2004) Random networks with tunable degree distribution and clustering. Phys Rev E 70(5):056115
Xia L, You D, Jiang X, Chen W (2018) Emergence and temporal structure of LeadLag correlations in collective stock dynamics. Phys A Statis Mech Appl 502:545–553
Xiong X, Cui Y, Yan X, Liu J, He S (2020) Costbenefit analysis of trading strategies in the stock index futures market. Financ Innov 6(1):1–17
Zeng K, Atta Mills EFE (2021) Can economic links explain lead–lag relations across firms? Int J Finance Econ. https://doi.org/10.1002/ijfe.2480
Zhang W, Yan K, Shen D (2021) Can the Baidu Index predict realized volatility in the Chinese stock market? Financ Innov 7(1):1–31
Funding
This work was supported by the National Natural Science Foundation of China (72171059, 71771041), the Fundamental Research Funds for the Central Universities (FRFCU5710000220) and the Natural Science Foundation of Heilongjiang Province, China (No. YQ2020G003).
Author information
Authors and Affiliations
Contributions
YL: Conceptualization, Methodology, Formal analysis, and Writing  Original Draft. TW: Methodology, Software, Formal analysis, and Writing  Original Draft. BS: Writing  Review & Editing, Supervision, Validation. CL: Visualization, Software, Validation, and Data Curation. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: The designed 20 alpha factors and their expressions
We present the designed 20 alpha factors in details here by providing their formulaic expressions. Specifically, Table 13 shows symbolic descriptions of the variables related to the collected data, and Table 14 shows the operators and functions adopted in these formulaic expressions of alpha factors.
Accordingly, the designed 20 alpha formulas are expressed one by one as below.

Alpha01: \({\rm Rank}\left(1*{\rm Correlation}\left(Open,Volume,20\right)\right)0.5\)

Alpha02: \({\rm Rank}(1*{\rm Correlation}({\rm Rank}({\rm Delta}(Volume,10)),{\rm Rank}(\frac{CloseOpen}{Open}),20))0.5\)

Alpha03: \({\rm Rank}(1*{\rm Ts}\_{\rm Rank}(Low,20))0.5\)

Alpha04: \({\rm Rank}\left(1*{\rm Correlation}\left(Open,Volume,20\right)\right)0.5\)

Alpha05: \({\rm Rank}\left({\rm Sign}\left({\rm Delta}\left(Volume,1\right)\right)*\left(1*{\rm Delta}\left(Close,20\right)\right)\right)0.5\)

Alpha06: \(0.51*{\rm Rank}({\rm Covariance}({\rm Rank}(Close),{\rm Rank}(Volume),20))\)

Alpha07: \({\rm Rank}((1*{\rm Rank}({\rm Delta}(Returns,10)))*{\rm Correlation}(Open,Volume,20))0.5\)

Alpha08: \(0.51*{\rm Rank}({\rm Correlation}({\rm Rank}(High),{\rm Rank}(Volume), 20))\)

Alpha09: \(0.51*{\rm Rank}({\rm covariance}({\rm Rank}(High),{\rm Rank}(Volume),20))\)

Alpha10: \(0.51*{\rm Rank}({\rm Correlation}({\rm Ts}\_{\rm Rank}(Volume,5),{\rm Ts}\_{\rm Rank}(High,5),15))\)

Alpha11: \({\rm Rank}({\rm Correlation}(Adv20,Low,5)+((High+Low)/2)Close)0.5\)

Alpha12: \({\rm Rank}({\rm Correlation}({\rm Delay}((OpenClose),1),Close,20))+{\rm Rank}((OpenClose))0.5\)

Alpha13: \({\rm Rank}(1*{\rm Rank}({\rm Std}(High,20))*{\rm Correlation}(High,Volume,20))0.5\)

Alpha14: \({\rm Rank}(1*{\rm Correlation}(High,{\rm Rank}(Volume),20))0.5\)

Alpha15: \({\rm Rank}(1*{\rm Delta}((2*CloseLowHigh)/(CloseLow),20))0.5\)

Alpha16: \({\rm Rank}({\rm Correlation}((LowClose)*(Open^5),(LowHigh)*(Close^5),20))0.5\)

Alpha17: \(0.5{\rm Rank}({\rm Correlation}({\rm Rank}(\frac{Close{\rm Min}(Low,12)}{{\rm Max}\left(High,12\right){\rm Min}\left(Low,12\right)}),{\rm Rank}(Volume),6))\)

Alpha18: \({\rm Rank}({\rm Correlation}\left(CloseOpen,HighLow\right),20)0.5\)

Alpha19: \({\rm Rank}(2{\rm Rank}({\rm Std}(Returns,7)/{\rm Std}(Returns,20)){\rm Rank}({\rm Delta}(Close,7)))0.5\)

Alpha20: \({\rm Rank}({\rm Ts}\_{\rm Rank}(Volume/Adv20,20)*{\rm Ts}\_{\rm Rank}(1*{\rm Delta}(Close,7),7))0.5\)
Appendix 2: List of ababreviations
See Table 15.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Y., Wang, T., Sun, B. et al. Detecting the lead–lag effect in stock markets: definition, patterns, and investment strategies. Financ Innov 8, 51 (2022). https://doi.org/10.1186/s40854022003563
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40854022003563