 Research
 Open access
 Published:
Novel modelling strategies for highfrequency stock trading data
Financial Innovation volume 9, Article number: 39 (2023)
Abstract
Full electronic automation in stock exchanges has recently become popular, generating highfrequency intraday data and motivating the development of near realtime price forecasting methods. Machine learning algorithms are widely applied to midprice stock predictions. Processing raw data as inputs for prediction models (e.g., data thinning and feature engineering) can primarily affect the performance of the prediction methods. However, researchers rarely discuss this topic. This motivated us to propose three novel modelling strategies for processing raw data. We illustrate how our novel modelling strategies improve forecasting performance by analyzing highfrequency data of the Dow Jones 30 component stocks. In these experiments, our strategies often lead to statistically significant improvement in predictions. The three strategies improve the F1 scores of the SVM models by 0.056, 0.087, and 0.016, respectively.
Highfrequency trading (HFT) arises from increased electronic automation in stock exchanges, which features the use of extraordinarily highspeed and sophisticated computer programs for generating, routing, and executing orders (Securities Commission 2010; Menkveld 2013). Investment banks, hedge funds, and institutional investors design and implement algorithmic trading strategies to identify emerging stock price surges (Parlour and Seppi 2008). The increase in transaction efficiency increases the complexity of limit order book (LOB) data. Compared with stock trading before electronic automation, more quote data are generated in the LOB during the highfrequency trading process. Extracting useful information and modelling the complexity of massive LOB data for precise stock midprice predictions are empirical big data challenges for traditional timeseries methods. For instance, Qian and Gao (2017) suggests that classical machine learning methods actually surpass traditional models in precision for financial timeseries predictions compared to the ARIMA and GARCH models. As computational resources, sophisticated datasets, and larger datasets continue to expand in the financial field, scholars and practitioners have developed increasingly elaborate methods for analyzing complex financial markets. In particular, machine learning has gained popularity in the finance industry because of its ability to capture nonlinearity, effectiveness, and strong predictive power. Innovative studies have demonstrated promising results for a variety of tasks. For example, machine learning and other advanced models have been employed for financial data mining (Li et al. 2021), financial market microstructure investigations (Qiao and Beling 2016; Huang et al. 2017), and stock price analysis (Chen et al. 2003; Wen et al. 2019).
Quantitative analyses of financial price predictions are important because more accurate predictions lead to higher profits from trading strategies (Fletcher and ShaweTaylor 2013; Kercheval and Zhang 2015). The quality of the prediction depends on two major factors: (1) the choice of statistical learning method used to train the prediction model and (2) the choice of input for machine learning methods, i.e., the extraction of information from large raw data, such as input variables (predictors) and subsets of training samples. The majority of the literature (Arévalo et al. 2016; Dixon 2016; Kong and Zhu 2018) focuses on enhancing prediction accuracy with advanced machine learning or deep learning models, which address the first factor discussed above. However, to the best of our knowledge, little attention has been paid to the second factor. This issue motivated us to study how to extract useful information from large amounts of raw data as inputs for machine learning methods. Next, we explain the importance of preprocessing raw data as input predictors, common practices to extract features from raw data, and the issues we want to address.
Although highfrequency data offer new opportunities to learn highresolution information at the nanosecond level for financial analysis, it creates new challenges in acquiring and utilizing massive amounts of information. Given the vast amount of highfrequency data records, it is impossible to consider the entire dataset because it is too computationally expensive. Furthermore, close observations in highfrequency data are highly correlated (Campbell et al. 1992; Campbell et al. 2012), which violates the independence assumption of most machinelearning models. Hence, it is critical to properly process the raw data and convert them into meaningful inputs for machine learning models. To address this issue, the common practice in the literature to process highfrequency raw data is to apply the eventbased protocol (Ntakaris et al. 2018; Nousi et al. 2019) together with the sampling strategy, which randomly subsamples raw data at fixed events. Both the eventbased protocol and the sampling strategy are forms of data thinning. Such approaches substantially reduce the size of the dataset and weaken the correlation among observations. However, this widely used data thinning approach has three disadvantages. First, data thinning compromises the advantage of high resolution highfrequency trading data. While reducing the data density, data thinning (i.e., event protocol and subsampling process) discards the inherent information between fixed events. Second, randomness in the data thinning procedure (e.g., different starting points of events and sampling strides) affects the models’ robustness and reproducibility. Third, longterm trends in price history could provide useful information for prediction, but they are rarely used in current models. Highfrequency data over a long historical period are difficult to handle by most models because this leads to numerous correlated predictor variables, which dilutes the impact of all predictors in the model and creates severe collinearity problems. Researchers tend to construct scalar variables based on data at or close to a specific timestamp without leveraging information over a longtime scope (Kercheval and Zhang 2015; Ntakaris et al. 2019; Nousi et al. 2019; Ntakaris et al. 2020). In this study, we propose three novel modelling strategies that aim to address these disadvantages to alleviate the insufficient use of highfrequency data and improve midprice prediction performance.
To overcome the first disadvantage, we devise Strategy I, which uses a collection of variables that summarize and recover useful information discarded during data thinning. In response to the second disadvantage, our second strategy proposes a stock price prediction framework called ‘Sampling+Ensemble’. This framework consists of two steps: the first step fits training models on many random subsets of original samples, and the second step integrates results from all models fitted in the first step and subsequently generates the final prediction through a voting scheme. This strategy combines the ‘Sampling’ step to reduce betweensample correlation and computational load in each subset, and the ‘Ensemble’ step to increase precision and robustness of predictions. This strategy is flexible, as users can choose from a wide range of machine learning models as their processors (base learners) to analyze each data subset generated in the ‘Sampling’ step. In real data experiments, we used the support vector machine (SVM) and elastic net (ENet) models as the base learners to obtain benchmark results for performance comparison. Owing to the ENet model’s automatic feature selection property, we identified the importance of the predictor variables by ranking the total number of times each predictor was selected in the ‘Sampling’ step. Finally, our third novel strategy introduces a new feature to highfrequency stock price modelling, which emphasizes the importance of considering longerterm price trends. The implementation of the functional principal component analysis (FPCA) method (Ramsay 2004) helps compress information in historical prices from a long and ordered list of correlated predictors into a few orthogonal predictors. We customize features that capture longterm price patterns over the past day, and examine whether they improve the prediction model.
The proposed method can be applied to highfrequency trading algorithms to achieve improved forecasting performance and informational efficiency. We illustrate the performance improvements of our three novel strategies using highfrequency intraday data on Dow Jones 30 component stocks from the New York Stock Exchange (NYSE) Trade and Quote (TAQ) database. Following the problem set up in previous work (Kercheval and Zhang 2015), we treat midprice movement prediction as a threeclass classification problem (i.e., up, down, or stationary midprice states) for every next 5th event in each random subset of training data. We forecast the Dow Jones 30 stock prices using various machine learning methods with and without each of our novel strategies and evaluate the improvement in prediction performance using our novel strategies. We used precision, recall, and F1 score as performance metrics, which are widely used in the machine learning community. To investigate the uncertainty of our comparison, we repeated our experiments 100 times on different random subsets of the original data and compared the performance metrics (e.g., F1 scores) using the nonparametric Wilcoxon signedrank test. Evaluation results of SVM models show that our second strategy (Sampling+Ensemble) is ‘consistently’ helpful, significantly outperforming the original models without this strategy in all 30 stocks, with up to a 0.23 increase in F1 scores. Our first strategy (recovering the discarded information by the data thinning process) is often helpful, significantly improving the prediction performance in 27 out of 30 stocks. Our third strategy (modelling longterm price trends by FPCA) can sometimes help significantly improve prediction performance in 3 out of 30 stocks. Note that whether our first or third method helps depends on the characteristics of the data. If the last observation of event windows always carries most of the information of the window, recovering the loss during data thinning cannot help. If the longterm trends of a stock are unstable, modelling longerterm trends cannot be helpful. Finally, the ENet models provide us with the most frequently selected features for predicting each midprice direction, which is novel knowledge for extending the existing features set for highfrequency midprice prediction for further studies.
In the remainder of this study, we describe the setup of our research problem and propose novel strategies for data preprocessing. Then, we provide a brief introduction to two machine learning methodologies that are used to illustrate our novel strategies. We demonstrate how our novel strategies improve the prediction performance using the TAQ data analysis results. Finally, we conclude the study and discuss its limitations.
Problem setup
This section introduces the research questions and defines the notations and evaluation criteria of model performance.
In this study, our goal is to predict midprice changes based on highfrequency LOB data. A limit order involves buying or selling security at a specific price or better. A buy limit order is an order to buy at a current or lower price, while a sell limit order sells security at no less than a specific price. The LOB accepts both buy and sell limit orders and matches buyers and sellers in the market. The highest bid price, denoted by \(P^{bid}\), is the best bid price, whereas the lowest ask price, denoted by \(P^{ask}\), is the best ask price. Their average price defines the socalled midprice, namely \(P^{mid}=(P^{bid}+P^{ask})/2\), whose movement is predicted. Every new limit order submission from either the buyer or seller creates and updates a new entry in the limit order book. More specifically, if the best bid price or best ask price is updated in the LOB, our midprice will be updated accordingly, which we define as a trading event.
Assume that a dataset consists of chronologically recorded LOB events with an index ranging from 1 to N. The occurrence of N events (i.e., quotes) depends on the market. They do not have a steady inflow rate, i.e., the time intervals between two consecutive events vary tremendously from nanoseconds to minutes. Following the literature on eventbased inflow protocols, we grouped every k consecutive events in a window, which leads to N/k windows for downstream analysis. Previous studies (Ntakaris et al. 2018; Kercheval and Zhang 2015) proposed various choices for the value of parameter k, ranging from 2 to 15. Such a value is not critical to illustrate the performance of our proposed novel strategies; therefore, we set \(k=5\) for simplicity in our discussion. To handle the windowbased data structure, as illustrated in Fig. 1, we used a twodimensional index system (i, j) as the subscript for each event, where \(i=1, \ldots , N/k\) denotes the ith window and \(j=1, \ldots , k\) denotes the event’s position within the window. For example, the first LOB event in the 4th window occurred at time \(t_{4,1}\) and had midprice \(P^{mid}_{4,1}\). To forecast this midprice, we can use information from the previous windows.
The input data formats for supervised machine learning methods are significantly different from those for timeseries methods. Time series methods consider data as a timeordered vector of length N, whereas the input of machine learning methods consists of an Ndimensional outcome vector (same as a time series) and a predictor matrix of dimension \(N \times p\). That is, each row of machine learning input data consists of an outcome (midprice at a certain timepoint), and p predictors/features created from historical midprice information before that time point (e.g., midprices of the last five trading events or two weeks of historical midprices traced back from the current time point).
Information on the relationship between consecutive observations of a time series is the most critical. Such information is equivalent to the outcomepredictor relationship within each row of machine learning input data. Hence, the order of the rows is not critical for machine learning methods. Moreover, the correlation between consecutive observations of time series methods is helpful for prediction, whereas correlations between rows in the machine learning data matrix void the independence assumptions of most machine learning methods. Therefore, the decorrelation among rows of the data matrix is important for machine learning methods.
We defined three types of predictor variables to summarize the highfrequency historical information at different resolution levels under our proposed strategies. The first type consists of variables at the window level, which are fetched using one event (usually the last one) in each window as the standard classic features used in the literature. The details of this type of variable are presented in Table 3 in the Data Cleaning and Multiresolution Features Construction section. The second type consists of variables that capture microtrends within each window and will be discussed in the section on our proposed Strategy I. The third type consists of variables that capture the trend of price change in longterm history and is discussed in the section on our proposed Strategy III.
Following (Ntakaris et al. 2019), we define the outcome variable based on the midprice ratio between the average midprice of all events in the current window \(\sum _{j=1}^k P^{mid}_{i,j}/k\) and the last observed midprice in its history \(P^{mid}_{i1,k}\). Using threshold values, we convert this ratio into a threeclass categorical variable that represents three possible stock midprice movement states: upwards, downwards, and stationary. Specifically, the outcome variable \(Y_i\) of the ith record (or window) is defined as follows:
where \(\alpha\) is the parameter that determines the significance of the midprice movement. In practice, we suggest choosing the value for \(\alpha\) using two rules. (1) The value should be large enough to make it meaningful in practice so that highfrequency trading decisions based on such \(\alpha\) values can make a profit. (2) The value cannot be too large, so we have enough training data to model the “upwards” and “downwards” movements of stock prices.
Machine learning methods require a different data format than timeseries methods do. The time series involves handling an Ndimensional vector indexed by time order. Time order is essential for time series forecasting because it predicts future values based on previously observed values. In contrast, machine learning methods predict future responses based on input features. In machinelearning methods, the temporal information contained in the time order of observations (of timeseries methods) is converted into an outcomepredictor relationship within each row of the input predictor matrix (of machinelearning methods). In other words, the historical information is included in the predictor matrix with various resolutions, as stated above. Therefore, machinelearning methods do not require correlation information between consecutive observations. A strong correlation between the rows of the data matrix should be avoided to satisfy the independence requirements of most machine learning models. However, when the window size is not sufficiently large, the midprice or converted categorical outcome \(Y_i\) might be highly correlated with their adjacent records. We propose our first two strategies to address this issue, and discuss these strategies in the Novel Strategy section.
To evaluate whether our novel strategies can improve prediction performance, for each fitted model, we calculated its recall, precision, and \(F_1\) score, which are widely used performance metrics by the machine learning community. Our novel strategy is considered helpful if it leads to a positive change in performance metrics. The recall and precision metrics are defined as follows:
where TP is the number of truepositive predictions (e.g., correctly predict ‘upwards’ as ‘upwards’), FN is the number of falsenegative predictions (e.g., incorrectly predict ‘upwards’ as ‘not upwards’), and FP is the number of falsepositive predictions (e.g., incorrectly predict ‘not upwards’ as ‘upwards’). Both recall and precision are performance metrics that are commonly used in classification tasks. Recall denotes the proportion of truepositive cases that are correctly labelled as positive, while precision denotes the proportion of predictedpositive cases that are correctly labeled as truepositive. A good classification model performance aims to achieve a relatively high recall and precision simultaneously. Usually, when we analyze the results, we can either investigate and compare one measure when the other measure is at a fixed level or we can combine these two metrics into one. In this study, we used \(F_1\) score, the harmonic average of precision and recall, as a single measurement of the classification task:
Novel strategies
In this section, we propose three novel strategies, describe the issues that they resolve, and explain the mechanisms behind them. Our objective is to preprocess highfrequency raw data into appropriate inputs for machine learning methods. All three novel strategies are independent of each other and can be applied separately or in combination. These strategies are not limited to midprice prediction, but open avenues for highfrequency data applications in other fields.
Strategy I: recover information in data thinning
The aforementioned eventbased inflow serves as a datathinning strategy for highresolution observations, which uses only one event (usually the last event) within each window. Using fewer events weakens the correlation between successive observations and reduces computational costs by shrinking the size of the dataset. However, each window carries much more useful information that can be captured by only one record. In particular, the records in the last window provide the most useful information for forecasting future prices. Using only one record in that window can result in significant loss of information.
Our first strategy is to define a few new variables to recover discarded useful information within each window. Although observations within an event window can be highly correlated and carry redundant information, their trend can be helpful in predicting the movement of the next midprice. Instead of using the features solely built by the “record” events, we included new variables to extract and generalize summary features within each window. More specifically, we proposed an extensive collection of input features based on information that can be extracted from events within each window, as depicted in Fig. 1. The feature set contains features such as mean, variance, range midprice observations, trade intensity, volatility, market depth, and bidask spread. As it summarizes the financial characteristics within each window, we call the new set of features as “withinwindow highfrequency variables”. Detailed descriptions and calculation formulas of these variables are summarized in Table 1.
This new collection of features can capture more temporal information and complement the variable set that is constructed based on the “record” observations. \(V_1\) and \(V_2\) are two types of returns, aiming to measure the percentage of price changes in the best bid price and the best ask price compared to the counterparts of their previous “record” event. \(V_3\) denotes the bidask spread crossing return, which is an indicator of potential arbitrage profits. For example, a trader makes a profit when he buys the asset at time \(t_{i1,1}\) with the lowest ask price and sells it at time \(t_{i1,k}\) with the highest bid price. \(V_4,~V_5\) and \(V_6\) are the mean values of the best ask price, best bid price, and midprice, respectively, among the five events within a window. The summed quantity quoted at the best bid and ask prices, revealing the market depth, is calculated in \(V_7\) and \(V_8\). The standard deviation of midprice changes is also known as price volatility. In \(V_9\), we measure withinwindow volatility by calculating the standard deviation among all events in the two previous windows. The utilization of events from two windows is preferable because the computed standard deviations of the events from only one window are most likely to be zero because of subtle volatility. The time length of the \((i1)\)th window is determined by the time difference between the first and last events in that window, namely, \(t_{i1,k}t_{i1,1}\). This represents the actual trading time for five events to occur prior to the given “record” event in ith window. Therefore, its reciprocal, as computed in \(V_{10}\), manifests the transaction intensity of the given window.
Strategy II: “sampling + ensemble” model
The characteristics of highfrequency trading data lie in their massive trading volume and high data dependence among observations. Although it provides us with highresolution data to train our models, highfrequency data also lead to challenges in analyzing such data. The massive amount of data is not manageable by most modern computers, and the high correlation among adjacent observations violates the independence assumptions of most machinelearning models. To address these challenges, the current standard approach is data thinning by randomly sampling event windows. However, such a sampling approach leads to further information loss and reduces the reliability of the results (i.e., depending on the random subset selected for data analysis). To improve robustness and address information loss, we propose a second strategy that combines the sampling approach with ensemble machine learning. Specifically, we used the bagging approach (a popular ensemble machine learning method) to combine many models fitted on various random subsets of the original training data.
We propose our second modelling strategy, “Sampling+Ensemble”, which retains the benefits of the sampling approach discussed above and addresses its robustness and information loss issue. Specifically, we randomly generated 100 subsets of training data, fitted a prediction model on each data subset, and used the average prediction of all 100 models as our final output. Each training subset uses only a portion of the original data, but the union of 100 subsets can cover the majority of the original data to avoid information loss. Integrating prediction results from models fitted on various subsets of original data can average the impact of subset selection, provide more robust results, and utilize more information/data than a model fitted on a single subset.
Note, using ‘100’ random subsets is our empirical choice after testing it on many data sets. Using too many subsets can substantially slow down the analysis but results in little improvement in prediction performance. Using too few subsets cannot achieve the desired robustness, and more information is lost in the data. Users can adjust this setting according to their specific problems, if needed.
Strategy III: combination of longterm and shortterm resolution
The essential information used for predicting future midprice movements is historical observations of midprices. In most prediction models, modelling a longer history of midprices requires including more historical observations as model predictors. Most machine learning models only fit a limited number of predictors, and hence cannot model how longterm history affects future midprices. This disadvantage is worse when analyzing highfrequency trading data because higher frequency data leads to redundant events observed over a long period. Thus, most current prediction models for highfrequency trading data in the literature utilize only information from a shortterm history. This motivated us to propose a third strategy that considers longterm price effect features to enhance information capacity in the current feature set.
Strategy III uses Functional Principal Component Analysis (FPCA) to reduce the dimensions of longterm history data before including them in the prediction model. FPCA is a dimensionreduction method similar to the Principal Component Analysis (PCA) method. PCA considers observations as vectors whose order is interchangeable, whereas FPCA handles observations as functions with interchangeable time orders. In other words, FPCA can utilize information from the midprice sequence. We chose FPCA instead of PCA because the temporal information in the midprice history plays a critical role in its prediction. In the prediction model, we represented the trends in the longterm history using a few FPCA scores instead of a long list of predictors (raw observed midprices).
In this study, we consider the longterm price effect of oneday history based on our empirical results and calculate the top Functional Principal Components (FPC) scores that account for \(99.9\%\) of information in these historical trends. Users can use historical data of customized durations (e.g., 3days or one week) according to their research objectives. Note that we expect that the longterm impact variables will uplift the prediction performance if the trajectory of the midprice movement has low volatility. By contrast, if the midprice movement trajectory is unstable and has rapid reversal or momentum, incorporating longterm impact variables in the prediction model will backfire. We could include both long and shortterm variables in the preliminary model and use machine learning methods to decide whether to retain the longterm variable in the final model. For example, the elastic net model has feature selection functionality and is suitable for this type of task. Users can also decide whether to include longterm variables manually, according to the stocks’ recent qualitative characteristics.
A detailed description of FPCA can be found in Ramsay (2004), Ramsay and Silverman (2007) and Kokoszka and Reimherr (2017). In this study, we briefly introduce the key concept of FPCA. The FPCA projects the input trajectories of midprice history to the functional space spanned by orthogonal FPC, and functional scores are the corresponding coordinates in the transformed functional space. Each component of the functional score vector is related exclusively to one FPC. The first FPC accounts for the largest proportion of variance in the data. By analogy, the next FPC explains the proportion of the rest of the variance after excluding previously generated FPCs. Based on our empirical findings, the first few FPCs account for most of the variance in the oneday historical data. Therefore, we reduce the dimensionality of data by choosing a few top FPCs that explain the majority of variance in the data and use the corresponding FPC scores to replace raw data with a long history of midprices. We denote \(s_{ij}~ (i=1,\dots , N; j=1,\dots ,K)\) the \(j^{th}\) FPC score of the \(i^{th}\) trajectory in the data, which is defined by
subjected to constraints
where t is the continuous timestamp, \(X_i(t)\) is the midprice of the \(i^{th}\) trajectory at time t, and \({\bar{X}}(t)=\sum _{i=1}^N X_i(t)/N\) is the pointwise mean trajectory of all samples in the data. The first principal component as the weight function is specified by \(\delta _1 (t)\), which maximizes the variance of the functional scores \(s_{i1}\) subject to Eq (5). The second, third, and higherorder principal components \(\delta _j (t)\) are defined in the same way, but each of them explains the variance of the data in addition to the previously established ones, and they also need to meet the same constraint that requires all the functional principal components to be orthogonal.
Using proposed strategies with machine learning methods
Our proposed novel strategies focus on preprocessing raw HFT data into input data for machinelearning methods. In this study, we used two machinelearning methods to illustrate the application of our strategies. In this study, we illustrate the application of our strategies using the two most popular machine learning methods: Supporting Vector Machines (SVM) (Tay and Cao 2001; Huang et al. 2005; Chalup and Mitschele 2008) and Elastic Net (ENet) (Zou and Hastie 2005).
The SVM model categorizes the response variables into two classes according to their input features. To achieve this goal, the SVM maps training samples to space and constructs a hyperplane along with two supporting vectors based on the training data. SVM further separates samples from the two classes using the hyperplane by maximizing both margins between the two supporting vectors. The new data points were then mapped into the same space and classified according to their position on the hyperplane. The ENet is a regularized linear regression model. It has a LASSO penalty and a Ridge penalty on the regression coefficients. The LASSO penalty can force irrelevant predictor coefficients to zero, thereby achieving automated feature selection. The Ridge penalty can shrink all predictor coefficients towards zero, which helps to address the collinearity and overfitting problems. The ENet model has two parameters that control the strength of two penalties: \(\lambda\) controls the overall strength, and \(\alpha ^*\) controls the weights between two penalties. In the Empirical Application section, we provide details on how to select the values of these two parameters. In Section 1 of the supplementary document, we provide a more detailed description of the two methods.
Next, we used real data to show how our novel strategies improve the prediction performance of these two machine learning models.
Empirical application
Data
To illustrate the prediction performance improved by each of our novel strategy, we acquired data from the New York Stock Exchange (NYSE) Daily Trade and Quote database (TAQ), which consists of highfrequency intraday quote and trade data for NYSEtraded securities in all public exchanges nationwide. The intraday orderlevel data comprise the continuous trading time between 9:30 am to 4:00 pm every trading day from June to August 2017 (64 trading days), with nanosecond (one billionth of a second) timestamps (e.g., HHMMSSxxxxxxxxx). We focused on the component stocks of Dow Jones 30^{Footnote 1}. The Dow Jones 30 includes the most prominent publicly traded companies in the U.S., representing a strong assessment of the market’s overall health and tendencies. The details of these 30 stocks are listed in Appendix Table 4, which consists of different industry sectors such as conglomerates, financial services, and information technology.
We plot the daily adjusted closing stock price of the Dow Jones 30 index during our sample period in Fig. 2. The Dow Jones 30 index increased by 3.42% during the threemonth study period. There were no extreme price movements in the Dow Jones 30 index during our sample period. In Table 2, we present the summary statistics of market capitalization, trading volume, bidask spread, midprice, and market depth. The average market capitalization of the Dow Jones 30 stock at the beginning of our sample period was USD 368 billion. The Dow Jones 30 stocks are highly liquid with an average bidask spread of 1.637 basis points and an average market depth of 2950 shares.
We turn the response variable (i.e., stock midprice) into a threeclass categorical variable for the prediction. We used a small value \(\alpha =10^{5}\) in Eq (1) to ensure that the stationary state has a similar sample size to the two other states and to make the upward or downward movements noticeable changes in stocks’ midprices. The \(\alpha\) value depends on stock volatility. We also experimented with two other values, \(10^{6}\) and \(10^{4}\). We found that a value around the \(10^{5}\) threshold is suitable for most of our categorical responses to obtain a balance. The value \(\alpha =10^{4}\) leads to an extreme imbalance in most stocks, whereas \(\alpha = 10^{6}\) leads to similar imbalance as our choice of \(\alpha =10^{5}\), but it is less financially significant. For more details on the proportion of response Y based on \(\alpha\), please refer to Additional file 1. Moreover, we used a stratified sampling approach to construct our training datasets and kept the ratio of labels for midprice (i.e., upward, downwards, and stationary) 1:1:1 in each training subset of data. This manipulation approach improves the data balance and makes it easier to compare the prediction performance.
Data cleaning and multiresolution features construction
Following (Hendershott and Moulton 2011), we cleaned our data to ensure legitimacy and consistency using four steps. (i) Eliminate records beyond the exchange opening time from 9:30 am to 4 pm; (ii) Eliminate quotes with negative price, size, or bid price greater than the ask price; (iii) Eliminate trades with zero quantities; (iv) Eliminate trades with prices more than (less than) 150\(\%\) (50\(\%\)) of the previous trade price; and exclude quotes when the quoted spread exceeds 25\(\%\) of the quote midpoint or when the ask price exceeds 150\(\%\) of the bid price.
Next, we standardize the variables using winsorization and normalization. The main purpose of winsorization and normalization is to remove extreme values and alleviate the impacts of different scales or units of the predictors. In the winsorization step, we removed extreme values detected by the same approach as used in the box plot method. We first computed the first and third quantiles (Q1 and Q3) of our training sample and calculated the interquartile range (IQR) equals to Q3Q1. Next, we replaced the observations falling outside [Q11.5IQR, Q3+1.5IQR] with the lower bound Q11.5IQR and upper bound Q3+1.5IQR, respectively. The normalization step standardizes each variable using its mean values and standard deviations from the training samples.
From the cleaned HFT data, we constructed features at three different resolutions for the prediction models based on machine learning methods: (1) windowlevel features used by standard methods in the literature, (2) withinwindow features proposed in Strategy I listed in Table 1, and (3) longterm history represented by FPCA scores as proposed in Strategy III.
The windowlevel variables are presented in Table 3. Variables \(V_{11}\) to \(V_{15}\) are the best bid price/volume, best ask price/volume, and midprice, respectively, which are fetched directly from the LOB data. These are classic economic variables that measure changes in commonly used financial indicators before the “record” event. \(V_{16}\) is an indicator of the bidask spread return. The bidask spread refers to the difference between the best ask price and the best bid price at the same timestamp. Typically, a narrow bidask spread exhibits a high volume of demand. On the contrary, a wide bidask spread may imply a low demand; therefore, it has an impact on the discrepancy in the asset price. Moreover, we measure the stock spike in features \(V_{17}\) to \(V_{21}\) through the average time derivatives of price and volume computed over the most recent second (Kercheval and Zhang 2015). This helps us track whether there are relatively large upward or downward changes in trading prices and volumes within a very short period of time. Similarly, we measured the shortterm average arrival rate by counting the number of quotes from both sides during the most recent second in feature \(V_{22}\).
Note that compared with variables used in other popular methods, such as (Kercheval and Zhang 2015), we did not include windowlevel variables that require depth levels larger than 1. Because our LOB records from the NYSE dataset only provide information about the best bid and ask, i.e., depth\(=1\), these variables cannot be derived from our data.
Design of benchmark study using real data
We conducted a benchmark study to evaluate the prediction performance of each of the proposed strategies. This study uses all component stocks of Dow Jones 30 from our NYSE data. From each stock, we randomly sampled 8000 records as training sets to train the prediction model and 2000 records as the testing set to evaluate the prediction performance. The evaluation results can be severely affected by sampling bias, i.e., the records were randomly selected in this experiment. To remove unwanted selection bias, we repeated this experiment 100 times by drawing 100 different random training and testing sets. We conclude based on 100 experiments, by averaging the sampling bias and learning the uncertainty in our evaluation.
In the training sets, we fitted four types of models to investigate the prediction performance improvement made by each of the predicted strategies. First, we fit an SVM model using all predictor variables at three different resolutions, including the standard windowlevel feature set (shown in Table 3), the “withinwindow” feature set (shown in Table 1), and the FPCA scores discussed above. This model utilizes Strategies I and III. We considered this model as a baseline and compared it with the other three models. Next, we fit two reduced SVM models by removing the “withinwindow” features (Strategy I) and FPCA scores (Strategy III) from the baseline model, respectively. Comparing these two reduced models to the baseline model, we can evaluate the change in prediction performance caused by Strategies I and III. Finally, we utilized 100 baseline models on different random subsets of data to construct an ensemble model and compared it with the baseline model to evaluate the usefulness of Strategy II. In summary, our experiment consists of four types of models. They are the baseline model (Strategies I, III), the ensemble model (Strategies I, II, III), the“withinwindow” model (Strategy I), and the FPCA model (Strategy III).
In the testing sets, we applied the trained SVM models to predict the midprice movement of each record using historical trade data. Then, we compared the predicted movement with the observed movement to calculate the prediction performance criteria: recall, precision, and F1 score. We used the F1 score as the major criterion. To evaluate the performance improvement of the proposed strategy on each testing dataset, we calculated the F1 score difference of the two corresponding models (with and without that strategy). For example, the performance of Strategy I can be evaluated by the F1 difference between the baseline model (Strategies I and III) minus the FPCA model (Strategy III). In total, this leads to 9000 F1 score differences obtained from combinations of the 3 strategies, 30 stocks, and 100 experiments. Furthermore, to learn the performance of our strategy with other machinelearning methods, we repeated these experiments with a different learner, ENet models, and all SVM models were replaced.
When training the SVM models, the kernel function used in this study was the polynomial kernel \(\kappa (x_i,x_j)=(x_i \cdot x_j +1)^d\) with d = 2 and the constraint parameter \(C = 0.25\) as suggested in Kercheval and Zhang (2015). When training the ENet model, we chose the values of parameters \(\lambda\) and \(\alpha ^*\) in Eq. (8) of the supplementary document using a twolayer crossvalidation (CV) approach. We applied a 5folds CV grid search to each training sample. The regularization parameter \(\lambda\) is evenly spaced on the logscale range of \(10^{8}\) and 5 at 100 values, meanwhile, with a fixed \(\lambda\), we searched for \(\alpha ^*\) values from a sequence of 4 values ranging from 0.2 to 0.8 with a stride of 0.2. We evaluated each combination of the two parameters and then determined the \(\lambda\) and \(\alpha ^*\) that yield the best model performance (Friedman et al. 2010).
The evaluation results are presented in the following subsections and Appendix. In addition to the prediction performance, we also evaluated the importance of each handcrafted feature for the midprice movement prediction task according to its frequency of being selected by the ENet model, the details of which are provided in Fig. 4.
Performance evaluation of proposed strategies
We conducted experiments using the component stocks of Dow Jones 30 from the NYSE data. For each prediction performance criteria (precision, recall, and F1 scores), we obtained 24,000 scores from combinations of the 2 methods (SVM and Enet), 4 models, 30 stocks, and 100 random repeats. For each setting, the median performance scores of 100 random repeats are provided in Appendix Tables 6 (SVM models) and 7 (ENet models). In the remainder of the discussion, we focused on the F1 score, as it is the most popular classification performance criteria used in the machine learning community. In each setting, we take the difference in F1 scores between the baseline model and the remaining three models to evaluate performance improvement based on their corresponding strategies. This led to 180,000 F1 score differences. We visualized these F1 score differences in Fig. 3, which comprises six panels. The top three panels show the results of the SVM models, and the bottom three panels show the results of the ENet models. From left to right, the panels show the F1 score improvement by each of the three proposed strategies. The results of the 30 stocks are represented by boxes from top to bottom of each panel, and each box represents 100 F1 score differences obtained from repeated experiments. Positive values in the F1 score differences indicate that the corresponding strategy improves the prediction performance; hence, we named it F1 improvement in the panel titles. The dashed vertical line is positioned at zero, which serves as a boundary to distinguish stocks whose midprice prediction can be improved by the proposed strategy, i.e., boxes on the righthand side of the boundary. To infer the significance of the improvement represented by each box, we calculated the raw pvalue of the Wilcoxon sign rank test and applied the false discovery rate adjustment (Benjamini and Hochberg 1995) for multiple testing to avoid inflated TypeI error by multiple tests. Appendix Table 8 presents the adjusted pvalues corresponding to each box in Fig. 3. The boxes corresponding to small adjusted pvalues (less than 0.05) are colored dark gray, which indicates that a strategy significantly improves the prediction of that stock, whereas the light gray boxes represent no significant improvements.
With SVM, the average improvement in the F1 score over 30 stocks brought by Strategies I, II, and III are 0.02, 0.018, and 0.00036, respectively. The highest improvements of the three strategies were 0.056 (Strategy I on Stock GS), 0.087 (Strategy II on Stock DOW), and 0.016 (Strategy III on Stock PFE). Likewise, regarding ENet model performance, the average improvement in the F1 score through Strategies I, II, and III are 0.016, 0.019, and 0.00046, respectively. The highest improvements of the three strategies are 0.058 (Strategy I on Stock GS), 0.2 (Strategy II on Stock PG), and 0.026 (Strategy III on Stock PG).
We summarize and visualize the performance of our proposed strategies based on the dark grey boxes observed in Fig. 3. Strategy I (variables of ‘withinwindow’ trends) significantly improved the prediction performance in 27 out of 30 stocks for both the SVM and ENet models. Ensemble learning based on models fitted on many random subsets (Strategy II) significantly improved prediction performance consistently for all stocks, except that the ENet model has one stock showing a positive but nonsignificant trend. Note that ENet models are not guaranteed to converge. A substantial portion of the ENet model failed to converge in analyzing stocks PG, DOW, and AAPL, which may explain why we have lightgray boxes in Fig. 3 for Strategies I and II in the ENet model. Therefore, we conclude that the first two strategies are useful for most applications. In contrast, the FPCA of the oneday historical trading record (Strategy III) only helps the SVM models in the three stocks and shows no help in the rest of the predictions. We find that FPCA features are helpful only when daily historical midprices are relatively stable. We suggest that users use Strategy III with caution because it only works in specific situations. Users should test Strategy III on their data with various history lengths (e.g., one week, one day, etc.), and use it only if the FPCA of a certain length history seems helpful for prediction of the data.
Appendix Table 5 shows the median computing times of the ENet and SVM models and their corresponding ensemble versions. We found that ENet models were much shorter than SVM models, especially with regard to the ensemble strategy. Thus, we recommend the ENet model, given that it requires fewer computing resources and does not sacrifice much prediction performance. In reallife applications such as HFT, decision time is critical, which makes ENet models more favorable.
Importance of the predictors
The ENet model automatically selects important predictors by assigning zero coefficients to the unimportant predictors. Therefore, we can summarize the importance of the predictors from the above experiment as a byproduct. For each stock, we fit numerous ENet models. We consider a predictor to have a high impact if it was selected (i.e., with nonzero coefficients) by \(80\%\) of the fitted elastic net models. We believe that the most useful predictors consistently have a high impact on many stocks. Fig. 4 summarizes the frequency of each variable that has a high impact on the 30 component stocks of the Dow Jones 30 index. Because there are 30 stocks, the frequency is in the range [0, 30].
From the observed frequencies, we found that for most component stocks, the best bid volume has a high impact in predicting the midprice movement states of Downwards, the withinwindow standard deviation has a high impact on predicting the stationary state, and the best asset volume variable has a high impact on predicting the upward state. Many factors, especially from the collection of “withinwindow” highfrequency variables set, are widely chosen to help predict the midprice movement stationary state, whereas the upwards and downwards states relate more directly to the price differences or the quote volumes from the ask/bid sides. Furthermore, the FPCA scores variables are popular among the prediction of “stationary” direction, which confirms that longterm midprice movement trajectories are useful for predicting stable stock midprice movement.
Conclusion
This study proposes three novel strategies to address common issues in predicting highfrequency stock prices using machine learning methods. Our data preprocessing strategies can extract more information from raw data and feed machine learning algorithms with highquality data input, which is of interest to highfrequency investors. As our first strategy summarizes and introduces the “withinwindow” variables into the model, it recovers the discarded information lost in the eventbased inflow protocol during the data thinning process. The second strategy combines a random sampling approach with ensemble machine learning. The sampling method alleviates correlation issues between consecutive observations, while the ensemble method addresses the shortage of potential selection bias caused by random sampling and therefore improves the robustness of the prediction results. Our third strategy sheds light on the effect of longterm trading history on our model. The FPCA reduction of variable dimensionality allows us to model longerterm price curves with few FPCA scores and avoids long vector variables of the sequence data.
We evaluated the performance of our three proposed strategies using intraday highfrequency trade and quote data from the NYSE and found that Strategies I and II significantly improve prediction performance in most applications. However, Strategy III helps only in certain situations. All three strategies are independent and can be used separately or in combination depending on users’ needs. We recommend using Strategies I and II in all applications with highfrequency data that require datathinning, but only employ Strategy III after testing its performance and carefully exploring the length of history to be utilized in FPCA. Additionally, our strategies are addons for use in conjunction with machinelearning models. We illustrate our strategies using SVM and ENet models, and ENet models are preferable because they are computationally faster without sacrificing too much prediction performance.
The proposed method has three limitations. Next, we discuss the study’s limitations and potential solutions. First, Strategy II could be time consuming if excessive ensemble learning is involved, which is problematic in some reallife settings. In cases where the complexity of methods is not linear to the sample size, we may borrow the concept of federated learning (Li et al. 2020; Kairouz et al. 2021), in which the model divides data into many smaller samples, learns, and integrates information by updating its parameters. Second, we used FPCA on the hourly resolution to illustrate our strategy, but it might not be the best resolution to reflect the stock’s longterm history. We suggest that users explore different resolutions (such as daily or by the minute) and select the best one before applying it to a new stock. The third limitation is that we set model parameters for all stocks using the same rule for illustration purposes, so the performance achieved by an individual stock might not be as ideal as possible. In practice, we recommend that users finetune all the relevant model parameters and those in our three strategies for a particular stock. For example, readers can customize any detail in these strategies, which includes the choice of machine learning base learner, the number of trained models to ensemble, the voting scheme in ensemble learning, etc.. Thus, we can obtain the best model performance for each stock.
Availability of data and materials
All the processing codes are available upon request. The data that support the findings of this study are available from the NYSE and TAQ datasets, but restrictions apply to the availability of these data, which were used under license for the current study and so are not publicly available. However, the sample data are available from the authors upon reasonable request.
Notes
WBA replaced GE on June 28, 2018; DOW replaced DWDP, on March 27, 2019
Abbreviations
 HFT:

Highfrequency trading
 LOB:

limit order book
 NYSE:

New York stock exchange
 TAQ:

Trade and quote
 CV:

crossvalidation
 PCA:

Principal component analysis
 FPCA:

Functional data analysis
 FPC:

functional principal component
 OLS:

ordinary least square
 IQR:

interquartile range
 ENet:

Elastic net model
 SVM:

Supporting vector machine
 R:

Recall
 P:

Precision
 F1:

F1 score
 MMM:

3M
 AXP:

American express
 AAPL:

Apple
 BA:

Boeing
 CAT:

Caterpillar
 CVX:

Chevron
 CSCO:

Cisco
 KO:

CocaCola
 DIS:

Disney
 DOW:

Dow chemical
 XOM:

Exxon mobil
 GS:

Goldman sachs
 HD:

Home depot
 IBM:

IBM
 INTC:

Intel
 JNJ:

Johnson & Johnson
 JPM:

JPMorgan chase
 MCD:

McDonald’s
 MRK:

Merck
 MSFT:

Microsoft
 NKE:

Nike
 PFE:

Pfizer
 PG:

Procter & gamble
 TRV:

Travelers companies Inc
 UTX:

United technologies
 UNH:

United health
 VZ:

Verizon
 V:

Visa
 WMT:

WalMart
 WBA:

Walgreen
References
Arévalo A, Niño J, Hernández G, Sandoval J (2016) Highfrequency trading strategy based on deep neural networks. In: international conference on intelligent computing pp 424–436. Springer
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol) 57(1):289–300
Campbell JY, Lo AW, MacKinlay AC (2012) The econometrics of financial markets. Princeton University Press, Princeton, New Jersey
Campbell JY, Grossman SJ, Wang J (1992) Trading volume and serial correlation in stock returns. NBER working papers 4193, National Bureau of Economic Research, Inc. https://ideas.repec.org/p/nbr/nberwo/4193.html
Chalup SK, Mitschele A (2008) Kernel methods in finance. In: handbook on information technology in finance pp 655–687. Springer, Germany
Chen AS, Leung MT, Daouk H (2003) Application of neural networks to an emerging financial market: forecasting and trading the taiwan stock index. Comput & Operat Res 30(6):901–923
Dixon M (2016) High frequency market making with machine learning. November
Fletcher T, ShaweTaylor J (2013) Multiple kernel learning with fisher kernels for high frequency currency prediction. Comput Econ 42(2):217–240
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Hendershott T, Moulton PC (2011) Automation, speed, and stock market quality: the nyse’s hybrid. J Financl Markets 14(4):568–604
Huang Y, Kou G, Peng Y (2017) Nonlinear manifold learning for early warnings in financial markets. Eur J Oper Res 258(2):692–702
Huang W, Nakamori Y, Wang SY (2005) Forecasting stock market movement direction with support vector machine. Comput Operat Res 32(10):2513–2522
Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, Charles Z, Cormode G, Cummings R et al (2021) Advances and open problems in federated learning. Found Trends® Machine Learn 14(1–2):1–210
Kercheval AN, Zhang Y (2015) Modelling highfrequency limit order book dynamics with support vector machines. Quant Finance 15(8):1315–1329
Kokoszka P, Reimherr M (2017) Introduction to functional data analysis. CRC Press, Boca Raton
Kong A, Zhu H (2018) Predicting trend of high frequency csi 300 index using adaptive input selection and machine learning techniques. J Syst Sci Inform 6(2):120–133
Li T, Sahu AK, Talwalkar A, Smith V (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag 37(3):50–60. https://doi.org/10.1109/MSP.2020.2975749
Li T, Kou G, Peng Y, Philip SY (2021) An integrated cluster detection, optimization, and interpretation approach for financial data. IEEE transactions on cybernetics
Menkveld AJ (2013) High frequency trading and the new market makers. J Finan Markets 16(4):712–740
Nousi P, Tsantekidis A, Passalis N, Ntakaris A, Kanniainen J, Tefas A, Gabbouj M, Iosifidis A (2019) Machine learning for forecasting midprice movements using limit order book data. Ieee Access 7:64722–64736
Ntakaris A, Kanniainen J, Gabbouj M, Iosifidis A (2020) Midprice prediction based on machine learning methods with technical and quantitative indicators. PLoS ONE 15(6):0234107
Ntakaris A, Magris M, Kanniainen J, Gabbouj M, Iosifidis A (2018) Benchmark dataset for midprice forecasting of limit order book data with machine learning methods. J Forecast 37(8):852–866
Ntakaris A, Mirone G, Kanniainen J, Gabbouj M, Iosifidis A (2019) Feature engineering for midprice prediction with deep learning. Ieee Access 7:82390–82412
Parlour CA, Seppi DJ (2008) Limit order markets: A survey. Handbook of financial intermediation and banking 5:63–95
Qian XY, Gao S (2017) Financial series prediction: Comparison between precision of time series models and machine learning methods. arXiv preprint arXiv:1706.00948, 1–9
Qiao Q, Beling PA (2016) Decision analytics and machine learning in economic and financial systems. Springer, USA
Ramsay JO (2004) Functional data analysis. Encyclopedia Stat Sci 4:554
Ramsay JO, Silverman BW (2007) Applied functional data analysis: methods and case studies. Springer, Germany
Securities Commission E (2010) Concept release on equity market structure. IEEE Transactions on Information Theory 34(61358), 7–0210
Tay FE, Cao L (2001) Application of support vector machines in financial time series forecasting. omega 29(4):309–317
Wen F, Xu L, Ouyang G, Kou G (2019) Retail investor attention and stock price crash risk: evidence from china. Int Rev Financ Anal 65:101376
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B 67(2):301–320
Acknowledgements
The authors acknowledge that this research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada (www.computecanada.ca).
Funding
Canada Research Chair (950231363, XZ), Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants (RGPIN20203530, LX), and the Social Sciences and Humanities Research Council of Canada (SSHRC) Insight Development Grants (430201800557, KX).
Author information
Authors and Affiliations
Contributions
XZ supervised this project. XZ and LX contributed to the conceptualization and design of the study. YH developed computer programs and conducted the experiments. KX provided data and supported modelling and the interpretation of the results' financial meaning. XZ and YH contributed to the manuscript preparation, and all authors contributed to the revision and approved the final draft.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Table S1. Proportion of the outcome variable Y (Upwards/Stationary/Downwards) according to different α values and over three months trading time. Table S2. Proportion of the outcome variable Y (Upwards/Stationary/Downwards) according to different α values and over June 2017. Table S3. Proportion of the outcome variable Y (Upwards/Stationary/Downwards) according to different α values and over July 2017. Table S4 .Proportion of the outcome variable Y (Upwards/Stationary/Downwards) according to different α values and over August 2017.
Appendix
Appendix
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, X., Huang, Y., Xu, K. et al. Novel modelling strategies for highfrequency stock trading data. Financ Innov 9, 39 (2023). https://doi.org/10.1186/s40854022004319
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40854022004319