Novel modelling strategies for high-frequency stock trading data

Full electronic automation in stock exchanges has recently become popular, generating high-frequency intraday data and motivating the development of near real-time price forecasting methods. Machine learning algorithms are widely applied to mid-price stock predictions. Processing raw data as inputs for prediction models (e.g., data thinning and feature engineering) can primarily affect the performance of the prediction methods. However, researchers rarely discuss this topic. This motivated us to propose three novel modelling strategies for processing raw data. We illustrate how our novel modelling strategies improve forecasting performance by analyzing high-frequency data of the Dow Jones 30 component stocks. In these experiments, our strategies often lead to statistically significant improvement in predictions. The three strategies improve the F1 scores of the SVM models by 0.056, 0.087, and 0.016, respectively. Supplementary Information The online version contains supplementary material available at 10.1186/s40854-022-00431-9.


Supporting Vector Machines
Supporting Vector Machines has gained its large popularity in the research area of financial data prediction, (e.g., [1], [2], and [3]). In this section, we will give a brief introduction to SVM model. For further details regarding model design and kernel choosing, (see [4], [5]).
We denote i-th input observation as (x i , y i ), y i ∈ {+1, −1}, where y is the binary response and x i is the features vector of the individual observation. To construct the optimal hyperplane regardless of linear separability of the given data set, one first need to transform the p-dimensional input into a P-dimensional space using a projection φ : R p → R P . Our decision rules are as follows: where w is the normal vector of the hyperplane, and b ∈ R is the unknown parameter. The margin between the closest data points from both positive and negative sides sums as 2 ||w|| , we can maximize the margin by minimizing 1 2 ||w|| 2 . The optimal w and b are found using quadratic programming optimization to first find the α i which maximize: We can solve the loss function when considering general forms of the dot-product in a Hilbert space, where a kernel function come to help to define the inner product of φ(x): After we solve α i , we get: and the set of support vectors S is represented by the indices i, where α i > 0. The bias b can then be calculated: For a new data x, the classi1cation function is then given by In the implement of 3-category SVM model, SVM uses the 'one-against-one' approach: three binary classifiers are trained and the appropriate predictive class is found by a voting scheme.

Elastic Net Model
In a linear regression problem, a typical classic way of obtaining the vector of unknown coefficients is called the ordinary least squares (OLS), minimizing the residual sum of squares. Since OLS does not always perform well in many prediction and interpretation scenarios, penalization techniques have been proposed to improve OLS. Ridge regression model proposed by Hoerl and Kennard [6], introduced a size constraint, which is a L 2 -norm of the coefficients, to the regularization, therefore alleviated the problem of the occurrence of high variance in the OLS estimated coefficients. However, ridge regression cannot produce a parsimonious model for it only shrinks the coefficients towards zero and each other when enlarging the amount of shrinkage but it does not eliminate any of the variables. Another promising shrinkage method called the Lasso was proposed by Tibshirani [7], which minimizes the residual sum of squares subject to a bound on the l 1 penalty. Lasso automatically selects the variables that contribute most to the model performance, shrinks the coefficients of those insignificant variables to zeroes and therefore yields a more explicable model. Lasso also has its disadvantage when it comes to multicollinearity, the Lasso tends to randomly select one variable from a group of highly correlated variables and ignores the others. In order to fix the problem about Lasso mentioned above and, Zou and Hastie [8] brought up a new regularization technique named the Elastic Net. The Elastic Net is an intermedia continuous shrinkage method between Ridge regression and Lasso regression since its regularization is a mixture of the l 1 penalty of Lasso regression and the l 2 penalty of Ridge regression. The introduction of both penalties have an impact on the estimation of the unknown coefficients. Hence, ENet likewise does automatic variables selection and simultaneously it can select group of correlated variables.
In the case where the response variable Y is three-classes categorical, ENet fits a multinomial logistic regression model, where the probability of each class of the i-th observation is represented by a linear function as follow: where x i is the vector input of the i-th observation. β is a p×3 matrix of coefficients where β k = (β k1 , . . . , β kp ) T refers to the k-th column, which is a vector of p unknown coefficients for class k, and β k0 is the slope for class k. There is no closed-form solution to β in multinomial logistic regression model. As for implement of the ENet model, we use R function glmnet [9], to solve the complicated objective function, namely, the elastic net penalized negative log-likelihood function: where y il = I(g i = l) is the indicator function response, ||·|| F is the Frobenius norm, λ is the complexity parameter that control the amount of shrinkage, and α * ∈ [0, 1] is an ENet mixing ratio parameter that balances the quadratic and absolute penalty terms: when α * = 1, the ENet is actually LASSO regression and when α * = 0, the ENet turns into Ridge regression. The R function uses a so-called partial Newton algorithm to approximate the log-likelihood and therefore numerically derives β [9]. With the estimated coefficients, we can assess the most probable category given its observed features and make a new classification upon the input. In our experiment, we mainly consider the ENet with α * ∈ [0.2, 0.8], given intricacy of correlation among our handcrafted features: variables of sequential prices are highly correlated while other variables are not. The details about choosing values of parameters α * and λ in our ENet model are explained in the empirical application section.

Proportion of Y based of α
In this section, we summarize four tables about the proportion of the categorical outcome variable Y regarding different threshold values α. Table S1 is the summary over the whole trading period, Table S2, Table S3, and Table S4 are summaries over June, July, and August of 2017, respectively. In short, we find that when α = 10 −4 , all the stocks have an extreme data unbalance issue. This issue gets alleviated when α equals 10 −5 or 10 −6 for some stocks. The optimal threshold varies for different stocks. We would suggest that the users choose a proper threshold based on the stock's volatility and the distribution of the categorical responses.