A hybrid model for stock price prediction based on multi-view heterogeneous data

Long, Wen; Gao, Jing; Bai, Kehan; Lu, Zhichen

doi:10.1186/s40854-023-00519-w

Research
Open access
Published: 29 February 2024

A hybrid model for stock price prediction based on multi-view heterogeneous data

Wen Long^1,2,3,
Jing Gao^1,2,3,
Kehan Bai⁴ &
…
Zhichen Lu^1,2,3

Financial Innovation volume 10, Article number: 48 (2024) Cite this article

1477 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Literature shows that both market data and financial media impact stock prices; however, using only one kind of data may lead to information bias. Therefore, this study uses market data and news to investigate their joint impact on stock price trends. However, combining these two types of information is difficult because of their completely different characteristics. This study develops a hybrid model called MVL-SVM for stock price trend prediction by integrating multi-view learning with a support vector machine (SVM). It works by simply inputting heterogeneous multi-view data simultaneously, which may reduce information loss. Compared with the ARIMA and classic SVM models based on single- and multi-view data, our hybrid model shows statistically significant advantages. In the robustness test, our model outperforms the others by at least 10% accuracy when the sliding windows of news and market data are set to 1–5 days, which confirms our model’s effectiveness. Finally, trading strategies based on single stock and investment portfolios are constructed separately, and the simulations show that MVL-SVM has better profitability and risk control performance than the benchmarks.

Introduction

Stock price predictions have always been a focus of financial research. Existing research on stock price prediction is primarily based on two data types. One is structured historical market data, and the other is unstructured text, such as financial news.

Stock market data, such as returns and trading volumes, play a vital role in stock price prediction, and many studies have used market data to predict stock price trends. White (1988) was the first to successfully predict the time series of the stock market using the Back Propagation Neural Network (BP-NN). Subsequently, Kolarik and Rudorfer (1994) compared the prediction results of Artificial Neural Network (ANN) with those of Autoregressive Integrated Moving Average model (ARIMA), showing that the ANN model was more effective. Bildirici and Ersin (2009) studied the historical stock data of the Istanbul stock market over the past 30 years by combining the Autoregressive Conditional Heteroskedasticity model (ARCH) or the Generalized Autoregressive Conditional Heteroskedasticity model (GARCH) with ANN and found that the hybrid model of GARCH and ANN had better prediction results than the hybrid model of ARCH and ANN. Hammad et al. (2007) applied the multi-layer BP-NN to predict the stock price, which showed a better prediction performance than other methods. In recent years, deep learning has been introduced into stock price predictions. Chen et al. (2015) realized the prediction of stock returns with the Long Short-Term Memory (LSTM) model. Fischer and Krauss (2018) used LSTM to predict stock prices and drew a short-term investment strategy. Long et al. (2019) put forward a multi-filters neural network model using deep learning methodologies and applied it to the Chinese stock market index CSI 300. Some studies have also utilized reinforcement learning for financial prediction; however, the algorithm usually requires training and testing over a very long period. Tan et al. (2011) developed a non-arbitrage algorithmic trading system based on reinforcement learning, which was tested on more than 20 stocks over 13 years from 1994 to 2006. Suhail et al. (2022) employed a reinforcement learning network to guide stock market trading, which used 11 years of Apple stock data from 2006 to 2016. Additionally, the performance of reinforcement learning is sometimes unsatisfactory. Li et al. (2007) adopted actor-only and actor–critic reinforcement learning to develop two prediction systems; however, both systems were unable to generate significant improvements. Kanwar (2019) also showed that deep reinforcement learning was less successful in capturing the dynamic changes in the stock market than originally thought.

Moreover, many studies have shown that in addition to market data, financial news has an impact on stock prices. News contains information about the company’s fundamentals and activities; hence, it will affect market participants’ expectations of future price changes, thus driving stock price movements. Dyck and Zingales (2003) proved that issuing earnings announcements through news media could increase volatility in the stock price. Shiller (2015) also held that media can fuel the fluctuations of the stock market. Hence, deriving information affecting stock prices from media coverage is very important. Wüthrich et al. (1998) chose the news in the most influential financial newspapers, such as the Wall Street Journal, as the object of empirical research and explored the forecasting effect of the news on market indexes. Lavrenko et al. (2000) constructed an e-Analyst news recommendation system to study the correlation between news and stock price time series. This system can recommend news that has a predictive effect on future stock price trends. Gidofalvi and Elkan (2001) applied the Bayesian text classifier and found that news indicators had a certain predictive effect on the stock price within 20 min before and after the news is released. Mittermayer and Knolmayer (2006) built a NewsCATS system to predict the intraday real-time price fluctuations of stocks caused by news. Compared with other automated text categorization algorithms, this system was found to have better predictive performance and higher system trading profitability. Schumaker and Chen (2009) found that news could be used by the SVM algorithm to make excellent predictions of stock prices 20 min after the news was released, and the prediction results could be used to guide trading. Long et al. (2019) proposed a new kernel S &S to study the impact of news on stock prices, which considered the information structures among news in addition to the news contents. With SVM algorithms, the new kernel outperformed other common kernels, such as the linear kernel, by at least 5% accuracy.

The aforementioned research works are based on single-view data, but stock price movement can be affected by both financial news and historical market data. Market and news data can be independently used to predict stock prices; however, if the model only uses single-view information, information deviation may occur. Figure 1 shows possible market scenarios. If the model only uses historical market data, the rational prediction in the left figure will be “rise,” and the rational prediction in the right figure will be “fall.” Therefore, if we witness an actual “fall” in the left figure or “rise” in the right figure, the prediction performance of the model is weakened. Moreover, if the model uses only financial news, it fails to explain why stock prices continue to increase when negative news is released in the left figure and why stock prices still fall when positive news is released in the right figure. Thus, analyzing the impact of multi-view data on stock prices comprehensively is important; only by this method can the model send the correct signal.

Many studies have tried incorporating the two kinds of data to improve the predicting performance. However, owing to the different structures of the two, combining them directly into a model is difficult. To solve this problem, studies usually apply indexing modeling; that is, they use textual data to compile indexes so that textual data are structured to predict stock prices together with market data (Deng et al. 2011a; Mohan et al. 2019; Li et al. 2020; Kesavan et al. 2020). Although this approach successfully fuses structured and unstructured text data, there are some limitations to this indexing treatment. Because abundant news text data are condensed into an index by directly using structured information about text (Deng et al. 2011a), such as news frequency, or by processing vectorized text into a structured sentiment index (Mohan et al. 2019; Li et al. 2020; Kesavan et al. 2020), inevitably, some information contained in the text will be lost. However, if common algorithms directly use the text vector with stock market data to perform prediction, the complicated news information with stock prediction information, a large amount of unrelated information and noise may potentially decrease the prediction performance (Lin et al. 2022). Accordingly, the appropriate extraction and exploitation of hidden information within raw multi-view heterogeneous data, including news and market data, to make accurate predictions becomes a challenging problem.

To solve the problem, this paper develops a hybrid model of stock price fluctuation prediction via a combination of multi-view learning for directly fusing different-structured data and a machine learning method called support vector machine for stock price trend classification. This model, called MVL-SVM, can maximize the consistency between the multi-view information learned from financial news or market data and therefore, not only reduces the information loss in news text processing, but also solves the difficulty in integrating complicated news information with market data. To evaluate the performance of the model, the time series method and classic SVMs were introduced for comparison. Finally, a series of trading strategies were constructed based on this algorithm and applied to three trading scenarios.

This paper contributes in the following three aspects. (1) The proposed hybrid model based on the framework of multi-view learning can input heterogeneous information influencing stock price fluctuations, such as financial news and market data, into the prediction model simultaneously, which not only enriches the information types for stock price prediction but also reduces the information loss in the process of prediction. Most previous studies only considered single-view data; some related to multi-view data tend to adopt the strategy of indexing modeling, which will inevitably lead to a large loss of information. (2) This study also investigates the lag effect of news and market data on stock-price forecasts. Usually, the news cannot be fully absorbed by the stock price on the day of the news release, which means it may further affect the stock price the next day; however, few studies consider the time lag of this impact. We solve this problem by studying a prediction problem with lag and different time windows to observe the ability of MVL-SVM to capture the information contained in multi-view heterogeneous data after or over a certain period. (3) This study constructs a series of trading strategies based on the proposed hybrid model and compares them with other prediction-based and common strategies. The simulation results show that MVL-SVM has better profitability and risk control ability than other models, which provides more favorable proof for evaluating the performance of our model. At the same time, related studies only focus on prediction accuracy.

The rest of the paper is organized as follows. “Literature review” section reviews the main existing methods related to stock price forecasting based on multi-view heterogeneous data. “Methods” section introduces the methods used in this study. “Experimental test” section presents our datasets and shows the results of the MVL-SVM model, which are compared with a time series model and some classic SVM models. “Robust test” section further evaluates the performance of the model under different sliding windows. “Trading strategy” section discusses a series of trading strategies designed with our model to assess its practical efficiency, and “Conclusion” section presents our conclusions.

Literature review

Some significant attempts have been made in the finance domain to incorporate news and market data in predicting stock prices. Relevant methods can be divided into two categories: indexing modeling and direct fusion methods.

Indexing modeling methods involve constructing indexes with news information and thus fusing the structured index with numerical market data for prediction. Frequently used methods include the statistical method, which calculates the frequency of the text data, and sentiment analysis, which uses the processed text from language preprocessing to provide polarity scores for social media data and news. Deng et al. (2011a) predicted the price movement with overall sentiment analysis and frequency based on news and comments, and technical analysis of historical market data. Mohan et al. (2019) extracted numerical data called text polarity from text articles and combined it with stock prices for prediction. Li et al. (2020) extracted sentiments from news and represented stock prices by technical indicators. Then, a layered deep learning model was used to learn the multi-view information, and a fully connected neural network was employed for stock predictions. Kesavan et al. (2020) represented news articles and social media contents by sentiment vectors and then used deep learning techniques to incorporate the polarity of the sentiments with financial time-series data to predict stock prices. This approach succeeded in fusing structured and unstructured textual data. However, when the structured information about text is directly used and constructed into some indexes, such as news frequency, or when text is first vectorized and then processed into a structured sentiment indicator, it may face information loss.

Direct fusion method aims to directly integrate structured and unstructured data to extract information or solve classification and prediction problems. Li et al. (2016) applied the extreme learning machine (ELM) to make stock price predictions based on the market news and stock prices concurrently and found that the accuracies of RBF ELM and RBF SVM are similar but higher than that of BP-NN; the prediction speeds of the two algorithms are also much faster than that of BP-NN. Wang et al. (2019) proposed a hybrid time-series predictive neural network to combine the daily K-line data with the news vectors and succeeded in stock volatility prediction. Ronaghi et al. (2022) predicted market index with COVID-19-related Twitter data and historical market data via a deep fusion framework consisting of two parallel paths, one based on CNN and another that integrates CNN with bi-directional LSTM (BLSTM). Lin et al. (2022) developed a spatial-temporal attention-based convolutional network, which successfully extracted text and numerical information for stock price prediction using the attention mechanism, CNN, and LSTM. However, the aforementioned studies were mostly based on the neural network framework. Considering that a neural network is prone to fall into a local minimum, we attempt to develop a new model based on a different framework to fuse the two data types.

Multi-view learning proposed by de Sa (1994) is a machine learning algorithm that can directly input heterogeneous data for training and usually has an excellent performance. Unlike the method of constructing an index from text, it substitutes labels using different views. It minimizes the inconsistency between the model outputs from distinct views to minimize classification errors. Yarowsky (1995) and Blum and Mitchell (1998) indicated that multi-view learning outperformed single-view learning in light of classification. Blum and Mitchell (1998) improved the algorithm by co-training distinct views when studying web page classification. Collins and Singer (1999) measured the consistency between distinct views by constructing an objective function. By maximizing the objective function, Dasgupta et al. (2001) presented an upper limit for the generalization error of multiple views. Multi-view learning has been applied to a variety of learning methods, such as dimensionality reduction (Sun et al. 2010) and classification methods (Han et al. 2022). Many scholars have noticed its usefulness and begun combining it with other traditional algorithms to obtain excellent performance. Xiao et al. (2022) utilized the multi-view learning to solve the data uncertainty; thus, successfully improving the Ordinal Regression classifier (OR). Lv et al. (2021) developed a prediction model with market data by integrating multi-view learning with the classic RBF network, which showed excellent performance in forecasting stock prices. However, it only uses market data and excludes financial news information, which has been proven to be predictive by many studies.

The intuition for building this hybrid model is as follows. On the one hand, SVM is a classic machine learning classification algorithm and is often used to predict stock prices with financial news (Schumaker and Chen 2009; Long et al. 2019). Many studies have shown that SVM performs better in financial forecasting when compared with some neural network frameworks (Kim 2003; Cao and Tay 2001, 2003; Li et al. 2016; Meesad and Thanh 2014). On the other hand, multi-view learning can learn common feature spaces or shared patterns by combining multiple data sources (Yan et al. 2021). Therefore, these two algorithms can be combined to use multi-view heterogeneous data to predict stock prices. Some scholars have theoretically proved the effectiveness of the multi-view model over the single-view model (Sun et al. 2022), and literature shows that this hybrid model has achieved excellent performance in fusing data with different structures for classification (Zhang et al. 2010; Xu et al. 2015; Ceci et al. 2015; Wang and Zhou 2021).

However, this model has not been widely applied in the financial field, and the limited related research can be divided into three categories. First, most studies based on this method considered only single-view data. Shynkevich et al. (2015b 2015a) used the model to predict the stock price based on financial news and found that fusing different news categories could improve the prediction performance. However, they only considered the impact of media on stock prices and did not consider the impact of market data. Second, although some studies simultaneously included numerical and textual data with a multi-view learning framework, they transformed the text into structured indexes to fuse text with numerical data (Deng et al. 2011a, b, 2014). As stated earlier, this approach increases information loss. Third, a few studies applied a multi-view learning framework to merge historical stock prices with financial news vectors for stock price prediction (Li et al. 2011; Wang et al. 2012); however, they neither took into account the lag effect of news and market data on stock prices nor built trading strategies; hence, the model’s actual application performance in the financial market could not be judged.

Methods

Chinese news text processing

Considering that news is unstructured and involves a lot of noise or redundant information, we must eliminate noise and extract representative features containing the most useful information for accurate prediction. Therefore, this section introduces the method of transforming news text into a structured feature vector for training through news preprocessing, data cleaning, text representation, and feature extraction.

News preprocessing

Because trading in the Chinese stock market ends at 15:00, news released after 15:00 on trading $day_t$ can be assumed to not affect the fluctuation of the stock price on $day_t$; similarly, the news released on weekends, holidays, and other closed days have no effect on the prices. Therefore, news released on the closed day or after 15:00 on each trading day is included in the news of the next trading day. We then sorted the news by the reorganized date for processing.

Data cleaning

In data cleaning, we first removed the punctuation and garbled characters in the news, and then used the jieba package of Python to perform Chinese word segmentation. Finally, we used word filtering to filter out unimportant words from the Baidu Stop Word List. This step can help remove stop words and leave representative words such as nouns, verbs, and adjectives.

Text representation

For a news article, the value of each word was calculated according to its classification importance. Words that were more important for classification were assigned higher weights. Thus, each article can be represented as a vector of word values. The bag-of-words model is a commonly used text representation method that represents the text as a bag of words, regardless of word order and grammar, while maintaining multiplicity. Based on the bag-of-words model, Salton et al. (1975) proposed a vector space model commonly used in text classification. Because news contains many new concepts and words, it is appropriate to assume the independence of each word in this model. Each news item is then represented as a vector composed of the weight of each word, and the weight is determined by the word’s importance in the news. According to Salton and Buckley (1988), the importance of words can be determined by TF-IDF, which supposes that words that rarely appear in the entire document but frequently appear in a text are of greater importance for classification. However, in practice, text length affects the weights obtained from this method. To better quantify the characteristic words, the influence of the length should be reduced. Therefore, we used the ltc method (Buckley et al. 1995) in this study, which combines length normalization (“l”), term frequency (“t”) and collection frequency (“c”) to calculate the weights of words. By normalizing the weight of words, the influence of article length can be avoided, and the importance of word frequency is weakened to a certain extent. This represents news articles in the following form:

$$news_{t} = \left( {w_{{t,1}} ,w_{{t,2}} , \ldots ,w_{{t,M}} } \right),$$

(1)

where $news_t$ represents the news vector on $day_t$ and $w_{t, m}$ is the weight of $word_m$ in $news_t$, which is expressed as

$$w_{{t,m}} = \frac{{\left( {log\left( {f_{{t,m}} } \right) + 1.0} \right)*log\frac{1}{{F_{m} }}}}{{\mathop \sum \limits_{{j = 1}}^{M} \left[ {\left( {log(f_{{t,j}} ) + 1.0} \right)*log\frac{1}{{F_{j} }}} \right]^{2} }}^{2} ,\;m = 1,2, \ldots ,M,$$

(2)

where $f_{t,m}$ represents the occurrence frequency of $word_m$ in $news_t$ (term frequency), $F_m$ is the occurrence frequency of $word_m$ in the news corpus (collection frequency). All symbols used in the equations are listed in the Appendix (see Table 22).

Feature extraction

Because the news corpus involves many words, but only a small portion of words is contained in each news item, we use $\chi ^2$ statistics (Yang and Pedersen 1997) to extract features of the text. Instead of using all the words in the news corpus, it selects words that contribute more to text classification to make computation easier and prevent overfitting. For word w and category c, we define A as the number of times w and c co-occur, B as the number of times w occurs without c, C as the number of times c occurs without w, D as the number of times neither c nor w occurs, and n is the sample size.

$$\begin{aligned} \begin{aligned} \chi ^2(w,c)=\frac{n\times (AD-BC)^2}{(A+B)\times (A+C)\times (B+D)\times (C+D)}. \\ \end{aligned} \end{aligned}$$

(3)

Then, for a word w, the $\chi ^2$ score is obtained by combining the scores of each category:

$$\begin{aligned} \begin{aligned} \chi ^2(w)=P(-1)\times \chi ^2(w,-1)+P(1)\times \chi ^2(w,1), \\ \end{aligned} \end{aligned}$$

(4)

where P(c) represents the frequency of the category $c \in \{-1,1\}$ in the news corpus. Words with higher $\chi ^2$ scores are considered more informative for prediction, so we use $\chi ^2$ scores to select the optimal number of words with the best prediction performance as the dimension of news. The prediction accuracy maximization was determined by calculating the prediction accuracy in different dimensions separately.

MVL-SVM algorithm

The proposed MVL-SVM algorithm combines a support vector machine and multi-view learning, which can apply multi-view learning for multi-view data fusion and then use a support vector machine for classification or prediction. These two components are discussed further in this section. Moreover, SVM will also serve as a benchmark to test the effectiveness of our hybrid model.

Support vector machine (SVM)

SVM (Cortes and Vapnik 1995; Deng et al. 2012) has been widely applied to solve classification problems owing to its performance. It can learn from a set of two-class training instances and divide new instances into one of the classes to solve classification problems.

We denote $(x_1, y_1),(x_2, y_2),\ldots , (x_n, y_n)$ as a two-class training dataset, where $x_i$ for $i=1,\dots ,n$ represents a p-dimensional real vector, and $y_i \in (-1,1)$ represents to which class $x_i$ belongs. According to the classification method, SVM can be divided into linear and nonlinear SVM. The main idea of linear SVM is to find a “ maximum margin hyperplane,” defined as $g(x)=\omega ^{T}x+b$ so that the two classes of samples can be accurately classified by this hyperplane and the sum of distances between the hyperplane and the closest point of each class is maximized. Mathematically, the classification problem is equivalent to solving the minimization problem as follows:

$$\begin{aligned} \begin{aligned} min \frac{1}{2}\omega ^{T}\omega&+C\sum _{i=1}^{n}{\xi _i},\\ s.t.\quad y_i(\omega ^{T}x_i+b)&+\xi _i \ge 1, \forall 1 \le i \le n,\\ \xi _i \ge 0&,\forall 1 \le i \le n. \end{aligned} \end{aligned}$$

(5)

where n refers to the sample size, $\xi _i$ is a slack variable, and C is a penalty term that controls the cost of misclassification of samples. The larger C is, the more intolerant the model is to classification errors, which are prone to overfitting. On the contrary, when C is smaller, there is more tolerance; therefore, the model is prone to underfitting. A 2-dimensional example is shown in Fig. 2 to clearly demonstrate the workings of the linear SVM. Here, samples of different colors come from different classes. The red line represents the maximum margin hyperplane obtained by training the samples.

However, in practice, not all the samples are linearly separable. Therefore, a nonlinear SVM can be introduced to solve this problem. A nonlinear SVM can implicitly map samples into a high-dimensional space with $\phi (x)$ to find a maximum margin hyperplane in this high-dimensional space. The optimization problem is as follows:

$$\begin{aligned} \begin{aligned} min \frac{1}{2}\omega ^{T}\omega&+C\sum _{i=1}^{n}{\xi _i},\\ s.t.\quad y_i(\omega ^{T}\phi (x_i)+b)&+\xi _i \ge 1, \forall 1 \le i \le n,\\ \xi _i \ge 0&,\forall 1 \le i \le n. \end{aligned} \end{aligned}$$

(6)

With Lagrange duality, the original problem can be transformed into the following dual problem.

$$\begin{aligned} \begin{aligned} max\quad W(\alpha )&=\sum _{i=1}^{n}{\sum _{j=1}^{n}{\alpha _{i} \alpha _{j}y_{i}y_{j}(\phi (x_{i})\cdot \phi (x_{j}))}}\\&=\sum _{i=1}^{n}{\alpha _{i}}-\frac{1}{2}\sum _{i=1}^{n}{\sum _{j=1}^{n}{\alpha _{i}\alpha _{j}y_{i}y_{j}k(x_{i},x_{j}))}},\\ s.t.&\quad \sum _{i=1}^{n}y_{i}\alpha _{i}=0,\\&\quad 0 \le \alpha _i \le C, \forall i=1,\ldots ,n. \end{aligned} \end{aligned}$$

(7)

where $\alpha _{i}$ is a Lagrangian multiplier corresponding to sample $x_i$ and $k(x_{i},x_{j})=\phi (x_{i})\cdot \phi (x_{j})$ is a kernel function that is a symmetric positive definite function that satisfies Mercer’s conditions. By solving the above optimization problem, we can obtain the solutions $\alpha _{i}^{*}$ and $b^{*}$; the decision function is obtained as

$$f(x) = sgn\left\{ {\sum\limits_{{i = 1}}^{n} {y_{i} } \alpha _{i}^{*} k(x_{i} ,x_{j} ) + b^{*} } \right\}.$$

(8)

Kernel function $k(x_{i},x_{j})$ determines the performance of the model. The linear kernel function is often used to solve linear classification problems, and the Gaussian kernel function is used to solve nonlinear classification problems.

1) Linear kernel function

$$\begin{aligned} \begin{aligned} k_{lin}(x_{i},x_{j})=x_{i}^Tx_{j}. \end{aligned} \end{aligned}$$

(9)

2) Gaussian kernel function

$$\begin{aligned} \begin{aligned} k_{Gau}(x_{i},x_{j})=exp\left(-\gamma ||x_{i}-x_{j}||^{2}\right). \end{aligned} \end{aligned}$$

(10)

where $\gamma$ is a Gaussian kernel parameter, which is important in determining kernel performance. When $\gamma$ is small, the model is prone to underfitting, whereas when $\gamma$ is large, the model is prone to overfitting.

Multi-view learning

Generally, single-view data can be easily used in machine learning methods for classification, whereas using multi-view data in these methods is difficult. Multi-view learning algorithms appear to solve this problem.

Multi-view learning designs a function for each perspective. All functions are optimized by maximizing the consistency between redundant views, and the model’s performance is improved. Owing to its outstanding performance in multi-view data applications, multi-view learning has gradually attracted increasing attention. There are three types of existing algorithms.

1)
Co-training: Maximizing mutual agreement on different views of unlabeled data through alternate learning.
2)
Multiple kernel learning: Linearly or non-linearly combining kernels for each view to improve training efficiency.
3)
Subspace learning: Acquiring an appropriate subspace under the assumption that multiple views are generated from this appropriate subspace.

This study uses a multiple kernel learning framework to build a stock price prediction model. By selecting the appropriate kernels and kernel combination for training, each data source can be trained with the corresponding optimal kernel function; therefore, the model can perform better than the single-kernel model (Xu et al. 2013). As illustrated in Fig. 3, distinct kernels are selected for distinct views, and multiple pieces of information can be fused by combining distinct kernels. There are many combination methods that can be grouped into two categories: linear and nonlinear combinations.

However, no empirical results show that a nonlinear combination can improve the model’s performance, which raises the question of whether the nonlinear combination method is necessary and efficient. Therefore, we only used linear combination methods in this study. There are two basic categories.

1) Direct summation

$$\begin{aligned} \begin{aligned} K(x_{i},x_{j})=\sum _{m=1}^{M}{k_{m}(x_{i},x_{j})}, \end{aligned} \end{aligned}$$

(11)

where $k_{m}(x_{i},x_{j})$ denotes the m-th kernel.

2) Weighted summation

$$\begin{aligned} \begin{aligned} K(x_{i},x_{j})=\sum _{m=1}^{M}{\beta _{m}k_{m}(x_{i},x_{j})}. \end{aligned} \end{aligned}$$

(12)

Here, $\beta _{m}$ represents the weight of kernel $k_{m}(x_{i},x_{j})$.

As different types of information have different importance for prediction/classification, using the direct summation method, which assigns equal priority to each kernel, is not ideal. In comparison, we choose the weighted summation kernel in this study, and the kernel function can be written as

$$\begin{aligned} \begin{aligned} K(x_{i},x_{j})=\sum _{m}{\beta _{m}k_{m}(x_{i},x_{j})},\beta _{m}\ge 0,\sum _{m}{\beta _{m}}=1. \end{aligned} \end{aligned}$$

(13)

The weight $\beta _{m}$ of the kernel $k_{m}(x_{i},x_{j})$ can be determined using kernel learning. By applying SVM with the above kernel function, we can obtain the decision function of MVL-SVM, as shown in Equation (14).

$$\begin{aligned} \begin{aligned} f(x)=sgn\left\{ \sum _{i=1}^{n}\alpha _{i}^{*}y_{i}\sum _{m}\beta _{m}k_{m}(x_{i},x_{j})+b^{*}. \right\} \end{aligned} \end{aligned}$$

(14)

ARIMA model

ARIMA model (Box et al. 2015) is a commonly used time series model that can input historical data sequences for prediction; therefore, we use this model to design one of the benchmarks based on single-view data. It contains three terms: the autoregression term, the integrated term, and the moving average term. A nonstationary data series can be converted into a stationary one by differencing to remove the impact of nonstationarity. The first-order differencing of a data series $z_t$ is expressed as

$$\begin{aligned} \begin{aligned} o_{t}=z_t-z_{t-1}. \end{aligned} \end{aligned}$$

(15)

The stationarity of the time series was tested using the ADF method. After converting the data series into a stationary one through d-order difference, we use the stationary time series to conduct a model with a combination of the autoregression model and the moving average model and obtain the future value by d-order integration. The autoregression model captures the impact of historical time-series values on the current value by performing linear regression. Because time series are usually affected by random disturbances in noisy environments, the moving average method is further introduced to observe the influence of random disturbances on future time series. Then, the ARIMA(p, d, q) model with three parameters, including the autoregression order p, differencing order d and moving average order q, can be expressed as

$$\begin{aligned} \begin{aligned} o_{t}=\sum _{i=1}^{p}\phi _{i}o_{t-1}+\sum _{j=1}^{q}\theta _{j}e_{t-1}+\epsilon _{t}, \end{aligned} \end{aligned}$$

(16)

where $\phi _{i}$ is the ith autoregression parameter, $\theta _{j}$ is the jth moving average parameter, and $\epsilon _{t}$ is the error term at time t. In practice, the autoregression order p and moving average order q can be determined using partial autocorrelation and autocorrelation diagrams, respectively.

Experimental test

Section “Introduction” shows that financial news and market data are significant in predicting price trends. Because the MVL-SVM method can integrate multiple information sources for classification, we now apply it to predict whether prices will rise or fall based on financial news and market data. Subsequently, the results were compared with classic SVMs using single-view and mixed data.

Data sources

The Shanghai Stock Exchange 50 index (SSE 50 index) comprises the most representative 50 stocks of the Shanghai Stock Exchange. This indicates the overall situation of several leading companies with the greatest market influence in the Chinese stock market. As these enterprises have the most active news reports and can thus provide sufficient news samples, we choose the constituent stocks of the SSE 50 index for empirical analysis. Due to the limitations of data sources, the period investigated in this study was from January 1, 2018 to December 31, 2020.

Table 1 Days of news releasing

A hybrid model for stock price prediction based on multi-view heterogeneous data

Abstract

Introduction

Literature review

Methods

Chinese news text processing

News preprocessing

Data cleaning

Text representation

Feature extraction

MVL-SVM algorithm

Support vector machine (SVM)

Multi-view learning

ARIMA model

Experimental test

Data sources

Experimental analysis and comparison

Stock price prediction based on one-day news and four market data

Stock price prediction based on one-day news and daily return

Statistical analysis

Robust test

Trading strategy

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords