Skip to main content

A hybrid model for stock price prediction based on multi-view heterogeneous data

Abstract

Literature shows that both market data and financial media impact stock prices; however, using only one kind of data may lead to information bias. Therefore, this study uses market data and news to investigate their joint impact on stock price trends. However, combining these two types of information is difficult because of their completely different characteristics. This study develops a hybrid model called MVL-SVM for stock price trend prediction by integrating multi-view learning with a support vector machine (SVM). It works by simply inputting heterogeneous multi-view data simultaneously, which may reduce information loss. Compared with the ARIMA and classic SVM models based on single- and multi-view data, our hybrid model shows statistically significant advantages. In the robustness test, our model outperforms the others by at least 10% accuracy when the sliding windows of news and market data are set to 1–5 days, which confirms our model’s effectiveness. Finally, trading strategies based on single stock and investment portfolios are constructed separately, and the simulations show that MVL-SVM has better profitability and risk control performance than the benchmarks.

Introduction

Stock price predictions have always been a focus of financial research. Existing research on stock price prediction is primarily based on two data types. One is structured historical market data, and the other is unstructured text, such as financial news.

Stock market data, such as returns and trading volumes, play a vital role in stock price prediction, and many studies have used market data to predict stock price trends. White (1988) was the first to successfully predict the time series of the stock market using the Back Propagation Neural Network (BP-NN). Subsequently, Kolarik and Rudorfer (1994) compared the prediction results of Artificial Neural Network (ANN) with those of Autoregressive Integrated Moving Average model (ARIMA), showing that the ANN model was more effective. Bildirici and Ersin (2009) studied the historical stock data of the Istanbul stock market over the past 30 years by combining the Autoregressive Conditional Heteroskedasticity model (ARCH) or the Generalized Autoregressive Conditional Heteroskedasticity model (GARCH) with ANN and found that the hybrid model of GARCH and ANN had better prediction results than the hybrid model of ARCH and ANN. Hammad et al. (2007) applied the multi-layer BP-NN to predict the stock price, which showed a better prediction performance than other methods. In recent years, deep learning has been introduced into stock price predictions. Chen et al. (2015) realized the prediction of stock returns with the Long Short-Term Memory (LSTM) model. Fischer and Krauss (2018) used LSTM to predict stock prices and drew a short-term investment strategy. Long et al. (2019) put forward a multi-filters neural network model using deep learning methodologies and applied it to the Chinese stock market index CSI 300. Some studies have also utilized reinforcement learning for financial prediction; however, the algorithm usually requires training and testing over a very long period. Tan et al. (2011) developed a non-arbitrage algorithmic trading system based on reinforcement learning, which was tested on more than 20 stocks over 13 years from 1994 to 2006. Suhail et al. (2022) employed a reinforcement learning network to guide stock market trading, which used 11 years of Apple stock data from 2006 to 2016. Additionally, the performance of reinforcement learning is sometimes unsatisfactory. Li et al. (2007) adopted actor-only and actor–critic reinforcement learning to develop two prediction systems; however, both systems were unable to generate significant improvements. Kanwar (2019) also showed that deep reinforcement learning was less successful in capturing the dynamic changes in the stock market than originally thought.

Moreover, many studies have shown that in addition to market data, financial news has an impact on stock prices. News contains information about the company’s fundamentals and activities; hence, it will affect market participants’ expectations of future price changes, thus driving stock price movements. Dyck and Zingales (2003) proved that issuing earnings announcements through news media could increase volatility in the stock price. Shiller (2015) also held that media can fuel the fluctuations of the stock market. Hence, deriving information affecting stock prices from media coverage is very important. Wüthrich et al. (1998) chose the news in the most influential financial newspapers, such as the Wall Street Journal, as the object of empirical research and explored the forecasting effect of the news on market indexes. Lavrenko et al. (2000) constructed an e-Analyst news recommendation system to study the correlation between news and stock price time series. This system can recommend news that has a predictive effect on future stock price trends. Gidofalvi and Elkan (2001) applied the Bayesian text classifier and found that news indicators had a certain predictive effect on the stock price within 20 min before and after the news is released. Mittermayer and Knolmayer (2006) built a NewsCATS system to predict the intraday real-time price fluctuations of stocks caused by news. Compared with other automated text categorization algorithms, this system was found to have better predictive performance and higher system trading profitability. Schumaker and Chen (2009) found that news could be used by the SVM algorithm to make excellent predictions of stock prices 20 min after the news was released, and the prediction results could be used to guide trading. Long et al. (2019) proposed a new kernel S &S to study the impact of news on stock prices, which considered the information structures among news in addition to the news contents. With SVM algorithms, the new kernel outperformed other common kernels, such as the linear kernel, by at least 5% accuracy.

The aforementioned research works are based on single-view data, but stock price movement can be affected by both financial news and historical market data. Market and news data can be independently used to predict stock prices; however, if the model only uses single-view information, information deviation may occur. Figure 1 shows possible market scenarios. If the model only uses historical market data, the rational prediction in the left figure will be “rise,” and the rational prediction in the right figure will be “fall.” Therefore, if we witness an actual “fall” in the left figure or “rise” in the right figure, the prediction performance of the model is weakened. Moreover, if the model uses only financial news, it fails to explain why stock prices continue to increase when negative news is released in the left figure and why stock prices still fall when positive news is released in the right figure. Thus, analyzing the impact of multi-view data on stock prices comprehensively is important; only by this method can the model send the correct signal.

Fig. 1
figure 1

Possible market scenarios

Many studies have tried incorporating the two kinds of data to improve the predicting performance. However, owing to the different structures of the two, combining them directly into a model is difficult. To solve this problem, studies usually apply indexing modeling; that is, they use textual data to compile indexes so that textual data are structured to predict stock prices together with market data (Deng et al. 2011a; Mohan et al. 2019; Li et al. 2020; Kesavan et al. 2020). Although this approach successfully fuses structured and unstructured text data, there are some limitations to this indexing treatment. Because abundant news text data are condensed into an index by directly using structured information about text (Deng et al. 2011a), such as news frequency, or by processing vectorized text into a structured sentiment index (Mohan et al. 2019; Li et al. 2020; Kesavan et al. 2020), inevitably, some information contained in the text will be lost. However, if common algorithms directly use the text vector with stock market data to perform prediction, the complicated news information with stock prediction information, a large amount of unrelated information and noise may potentially decrease the prediction performance (Lin et al. 2022). Accordingly, the appropriate extraction and exploitation of hidden information within raw multi-view heterogeneous data, including news and market data, to make accurate predictions becomes a challenging problem.

To solve the problem, this paper develops a hybrid model of stock price fluctuation prediction via a combination of multi-view learning for directly fusing different-structured data and a machine learning method called support vector machine for stock price trend classification. This model, called MVL-SVM, can maximize the consistency between the multi-view information learned from financial news or market data and therefore, not only reduces the information loss in news text processing, but also solves the difficulty in integrating complicated news information with market data. To evaluate the performance of the model, the time series method and classic SVMs were introduced for comparison. Finally, a series of trading strategies were constructed based on this algorithm and applied to three trading scenarios.

This paper contributes in the following three aspects. (1) The proposed hybrid model based on the framework of multi-view learning can input heterogeneous information influencing stock price fluctuations, such as financial news and market data, into the prediction model simultaneously, which not only enriches the information types for stock price prediction but also reduces the information loss in the process of prediction. Most previous studies only considered single-view data; some related to multi-view data tend to adopt the strategy of indexing modeling, which will inevitably lead to a large loss of information. (2) This study also investigates the lag effect of news and market data on stock-price forecasts. Usually, the news cannot be fully absorbed by the stock price on the day of the news release, which means it may further affect the stock price the next day; however, few studies consider the time lag of this impact. We solve this problem by studying a prediction problem with lag and different time windows to observe the ability of MVL-SVM to capture the information contained in multi-view heterogeneous data after or over a certain period. (3) This study constructs a series of trading strategies based on the proposed hybrid model and compares them with other prediction-based and common strategies. The simulation results show that MVL-SVM has better profitability and risk control ability than other models, which provides more favorable proof for evaluating the performance of our model. At the same time, related studies only focus on prediction accuracy.

The rest of the paper is organized as follows. “Literature review” section reviews the main existing methods related to stock price forecasting based on multi-view heterogeneous data. “Methods” section introduces the methods used in this study. “Experimental test” section presents our datasets and shows the results of the MVL-SVM model, which are compared with a time series model and some classic SVM models. “Robust test” section further evaluates the performance of the model under different sliding windows. “Trading strategy” section discusses a series of trading strategies designed with our model to assess its practical efficiency, and “Conclusion” section presents our conclusions.

Literature review

Some significant attempts have been made in the finance domain to incorporate news and market data in predicting stock prices. Relevant methods can be divided into two categories: indexing modeling and direct fusion methods.

Indexing modeling methods involve constructing indexes with news information and thus fusing the structured index with numerical market data for prediction. Frequently used methods include the statistical method, which calculates the frequency of the text data, and sentiment analysis, which uses the processed text from language preprocessing to provide polarity scores for social media data and news. Deng et al. (2011a) predicted the price movement with overall sentiment analysis and frequency based on news and comments, and technical analysis of historical market data. Mohan et al. (2019) extracted numerical data called text polarity from text articles and combined it with stock prices for prediction. Li et al. (2020) extracted sentiments from news and represented stock prices by technical indicators. Then, a layered deep learning model was used to learn the multi-view information, and a fully connected neural network was employed for stock predictions. Kesavan et al. (2020) represented news articles and social media contents by sentiment vectors and then used deep learning techniques to incorporate the polarity of the sentiments with financial time-series data to predict stock prices. This approach succeeded in fusing structured and unstructured textual data. However, when the structured information about text is directly used and constructed into some indexes, such as news frequency, or when text is first vectorized and then processed into a structured sentiment indicator, it may face information loss.

Direct fusion method aims to directly integrate structured and unstructured data to extract information or solve classification and prediction problems. Li et al. (2016) applied the extreme learning machine (ELM) to make stock price predictions based on the market news and stock prices concurrently and found that the accuracies of RBF ELM and RBF SVM are similar but higher than that of BP-NN; the prediction speeds of the two algorithms are also much faster than that of BP-NN. Wang et al. (2019) proposed a hybrid time-series predictive neural network to combine the daily K-line data with the news vectors and succeeded in stock volatility prediction. Ronaghi et al. (2022) predicted market index with COVID-19-related Twitter data and historical market data via a deep fusion framework consisting of two parallel paths, one based on CNN and another that integrates CNN with bi-directional LSTM (BLSTM). Lin et al. (2022) developed a spatial-temporal attention-based convolutional network, which successfully extracted text and numerical information for stock price prediction using the attention mechanism, CNN, and LSTM. However, the aforementioned studies were mostly based on the neural network framework. Considering that a neural network is prone to fall into a local minimum, we attempt to develop a new model based on a different framework to fuse the two data types.

Multi-view learning proposed by de Sa (1994) is a machine learning algorithm that can directly input heterogeneous data for training and usually has an excellent performance. Unlike the method of constructing an index from text, it substitutes labels using different views. It minimizes the inconsistency between the model outputs from distinct views to minimize classification errors. Yarowsky (1995) and Blum and Mitchell (1998) indicated that multi-view learning outperformed single-view learning in light of classification. Blum and Mitchell (1998) improved the algorithm by co-training distinct views when studying web page classification. Collins and Singer (1999) measured the consistency between distinct views by constructing an objective function. By maximizing the objective function, Dasgupta et al. (2001) presented an upper limit for the generalization error of multiple views. Multi-view learning has been applied to a variety of learning methods, such as dimensionality reduction (Sun et al. 2010) and classification methods (Han et al. 2022). Many scholars have noticed its usefulness and begun combining it with other traditional algorithms to obtain excellent performance. Xiao et al. (2022) utilized the multi-view learning to solve the data uncertainty; thus, successfully improving the Ordinal Regression classifier (OR). Lv et al. (2021) developed a prediction model with market data by integrating multi-view learning with the classic RBF network, which showed excellent performance in forecasting stock prices. However, it only uses market data and excludes financial news information, which has been proven to be predictive by many studies.

The intuition for building this hybrid model is as follows. On the one hand, SVM is a classic machine learning classification algorithm and is often used to predict stock prices with financial news (Schumaker and Chen 2009; Long et al. 2019). Many studies have shown that SVM performs better in financial forecasting when compared with some neural network frameworks (Kim 2003; Cao and Tay 2001, 2003; Li et al. 2016; Meesad and Thanh 2014). On the other hand, multi-view learning can learn common feature spaces or shared patterns by combining multiple data sources (Yan et al. 2021). Therefore, these two algorithms can be combined to use multi-view heterogeneous data to predict stock prices. Some scholars have theoretically proved the effectiveness of the multi-view model over the single-view model (Sun et al. 2022), and literature shows that this hybrid model has achieved excellent performance in fusing data with different structures for classification (Zhang et al. 2010; Xu et al. 2015; Ceci et al. 2015; Wang and Zhou 2021).

However, this model has not been widely applied in the financial field, and the limited related research can be divided into three categories. First, most studies based on this method considered only single-view data. Shynkevich et al. (2015b2015a) used the model to predict the stock price based on financial news and found that fusing different news categories could improve the prediction performance. However, they only considered the impact of media on stock prices and did not consider the impact of market data. Second, although some studies simultaneously included numerical and textual data with a multi-view learning framework, they transformed the text into structured indexes to fuse text with numerical data (Deng et al. 2011a, b, 2014). As stated earlier, this approach increases information loss. Third, a few studies applied a multi-view learning framework to merge historical stock prices with financial news vectors for stock price prediction (Li et al. 2011; Wang et al. 2012); however, they neither took into account the lag effect of news and market data on stock prices nor built trading strategies; hence, the model’s actual application performance in the financial market could not be judged.

Methods

Chinese news text processing

Considering that news is unstructured and involves a lot of noise or redundant information, we must eliminate noise and extract representative features containing the most useful information for accurate prediction. Therefore, this section introduces the method of transforming news text into a structured feature vector for training through news preprocessing, data cleaning, text representation, and feature extraction.

News preprocessing

Because trading in the Chinese stock market ends at 15:00, news released after 15:00 on trading \(day_t\) can be assumed to not affect the fluctuation of the stock price on \(day_t\); similarly, the news released on weekends, holidays, and other closed days have no effect on the prices. Therefore, news released on the closed day or after 15:00 on each trading day is included in the news of the next trading day. We then sorted the news by the reorganized date for processing.

Data cleaning

In data cleaning, we first removed the punctuation and garbled characters in the news, and then used the jieba package of Python to perform Chinese word segmentation. Finally, we used word filtering to filter out unimportant words from the Baidu Stop Word List. This step can help remove stop words and leave representative words such as nouns, verbs, and adjectives.

Text representation

For a news article, the value of each word was calculated according to its classification importance. Words that were more important for classification were assigned higher weights. Thus, each article can be represented as a vector of word values. The bag-of-words model is a commonly used text representation method that represents the text as a bag of words, regardless of word order and grammar, while maintaining multiplicity. Based on the bag-of-words model, Salton et al. (1975) proposed a vector space model commonly used in text classification. Because news contains many new concepts and words, it is appropriate to assume the independence of each word in this model. Each news item is then represented as a vector composed of the weight of each word, and the weight is determined by the word’s importance in the news. According to Salton and Buckley (1988), the importance of words can be determined by TF-IDF, which supposes that words that rarely appear in the entire document but frequently appear in a text are of greater importance for classification. However, in practice, text length affects the weights obtained from this method. To better quantify the characteristic words, the influence of the length should be reduced. Therefore, we used the ltc method (Buckley et al. 1995) in this study, which combines length normalization (“l”), term frequency (“t”) and collection frequency (“c”) to calculate the weights of words. By normalizing the weight of words, the influence of article length can be avoided, and the importance of word frequency is weakened to a certain extent. This represents news articles in the following form:

$$news_{t} = \left( {w_{{t,1}} ,w_{{t,2}} , \ldots ,w_{{t,M}} } \right),$$
(1)

where \(news_t\) represents the news vector on \(day_t\) and \(w_{t, m}\) is the weight of \(word_m\) in \(news_t\), which is expressed as

$$w_{{t,m}} = \frac{{\left( {log\left( {f_{{t,m}} } \right) + 1.0} \right)*log\frac{1}{{F_{m} }}}}{{\mathop \sum \limits_{{j = 1}}^{M} \left[ {\left( {log(f_{{t,j}} ) + 1.0} \right)*log\frac{1}{{F_{j} }}} \right]^{2} }}^{2} ,\;m = 1,2, \ldots ,M,$$
(2)

where \(f_{t,m}\) represents the occurrence frequency of \(word_m\) in \(news_t\) (term frequency), \(F_m\) is the occurrence frequency of \(word_m\) in the news corpus (collection frequency). All symbols used in the equations are listed in the Appendix (see Table 22).

Feature extraction

Because the news corpus involves many words, but only a small portion of words is contained in each news item, we use \(\chi ^2\) statistics (Yang and Pedersen 1997) to extract features of the text. Instead of using all the words in the news corpus, it selects words that contribute more to text classification to make computation easier and prevent overfitting. For word w and category c, we define A as the number of times w and c co-occur, B as the number of times w occurs without c, C as the number of times c occurs without w, D as the number of times neither c nor w occurs, and n is the sample size.

$$\begin{aligned} \begin{aligned} \chi ^2(w,c)=\frac{n\times (AD-BC)^2}{(A+B)\times (A+C)\times (B+D)\times (C+D)}. \\ \end{aligned} \end{aligned}$$
(3)

Then, for a word w, the \(\chi ^2\) score is obtained by combining the scores of each category:

$$\begin{aligned} \begin{aligned} \chi ^2(w)=P(-1)\times \chi ^2(w,-1)+P(1)\times \chi ^2(w,1), \\ \end{aligned} \end{aligned}$$
(4)

where P(c) represents the frequency of the category \(c \in \{-1,1\}\) in the news corpus. Words with higher \(\chi ^2\) scores are considered more informative for prediction, so we use \(\chi ^2\) scores to select the optimal number of words with the best prediction performance as the dimension of news. The prediction accuracy maximization was determined by calculating the prediction accuracy in different dimensions separately.

MVL-SVM algorithm

The proposed MVL-SVM algorithm combines a support vector machine and multi-view learning, which can apply multi-view learning for multi-view data fusion and then use a support vector machine for classification or prediction. These two components are discussed further in this section. Moreover, SVM will also serve as a benchmark to test the effectiveness of our hybrid model.

Support vector machine (SVM)

SVM (Cortes and Vapnik 1995; Deng et al. 2012) has been widely applied to solve classification problems owing to its performance. It can learn from a set of two-class training instances and divide new instances into one of the classes to solve classification problems.

We denote \((x_1, y_1),(x_2, y_2),\ldots , (x_n, y_n)\) as a two-class training dataset, where \(x_i\) for \(i=1,\dots ,n\) represents a p-dimensional real vector, and \(y_i \in (-1,1)\) represents to which class \(x_i\) belongs. According to the classification method, SVM can be divided into linear and nonlinear SVM. The main idea of linear SVM is to find a “ maximum margin hyperplane,” defined as \(g(x)=\omega ^{T}x+b\) so that the two classes of samples can be accurately classified by this hyperplane and the sum of distances between the hyperplane and the closest point of each class is maximized. Mathematically, the classification problem is equivalent to solving the minimization problem as follows:

$$\begin{aligned} \begin{aligned} min \frac{1}{2}\omega ^{T}\omega&+C\sum _{i=1}^{n}{\xi _i},\\ s.t.\quad y_i(\omega ^{T}x_i+b)&+\xi _i \ge 1, \forall 1 \le i \le n,\\ \xi _i \ge 0&,\forall 1 \le i \le n. \end{aligned} \end{aligned}$$
(5)

where n refers to the sample size, \(\xi _i\) is a slack variable, and C is a penalty term that controls the cost of misclassification of samples. The larger C is, the more intolerant the model is to classification errors, which are prone to overfitting. On the contrary, when C is smaller, there is more tolerance; therefore, the model is prone to underfitting. A 2-dimensional example is shown in Fig. 2 to clearly demonstrate the workings of the linear SVM. Here, samples of different colors come from different classes. The red line represents the maximum margin hyperplane obtained by training the samples.

Fig. 2
figure 2

A 2-dimensional classification instance using SVM

However, in practice, not all the samples are linearly separable. Therefore, a nonlinear SVM can be introduced to solve this problem. A nonlinear SVM can implicitly map samples into a high-dimensional space with \(\phi (x)\) to find a maximum margin hyperplane in this high-dimensional space. The optimization problem is as follows:

$$\begin{aligned} \begin{aligned} min \frac{1}{2}\omega ^{T}\omega&+C\sum _{i=1}^{n}{\xi _i},\\ s.t.\quad y_i(\omega ^{T}\phi (x_i)+b)&+\xi _i \ge 1, \forall 1 \le i \le n,\\ \xi _i \ge 0&,\forall 1 \le i \le n. \end{aligned} \end{aligned}$$
(6)

With Lagrange duality, the original problem can be transformed into the following dual problem.

$$\begin{aligned} \begin{aligned} max\quad W(\alpha )&=\sum _{i=1}^{n}{\sum _{j=1}^{n}{\alpha _{i} \alpha _{j}y_{i}y_{j}(\phi (x_{i})\cdot \phi (x_{j}))}}\\&=\sum _{i=1}^{n}{\alpha _{i}}-\frac{1}{2}\sum _{i=1}^{n}{\sum _{j=1}^{n}{\alpha _{i}\alpha _{j}y_{i}y_{j}k(x_{i},x_{j}))}},\\ s.t.&\quad \sum _{i=1}^{n}y_{i}\alpha _{i}=0,\\&\quad 0 \le \alpha _i \le C, \forall i=1,\ldots ,n. \end{aligned} \end{aligned}$$
(7)

where \(\alpha _{i}\) is a Lagrangian multiplier corresponding to sample \(x_i\) and \(k(x_{i},x_{j})=\phi (x_{i})\cdot \phi (x_{j})\) is a kernel function that is a symmetric positive definite function that satisfies Mercer’s conditions. By solving the above optimization problem, we can obtain the solutions \(\alpha _{i}^{*}\) and \(b^{*}\); the decision function is obtained as

$$f(x) = sgn\left\{ {\sum\limits_{{i = 1}}^{n} {y_{i} } \alpha _{i}^{*} k(x_{i} ,x_{j} ) + b^{*} } \right\}.$$
(8)

Kernel function \(k(x_{i},x_{j})\) determines the performance of the model. The linear kernel function is often used to solve linear classification problems, and the Gaussian kernel function is used to solve nonlinear classification problems.

1) Linear kernel function

$$\begin{aligned} \begin{aligned} k_{lin}(x_{i},x_{j})=x_{i}^Tx_{j}. \end{aligned} \end{aligned}$$
(9)

2) Gaussian kernel function

$$\begin{aligned} \begin{aligned} k_{Gau}(x_{i},x_{j})=exp\left(-\gamma ||x_{i}-x_{j}||^{2}\right). \end{aligned} \end{aligned}$$
(10)

where \(\gamma\) is a Gaussian kernel parameter, which is important in determining kernel performance. When \(\gamma\) is small, the model is prone to underfitting, whereas when \(\gamma\) is large, the model is prone to overfitting.

Multi-view learning

Generally, single-view data can be easily used in machine learning methods for classification, whereas using multi-view data in these methods is difficult. Multi-view learning algorithms appear to solve this problem.

Multi-view learning designs a function for each perspective. All functions are optimized by maximizing the consistency between redundant views, and the model’s performance is improved. Owing to its outstanding performance in multi-view data applications, multi-view learning has gradually attracted increasing attention. There are three types of existing algorithms.

  1. 1)

    Co-training: Maximizing mutual agreement on different views of unlabeled data through alternate learning.

  2. 2)

    Multiple kernel learning: Linearly or non-linearly combining kernels for each view to improve training efficiency.

  3. 3)

    Subspace learning: Acquiring an appropriate subspace under the assumption that multiple views are generated from this appropriate subspace.

This study uses a multiple kernel learning framework to build a stock price prediction model. By selecting the appropriate kernels and kernel combination for training, each data source can be trained with the corresponding optimal kernel function; therefore, the model can perform better than the single-kernel model (Xu et al. 2013). As illustrated in Fig. 3, distinct kernels are selected for distinct views, and multiple pieces of information can be fused by combining distinct kernels. There are many combination methods that can be grouped into two categories: linear and nonlinear combinations.

Fig. 3
figure 3

Sketch map of multiple kernel learning

However, no empirical results show that a nonlinear combination can improve the model’s performance, which raises the question of whether the nonlinear combination method is necessary and efficient. Therefore, we only used linear combination methods in this study. There are two basic categories.

1) Direct summation

$$\begin{aligned} \begin{aligned} K(x_{i},x_{j})=\sum _{m=1}^{M}{k_{m}(x_{i},x_{j})}, \end{aligned} \end{aligned}$$
(11)

where \(k_{m}(x_{i},x_{j})\) denotes the m-th kernel.

2) Weighted summation

$$\begin{aligned} \begin{aligned} K(x_{i},x_{j})=\sum _{m=1}^{M}{\beta _{m}k_{m}(x_{i},x_{j})}. \end{aligned} \end{aligned}$$
(12)

Here, \(\beta _{m}\) represents the weight of kernel \(k_{m}(x_{i},x_{j})\).

As different types of information have different importance for prediction/classification, using the direct summation method, which assigns equal priority to each kernel, is not ideal. In comparison, we choose the weighted summation kernel in this study, and the kernel function can be written as

$$\begin{aligned} \begin{aligned} K(x_{i},x_{j})=\sum _{m}{\beta _{m}k_{m}(x_{i},x_{j})},\beta _{m}\ge 0,\sum _{m}{\beta _{m}}=1. \end{aligned} \end{aligned}$$
(13)

The weight \(\beta _{m}\) of the kernel \(k_{m}(x_{i},x_{j})\) can be determined using kernel learning. By applying SVM with the above kernel function, we can obtain the decision function of MVL-SVM, as shown in Equation (14).

$$\begin{aligned} \begin{aligned} f(x)=sgn\left\{ \sum _{i=1}^{n}\alpha _{i}^{*}y_{i}\sum _{m}\beta _{m}k_{m}(x_{i},x_{j})+b^{*}. \right\} \end{aligned} \end{aligned}$$
(14)

ARIMA model

ARIMA model (Box et al. 2015) is a commonly used time series model that can input historical data sequences for prediction; therefore, we use this model to design one of the benchmarks based on single-view data. It contains three terms: the autoregression term, the integrated term, and the moving average term. A nonstationary data series can be converted into a stationary one by differencing to remove the impact of nonstationarity. The first-order differencing of a data series \(z_t\) is expressed as

$$\begin{aligned} \begin{aligned} o_{t}=z_t-z_{t-1}. \end{aligned} \end{aligned}$$
(15)

The stationarity of the time series was tested using the ADF method. After converting the data series into a stationary one through d-order difference, we use the stationary time series to conduct a model with a combination of the autoregression model and the moving average model and obtain the future value by d-order integration. The autoregression model captures the impact of historical time-series values on the current value by performing linear regression. Because time series are usually affected by random disturbances in noisy environments, the moving average method is further introduced to observe the influence of random disturbances on future time series. Then, the ARIMA(pdq) model with three parameters, including the autoregression order p, differencing order d and moving average order q, can be expressed as

$$\begin{aligned} \begin{aligned} o_{t}=\sum _{i=1}^{p}\phi _{i}o_{t-1}+\sum _{j=1}^{q}\theta _{j}e_{t-1}+\epsilon _{t}, \end{aligned} \end{aligned}$$
(16)

where \(\phi _{i}\) is the ith autoregression parameter, \(\theta _{j}\) is the jth moving average parameter, and \(\epsilon _{t}\) is the error term at time t. In practice, the autoregression order p and moving average order q can be determined using partial autocorrelation and autocorrelation diagrams, respectively.

Experimental test

Section “Introduction” shows that financial news and market data are significant in predicting price trends. Because the MVL-SVM method can integrate multiple information sources for classification, we now apply it to predict whether prices will rise or fall based on financial news and market data. Subsequently, the results were compared with classic SVMs using single-view and mixed data.

Data sources

The Shanghai Stock Exchange 50 index (SSE 50 index) comprises the most representative 50 stocks of the Shanghai Stock Exchange. This indicates the overall situation of several leading companies with the greatest market influence in the Chinese stock market. As these enterprises have the most active news reports and can thus provide sufficient news samples, we choose the constituent stocks of the SSE 50 index for empirical analysis. Due to the limitations of data sources, the period investigated in this study was from January 1, 2018 to December 31, 2020.

Table 1 Days of news releasing

Table 1 shows the total number of days with news release for each stock. As the price of the newly listed stocks fluctuates unstably, we exclude stocks listed after January 1, 2016. Furthermore, because of less news, three stocks, including 600745. SH, 600690.SH and 601888.SH was not considered for the sample. Consequently, the research object consisted of 37 stocks.

Fig. 4
figure 4

Data visualization for stock code 600276.SH

For the structured data, considering the selected market data should comprehensively reflect the stock information, such as price changes, transaction activity, market liquidity, scale, and so on, we choose four widely used variables, including stock daily return (r), trading volume (tv), turnover rate (tr) and total market cap (mc) from Wind database (https://www.wind.com.cn/), to predict the stock price. We denote \(md_1, md_2,\ldots , md_t,\ldots\) as the market data sequences, and \(md_t=(r_t, tv_t, tr_t, mc_t)\) represents the four market variables on \(day_t\). The daily returns used in this study are all log returns, and all data are daily.

For financial news, the news release time, summary, and text were collected from the Uqer database (https://uqer.datayes.com/). Uqer database provides a news API that collects news from 223 news websites, including reports on the company and coverage related to the macroeconomic environment. Here for each stock, we selected the news containing the stock’s name as its stock news; we collected 496,014 pieces of news for 37 stocks in total. To illustrate our data in detail, we randomly selected a stock of 600276. SH to visualize the data (see Fig. 4). Table 2 presents the basic statistical characteristics of the market variables. From Fig. 4a and b, we find that the amount of news in 2020 is the largest. Each year the amounts of news in January and February are relatively small owing to holidays. In addition, from Fig. 4c, we find that except for the linear relationship between the turnover rate and trading volume, other data are basically irrelevant.

Table 2 Descriptive statistics of the involved market variables
Fig. 5
figure 5

The impact of COVID-19 on SSE 50 index

Considering the spread of the Coronavirus in 2019, we further investigate if this issue impacts our data and the market trend. The total amount of news on COVID-19 in our sample was 3,467. The COVID-19 outbreak occurred at the end of December 2019. However, the disease was not considered serious in the early stages; hence, it did not cause large stock market fluctuations. In January 20, 2020, a report pointed out that COVID-19 was a human-to-human fast-spreading communicable disease for which a cure had not been found; panic began to spread, and the stock market began to fall, as shown in Fig. 5. On January 23, 2020, Wuhan was announced to shut down, and the stock market fell sharply. Affected by the epidemic, the US stock market experienced four circuit breakers in March 2020, including March 9, March 12, March 16, and March 18. This also affected A-share investors, leading to a sharp drop in SSE 50 index. Choosing 2018 to 2020 as our research period can also help us explore whether the results of our model remain robust under special circumstances, such as epidemics.

The data must be normalized before being input into the model for training. Because the market data can take both positive and negative values, they are transformed by Equation (17) to satisfy the normalization requirements.

$$\begin{aligned} \begin{aligned} norm(md^{k}_t)=\frac{md^{k}_t}{max\{|md^k|\}}, k=1,\ldots ,4, t=1,\ldots ,n. \end{aligned} \end{aligned}$$
(17)

Here, \(md^{k}_t\) denotes the k-th market variable on \(day_t\) and \(max\{|md^k|\}\) refers to the maximum value of the k-th market variable. The values range from −1 to 1 after normalization. There is no need to normalize the news data because they are already normalized through the ltc method, as shown in Eq. (2).

In this study, the high-dimensional news vector obtained from the Chinese news text processing methods and the four market data were all input to the MVL-SVM algorithm. After training on the training set, the algorithm can choose the optimal kernel for each data source and the optimal kernel combination weights. Therefore, combining multiple kernels can successfully fuse the multi-view data, and the new kernel can be input into the SVM classifier for classification. The framework of this multi-view stock price prediction model is illustrated in Fig. 6.

Fig. 6
figure 6

Multi-view learning framework of stock price prediction model

Experimental analysis and comparison

In this section, we consider the joint influence of structured market data and unstructured financial news and apply the MVL-SVM model to predict the stock price trend on the day or the next. The obtained sample is labeled according to the daily stock return \(r_{t+i}\), as shown in Equation (18).

$$\begin{aligned} \begin{aligned} tag_{t+i}=\left\{ \begin{aligned} 1, if \ r_{t+i}>0\\ -1, if \ r_{t+i}\le 0 \end{aligned} \right. \end{aligned} \end{aligned}$$
(18)

Here, \(r_{t+i}\) represents the daily return of the stock on \(day_{t+i}\), \(tag_{t+i}\) indicates that the daily return on \(day_{ t+i}\) is used to label the sample, and \(i=0\) means that the model aims to predict the stock price of the day, whereas \(i=1\) means to predict that of the next day. Because it is meaningless to use \(r_t\) to predict the price fluctuation of \(day_t\), the market data can only be used to predict the next day’s price movement, while the news released before the closing of the stock market on \(day_t\) can be used to predict the rise and fall of stock prices on both \(day_t\) and \(day_{t+1}\). Therefore, \(i=0,1\) is valid for financial news, and \(i=1\) is valid for market data.

Table 3 Confusion matrix

As shown in Table 3, we used a confusion matrix to show the classification results, and the accuracy is given by

$$\begin{aligned} \begin{aligned} Accuracy=\frac{TP+TN}{TP+TN+FP+FN}, \end{aligned} \end{aligned}$$
(19)

where TP means true positive, which indicates a stock price trends up and the model correctly predicts the upward trend, TN (true negative) means a stock price trends down and the downward trend is also correctly predicted while FN (false negative) occurs when the actual stock price is rising but the model mistakes it as a downward trend, and similarly, FP (false positive) refers to that the actual stock price is falling but the model mistakes it as an upward trend. Using Formula (19), we obtained the percentage of correctly predicted samples in the total sample.

Stock price prediction based on one-day news and four market data

Here, we build an MVL-SVM model to predict stock price movement using one-day market data and financial news. In this section, the prediction accuracy is compared with that of the classic SVM models to evaluate the predictive performance.

As explained in Section “Experimental analysis and comparison”, news released before 15:00 on \(day_t\) can be used to predict the stock returns of \(day_t\) and \(day_{t+1}\). However, market data on \(day_t\) can only be used to predict stock returns on \(day_{t+1}\). Therefore, we consider two experimental settings: lag=0 (predict the price on the day of the news release) and lag=1 (predict the price on the next day of the news release).

In the case of lag=0, the sign of \(r_t\) is predicted by \(md_{t-1}\) and \(news_t\), where \(md_{t-1}=(r_{t-1}, tv_{t-1}, tr_{t-1}, mc_{t-1})\) represents the market data of \(day_{t-1}\) and \(news_t=(w_{t,1}, w_{t,2},\ldots , w_{t,M})\) represents the news vector on \(day_t\), which is obtained from “Chinese news text processing”. The input matrix of market data MD and News News is formulated in Equation (20), where each row of MD and News denotes a vector, and the labels of these vectors are also shown in Equation (20).

$$\begin{aligned} \begin{aligned} MD=\begin{bmatrix} r_{1}&{} tv_{1}&{} tr_{1}&{} mc_{1}\\ r_{2}&{} tv_{2}&{} tr_{2}&{} mc_{2}\\ \cdot &{}\cdot &{}\cdot &{}\cdot \\ \cdot &{}\cdot &{}\cdot &{}\cdot \\ r_{n-2}&{} tv_{n-2}&{} tr_{n-2}&{} mc_{n-2}\\ r_{n-1}&{} tv_{n-1}&{} tr_{n-1}&{} mc_{n-1}\\ \end{bmatrix}, News=\begin{bmatrix} news_2\\ news_3\\ \cdots \\ \cdots \\ news_{n-1}\\ news_{n}\\ \end{bmatrix}, Label=\begin{bmatrix} tag_2\\ tag_3\\ \cdots \\ \cdots \\ tag_{n-1}\\ tag_n\\ \end{bmatrix} \end{aligned} \end{aligned}$$
(20)

When lag = 1, the sign of \(r_{t+1}\) is predicted by \(md_t\) and \(news_t\). The input matrices MD and News and output vector Label are shown in Equation (21).

$$\begin{aligned} \begin{aligned} MD=\begin{bmatrix} r_{1}&{} tv_{1}&{} tr_{1}&{} mc_{1}\\ r_{2}&{} tv_{2}&{} tr_{2}&{} mc_{2}\\ \cdot &{}\cdot &{}\cdot &{}\cdot \\ \cdot &{}\cdot &{}\cdot &{}\cdot \\ r_{n-2}&{} tv_{n-2}&{} tr_{n-2}&{} mc_{n-2}\\ r_{n-1}&{} tv_{n-1}&{} tr_{n-1}&{} mc_{n-1}\\ \end{bmatrix}, News=\begin{bmatrix} news_1\\ news_2\\ \cdots \\ \cdots \\ news_{n-2}\\ news_{n-1}\\ \end{bmatrix}, Label=\begin{bmatrix} tag_2\\ tag_3\\ \cdots \\ \cdots \\ tag_{n-1}\\ tag_n\\ \end{bmatrix} \end{aligned} \end{aligned}$$
(21)

For the SVM model, three cases are considered: (i) inputting only the market data (SVMMD), (ii) inputting only financial news (SVMFN), and (iii) inputting concatenated multi-view data of market data and financial news (SVMMV). Considering our sample size, we adopted three training/testing proportions: 60%/40%, 75%/25%, and 90%/10%. Because the model parameters will considerably affect the training performance, five-fold cross-validation and the grid search method are applied in the training set to select the optimal parameters. For the penalty term C, the initial values are set to \(C=10^a\) where we have \(a \in \{-3,-2,-1,0,1,2,3\}\) and for the Gaussian kernel parameter \(\gamma\), the initial values are \(\gamma =2^b\), where \(b \in \{-4,-3,-2,-1,0,1,2,3,4\}\). Therefore, for a nonlinear SVM, there were 63 combinations of parameters. A grid search can build a grid, where each node refers to a parameter combination. This method can traverse all the grid nodes to determine the optimal parameter combination of the model. We used five-fold cross-validation for each parameter combination to observe the model’s performance. That is, the training set was divided into five parts, and each part was set as the validation set once, while the other four parts were used for training the model. Finally, the average prediction accuracy of the five experiments was considered as the model performance under this parameter combination. Among all parameter combinations, the one with the best performance was chosen to allocate the final model for independent testing. We used Python and the Sklearn package in this study to implement the model. The prediction accuracies for the 37 stocks are shown in Table 4.

Table 4 indicates that the MVL-SVM model always had the highest accuracy, despite the training set proportion. Surprisingly, the prediction accuracy of the MVL-SVM model was approximately 30% higher than that of the SVM model when both types of data were input. In particular, when we predict the price trend one day after the news release with a 90% training proportion, the average prediction accuracy of the SVM model based only on market data or financial news is 50.77% and 61.70%, respectively, whereas the MVL-SVM model can reach nearly 88% accuracy. This shows that the MVL-SVM model significantly outperformed the other baseline models.

Table 4 The predicting accuracy of four models with different lags

For the three SVM models in Table 4, on the one hand, we can find that the predicting accuracy is improved when adding news data to the market data, indicating the news information contributes to price prediction. On the other hand, when the training set proportion was 60% and 75%, the accuracy of the SVMMV model was close to that of the SVMFN, but when the proportion was 90%, the SVM model with multi-view data seemed worse than the SVMFN model. However, when using the MVL-SVM model, there was an obvious improvement in the prediction accuracy. This shows that heterogeneous data, which can be used to predict the stock price alone, cannot obtain ideal results if they are simply and roughly concatenated and inputted into the SVM model because of the differences in their data characteristics, such as dimensions. This further demonstrates the advantages of the proposed model. Because the MVL-SVM model can learn and minimize the inconsistency from distinct views, it can effectively combine the data with different structures to make reasonable predictions on stock price trends. The experiment also shows that the prediction performance of MVL-SVM is stable because the average prediction accuracy of the MVL-SVM model only changes by 2.37% when the training proportion is changed, whereas that of the SVM model based only on financial news changes by 8.79%. Further, the training/testing proportion of 60%/40% refers to training from January 2, 2018, to October 23, 2019, when COVID-19 had not emerged. The testing period is from October 24, 2019, to December 31, 2020, when COVID-19 broke out and affected the stock market. The models with the other two training/testing proportions were trained using data on COVID-19. However, from Table 4, we can see that although our models were trained without COVID-19-related data, the average prediction accuracy was still approximately 85%. Considering the case of lag = 0 and lag = 1, the differences between the average prediction accuracy of the models with 75%/25% training/testing proportions and those with 60%/40% training/testing proportions, are only 1.68% and 0.84%, respectively, indicating that the spread of Coronavirus has little impact on the prediction performance of our model.

Here, we have further simplified the follow-up research. Because roughly concatenating heterogeneous data will weaken the training efficiency of the model, considering the SVMMV model is not necessary. Therefore, we only considered the prediction performance of the SVM models with single-view data and the MVL-SVM model. Additionally, as shown in Table 4, using one training/testing proportion was sufficient to illustrate the effectiveness of the different models. It appears that there is little difference between the prediction accuracies with lag = 0 and lag = 1. Considering that the result only based on market data with a 0-day lag is unavailable and for the convenience of explanation, it is better to choose the same lag period for multi-view data in the MVL-SVM model. Accordingly, in this section, we will only provide the results with a training proportion of 90% and consider the case with a 1-day lag.

Stock price prediction based on one-day news and daily return

We use four market variables to predict the stock price movement in the above study. Still, literature shows that many studies forecast stock price trends based only on historical stock daily returns (Jarrett and Schilling 2008; Sun 2017; Vo and Ślepaczuk 2022), showing that researchers pay more attention to the daily returns among the four market variables. Therefore, we changed the input of the four market data sequences to only daily return sequences to discuss the predictive ability of the MVL-SVM model in this case. Considering that ARIMA is a commonly used time-series model that can input historical return sequences for prediction, we use this model to build a benchmark based on single-view data in our study. The ARIMA model was implemented using the R language.

Table 5 The results of ADF test

According to the ADF test (see Table 5), the p-values for all stocks are significant, so the series of daily returns are stationary. The orders of the ARIMA model can be determined using autocorrelation and partial autocorrelation diagrams. The results are presented in Table 6, and the average prediction accuracy for the 37 stocks is 49.61%.

Table 6 The results of ARIMA models

Before evaluating the MVL-SVM model, we first observe whether the performance of SVM model changes if daily returns replace the market data. The experimental settings are as follows. We consider the case of lag = 1, that is, we use \(md_t\) and \(news_t\) to predict \(r_{t+1}\). Instead of using all four variables, we input \(r_t\) as \(md_t\) to conduct the experiment. The results of the SVM model based on daily returns (SVMDR) are presented in Table 7. The performance was compared with that of the SVM model based on four-market data (SVMMD).

Table 7 The predicting accuracy of SVM models based on one-day news and market data or daily return

As shown in Table 7, there are 22 of 37 stocks (59.46%) whose prediction accuracy of the SVMDR model is higher than that of the SVMMD model. The average prediction accuracies of SVMMD and SVMDR were 50.77% and 51.72%, respectively. This indicates that the SVMDR model is better than the SVMMD and ARIMA models and implies that historical data of daily stock returns contain most of the price fluctuation information among the market data.

The prediction accuracy of MVL-SVM is given in Table 8. For simplicity, we name the MVL-SVM model based on news and market data MVL-SVMMD and the MVL-SVM model based on news and the daily return MVL-SVMDR.

The average prediction accuracy of MVL-SVMMD and MVL-SVMDR is 87.41% and 87.89%, respectively, showing that MVL-SVMDR is slightly higher, and the performance of the MVL-SVMDR model is better than that of MVL-SVMMD model for nearly half of the sample. Our experiments imply that after excluding the other three market variables, including total market cap, turnover rate, and trading volume, the information for prediction in the market data does not necessarily decrease and even becomes more effective due to the refinement of data and also confirms that many studies only use stock returns for prediction to be meaningful and reasonable.

Table 8 The predicting accuracy of MVL-SVM models based on one-day news and market data or daily return

Statistical analysis

The above analysis is based on a numerical comparison. Next, we further evaluated the model from the perspective of statistical analysis.

Fig. 7
figure 7

Comparison of pairwise algorithms with the Nemenyi test

The nonparametric test (Demšar 2006) is a useful approach for classifier comparison over multiple datasets. Here, we employ two non-parametric tests, the Nemenyi test (Demšar 2006) and contrast estimation based on medians (García et al. 2010), to compare the relative performances of the pairwise algorithms. Nemenyi test can determine whether one algorithm yields competitive performance compared to the other methods. The algorithms were ranked according to their performance on multiple datasets, and the average rank of each algorithm was calculated. The performance of each pairwise model whose average rank differs by at least one critical difference (CD) is considered significantly different, and the critical difference can be calculated using the following formula:

$$\begin{aligned} \begin{aligned} CD=q_{\alpha }\sqrt{\frac{K(K+1)}{6N_{stock}}}, \end{aligned} \end{aligned}$$
(22)

where \(q_{\alpha }\) is the critical value of Nemenyi test, K represents the number of algorithms involved, and \(N_{stock}\) is the number of stocks. In this section, we compare the performance of all the models involved, including ARIMA, SVMDR, SVMMD, SVMFN, SVMMV, MVL-SVMDR and MVL-SVMMD. Therefore, the K value in our test is seven, and we have \(q_{\alpha }=2.949\) at a significance level \(\alpha =0.05\). The accuracy of the CD diagram is shown in Fig. 7. In the figure, if the average ranks of pairwise models are within one CD, the two models will be linked in the CD diagram, whereas the performances of the unlinked models are thought to be significantly different. Clearly, the MVL-SVM algorithm is ranked first on average and is significantly different from the benchmark methods.

Moreover, contrast estimation based on medians can obtain a quantitative difference calculated from the medians between comparison algorithms over multiple datasets. Using this method, researchers can successfully estimate the difference between the performance of the two algorithms. Table 9 lists the results, where a positive value suggests that the row algorithm outperforms the corresponding column algorithm. Our model always achieves positive values concerning the baseline models.

Table 9 Contrast estimation based on medians among all models

Robust test

To further observe the ability of MVL-SVM to capture the joint impacts of news and market data on stock price trends over a certain period of time, we set the sliding window for news and market data from 1 to 5 days to predict the price trend with a 1-day lag after news releases. We define \(\lambda =max\{T_1,T_2\}\), where \(T_1\) denotes the news window, and \(T_2\) denotes the market data window. When we obtain n days for the sample, considering the use of sliding windows, the actual length of the output is \(n-\lambda\). Then, the input–output process of the model is formulated by equation (23), where md denotes the original market data sequence, and the dimension of the input matrix MD is expanded according to the sliding window. \(news_t^{T_1}\) is obtained by gathering the news between \(day_t\) and \(day_{t-T_1+1}\) and inputting it into the Chinese news text processing algorithm. \(w^{T_1}_{t,m}\) represents the weight of \(word_m\) obtained from \(T_1\)-days of news.

$$\begin{aligned} \begin{aligned} md=\begin{bmatrix} md_1\\ md_2\\ \cdots \\ md_{n-1}\\ md_n \end{bmatrix} MD=\begin{bmatrix} md_\lambda &{}md_{\lambda -1}&{}md_{\lambda -2}&{}\cdots &{} md_{\lambda -T_2+1}\\ md_{\lambda +1} &{}md_{\lambda }&{}md_{\lambda -1}&{}\cdots &{} md_{\lambda -T_2+2}\\ \cdot &{}\cdot &{}\cdot &{}\cdots &{}\cdot \\ \cdot &{}\cdot &{}\cdot &{}\cdots &{}\cdot \\ md_{n-2} &{}md_{n-3}&{}md_{n-4}&{}\cdots &{} md_{n-T_2-1}\\ md_{n-1} &{}md_{n-2}&{}md_{n-3}&{}\cdots &{} md_{n-T_2}\\ \end{bmatrix},\\ News=\begin{bmatrix} news_\lambda ^{T_1}\\ news_{\lambda +1}^{T_1}\\ \cdots \\ \cdots \\ news_{n-2}^{T_1}\\ news_{n-1}^{T_1}\\ \end{bmatrix}, where \ news^{T_1}_t=\begin{bmatrix} w^{T_1}_{t,1}\\ w^{T_1}_{t,2}\\ \cdots \\ \cdots \\ w^{T_1}_{t,M} \end{bmatrix}', Label=\begin{bmatrix} tag_{\lambda +1}\\ tag_{\lambda +2}\\ \cdots \\ \cdots \\ tag_{n-1}\\ tag_{n}\\ \end{bmatrix}. \end{aligned} \end{aligned}$$
(23)

Notably, the case that both sliding windows are one day has been discussed in “Experimental test” section. The prediction results of MVL-SVMMD are listed in Table 10.

Table 10 The predicting accuracy of MVL-SVMMD in different sliding windows

Heatmaps can show the results more clearly. In Fig. 8, the horizontal axis represents the sliding windows of financial news, whereas the vertical axis represents market data. We find that the color darkens when the sliding window of financial news changes from five days to one day. This means MVL-SVMMD can reach the highest prediction accuracy when using one-day news, indicating that it can capture the prompt impact of news on stock prices.

Fig. 8
figure 8

The heatmaps of the predicting accuracy of MVL-SVMMD models

We also find that the average prediction accuracy of the MVL-SVMMD model shows little change for different sliding windows of market data, indicating that the impact of market data is persistent, and the largest prediction accuracy is up to 88%, showing that this model has good performance. Furthermore, we want to determine whether the model is superior to the SVM models and whether the single-view data are sufficient to predict stock prices. Hence, we used the SVMMD and the SVMFN model in different sliding windows to conduct the experiment and compare them with the MVL-SVM model. The results are illustrated in Table 11, indicating that the MVL-SVM model performed much better than the SVM models. The prediction accuracy of the SVMMD model was between 51% and 53%, that of the SVMFN model was between 61% and 64%, and that of the MVL-SVM model was between 73% and 88%, which was at least 10% higher than that of the two SVM models. This shows that the MVL-SVM model can successfully combine data with different structures and extract information about stock price rise and fall.

Table 11 The predicting accuracy of SVMMD and SVMFN models in different sliding windows

Furthermore, as mentioned in “Stock price prediction based on one-day news and daily return” section, removing the total market cap, turnover rate, and trading volume from the market data may improve the performance of the MVL-SVM model. Therefore, in this section, we attempt to refine the model in the same way and set the sliding windows for news and daily returns to 1–5 days separately.

Before discussing the results of MVL-SVM model, we observe the performance change of the SVM model in different sliding windows after replacing the four-market data with daily returns. From Table 12, we can see that in most cases, the average prediction accuracy of SVMDR is higher than that of the ARIMA and SVMMD models. Moreover, with an increase in the length of the sliding window, the average prediction accuracy of both models shows the same trend of first rising, then falling, and then rising (shown in Fig. 9). They reached the highest average prediction accuracy when the sliding window was 2 days.

Table 12 The predicting accuracy of SVMMD and SVMDR in different sliding windows
Fig. 9
figure 9

Average and median predicting accuracy of SVMMD and SVMDR models

Because the performance of the SVM model is improved by changing the four-market data into daily returns, we infer that the same adjustment will lead to the same improvement as the MVL-SVM model. The prediction accuracies of the MVL-SVM model for different sliding windows are presented in Table 13.

Table 13 The predicting accuracy of MVL-SVM models in different sliding windows

We evaluated the validity of the MVL-SVM model based on daily returns and news by comparing its prediction results with those of the ARIMA model, SVMDR, and SVMFN models, as shown in Tables 6, 11, and 12. The prediction accuracy of the MVL-SVM model with daily returns and news is much higher than that of the other three baseline models, which is the same as the results of the MVL-SVM model based on four market variables and news. The heatmaps of average and median prediction accuracies of the MVL-SVMDR model are shown in Fig. 10. We speculate from the heatmaps that the accuracy of the MVL-SVM models is related to the sliding windows of financial news because the prediction accuracy shows a downward trend with the increase in the length of the news sliding window. In particular, when the sliding window of financial news is set to one day, the model shows the best average and median accuracy performance.

Fig. 10
figure 10

The heatmaps of the predicting accuracy of MVL-SVMDR models

In addition, by comparing the average and median prediction accuracy of MVL-SVMDR with those of MVL-SVMMD, we find that for most cases, the prediction accuracy of the MVL-SVM model based on news and daily returns is slightly higher than that based on news and four market data. This confirms our conjecture that using only daily returns in the market data can improve the model’s performance.

Trading strategy

From the above experimental analysis, news and market data play an important role in stock price prediction, and the model trained by MVL-SVM has the best predictive performance. To test its effectiveness in practical applications, we design and evaluate a series of trading strategies based on this model in this section.

The trading setup is as follows. In Section “Stock price prediction based on oneday news and four market data”, we adopt three training/testing proportions, including 60%/40%, 75%/25% and 90%/10%, respectively. To maximize the number of samples in the training and testing sets, we chose the middle training/testing proportion of 75%/25% to implement the trading strategy. We divide the data set into 75% training and 25% testing and train the models on the training set to predict future price trends of the 37 stocks from April 4, 2020, to December 31, 2020. We apply five-fold cross-validation and the grid search method on the training set to find the best parameters of the model that can achieve the highest average predicting accuracy on the validation set. Then, the model with adjusted parameters was applied to the testing set, and the prediction results were used as the signal to guide trading. If the stock price is predicted to rise on \(day_{t+1}\), we buy it at the closing price on \(day_{t}\) and sell it at that on \(day_{t+1}\). If the stock price is predicted to fall on \(day_{t+1}\), no operation will be carried out. For convenience, we assume that the transaction has no cost, which is common in trading simulations. Moreover, from the analysis in Section “Robust test” the prediction performance of the MVL-SVMMD model varies with different sliding windows, and the model can achieve the highest prediction accuracy when the sliding windows of news and market data are one and two days, respectively. Therefore, we choose an optimal sliding window in the trading strategy.

All previously designed models are considered in this section for comparison, including ARIMA, SVMMD, SVMDR, SVMFN, SVMMV, MVL-SVMMD, and MV-SVMDR models. Additionally, we also introduce the momentum trading strategy, buy-and-hold strategy and randomly buy strategy to compare with the above seven models. For the momentum trading strategy, we consider absolute momentum, also known as price momentum. This strategy is based on the momentum effect, which holds that future returns are positively correlated with past returns. This strategy measures the average stock return over the past period and assumes that when the average stock return over the past period is positive, the stock price will rise.

In Sections “Experimental test” and “Robust test”, we focus on prediction accuracy, which can measure the predicting ability of different models. However, investors are concerned about whether they can profit from these strategies. Therefore, we introduce the annual return rate (AR) to measure the model’s profitability. The equation used is as follows:

$$\begin{aligned} \begin{aligned} AR=\frac{250}{t}\sum _{i=1}^{t}r(i), \end{aligned} \end{aligned}$$
(24)

where 250 is the total average number of trading days in a year, t is the number of trading days for testing during the simulation, and r(i) represents the return obtained on trading \(day_i\) from the trading strategy, which is calculated by

$$\begin{aligned} \begin{aligned} r(i)=r_i\times signal_i, i=1,\ldots ,t, \end{aligned} \end{aligned}$$
(25)

where \(r_i\) represents the stock’s daily return on trading \(day_i\) and \(signal_i\) is a dummy variable representing the corresponding strategy signal. When \(signal_i\) is 0, the strategy predicts that the price will fall on \(day_i\) and we take no action; on the contrary, when \(signal_i\) is 1, the price is predicted to trend up on \(day_i\) so we buy it on \(day_{i-1}\) and have a short position on \(day_i\). Therefore, the cumulative return from trading \(day_1\) to \(day_t\) can be used to observe the real-time performance of each strategy.

Investment risk plays a vital role in evaluating the performance of a trading system; therefore, annual volatility (AV) and maximum drawdown (MDD) are measured to evaluate the risk. AV can be calculated using the following equation:

$$\begin{aligned} \begin{aligned} AV=\sqrt{\frac{250}{t-1}\sum _{i=1}^{t}[r(i)-\bar{r}]^{2}}, \end{aligned} \end{aligned}$$
(26)

where

$$\begin{aligned} \begin{aligned} \bar{r}=\frac{1}{t}\sum _{i=1}^{t}r(i). \end{aligned} \end{aligned}$$
(27)

MDD is the maximum loss from a peak to a trough before a new peak is attained. A lower MDD indicates a lower maximum possible loss during a trading period.

The annual sharp ratio (ASR) is often regarded as risk-adjusted profit and introduced for the stability assessment of trading systems. It is the ratio of average excess returns to the volatility of excess returns. The formula of ASR is as follows:

$$\begin{aligned} \begin{aligned} ASR=\sqrt{250} \times \frac{\bar{r_{e}}}{\sigma _{e}}, \end{aligned} \end{aligned}$$
(28)

where \(\bar{r_{e}}\) and \(\sigma _{e}\) represent the average and volatility of daily excess returns during the simulation period, respectively.

$$\begin{aligned} \begin{aligned} \bar{r_{e}}=\frac{1}{n}\sum _{i=1}^{n}[r(i)-r_{f}(i)],\\ \sigma _{e}=\sqrt{\frac{1}{n-1}\sum _{i=1}^{n}[r(i)-r_{f}(i)-\bar{r_{e}}]^{2}}. \end{aligned} \end{aligned}$$
(29)

Here, \(r_{f}(i)\) represents the risk-free interest rate on trading \(day_i\). From Equation (29), a large ASR indicates that investors can obtain high profits under unit risk. This also implies a more stable trading system.

Table 14 The average AR, ASR, MDD and AV of different strategies
Fig. 11
figure 11

The cumulative return curves of different trading strategies on four randomly selected stocks

The four indicators of AR, ASR, MDD, and AV obtained from the different trading strategies are shown in Table 14. Moreover, we randomly selected four stocks to demonstrate the cumulative return curves in Fig. 11, which can be used to observe the performance of different strategies in detail.

From Fig. 11, we find that the MVL-SVM strategy proposed in this study is significantly more profitable than the other strategies. Although both SVM and MVL-SVM use multi-view heterogeneous data, the SVM strategy behaves in an unstable manner. In some cases, the performance of SVM based on multi-view data is third only to the two MVL-SVM strategies, such as stock 600031. SH and 600196.SH. However, in some cases, its performance is similar to or worse than that of the random buy strategy, such as stock 600585. SH and 601088.SH. In comparison, the MVL-SVM model exhibits good results in almost all cases. Even if the daily return is used to replace the four market data, it can still obtain a much higher return than the buy-and-hold strategy, randomly buy strategy, momentum trading strategy, and other prediction-based strategies.

Then, we specifically analyze the performance of each strategy according to AR, ASR, AV, and MDD. Clearly, from Table 15, only MVL-SVM strategies (both MVL-SVMMD and MVL-SVMDR) can achieve positive returns for all stocks. And among the ten strategies, MVL-SVM strategies always have the highest AR except for one stock (code 600276.SH). The average AR of the MVL-SVMMD strategy is higher than that of the buy-and-hold and randomly buy strategies by 63.17% and 81.09%, respectively. There are 36 in 37 stocks (97.30%) whose ARs of MVL-SVM are the highest and surpass those of traditional algorithms based on single-view data, including momentum trading strategy, ARIMA, SVMMD, and SVMFN. However, although the average AR of SVM based on multi-view heterogeneous data is better than that of the single-view models, there are 24 in 37 stocks (64.86%) whose ARs of SVMMV are lower than those based on single-view models. This confirms that a rough connection between heterogeneous data leads to unsatisfactory results.

Table 15 The results of AR

As shown in Table 16, there are 26 in 37 stocks (70.27%) whose ASR of the MVL-SVM strategy is at least twice that of other strategies. From the average value shown in Table 14, the average ASR of MVL-SVMMD is higher than that of the buy-and-hold strategy, randomly buy strategy, and momentum trading strategy by 3.87, 4.22 and 4.10, respectively. In particular, the average ASR based on time series and traditional machine learning strategies is lower than 1, whereas that of MVL-SVM is still above 4. This demonstrates that MVL-SVM has higher stability and can gain higher profits under unit risk.

Table 16 The results of ASR

Regarding risk, Table 14 shows that the MVL-SVM models have the lowest average MDD and relatively lower average AV among the strategies. For most stocks, the maximum loss from a peak to a trough of MVL-SVM is less than that of most traditional strategies including buy-and-hold, randomly buy, momentum trading, ARIMA, and SVMFN strategies (shown in Table 17), and for most stocks, the AV of MVL-SVM is lower than that of buy-and-hold, randomly buy, momentum trading and ARIMA strategies (see in Table 18). This implies that MVL-SVM has a relatively good capacity to send stable trading signals and a relatively lower risk.

Table 17 The results of MDD

Moreover, we compared the practical performances of MVL-SVMMD and MVL-SVMDR. Table 15 shows that nearly 70% of stocks whose AR of MVL-SVMDR is higher than that of MVL-SVMMD. In general, the average AR of MVL-SVMDR is higher than that of MVL-SVMMD by 7.64%, as shown in Table 14, while the average values of the other indicators have little difference. This indicates that in contrast to the MVL-SVMMD model, the MVL-SVMDR model can provide more appropriate guidance for investors and help them obtain higher returns.

Table 18 The results of AV

Finally, to evaluate the statistical significance of the advantages of MVL-SVM over other baseline strategies, we apply nonparametric statistical analyses again, including the Nemenyi test and contrast estimation based on medians. As demonstrated in Fig. 12, MVL-SVM strategies are ranked first in the performance of AR, ASR, and MDD, and are significantly different from other baseline strategies in the profitability metrics, including AR and ASR. This illustrates the profitability of our model and its ability to control risks, which can also be demonstrated by the positive values in rows “MVL-SVMMD” and “MVL-SVMDR” in Table 19. In addition, the results also show that the performance of MVL-SVM can be improved by transforming market data into daily returns, as MVL-SVMDR significantly outperforms MVL-SVMMD according to AR. In contrast, other metrics do not show significant differences.

Fig. 12
figure 12

Comparison of pairwise algorithms with the Nemeyi test in terms of each evaluation metrics

The aforementioned trading strategies are based on a single stock. If an investor is optimistic about a certain stock and plans to invest in it, our strategy can help them find better time nodes for trading. However, investors often tend to invest in a basket of stocks, which implies that building an appropriate trading strategy for an investment portfolio is necessary.

Table 19 Contrast estimation based on medians among all models

We further take the 37 stocks of the sample as a basket to build a trading strategy with our proposed algorithm. Considering that some investors tend to construct portfolios according to a certain market index, we add a passive trading strategy to hold the SSE 50 index for comparative evaluation. Assuming that the initial capital is 1,000,000 CNY, which is used to invest equally in each stock, we use the same operation for each stock as the single-stock trading strategy. That is, if the algorithm sends an up signal for a certain stock, our system will spend 1/37 of the capital to buy it at the day’s closing price and have a short position at the closing price the next day; if it sends a down signal, no operation will be carried out. Trading signals are executed only when the total cash balance exceeds 100,000 CNY. The division of the training/testing set and the selection of sliding windows are the same as those in the single-stock trading strategy. Therefore, the return on trading \(day_i\) is transformed into

$$\begin{aligned} \begin{aligned} r(i)=\frac{1}{N_{stock}}\sum _{j=1}^{N_{stock}}r_{i,j}\times signal_{i,j}, i=1,\ldots ,n, \end{aligned} \end{aligned}$$
(30)

where \(N_{stock}\) is the number of stocks, \(r_{i,j}\) is the daily return of stock j on trading \(day_i\) and \(signal_{i,j}\in \{0,1\}\) is the corresponding strategy signal of stock j. Figure 13 shows that MVL-SVM models performed excellently during the simulation. From Table 20, we find that our models are significantly superior to other baseline models, not only in terms of profits but also in terms of risk control, indicating that they can pick out quality ones from a basket of stocks and invest them at an appropriate trading point to make profits.

Fig. 13
figure 13

The cumulative return curves of market simulations based on different trading strategies for portfolios

Table 20 Four metrics of different trading strategies for portfolios

Moreover, from the simulation results of MVL-SVMMD strategies based on a single stock, we find that the AR ranges from 22.51 to 232.5%, indicating that our model does not always yield high returns. Therefore, we conduct a simulation to determine whether our model can help investors gain profit when it underperforms. Therefore, for each stock, we calculate the minimum difference between MVL-SVMMD and other baseline strategies to select five stocks with the lowest differences; that is, MVL-SVMMD does not perform well on them. The five stocks, including 600276.SH, 601288.SH, 601668.SH, 600016.SH and 600570.SH are regarded as a new equally weighted portfolio and traded with 1/5 of the capital each time, with the same operation as above. Figure 14 and Table 21 show that the ARIMA strategy has the worst performance in the portfolio of five stocks. Although the buy-and-hold strategy has the highest cumulative return in July 2020, our model performs the best in the long run. This indicates that even if the MVL-SVM model does not significantly outperform the others, it can still help investors achieve considerable returns with relatively low risk.

Fig. 14
figure 14

The cumulative return curves of different trading strategies for an investment portfolio of 5 stocks

Table 21 Four metrics of different trading strategies for a portfolio of 5 stocks

According to the above three trading simulations, clearly, compared with the common trading strategies, such as momentum trading strategy and prediction-based strategies, such as ARIMA and traditional SVM strategies, MVL-SVM based on multi-view heterogeneous data shows excellent performances in profitability and risk control ability. Even if some prediction-based strategies fail to exceed the passive strategy of holding the SSE 50 index and the buy-and-hold strategy, MVL-SVM can achieve much better results than the passive strategies and other benchmarks. In particular, when the dimension of the input data is decreased, that is, the four market variables are changed to one (the daily return), the predictive and profitable effectiveness can be improved owing to the reduction in redundant information.

Conclusion

In this study, we propose a hybrid model for stock price prediction called MVL-SVM. It combines multi-view learning with a support vector machine to investigate the joint impact of financial news and market data on stock price movements. MVL-SVM can fuse multiple data sources directly with the multi-view learning algorithm and classify stock price fluctuations with a support vector machine, which enriches the information sources and reduces information loss in the fusion process.

In the experiment, we consider 37 constituent stocks in the SSE 50 index as the research object and use unstructured financial news and structured market data as inputs to predict the price trend of each stock. By comparing MVL-SVM with classic SVM models based on single-view and multi-view data, we found that roughly concatenating and inputting multi-view heterogeneous data yields unsatisfactory results because of the characteristic difference between the distinct views. However, the MVL-SVM model can learn and minimize the inconsistency from multiple data sources, and thus can demonstrate outstanding performance in this situation. Furthermore, we aimed to improve our model. Considering the importance of daily returns among the four market variables, we replace the four with daily return sequences to construct a new model and compare it with the ARIMA model and classic SVM models. It appears that MVL-SVM based on news and daily return sequences significantly outperforms the other baseline models. Its performance surpasses that of MVL-SVM based on news and the four market variables. This shows the important role of daily returns in market data and confirms the validity of the many studies that only use stock returns for research.

In the robustness test, we try to observe the model’s ability to capture the joint impact of the multi-view data within a certain period, thus setting the sliding windows of news and market data to 1–5 days. It can be concluded from the results that MVL-SVM can capture the prompt impact of news on stock prices because the sliding window of news can influence the prediction accuracy of MVL-SVM, and the model based on 1-day news has the best performance. The comparison demonstrates that MVL-SVM surpasses the benchmarks by at least 10% accuracy, which is a meaningful improvement.

Finally, a series of trading strategies are constructed based on the predicting results of two MVL-SVM models, which are compared with other prediction-based strategies based on single-view and multi-view data, as well as three common strategies, including the buy-and-hold strategy, randomly buy strategy and momentum trading strategy. When building trading strategies for a basket of stocks, a passive strategy of holding the SSE 50 index was also considered for comparative evaluation. The results show that the MVL-SVM strategy has excellent profitability and risk-control performance in various scenarios. Moreover, its performance can be improved by changing the four market variables to daily return sequences.

In summary, from the prediction perspective, the proposed MVL-SVM model based on multi-view heterogeneous data can predict stock price movement more accurately than other models. From the perspective of trading strategy, this can help investors gain higher profits and have better risk control ability. But there are still some limitations. In this article, we only consider two information sources: market data and news. For future work, we can include more data sources in the model for discussion, such as online posts on social media, companies’ financial statements, etc. In addition, when building SVM and MVL-SVM models, this study only considers two kernel functions, including linear and Gaussian kernels. In the future, we can construct the models by adding more kernel functions, such as the poly kernel and the sigmoid kernel function. As the proposed model has no restrictions on financial assets, we will further attempt to apply it to solve the problems of other financial assets.

Availability of data and materials

The market data used in this article are available in the WIND database, https://www.wind.com.cn/. And financial news is available in the Uqer database, https://uqer.datayes.com/.

Abbreviations

SVM:

Support vector machine

SVMMD:

SVM based on market data

SVMFN:

SVM based on financial news

SVMMV:

SVM based on news and market data

MVL-SVM:

Model integrating multi-view learning with SVM

MVL-SVMMD:

MVL-SVM model based on news and market data

MVL-SVMDR:

MVL-SVM model based on news and daily returns

AR:

Annual return rate

ASR:

Annual share ratio

MDD:

Maximum drawdown

AV:

Annual volatility

CD:

Critical difference

References

  • Bildirici M, Ersin ÖÖ (2009) Improving forecasts of GARCH family models with the artificial neural networks: an application to the daily returns in Istanbul stock exchange. Expert Syst. Appl. 36(4):7355–7362

    Article  Google Scholar 

  • Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory, pp 92–100

  • Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, Hoboken

    Google Scholar 

  • Buckley C, Salton G, Allan J, and Singhal A (1995) Automatic query expansion using SMART: TREC 3. NIST special publication sp, pp 69–69

  • Cao L, Tay F (2001) Financial forecasting using support vector machines. Neural Comput Appl 10(2):184–192

    Article  Google Scholar 

  • Cao LJ, Tay FEH (2003) Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans Neural Netw 14(6):1506–1518

    Article  CAS  PubMed  Google Scholar 

  • Ceci M, Pio G, Kuzmanovski V, Džeroski S (2015) Semi-supervised multi-view learning for gene network reconstruction. PLoS ONE 10(12):e0144031

    Article  PubMed  PubMed Central  Google Scholar 

  • Chen K, Zhou Y, and Dai F (2015) A LSTM-based method for stock returns prediction: a case study of china stock market. In: 2015 IEEE international conference on big data (big data), pp 2823–2824. IEEE

  • Collins M and Singer Y (1999) Unsupervised models for named entity classification. In: 1999 joint SIGDAT conference on empirical methods in natural language processing and very large corpora

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  Google Scholar 

  • Dasgupta S, Littman ML, and Mcallester DA (2001) PAC generalization bounds for co-training. In: Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001, 3–8 Dec 2001, Vancouver, British Columbia, Canada], pp 375–382

  • de Sa VR (1994) Learning classification with unlabeled data. Morgan Kaufmann Publishers, Burlington, pp 112–112

    Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  • Deng N, Tian Y, Zhang C (2012) Support vector machines: optimization based theory, algorithms, and extensions. CRC Press, Boca Raton

    Book  Google Scholar 

  • Deng S, Mitsubuchi T, Sakurai A (2014) Stock price change rate prediction by utilizing social network activities. Sci World J. https://doi.org/10.1155/2014/861641

    Article  Google Scholar 

  • Deng S, Mitsubuchi T, Shioda K, Shimada T, and Sakurai A (2011a) Combining technical analysis with sentiment analysis for stock price prediction. In: 2011 IEEE ninth international conference on dependable, autonomic and secure computing, pp 800–807. IEEE

  • Deng S, Mitsubuchi T, Shioda K, Shimada T, and Sakurai A (2011b) Multiple kernel learning on time series data and social networks for stock price prediction. In: 2011 10th international conference on machine learning and applications and workshops, vol 2, pp 228–234. IEEE

  • Dyck A , Zingales L (2003) The media and asset prices. Technical report, Working Paper, Harvard Business School, Harvard

  • Fischer T, Krauss C (2018) Deep learning with long short-term memory networks for financial market predictions. Eur J Oper Res 270(2):654–669

    Article  MathSciNet  Google Scholar 

  • García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  • Gidofalvi G, Elkan C (2001) Using news articles to predict stock price movements. University of California, San Diego, p 17

    Google Scholar 

  • Hammad AAA, Ali SMA, Hall EL (2007) Forecasting the Jordanian stock price using artificial neural network. Intell Eng Syst Through Artif Neural Netw 17:1–6

    Google Scholar 

  • Han Z, Zhang C, Fu H, Zhou JT (2022) Trusted multi-view classification with dynamic evidential fusion. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3171983

    Article  PubMed  Google Scholar 

  • Jarrett JE, Schilling J (2008) Daily variation and predicting stock market returns for the frankfurter börse (stock market). J Bus Econ Manag 9(3):189–198

    Article  Google Scholar 

  • Kanwar N (2019) Deep reinforcement learning-based portfolio management. PhD thesis, The University of Texas at Arlington

  • Kesavan M, Karthiraman J, Ebenezer RT, Adhithyan S (2020) Stock market prediction with historical time series data and sentimental analysis of social media data. In: 2020 4th international conference on intelligent computing and control systems (ICICCS)

  • Kim KJ (2003) Financial time series forecasting using support vector machines. Neurocomputing 55(1/2):307–319

    Article  Google Scholar 

  • Kolarik T, Rudorfer G (1994) Time series forecasting using neural networks. ACM Sigapl Apl Quote Quad 25(1):86–94

    Article  Google Scholar 

  • Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Language models for financial news recommendation. In: Proceedings of the ninth international conference on Information and knowledge management, pp 389–396

  • Li X, Xie H, Wang R, Cai Y, Cao J, Wang F, Min H, Deng X (2016) Empirical analysis: stock market prediction via extreme learning machine. Neural Comput Appl 27(1):67–78

    Article  CAS  Google Scholar 

  • Li X, Wu P, Wang W (2020) Incorporating stock prices and news sentiments for stock market prediction: a case of Hong Kong. Inf Process Manag 57:102212

    Article  Google Scholar 

  • Li H, Dagli CH, Enke D (2007) Short-term stock market timing prediction under reinforcement learning schemes. In: 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning, pp 233–240. IEEE

  • Lin CT, Wang YK, Huang PL, Shi Y, Chang YC (2022) Spatial-temporal attention-based convolutional network with text and numerical information for stock price prediction. Neural Comput Appl. https://doi.org/10.1007/s00521-022-07234-0

    Article  PubMed  PubMed Central  Google Scholar 

  • Li X, Wang C, Dong J, Wang F, Deng X, Zhu S (2011) Improving stock market prediction by integrating both market news and stock prices. In: International conference on database and expert systems applications, pp 279–293. Springer, Berlin

  • Long W, Lu Z, Cui L (2019) Deep learning-based feature engineering for stock price movement prediction. Knowl-Based Syst 164:163–173

    Article  Google Scholar 

  • Long W, Song L, Tian Y (2019) A new graphic kernel method of stock price trend prediction based on financial news semantic and structural similarity. Expert Syst Appl 118:411–424

    Article  Google Scholar 

  • Lv B, Jiang Y, Li Q (2021) Prediction of short-term stock price trend based on multiview RBF neural network. Intell Neuroscience. https://doi.org/10.1155/2021/8495288

    Article  Google Scholar 

  • Meesad P and Thanh H (2014) Stock market trend prediction based on text mining of corporate web and time series data. J Adv Comput Intell Intell Inf. https://doi.org/10.20965/jaciii.2014.p0022

  • Mittermayer M-A, Knolmayer GF (2006) Newscats: a news categorization and trading system. In: Sixth international conference on data mining (ICDM’06), pp 1002–1007. IEEE

  • Mohan S, Mullapudi S, Sammeta S, Vijayvergia P, Anastasiu DC (2019) Stock price prediction using news sentiment analysis. In: 2019 IEEE Fifth international conference on big data computing service and applications (BigDataService), pp 205–208. IEEE

  • Ronaghi F, Salimibeni M, Naderkhani F, Mohammadi A (2022) COVID19-HPSMP: COVID-19 adopted hybrid and parallel deep information fusion framework for stock price movement prediction. Expert Syst Appl 187:115879

    Article  PubMed  Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  • Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  Google Scholar 

  • Schumaker RP, Chen H (2009) Textual analysis of stock market prediction using breaking financial news: the AZFin text system. ACM Trans Inf Syst (TOIS) 27(2):1–19

    Article  Google Scholar 

  • Shiller RJ (2015) Irrational exuberance. Princeton University Press, Princeton

    Book  Google Scholar 

  • Shynkevich Y, McGinnity TM, Coleman S, Belatreche A (2015a) Predicting stock price movements based on different categories of news articles. In: 2015 IEEE symposium series on computational intelligence, pp 703–710. IEEE

  • Shynkevich Y, McGinnity TM, Coleman S, Belatreche A (2015b) Stock price prediction based on stock-specific and sub-industry-specific news articles. In: 2015 International joint conference on neural networks (IJCNN), pp 1–8. IEEE

  • Suhail K, Sankar S, Kumar AS, Nestor T, Soliman NF, Algarni AD, El-Shafai W, Abd El-Samie FE (2022) Stock market trading based on market sentiments and reinforcement learning. CMC-Comput Mater Continua 70(1):935–950

    Article  Google Scholar 

  • Sun K et al (2017) Equity return modeling and prediction using hybrid ARIMA-GARCH model. Int J Financ Res 8(3):154–161

    Article  Google Scholar 

  • Sun S, Yu M, Shawe-Taylor J, Mao L (2022) Stability-based PAC-bayes analysis for multi-view learning algorithms. Inf Fusion 86:76–92

    Article  Google Scholar 

  • Sun L, Ceran B, Ye J (2010) A scalable two-stage approach for a class of dimensionality reduction techniques. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 313–322

  • Tan Z, Quek C, Cheng PY (2011) Stock trading with cycles: a financial application of ANFIS and reinforcement learning. Expert Syst Appl 38(5):4741–4755

    Article  Google Scholar 

  • Vo N, Ślepaczuk R (2022) Applying hybrid ARIMA-SGARCH in algorithmic investment strategies on S &P500 index. Entropy 24(2):158

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  • Wang H, Zhou Z (2021) Multi-view learning based on maximum margin of twin spheres support vector machine. J Intell Fuzzy Syst 40(6):11273–11286

    Article  MathSciNet  Google Scholar 

  • Wang Y, Liu H, Guo Q, Xie S, Zhang X (2019) Stock volatility prediction by hybrid neural network. IEEE Access 7:154524–154534

    Article  Google Scholar 

  • Wang F, Liu L, Dou C (2012) Stock market volatility prediction: a service-oriented multi-kernel learning approach. In: 2012 IEEE ninth international conference on services computing, pp 49–56. IEEE

  • White H (1988) Economic prediction using neural networks: the case of IBM daily stock returns. In: ICNN, vol 2, pp 451–458

  • Wüthrich B, Permunetilleke D, Leung S, Lam W, Cho V, Zhang J (1998) Daily prediction of major stock indices from textual www data. Hkie Trans 5(3):151–156

    Article  Google Scholar 

  • Xiao Y, Li X, Liu B, Zhao L, Kong X, Alhudhaif A, Alenezi F (2022) Multi-view support vector ordinal regression with data uncertainty. Inf Sci 589:516–530

    Article  Google Scholar 

  • Xu C, Tao D, Xu C (2015) Multi-view learning with incomplete views. IEEE Trans Image Process 24(12):5812–5825

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  • Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634

  • Yan X, Hu S, Mao Y, Ye Y, Yu H (2021) Deep multi-view learning methods: a review. Neurocomputing 448:106–129

    Article  Google Scholar 

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML, vol 97, pp 35. CiteSeer

  • Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the association for computational linguistics, pp 189–196

  • Zhang T, Liu S, Xu C, Lu H (2010) Human action recognition via multi-view learning. In: Proceedings of the second international conference on internet multimedia computing and service, pp 23–28

Download references

Acknowledgements

We are thankful to reviewers for their valuable comments to improve the manuscripts

Funding

This research was partly supported by National Natural Science Foundation of China (No. 71771204, 72231010) and the Fundamental Research Funds for the Central Universities (No. E0E48946X2).

Author information

Authors and Affiliations

Authors

Contributions

All the authors were involved in the research that led to the article and in its writing. All the authors read and approved the final manuscript

Corresponding author

Correspondence to Jing Gao.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Variables and the corresponding definition are listed in Table 22.

Table 22 Variables and the corresponding definition

Some related works are listed in Table 23.

Table 23 Related works

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Long, W., Gao, J., Bai, K. et al. A hybrid model for stock price prediction based on multi-view heterogeneous data. Financ Innov 10, 48 (2024). https://doi.org/10.1186/s40854-023-00519-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40854-023-00519-w

Keywords