 Research
 Open access
 Published:
A hybrid model for stock price prediction based on multiview heterogeneous data
Financial Innovation volume 10, Article number: 48 (2024)
Abstract
Literature shows that both market data and financial media impact stock prices; however, using only one kind of data may lead to information bias. Therefore, this study uses market data and news to investigate their joint impact on stock price trends. However, combining these two types of information is difficult because of their completely different characteristics. This study develops a hybrid model called MVLSVM for stock price trend prediction by integrating multiview learning with a support vector machine (SVM). It works by simply inputting heterogeneous multiview data simultaneously, which may reduce information loss. Compared with the ARIMA and classic SVM models based on single and multiview data, our hybrid model shows statistically significant advantages. In the robustness test, our model outperforms the others by at least 10% accuracy when the sliding windows of news and market data are set to 1–5 days, which confirms our model’s effectiveness. Finally, trading strategies based on single stock and investment portfolios are constructed separately, and the simulations show that MVLSVM has better profitability and risk control performance than the benchmarks.
Introduction
Stock price predictions have always been a focus of financial research. Existing research on stock price prediction is primarily based on two data types. One is structured historical market data, and the other is unstructured text, such as financial news.
Stock market data, such as returns and trading volumes, play a vital role in stock price prediction, and many studies have used market data to predict stock price trends. White (1988) was the first to successfully predict the time series of the stock market using the Back Propagation Neural Network (BPNN). Subsequently, Kolarik and Rudorfer (1994) compared the prediction results of Artificial Neural Network (ANN) with those of Autoregressive Integrated Moving Average model (ARIMA), showing that the ANN model was more effective. Bildirici and Ersin (2009) studied the historical stock data of the Istanbul stock market over the past 30 years by combining the Autoregressive Conditional Heteroskedasticity model (ARCH) or the Generalized Autoregressive Conditional Heteroskedasticity model (GARCH) with ANN and found that the hybrid model of GARCH and ANN had better prediction results than the hybrid model of ARCH and ANN. Hammad et al. (2007) applied the multilayer BPNN to predict the stock price, which showed a better prediction performance than other methods. In recent years, deep learning has been introduced into stock price predictions. Chen et al. (2015) realized the prediction of stock returns with the Long ShortTerm Memory (LSTM) model. Fischer and Krauss (2018) used LSTM to predict stock prices and drew a shortterm investment strategy. Long et al. (2019) put forward a multifilters neural network model using deep learning methodologies and applied it to the Chinese stock market index CSI 300. Some studies have also utilized reinforcement learning for financial prediction; however, the algorithm usually requires training and testing over a very long period. Tan et al. (2011) developed a nonarbitrage algorithmic trading system based on reinforcement learning, which was tested on more than 20 stocks over 13 years from 1994 to 2006. Suhail et al. (2022) employed a reinforcement learning network to guide stock market trading, which used 11 years of Apple stock data from 2006 to 2016. Additionally, the performance of reinforcement learning is sometimes unsatisfactory. Li et al. (2007) adopted actoronly and actor–critic reinforcement learning to develop two prediction systems; however, both systems were unable to generate significant improvements. Kanwar (2019) also showed that deep reinforcement learning was less successful in capturing the dynamic changes in the stock market than originally thought.
Moreover, many studies have shown that in addition to market data, financial news has an impact on stock prices. News contains information about the company’s fundamentals and activities; hence, it will affect market participants’ expectations of future price changes, thus driving stock price movements. Dyck and Zingales (2003) proved that issuing earnings announcements through news media could increase volatility in the stock price. Shiller (2015) also held that media can fuel the fluctuations of the stock market. Hence, deriving information affecting stock prices from media coverage is very important. Wüthrich et al. (1998) chose the news in the most influential financial newspapers, such as the Wall Street Journal, as the object of empirical research and explored the forecasting effect of the news on market indexes. Lavrenko et al. (2000) constructed an eAnalyst news recommendation system to study the correlation between news and stock price time series. This system can recommend news that has a predictive effect on future stock price trends. Gidofalvi and Elkan (2001) applied the Bayesian text classifier and found that news indicators had a certain predictive effect on the stock price within 20 min before and after the news is released. Mittermayer and Knolmayer (2006) built a NewsCATS system to predict the intraday realtime price fluctuations of stocks caused by news. Compared with other automated text categorization algorithms, this system was found to have better predictive performance and higher system trading profitability. Schumaker and Chen (2009) found that news could be used by the SVM algorithm to make excellent predictions of stock prices 20 min after the news was released, and the prediction results could be used to guide trading. Long et al. (2019) proposed a new kernel S &S to study the impact of news on stock prices, which considered the information structures among news in addition to the news contents. With SVM algorithms, the new kernel outperformed other common kernels, such as the linear kernel, by at least 5% accuracy.
The aforementioned research works are based on singleview data, but stock price movement can be affected by both financial news and historical market data. Market and news data can be independently used to predict stock prices; however, if the model only uses singleview information, information deviation may occur. Figure 1 shows possible market scenarios. If the model only uses historical market data, the rational prediction in the left figure will be “rise,” and the rational prediction in the right figure will be “fall.” Therefore, if we witness an actual “fall” in the left figure or “rise” in the right figure, the prediction performance of the model is weakened. Moreover, if the model uses only financial news, it fails to explain why stock prices continue to increase when negative news is released in the left figure and why stock prices still fall when positive news is released in the right figure. Thus, analyzing the impact of multiview data on stock prices comprehensively is important; only by this method can the model send the correct signal.
Many studies have tried incorporating the two kinds of data to improve the predicting performance. However, owing to the different structures of the two, combining them directly into a model is difficult. To solve this problem, studies usually apply indexing modeling; that is, they use textual data to compile indexes so that textual data are structured to predict stock prices together with market data (Deng et al. 2011a; Mohan et al. 2019; Li et al. 2020; Kesavan et al. 2020). Although this approach successfully fuses structured and unstructured text data, there are some limitations to this indexing treatment. Because abundant news text data are condensed into an index by directly using structured information about text (Deng et al. 2011a), such as news frequency, or by processing vectorized text into a structured sentiment index (Mohan et al. 2019; Li et al. 2020; Kesavan et al. 2020), inevitably, some information contained in the text will be lost. However, if common algorithms directly use the text vector with stock market data to perform prediction, the complicated news information with stock prediction information, a large amount of unrelated information and noise may potentially decrease the prediction performance (Lin et al. 2022). Accordingly, the appropriate extraction and exploitation of hidden information within raw multiview heterogeneous data, including news and market data, to make accurate predictions becomes a challenging problem.
To solve the problem, this paper develops a hybrid model of stock price fluctuation prediction via a combination of multiview learning for directly fusing differentstructured data and a machine learning method called support vector machine for stock price trend classification. This model, called MVLSVM, can maximize the consistency between the multiview information learned from financial news or market data and therefore, not only reduces the information loss in news text processing, but also solves the difficulty in integrating complicated news information with market data. To evaluate the performance of the model, the time series method and classic SVMs were introduced for comparison. Finally, a series of trading strategies were constructed based on this algorithm and applied to three trading scenarios.
This paper contributes in the following three aspects. (1) The proposed hybrid model based on the framework of multiview learning can input heterogeneous information influencing stock price fluctuations, such as financial news and market data, into the prediction model simultaneously, which not only enriches the information types for stock price prediction but also reduces the information loss in the process of prediction. Most previous studies only considered singleview data; some related to multiview data tend to adopt the strategy of indexing modeling, which will inevitably lead to a large loss of information. (2) This study also investigates the lag effect of news and market data on stockprice forecasts. Usually, the news cannot be fully absorbed by the stock price on the day of the news release, which means it may further affect the stock price the next day; however, few studies consider the time lag of this impact. We solve this problem by studying a prediction problem with lag and different time windows to observe the ability of MVLSVM to capture the information contained in multiview heterogeneous data after or over a certain period. (3) This study constructs a series of trading strategies based on the proposed hybrid model and compares them with other predictionbased and common strategies. The simulation results show that MVLSVM has better profitability and risk control ability than other models, which provides more favorable proof for evaluating the performance of our model. At the same time, related studies only focus on prediction accuracy.
The rest of the paper is organized as follows. “Literature review” section reviews the main existing methods related to stock price forecasting based on multiview heterogeneous data. “Methods” section introduces the methods used in this study. “Experimental test” section presents our datasets and shows the results of the MVLSVM model, which are compared with a time series model and some classic SVM models. “Robust test” section further evaluates the performance of the model under different sliding windows. “Trading strategy” section discusses a series of trading strategies designed with our model to assess its practical efficiency, and “Conclusion” section presents our conclusions.
Literature review
Some significant attempts have been made in the finance domain to incorporate news and market data in predicting stock prices. Relevant methods can be divided into two categories: indexing modeling and direct fusion methods.
Indexing modeling methods involve constructing indexes with news information and thus fusing the structured index with numerical market data for prediction. Frequently used methods include the statistical method, which calculates the frequency of the text data, and sentiment analysis, which uses the processed text from language preprocessing to provide polarity scores for social media data and news. Deng et al. (2011a) predicted the price movement with overall sentiment analysis and frequency based on news and comments, and technical analysis of historical market data. Mohan et al. (2019) extracted numerical data called text polarity from text articles and combined it with stock prices for prediction. Li et al. (2020) extracted sentiments from news and represented stock prices by technical indicators. Then, a layered deep learning model was used to learn the multiview information, and a fully connected neural network was employed for stock predictions. Kesavan et al. (2020) represented news articles and social media contents by sentiment vectors and then used deep learning techniques to incorporate the polarity of the sentiments with financial timeseries data to predict stock prices. This approach succeeded in fusing structured and unstructured textual data. However, when the structured information about text is directly used and constructed into some indexes, such as news frequency, or when text is first vectorized and then processed into a structured sentiment indicator, it may face information loss.
Direct fusion method aims to directly integrate structured and unstructured data to extract information or solve classification and prediction problems. Li et al. (2016) applied the extreme learning machine (ELM) to make stock price predictions based on the market news and stock prices concurrently and found that the accuracies of RBF ELM and RBF SVM are similar but higher than that of BPNN; the prediction speeds of the two algorithms are also much faster than that of BPNN. Wang et al. (2019) proposed a hybrid timeseries predictive neural network to combine the daily Kline data with the news vectors and succeeded in stock volatility prediction. Ronaghi et al. (2022) predicted market index with COVID19related Twitter data and historical market data via a deep fusion framework consisting of two parallel paths, one based on CNN and another that integrates CNN with bidirectional LSTM (BLSTM). Lin et al. (2022) developed a spatialtemporal attentionbased convolutional network, which successfully extracted text and numerical information for stock price prediction using the attention mechanism, CNN, and LSTM. However, the aforementioned studies were mostly based on the neural network framework. Considering that a neural network is prone to fall into a local minimum, we attempt to develop a new model based on a different framework to fuse the two data types.
Multiview learning proposed by de Sa (1994) is a machine learning algorithm that can directly input heterogeneous data for training and usually has an excellent performance. Unlike the method of constructing an index from text, it substitutes labels using different views. It minimizes the inconsistency between the model outputs from distinct views to minimize classification errors. Yarowsky (1995) and Blum and Mitchell (1998) indicated that multiview learning outperformed singleview learning in light of classification. Blum and Mitchell (1998) improved the algorithm by cotraining distinct views when studying web page classification. Collins and Singer (1999) measured the consistency between distinct views by constructing an objective function. By maximizing the objective function, Dasgupta et al. (2001) presented an upper limit for the generalization error of multiple views. Multiview learning has been applied to a variety of learning methods, such as dimensionality reduction (Sun et al. 2010) and classification methods (Han et al. 2022). Many scholars have noticed its usefulness and begun combining it with other traditional algorithms to obtain excellent performance. Xiao et al. (2022) utilized the multiview learning to solve the data uncertainty; thus, successfully improving the Ordinal Regression classifier (OR). Lv et al. (2021) developed a prediction model with market data by integrating multiview learning with the classic RBF network, which showed excellent performance in forecasting stock prices. However, it only uses market data and excludes financial news information, which has been proven to be predictive by many studies.
The intuition for building this hybrid model is as follows. On the one hand, SVM is a classic machine learning classification algorithm and is often used to predict stock prices with financial news (Schumaker and Chen 2009; Long et al. 2019). Many studies have shown that SVM performs better in financial forecasting when compared with some neural network frameworks (Kim 2003; Cao and Tay 2001, 2003; Li et al. 2016; Meesad and Thanh 2014). On the other hand, multiview learning can learn common feature spaces or shared patterns by combining multiple data sources (Yan et al. 2021). Therefore, these two algorithms can be combined to use multiview heterogeneous data to predict stock prices. Some scholars have theoretically proved the effectiveness of the multiview model over the singleview model (Sun et al. 2022), and literature shows that this hybrid model has achieved excellent performance in fusing data with different structures for classification (Zhang et al. 2010; Xu et al. 2015; Ceci et al. 2015; Wang and Zhou 2021).
However, this model has not been widely applied in the financial field, and the limited related research can be divided into three categories. First, most studies based on this method considered only singleview data. Shynkevich et al. (2015b2015a) used the model to predict the stock price based on financial news and found that fusing different news categories could improve the prediction performance. However, they only considered the impact of media on stock prices and did not consider the impact of market data. Second, although some studies simultaneously included numerical and textual data with a multiview learning framework, they transformed the text into structured indexes to fuse text with numerical data (Deng et al. 2011a, b, 2014). As stated earlier, this approach increases information loss. Third, a few studies applied a multiview learning framework to merge historical stock prices with financial news vectors for stock price prediction (Li et al. 2011; Wang et al. 2012); however, they neither took into account the lag effect of news and market data on stock prices nor built trading strategies; hence, the model’s actual application performance in the financial market could not be judged.
Methods
Chinese news text processing
Considering that news is unstructured and involves a lot of noise or redundant information, we must eliminate noise and extract representative features containing the most useful information for accurate prediction. Therefore, this section introduces the method of transforming news text into a structured feature vector for training through news preprocessing, data cleaning, text representation, and feature extraction.
News preprocessing
Because trading in the Chinese stock market ends at 15:00, news released after 15:00 on trading \(day_t\) can be assumed to not affect the fluctuation of the stock price on \(day_t\); similarly, the news released on weekends, holidays, and other closed days have no effect on the prices. Therefore, news released on the closed day or after 15:00 on each trading day is included in the news of the next trading day. We then sorted the news by the reorganized date for processing.
Data cleaning
In data cleaning, we first removed the punctuation and garbled characters in the news, and then used the jieba package of Python to perform Chinese word segmentation. Finally, we used word filtering to filter out unimportant words from the Baidu Stop Word List. This step can help remove stop words and leave representative words such as nouns, verbs, and adjectives.
Text representation
For a news article, the value of each word was calculated according to its classification importance. Words that were more important for classification were assigned higher weights. Thus, each article can be represented as a vector of word values. The bagofwords model is a commonly used text representation method that represents the text as a bag of words, regardless of word order and grammar, while maintaining multiplicity. Based on the bagofwords model, Salton et al. (1975) proposed a vector space model commonly used in text classification. Because news contains many new concepts and words, it is appropriate to assume the independence of each word in this model. Each news item is then represented as a vector composed of the weight of each word, and the weight is determined by the word’s importance in the news. According to Salton and Buckley (1988), the importance of words can be determined by TFIDF, which supposes that words that rarely appear in the entire document but frequently appear in a text are of greater importance for classification. However, in practice, text length affects the weights obtained from this method. To better quantify the characteristic words, the influence of the length should be reduced. Therefore, we used the ltc method (Buckley et al. 1995) in this study, which combines length normalization (“l”), term frequency (“t”) and collection frequency (“c”) to calculate the weights of words. By normalizing the weight of words, the influence of article length can be avoided, and the importance of word frequency is weakened to a certain extent. This represents news articles in the following form:
where \(news_t\) represents the news vector on \(day_t\) and \(w_{t, m}\) is the weight of \(word_m\) in \(news_t\), which is expressed as
where \(f_{t,m}\) represents the occurrence frequency of \(word_m\) in \(news_t\) (term frequency), \(F_m\) is the occurrence frequency of \(word_m\) in the news corpus (collection frequency). All symbols used in the equations are listed in the Appendix (see Table 22).
Feature extraction
Because the news corpus involves many words, but only a small portion of words is contained in each news item, we use \(\chi ^2\) statistics (Yang and Pedersen 1997) to extract features of the text. Instead of using all the words in the news corpus, it selects words that contribute more to text classification to make computation easier and prevent overfitting. For word w and category c, we define A as the number of times w and c cooccur, B as the number of times w occurs without c, C as the number of times c occurs without w, D as the number of times neither c nor w occurs, and n is the sample size.
Then, for a word w, the \(\chi ^2\) score is obtained by combining the scores of each category:
where P(c) represents the frequency of the category \(c \in \{1,1\}\) in the news corpus. Words with higher \(\chi ^2\) scores are considered more informative for prediction, so we use \(\chi ^2\) scores to select the optimal number of words with the best prediction performance as the dimension of news. The prediction accuracy maximization was determined by calculating the prediction accuracy in different dimensions separately.
MVLSVM algorithm
The proposed MVLSVM algorithm combines a support vector machine and multiview learning, which can apply multiview learning for multiview data fusion and then use a support vector machine for classification or prediction. These two components are discussed further in this section. Moreover, SVM will also serve as a benchmark to test the effectiveness of our hybrid model.
Support vector machine (SVM)
SVM (Cortes and Vapnik 1995; Deng et al. 2012) has been widely applied to solve classification problems owing to its performance. It can learn from a set of twoclass training instances and divide new instances into one of the classes to solve classification problems.
We denote \((x_1, y_1),(x_2, y_2),\ldots , (x_n, y_n)\) as a twoclass training dataset, where \(x_i\) for \(i=1,\dots ,n\) represents a pdimensional real vector, and \(y_i \in (1,1)\) represents to which class \(x_i\) belongs. According to the classification method, SVM can be divided into linear and nonlinear SVM. The main idea of linear SVM is to find a “ maximum margin hyperplane,” defined as \(g(x)=\omega ^{T}x+b\) so that the two classes of samples can be accurately classified by this hyperplane and the sum of distances between the hyperplane and the closest point of each class is maximized. Mathematically, the classification problem is equivalent to solving the minimization problem as follows:
where n refers to the sample size, \(\xi _i\) is a slack variable, and C is a penalty term that controls the cost of misclassification of samples. The larger C is, the more intolerant the model is to classification errors, which are prone to overfitting. On the contrary, when C is smaller, there is more tolerance; therefore, the model is prone to underfitting. A 2dimensional example is shown in Fig. 2 to clearly demonstrate the workings of the linear SVM. Here, samples of different colors come from different classes. The red line represents the maximum margin hyperplane obtained by training the samples.
However, in practice, not all the samples are linearly separable. Therefore, a nonlinear SVM can be introduced to solve this problem. A nonlinear SVM can implicitly map samples into a highdimensional space with \(\phi (x)\) to find a maximum margin hyperplane in this highdimensional space. The optimization problem is as follows:
With Lagrange duality, the original problem can be transformed into the following dual problem.
where \(\alpha _{i}\) is a Lagrangian multiplier corresponding to sample \(x_i\) and \(k(x_{i},x_{j})=\phi (x_{i})\cdot \phi (x_{j})\) is a kernel function that is a symmetric positive definite function that satisfies Mercer’s conditions. By solving the above optimization problem, we can obtain the solutions \(\alpha _{i}^{*}\) and \(b^{*}\); the decision function is obtained as
Kernel function \(k(x_{i},x_{j})\) determines the performance of the model. The linear kernel function is often used to solve linear classification problems, and the Gaussian kernel function is used to solve nonlinear classification problems.
1) Linear kernel function
2) Gaussian kernel function
where \(\gamma\) is a Gaussian kernel parameter, which is important in determining kernel performance. When \(\gamma\) is small, the model is prone to underfitting, whereas when \(\gamma\) is large, the model is prone to overfitting.
Multiview learning
Generally, singleview data can be easily used in machine learning methods for classification, whereas using multiview data in these methods is difficult. Multiview learning algorithms appear to solve this problem.
Multiview learning designs a function for each perspective. All functions are optimized by maximizing the consistency between redundant views, and the model’s performance is improved. Owing to its outstanding performance in multiview data applications, multiview learning has gradually attracted increasing attention. There are three types of existing algorithms.

1)
Cotraining: Maximizing mutual agreement on different views of unlabeled data through alternate learning.

2)
Multiple kernel learning: Linearly or nonlinearly combining kernels for each view to improve training efficiency.

3)
Subspace learning: Acquiring an appropriate subspace under the assumption that multiple views are generated from this appropriate subspace.
This study uses a multiple kernel learning framework to build a stock price prediction model. By selecting the appropriate kernels and kernel combination for training, each data source can be trained with the corresponding optimal kernel function; therefore, the model can perform better than the singlekernel model (Xu et al. 2013). As illustrated in Fig. 3, distinct kernels are selected for distinct views, and multiple pieces of information can be fused by combining distinct kernels. There are many combination methods that can be grouped into two categories: linear and nonlinear combinations.
However, no empirical results show that a nonlinear combination can improve the model’s performance, which raises the question of whether the nonlinear combination method is necessary and efficient. Therefore, we only used linear combination methods in this study. There are two basic categories.
1) Direct summation
where \(k_{m}(x_{i},x_{j})\) denotes the mth kernel.
2) Weighted summation
Here, \(\beta _{m}\) represents the weight of kernel \(k_{m}(x_{i},x_{j})\).
As different types of information have different importance for prediction/classification, using the direct summation method, which assigns equal priority to each kernel, is not ideal. In comparison, we choose the weighted summation kernel in this study, and the kernel function can be written as
The weight \(\beta _{m}\) of the kernel \(k_{m}(x_{i},x_{j})\) can be determined using kernel learning. By applying SVM with the above kernel function, we can obtain the decision function of MVLSVM, as shown in Equation (14).
ARIMA model
ARIMA model (Box et al. 2015) is a commonly used time series model that can input historical data sequences for prediction; therefore, we use this model to design one of the benchmarks based on singleview data. It contains three terms: the autoregression term, the integrated term, and the moving average term. A nonstationary data series can be converted into a stationary one by differencing to remove the impact of nonstationarity. The firstorder differencing of a data series \(z_t\) is expressed as
The stationarity of the time series was tested using the ADF method. After converting the data series into a stationary one through dorder difference, we use the stationary time series to conduct a model with a combination of the autoregression model and the moving average model and obtain the future value by dorder integration. The autoregression model captures the impact of historical timeseries values on the current value by performing linear regression. Because time series are usually affected by random disturbances in noisy environments, the moving average method is further introduced to observe the influence of random disturbances on future time series. Then, the ARIMA(p, d, q) model with three parameters, including the autoregression order p, differencing order d and moving average order q, can be expressed as
where \(\phi _{i}\) is the ith autoregression parameter, \(\theta _{j}\) is the jth moving average parameter, and \(\epsilon _{t}\) is the error term at time t. In practice, the autoregression order p and moving average order q can be determined using partial autocorrelation and autocorrelation diagrams, respectively.
Experimental test
Section “Introduction” shows that financial news and market data are significant in predicting price trends. Because the MVLSVM method can integrate multiple information sources for classification, we now apply it to predict whether prices will rise or fall based on financial news and market data. Subsequently, the results were compared with classic SVMs using singleview and mixed data.
Data sources
The Shanghai Stock Exchange 50 index (SSE 50 index) comprises the most representative 50 stocks of the Shanghai Stock Exchange. This indicates the overall situation of several leading companies with the greatest market influence in the Chinese stock market. As these enterprises have the most active news reports and can thus provide sufficient news samples, we choose the constituent stocks of the SSE 50 index for empirical analysis. Due to the limitations of data sources, the period investigated in this study was from January 1, 2018 to December 31, 2020.
Table 1 shows the total number of days with news release for each stock. As the price of the newly listed stocks fluctuates unstably, we exclude stocks listed after January 1, 2016. Furthermore, because of less news, three stocks, including 600745. SH, 600690.SH and 601888.SH was not considered for the sample. Consequently, the research object consisted of 37 stocks.
For the structured data, considering the selected market data should comprehensively reflect the stock information, such as price changes, transaction activity, market liquidity, scale, and so on, we choose four widely used variables, including stock daily return (r), trading volume (tv), turnover rate (tr) and total market cap (mc) from Wind database (https://www.wind.com.cn/), to predict the stock price. We denote \(md_1, md_2,\ldots , md_t,\ldots\) as the market data sequences, and \(md_t=(r_t, tv_t, tr_t, mc_t)\) represents the four market variables on \(day_t\). The daily returns used in this study are all log returns, and all data are daily.
For financial news, the news release time, summary, and text were collected from the Uqer database (https://uqer.datayes.com/). Uqer database provides a news API that collects news from 223 news websites, including reports on the company and coverage related to the macroeconomic environment. Here for each stock, we selected the news containing the stock’s name as its stock news; we collected 496,014 pieces of news for 37 stocks in total. To illustrate our data in detail, we randomly selected a stock of 600276. SH to visualize the data (see Fig. 4). Table 2 presents the basic statistical characteristics of the market variables. From Fig. 4a and b, we find that the amount of news in 2020 is the largest. Each year the amounts of news in January and February are relatively small owing to holidays. In addition, from Fig. 4c, we find that except for the linear relationship between the turnover rate and trading volume, other data are basically irrelevant.
Considering the spread of the Coronavirus in 2019, we further investigate if this issue impacts our data and the market trend. The total amount of news on COVID19 in our sample was 3,467. The COVID19 outbreak occurred at the end of December 2019. However, the disease was not considered serious in the early stages; hence, it did not cause large stock market fluctuations. In January 20, 2020, a report pointed out that COVID19 was a humantohuman fastspreading communicable disease for which a cure had not been found; panic began to spread, and the stock market began to fall, as shown in Fig. 5. On January 23, 2020, Wuhan was announced to shut down, and the stock market fell sharply. Affected by the epidemic, the US stock market experienced four circuit breakers in March 2020, including March 9, March 12, March 16, and March 18. This also affected Ashare investors, leading to a sharp drop in SSE 50 index. Choosing 2018 to 2020 as our research period can also help us explore whether the results of our model remain robust under special circumstances, such as epidemics.
The data must be normalized before being input into the model for training. Because the market data can take both positive and negative values, they are transformed by Equation (17) to satisfy the normalization requirements.
Here, \(md^{k}_t\) denotes the kth market variable on \(day_t\) and \(max\{md^k\}\) refers to the maximum value of the kth market variable. The values range from −1 to 1 after normalization. There is no need to normalize the news data because they are already normalized through the ltc method, as shown in Eq. (2).
In this study, the highdimensional news vector obtained from the Chinese news text processing methods and the four market data were all input to the MVLSVM algorithm. After training on the training set, the algorithm can choose the optimal kernel for each data source and the optimal kernel combination weights. Therefore, combining multiple kernels can successfully fuse the multiview data, and the new kernel can be input into the SVM classifier for classification. The framework of this multiview stock price prediction model is illustrated in Fig. 6.
Experimental analysis and comparison
In this section, we consider the joint influence of structured market data and unstructured financial news and apply the MVLSVM model to predict the stock price trend on the day or the next. The obtained sample is labeled according to the daily stock return \(r_{t+i}\), as shown in Equation (18).
Here, \(r_{t+i}\) represents the daily return of the stock on \(day_{t+i}\), \(tag_{t+i}\) indicates that the daily return on \(day_{ t+i}\) is used to label the sample, and \(i=0\) means that the model aims to predict the stock price of the day, whereas \(i=1\) means to predict that of the next day. Because it is meaningless to use \(r_t\) to predict the price fluctuation of \(day_t\), the market data can only be used to predict the next day’s price movement, while the news released before the closing of the stock market on \(day_t\) can be used to predict the rise and fall of stock prices on both \(day_t\) and \(day_{t+1}\). Therefore, \(i=0,1\) is valid for financial news, and \(i=1\) is valid for market data.
As shown in Table 3, we used a confusion matrix to show the classification results, and the accuracy is given by
where TP means true positive, which indicates a stock price trends up and the model correctly predicts the upward trend, TN (true negative) means a stock price trends down and the downward trend is also correctly predicted while FN (false negative) occurs when the actual stock price is rising but the model mistakes it as a downward trend, and similarly, FP (false positive) refers to that the actual stock price is falling but the model mistakes it as an upward trend. Using Formula (19), we obtained the percentage of correctly predicted samples in the total sample.
Stock price prediction based on oneday news and four market data
Here, we build an MVLSVM model to predict stock price movement using oneday market data and financial news. In this section, the prediction accuracy is compared with that of the classic SVM models to evaluate the predictive performance.
As explained in Section “Experimental analysis and comparison”, news released before 15:00 on \(day_t\) can be used to predict the stock returns of \(day_t\) and \(day_{t+1}\). However, market data on \(day_t\) can only be used to predict stock returns on \(day_{t+1}\). Therefore, we consider two experimental settings: lag=0 (predict the price on the day of the news release) and lag=1 (predict the price on the next day of the news release).
In the case of lag=0, the sign of \(r_t\) is predicted by \(md_{t1}\) and \(news_t\), where \(md_{t1}=(r_{t1}, tv_{t1}, tr_{t1}, mc_{t1})\) represents the market data of \(day_{t1}\) and \(news_t=(w_{t,1}, w_{t,2},\ldots , w_{t,M})\) represents the news vector on \(day_t\), which is obtained from “Chinese news text processing”. The input matrix of market data MD and News News is formulated in Equation (20), where each row of MD and News denotes a vector, and the labels of these vectors are also shown in Equation (20).
When lag = 1, the sign of \(r_{t+1}\) is predicted by \(md_t\) and \(news_t\). The input matrices MD and News and output vector Label are shown in Equation (21).
For the SVM model, three cases are considered: (i) inputting only the market data (SVMMD), (ii) inputting only financial news (SVMFN), and (iii) inputting concatenated multiview data of market data and financial news (SVMMV). Considering our sample size, we adopted three training/testing proportions: 60%/40%, 75%/25%, and 90%/10%. Because the model parameters will considerably affect the training performance, fivefold crossvalidation and the grid search method are applied in the training set to select the optimal parameters. For the penalty term C, the initial values are set to \(C=10^a\) where we have \(a \in \{3,2,1,0,1,2,3\}\) and for the Gaussian kernel parameter \(\gamma\), the initial values are \(\gamma =2^b\), where \(b \in \{4,3,2,1,0,1,2,3,4\}\). Therefore, for a nonlinear SVM, there were 63 combinations of parameters. A grid search can build a grid, where each node refers to a parameter combination. This method can traverse all the grid nodes to determine the optimal parameter combination of the model. We used fivefold crossvalidation for each parameter combination to observe the model’s performance. That is, the training set was divided into five parts, and each part was set as the validation set once, while the other four parts were used for training the model. Finally, the average prediction accuracy of the five experiments was considered as the model performance under this parameter combination. Among all parameter combinations, the one with the best performance was chosen to allocate the final model for independent testing. We used Python and the Sklearn package in this study to implement the model. The prediction accuracies for the 37 stocks are shown in Table 4.
Table 4 indicates that the MVLSVM model always had the highest accuracy, despite the training set proportion. Surprisingly, the prediction accuracy of the MVLSVM model was approximately 30% higher than that of the SVM model when both types of data were input. In particular, when we predict the price trend one day after the news release with a 90% training proportion, the average prediction accuracy of the SVM model based only on market data or financial news is 50.77% and 61.70%, respectively, whereas the MVLSVM model can reach nearly 88% accuracy. This shows that the MVLSVM model significantly outperformed the other baseline models.
For the three SVM models in Table 4, on the one hand, we can find that the predicting accuracy is improved when adding news data to the market data, indicating the news information contributes to price prediction. On the other hand, when the training set proportion was 60% and 75%, the accuracy of the SVMMV model was close to that of the SVMFN, but when the proportion was 90%, the SVM model with multiview data seemed worse than the SVMFN model. However, when using the MVLSVM model, there was an obvious improvement in the prediction accuracy. This shows that heterogeneous data, which can be used to predict the stock price alone, cannot obtain ideal results if they are simply and roughly concatenated and inputted into the SVM model because of the differences in their data characteristics, such as dimensions. This further demonstrates the advantages of the proposed model. Because the MVLSVM model can learn and minimize the inconsistency from distinct views, it can effectively combine the data with different structures to make reasonable predictions on stock price trends. The experiment also shows that the prediction performance of MVLSVM is stable because the average prediction accuracy of the MVLSVM model only changes by 2.37% when the training proportion is changed, whereas that of the SVM model based only on financial news changes by 8.79%. Further, the training/testing proportion of 60%/40% refers to training from January 2, 2018, to October 23, 2019, when COVID19 had not emerged. The testing period is from October 24, 2019, to December 31, 2020, when COVID19 broke out and affected the stock market. The models with the other two training/testing proportions were trained using data on COVID19. However, from Table 4, we can see that although our models were trained without COVID19related data, the average prediction accuracy was still approximately 85%. Considering the case of lag = 0 and lag = 1, the differences between the average prediction accuracy of the models with 75%/25% training/testing proportions and those with 60%/40% training/testing proportions, are only 1.68% and 0.84%, respectively, indicating that the spread of Coronavirus has little impact on the prediction performance of our model.
Here, we have further simplified the followup research. Because roughly concatenating heterogeneous data will weaken the training efficiency of the model, considering the SVMMV model is not necessary. Therefore, we only considered the prediction performance of the SVM models with singleview data and the MVLSVM model. Additionally, as shown in Table 4, using one training/testing proportion was sufficient to illustrate the effectiveness of the different models. It appears that there is little difference between the prediction accuracies with lag = 0 and lag = 1. Considering that the result only based on market data with a 0day lag is unavailable and for the convenience of explanation, it is better to choose the same lag period for multiview data in the MVLSVM model. Accordingly, in this section, we will only provide the results with a training proportion of 90% and consider the case with a 1day lag.
Stock price prediction based on oneday news and daily return
We use four market variables to predict the stock price movement in the above study. Still, literature shows that many studies forecast stock price trends based only on historical stock daily returns (Jarrett and Schilling 2008; Sun 2017; Vo and Ślepaczuk 2022), showing that researchers pay more attention to the daily returns among the four market variables. Therefore, we changed the input of the four market data sequences to only daily return sequences to discuss the predictive ability of the MVLSVM model in this case. Considering that ARIMA is a commonly used timeseries model that can input historical return sequences for prediction, we use this model to build a benchmark based on singleview data in our study. The ARIMA model was implemented using the R language.
According to the ADF test (see Table 5), the pvalues for all stocks are significant, so the series of daily returns are stationary. The orders of the ARIMA model can be determined using autocorrelation and partial autocorrelation diagrams. The results are presented in Table 6, and the average prediction accuracy for the 37 stocks is 49.61%.
Before evaluating the MVLSVM model, we first observe whether the performance of SVM model changes if daily returns replace the market data. The experimental settings are as follows. We consider the case of lag = 1, that is, we use \(md_t\) and \(news_t\) to predict \(r_{t+1}\). Instead of using all four variables, we input \(r_t\) as \(md_t\) to conduct the experiment. The results of the SVM model based on daily returns (SVMDR) are presented in Table 7. The performance was compared with that of the SVM model based on fourmarket data (SVMMD).
As shown in Table 7, there are 22 of 37 stocks (59.46%) whose prediction accuracy of the SVMDR model is higher than that of the SVMMD model. The average prediction accuracies of SVMMD and SVMDR were 50.77% and 51.72%, respectively. This indicates that the SVMDR model is better than the SVMMD and ARIMA models and implies that historical data of daily stock returns contain most of the price fluctuation information among the market data.
The prediction accuracy of MVLSVM is given in Table 8. For simplicity, we name the MVLSVM model based on news and market data MVLSVMMD and the MVLSVM model based on news and the daily return MVLSVMDR.
The average prediction accuracy of MVLSVMMD and MVLSVMDR is 87.41% and 87.89%, respectively, showing that MVLSVMDR is slightly higher, and the performance of the MVLSVMDR model is better than that of MVLSVMMD model for nearly half of the sample. Our experiments imply that after excluding the other three market variables, including total market cap, turnover rate, and trading volume, the information for prediction in the market data does not necessarily decrease and even becomes more effective due to the refinement of data and also confirms that many studies only use stock returns for prediction to be meaningful and reasonable.
Statistical analysis
The above analysis is based on a numerical comparison. Next, we further evaluated the model from the perspective of statistical analysis.
The nonparametric test (Demšar 2006) is a useful approach for classifier comparison over multiple datasets. Here, we employ two nonparametric tests, the Nemenyi test (Demšar 2006) and contrast estimation based on medians (García et al. 2010), to compare the relative performances of the pairwise algorithms. Nemenyi test can determine whether one algorithm yields competitive performance compared to the other methods. The algorithms were ranked according to their performance on multiple datasets, and the average rank of each algorithm was calculated. The performance of each pairwise model whose average rank differs by at least one critical difference (CD) is considered significantly different, and the critical difference can be calculated using the following formula:
where \(q_{\alpha }\) is the critical value of Nemenyi test, K represents the number of algorithms involved, and \(N_{stock}\) is the number of stocks. In this section, we compare the performance of all the models involved, including ARIMA, SVMDR, SVMMD, SVMFN, SVMMV, MVLSVMDR and MVLSVMMD. Therefore, the K value in our test is seven, and we have \(q_{\alpha }=2.949\) at a significance level \(\alpha =0.05\). The accuracy of the CD diagram is shown in Fig. 7. In the figure, if the average ranks of pairwise models are within one CD, the two models will be linked in the CD diagram, whereas the performances of the unlinked models are thought to be significantly different. Clearly, the MVLSVM algorithm is ranked first on average and is significantly different from the benchmark methods.
Moreover, contrast estimation based on medians can obtain a quantitative difference calculated from the medians between comparison algorithms over multiple datasets. Using this method, researchers can successfully estimate the difference between the performance of the two algorithms. Table 9 lists the results, where a positive value suggests that the row algorithm outperforms the corresponding column algorithm. Our model always achieves positive values concerning the baseline models.
Robust test
To further observe the ability of MVLSVM to capture the joint impacts of news and market data on stock price trends over a certain period of time, we set the sliding window for news and market data from 1 to 5 days to predict the price trend with a 1day lag after news releases. We define \(\lambda =max\{T_1,T_2\}\), where \(T_1\) denotes the news window, and \(T_2\) denotes the market data window. When we obtain n days for the sample, considering the use of sliding windows, the actual length of the output is \(n\lambda\). Then, the input–output process of the model is formulated by equation (23), where md denotes the original market data sequence, and the dimension of the input matrix MD is expanded according to the sliding window. \(news_t^{T_1}\) is obtained by gathering the news between \(day_t\) and \(day_{tT_1+1}\) and inputting it into the Chinese news text processing algorithm. \(w^{T_1}_{t,m}\) represents the weight of \(word_m\) obtained from \(T_1\)days of news.
Notably, the case that both sliding windows are one day has been discussed in “Experimental test” section. The prediction results of MVLSVMMD are listed in Table 10.
Heatmaps can show the results more clearly. In Fig. 8, the horizontal axis represents the sliding windows of financial news, whereas the vertical axis represents market data. We find that the color darkens when the sliding window of financial news changes from five days to one day. This means MVLSVMMD can reach the highest prediction accuracy when using oneday news, indicating that it can capture the prompt impact of news on stock prices.
We also find that the average prediction accuracy of the MVLSVMMD model shows little change for different sliding windows of market data, indicating that the impact of market data is persistent, and the largest prediction accuracy is up to 88%, showing that this model has good performance. Furthermore, we want to determine whether the model is superior to the SVM models and whether the singleview data are sufficient to predict stock prices. Hence, we used the SVMMD and the SVMFN model in different sliding windows to conduct the experiment and compare them with the MVLSVM model. The results are illustrated in Table 11, indicating that the MVLSVM model performed much better than the SVM models. The prediction accuracy of the SVMMD model was between 51% and 53%, that of the SVMFN model was between 61% and 64%, and that of the MVLSVM model was between 73% and 88%, which was at least 10% higher than that of the two SVM models. This shows that the MVLSVM model can successfully combine data with different structures and extract information about stock price rise and fall.
Furthermore, as mentioned in “Stock price prediction based on oneday news and daily return” section, removing the total market cap, turnover rate, and trading volume from the market data may improve the performance of the MVLSVM model. Therefore, in this section, we attempt to refine the model in the same way and set the sliding windows for news and daily returns to 1–5 days separately.
Before discussing the results of MVLSVM model, we observe the performance change of the SVM model in different sliding windows after replacing the fourmarket data with daily returns. From Table 12, we can see that in most cases, the average prediction accuracy of SVMDR is higher than that of the ARIMA and SVMMD models. Moreover, with an increase in the length of the sliding window, the average prediction accuracy of both models shows the same trend of first rising, then falling, and then rising (shown in Fig. 9). They reached the highest average prediction accuracy when the sliding window was 2 days.
Because the performance of the SVM model is improved by changing the fourmarket data into daily returns, we infer that the same adjustment will lead to the same improvement as the MVLSVM model. The prediction accuracies of the MVLSVM model for different sliding windows are presented in Table 13.
We evaluated the validity of the MVLSVM model based on daily returns and news by comparing its prediction results with those of the ARIMA model, SVMDR, and SVMFN models, as shown in Tables 6, 11, and 12. The prediction accuracy of the MVLSVM model with daily returns and news is much higher than that of the other three baseline models, which is the same as the results of the MVLSVM model based on four market variables and news. The heatmaps of average and median prediction accuracies of the MVLSVMDR model are shown in Fig. 10. We speculate from the heatmaps that the accuracy of the MVLSVM models is related to the sliding windows of financial news because the prediction accuracy shows a downward trend with the increase in the length of the news sliding window. In particular, when the sliding window of financial news is set to one day, the model shows the best average and median accuracy performance.
In addition, by comparing the average and median prediction accuracy of MVLSVMDR with those of MVLSVMMD, we find that for most cases, the prediction accuracy of the MVLSVM model based on news and daily returns is slightly higher than that based on news and four market data. This confirms our conjecture that using only daily returns in the market data can improve the model’s performance.
Trading strategy
From the above experimental analysis, news and market data play an important role in stock price prediction, and the model trained by MVLSVM has the best predictive performance. To test its effectiveness in practical applications, we design and evaluate a series of trading strategies based on this model in this section.
The trading setup is as follows. In Section “Stock price prediction based on oneday news and four market data”, we adopt three training/testing proportions, including 60%/40%, 75%/25% and 90%/10%, respectively. To maximize the number of samples in the training and testing sets, we chose the middle training/testing proportion of 75%/25% to implement the trading strategy. We divide the data set into 75% training and 25% testing and train the models on the training set to predict future price trends of the 37 stocks from April 4, 2020, to December 31, 2020. We apply fivefold crossvalidation and the grid search method on the training set to find the best parameters of the model that can achieve the highest average predicting accuracy on the validation set. Then, the model with adjusted parameters was applied to the testing set, and the prediction results were used as the signal to guide trading. If the stock price is predicted to rise on \(day_{t+1}\), we buy it at the closing price on \(day_{t}\) and sell it at that on \(day_{t+1}\). If the stock price is predicted to fall on \(day_{t+1}\), no operation will be carried out. For convenience, we assume that the transaction has no cost, which is common in trading simulations. Moreover, from the analysis in Section “Robust test” the prediction performance of the MVLSVMMD model varies with different sliding windows, and the model can achieve the highest prediction accuracy when the sliding windows of news and market data are one and two days, respectively. Therefore, we choose an optimal sliding window in the trading strategy.
All previously designed models are considered in this section for comparison, including ARIMA, SVMMD, SVMDR, SVMFN, SVMMV, MVLSVMMD, and MVSVMDR models. Additionally, we also introduce the momentum trading strategy, buyandhold strategy and randomly buy strategy to compare with the above seven models. For the momentum trading strategy, we consider absolute momentum, also known as price momentum. This strategy is based on the momentum effect, which holds that future returns are positively correlated with past returns. This strategy measures the average stock return over the past period and assumes that when the average stock return over the past period is positive, the stock price will rise.
In Sections “Experimental test” and “Robust test”, we focus on prediction accuracy, which can measure the predicting ability of different models. However, investors are concerned about whether they can profit from these strategies. Therefore, we introduce the annual return rate (AR) to measure the model’s profitability. The equation used is as follows:
where 250 is the total average number of trading days in a year, t is the number of trading days for testing during the simulation, and r(i) represents the return obtained on trading \(day_i\) from the trading strategy, which is calculated by
where \(r_i\) represents the stock’s daily return on trading \(day_i\) and \(signal_i\) is a dummy variable representing the corresponding strategy signal. When \(signal_i\) is 0, the strategy predicts that the price will fall on \(day_i\) and we take no action; on the contrary, when \(signal_i\) is 1, the price is predicted to trend up on \(day_i\) so we buy it on \(day_{i1}\) and have a short position on \(day_i\). Therefore, the cumulative return from trading \(day_1\) to \(day_t\) can be used to observe the realtime performance of each strategy.
Investment risk plays a vital role in evaluating the performance of a trading system; therefore, annual volatility (AV) and maximum drawdown (MDD) are measured to evaluate the risk. AV can be calculated using the following equation:
where
MDD is the maximum loss from a peak to a trough before a new peak is attained. A lower MDD indicates a lower maximum possible loss during a trading period.
The annual sharp ratio (ASR) is often regarded as riskadjusted profit and introduced for the stability assessment of trading systems. It is the ratio of average excess returns to the volatility of excess returns. The formula of ASR is as follows:
where \(\bar{r_{e}}\) and \(\sigma _{e}\) represent the average and volatility of daily excess returns during the simulation period, respectively.
Here, \(r_{f}(i)\) represents the riskfree interest rate on trading \(day_i\). From Equation (29), a large ASR indicates that investors can obtain high profits under unit risk. This also implies a more stable trading system.
The four indicators of AR, ASR, MDD, and AV obtained from the different trading strategies are shown in Table 14. Moreover, we randomly selected four stocks to demonstrate the cumulative return curves in Fig. 11, which can be used to observe the performance of different strategies in detail.
From Fig. 11, we find that the MVLSVM strategy proposed in this study is significantly more profitable than the other strategies. Although both SVM and MVLSVM use multiview heterogeneous data, the SVM strategy behaves in an unstable manner. In some cases, the performance of SVM based on multiview data is third only to the two MVLSVM strategies, such as stock 600031. SH and 600196.SH. However, in some cases, its performance is similar to or worse than that of the random buy strategy, such as stock 600585. SH and 601088.SH. In comparison, the MVLSVM model exhibits good results in almost all cases. Even if the daily return is used to replace the four market data, it can still obtain a much higher return than the buyandhold strategy, randomly buy strategy, momentum trading strategy, and other predictionbased strategies.
Then, we specifically analyze the performance of each strategy according to AR, ASR, AV, and MDD. Clearly, from Table 15, only MVLSVM strategies (both MVLSVMMD and MVLSVMDR) can achieve positive returns for all stocks. And among the ten strategies, MVLSVM strategies always have the highest AR except for one stock (code 600276.SH). The average AR of the MVLSVMMD strategy is higher than that of the buyandhold and randomly buy strategies by 63.17% and 81.09%, respectively. There are 36 in 37 stocks (97.30%) whose ARs of MVLSVM are the highest and surpass those of traditional algorithms based on singleview data, including momentum trading strategy, ARIMA, SVMMD, and SVMFN. However, although the average AR of SVM based on multiview heterogeneous data is better than that of the singleview models, there are 24 in 37 stocks (64.86%) whose ARs of SVMMV are lower than those based on singleview models. This confirms that a rough connection between heterogeneous data leads to unsatisfactory results.
As shown in Table 16, there are 26 in 37 stocks (70.27%) whose ASR of the MVLSVM strategy is at least twice that of other strategies. From the average value shown in Table 14, the average ASR of MVLSVMMD is higher than that of the buyandhold strategy, randomly buy strategy, and momentum trading strategy by 3.87, 4.22 and 4.10, respectively. In particular, the average ASR based on time series and traditional machine learning strategies is lower than 1, whereas that of MVLSVM is still above 4. This demonstrates that MVLSVM has higher stability and can gain higher profits under unit risk.
Regarding risk, Table 14 shows that the MVLSVM models have the lowest average MDD and relatively lower average AV among the strategies. For most stocks, the maximum loss from a peak to a trough of MVLSVM is less than that of most traditional strategies including buyandhold, randomly buy, momentum trading, ARIMA, and SVMFN strategies (shown in Table 17), and for most stocks, the AV of MVLSVM is lower than that of buyandhold, randomly buy, momentum trading and ARIMA strategies (see in Table 18). This implies that MVLSVM has a relatively good capacity to send stable trading signals and a relatively lower risk.
Moreover, we compared the practical performances of MVLSVMMD and MVLSVMDR. Table 15 shows that nearly 70% of stocks whose AR of MVLSVMDR is higher than that of MVLSVMMD. In general, the average AR of MVLSVMDR is higher than that of MVLSVMMD by 7.64%, as shown in Table 14, while the average values of the other indicators have little difference. This indicates that in contrast to the MVLSVMMD model, the MVLSVMDR model can provide more appropriate guidance for investors and help them obtain higher returns.
Finally, to evaluate the statistical significance of the advantages of MVLSVM over other baseline strategies, we apply nonparametric statistical analyses again, including the Nemenyi test and contrast estimation based on medians. As demonstrated in Fig. 12, MVLSVM strategies are ranked first in the performance of AR, ASR, and MDD, and are significantly different from other baseline strategies in the profitability metrics, including AR and ASR. This illustrates the profitability of our model and its ability to control risks, which can also be demonstrated by the positive values in rows “MVLSVMMD” and “MVLSVMDR” in Table 19. In addition, the results also show that the performance of MVLSVM can be improved by transforming market data into daily returns, as MVLSVMDR significantly outperforms MVLSVMMD according to AR. In contrast, other metrics do not show significant differences.
The aforementioned trading strategies are based on a single stock. If an investor is optimistic about a certain stock and plans to invest in it, our strategy can help them find better time nodes for trading. However, investors often tend to invest in a basket of stocks, which implies that building an appropriate trading strategy for an investment portfolio is necessary.
We further take the 37 stocks of the sample as a basket to build a trading strategy with our proposed algorithm. Considering that some investors tend to construct portfolios according to a certain market index, we add a passive trading strategy to hold the SSE 50 index for comparative evaluation. Assuming that the initial capital is 1,000,000 CNY, which is used to invest equally in each stock, we use the same operation for each stock as the singlestock trading strategy. That is, if the algorithm sends an up signal for a certain stock, our system will spend 1/37 of the capital to buy it at the day’s closing price and have a short position at the closing price the next day; if it sends a down signal, no operation will be carried out. Trading signals are executed only when the total cash balance exceeds 100,000 CNY. The division of the training/testing set and the selection of sliding windows are the same as those in the singlestock trading strategy. Therefore, the return on trading \(day_i\) is transformed into
where \(N_{stock}\) is the number of stocks, \(r_{i,j}\) is the daily return of stock j on trading \(day_i\) and \(signal_{i,j}\in \{0,1\}\) is the corresponding strategy signal of stock j. Figure 13 shows that MVLSVM models performed excellently during the simulation. From Table 20, we find that our models are significantly superior to other baseline models, not only in terms of profits but also in terms of risk control, indicating that they can pick out quality ones from a basket of stocks and invest them at an appropriate trading point to make profits.
Moreover, from the simulation results of MVLSVMMD strategies based on a single stock, we find that the AR ranges from 22.51 to 232.5%, indicating that our model does not always yield high returns. Therefore, we conduct a simulation to determine whether our model can help investors gain profit when it underperforms. Therefore, for each stock, we calculate the minimum difference between MVLSVMMD and other baseline strategies to select five stocks with the lowest differences; that is, MVLSVMMD does not perform well on them. The five stocks, including 600276.SH, 601288.SH, 601668.SH, 600016.SH and 600570.SH are regarded as a new equally weighted portfolio and traded with 1/5 of the capital each time, with the same operation as above. Figure 14 and Table 21 show that the ARIMA strategy has the worst performance in the portfolio of five stocks. Although the buyandhold strategy has the highest cumulative return in July 2020, our model performs the best in the long run. This indicates that even if the MVLSVM model does not significantly outperform the others, it can still help investors achieve considerable returns with relatively low risk.
According to the above three trading simulations, clearly, compared with the common trading strategies, such as momentum trading strategy and predictionbased strategies, such as ARIMA and traditional SVM strategies, MVLSVM based on multiview heterogeneous data shows excellent performances in profitability and risk control ability. Even if some predictionbased strategies fail to exceed the passive strategy of holding the SSE 50 index and the buyandhold strategy, MVLSVM can achieve much better results than the passive strategies and other benchmarks. In particular, when the dimension of the input data is decreased, that is, the four market variables are changed to one (the daily return), the predictive and profitable effectiveness can be improved owing to the reduction in redundant information.
Conclusion
In this study, we propose a hybrid model for stock price prediction called MVLSVM. It combines multiview learning with a support vector machine to investigate the joint impact of financial news and market data on stock price movements. MVLSVM can fuse multiple data sources directly with the multiview learning algorithm and classify stock price fluctuations with a support vector machine, which enriches the information sources and reduces information loss in the fusion process.
In the experiment, we consider 37 constituent stocks in the SSE 50 index as the research object and use unstructured financial news and structured market data as inputs to predict the price trend of each stock. By comparing MVLSVM with classic SVM models based on singleview and multiview data, we found that roughly concatenating and inputting multiview heterogeneous data yields unsatisfactory results because of the characteristic difference between the distinct views. However, the MVLSVM model can learn and minimize the inconsistency from multiple data sources, and thus can demonstrate outstanding performance in this situation. Furthermore, we aimed to improve our model. Considering the importance of daily returns among the four market variables, we replace the four with daily return sequences to construct a new model and compare it with the ARIMA model and classic SVM models. It appears that MVLSVM based on news and daily return sequences significantly outperforms the other baseline models. Its performance surpasses that of MVLSVM based on news and the four market variables. This shows the important role of daily returns in market data and confirms the validity of the many studies that only use stock returns for research.
In the robustness test, we try to observe the model’s ability to capture the joint impact of the multiview data within a certain period, thus setting the sliding windows of news and market data to 1–5 days. It can be concluded from the results that MVLSVM can capture the prompt impact of news on stock prices because the sliding window of news can influence the prediction accuracy of MVLSVM, and the model based on 1day news has the best performance. The comparison demonstrates that MVLSVM surpasses the benchmarks by at least 10% accuracy, which is a meaningful improvement.
Finally, a series of trading strategies are constructed based on the predicting results of two MVLSVM models, which are compared with other predictionbased strategies based on singleview and multiview data, as well as three common strategies, including the buyandhold strategy, randomly buy strategy and momentum trading strategy. When building trading strategies for a basket of stocks, a passive strategy of holding the SSE 50 index was also considered for comparative evaluation. The results show that the MVLSVM strategy has excellent profitability and riskcontrol performance in various scenarios. Moreover, its performance can be improved by changing the four market variables to daily return sequences.
In summary, from the prediction perspective, the proposed MVLSVM model based on multiview heterogeneous data can predict stock price movement more accurately than other models. From the perspective of trading strategy, this can help investors gain higher profits and have better risk control ability. But there are still some limitations. In this article, we only consider two information sources: market data and news. For future work, we can include more data sources in the model for discussion, such as online posts on social media, companies’ financial statements, etc. In addition, when building SVM and MVLSVM models, this study only considers two kernel functions, including linear and Gaussian kernels. In the future, we can construct the models by adding more kernel functions, such as the poly kernel and the sigmoid kernel function. As the proposed model has no restrictions on financial assets, we will further attempt to apply it to solve the problems of other financial assets.
Availability of data and materials
The market data used in this article are available in the WIND database, https://www.wind.com.cn/. And financial news is available in the Uqer database, https://uqer.datayes.com/.
Abbreviations
 SVM:

Support vector machine
 SVMMD:

SVM based on market data
 SVMFN:

SVM based on financial news
 SVMMV:

SVM based on news and market data
 MVLSVM:

Model integrating multiview learning with SVM
 MVLSVMMD:

MVLSVM model based on news and market data
 MVLSVMDR:

MVLSVM model based on news and daily returns
 AR:

Annual return rate
 ASR:

Annual share ratio
 MDD:

Maximum drawdown
 AV:

Annual volatility
 CD:

Critical difference
References
Bildirici M, Ersin ÖÖ (2009) Improving forecasts of GARCH family models with the artificial neural networks: an application to the daily returns in Istanbul stock exchange. Expert Syst. Appl. 36(4):7355–7362
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with cotraining. In: Proceedings of the eleventh annual conference on computational learning theory, pp 92–100
Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, Hoboken
Buckley C, Salton G, Allan J, and Singhal A (1995) Automatic query expansion using SMART: TREC 3. NIST special publication sp, pp 69–69
Cao L, Tay F (2001) Financial forecasting using support vector machines. Neural Comput Appl 10(2):184–192
Cao LJ, Tay FEH (2003) Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans Neural Netw 14(6):1506–1518
Ceci M, Pio G, Kuzmanovski V, Džeroski S (2015) Semisupervised multiview learning for gene network reconstruction. PLoS ONE 10(12):e0144031
Chen K, Zhou Y, and Dai F (2015) A LSTMbased method for stock returns prediction: a case study of china stock market. In: 2015 IEEE international conference on big data (big data), pp 2823–2824. IEEE
Collins M and Singer Y (1999) Unsupervised models for named entity classification. In: 1999 joint SIGDAT conference on empirical methods in natural language processing and very large corpora
Cortes C, Vapnik V (1995) Supportvector networks. Mach Learn 20(3):273–297
Dasgupta S, Littman ML, and Mcallester DA (2001) PAC generalization bounds for cotraining. In: Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001, 3–8 Dec 2001, Vancouver, British Columbia, Canada], pp 375–382
de Sa VR (1994) Learning classification with unlabeled data. Morgan Kaufmann Publishers, Burlington, pp 112–112
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Deng N, Tian Y, Zhang C (2012) Support vector machines: optimization based theory, algorithms, and extensions. CRC Press, Boca Raton
Deng S, Mitsubuchi T, Sakurai A (2014) Stock price change rate prediction by utilizing social network activities. Sci World J. https://doi.org/10.1155/2014/861641
Deng S, Mitsubuchi T, Shioda K, Shimada T, and Sakurai A (2011a) Combining technical analysis with sentiment analysis for stock price prediction. In: 2011 IEEE ninth international conference on dependable, autonomic and secure computing, pp 800–807. IEEE
Deng S, Mitsubuchi T, Shioda K, Shimada T, and Sakurai A (2011b) Multiple kernel learning on time series data and social networks for stock price prediction. In: 2011 10th international conference on machine learning and applications and workshops, vol 2, pp 228–234. IEEE
Dyck A , Zingales L (2003) The media and asset prices. Technical report, Working Paper, Harvard Business School, Harvard
Fischer T, Krauss C (2018) Deep learning with long shortterm memory networks for financial market predictions. Eur J Oper Res 270(2):654–669
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064
Gidofalvi G, Elkan C (2001) Using news articles to predict stock price movements. University of California, San Diego, p 17
Hammad AAA, Ali SMA, Hall EL (2007) Forecasting the Jordanian stock price using artificial neural network. Intell Eng Syst Through Artif Neural Netw 17:1–6
Han Z, Zhang C, Fu H, Zhou JT (2022) Trusted multiview classification with dynamic evidential fusion. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3171983
Jarrett JE, Schilling J (2008) Daily variation and predicting stock market returns for the frankfurter börse (stock market). J Bus Econ Manag 9(3):189–198
Kanwar N (2019) Deep reinforcement learningbased portfolio management. PhD thesis, The University of Texas at Arlington
Kesavan M, Karthiraman J, Ebenezer RT, Adhithyan S (2020) Stock market prediction with historical time series data and sentimental analysis of social media data. In: 2020 4th international conference on intelligent computing and control systems (ICICCS)
Kim KJ (2003) Financial time series forecasting using support vector machines. Neurocomputing 55(1/2):307–319
Kolarik T, Rudorfer G (1994) Time series forecasting using neural networks. ACM Sigapl Apl Quote Quad 25(1):86–94
Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Language models for financial news recommendation. In: Proceedings of the ninth international conference on Information and knowledge management, pp 389–396
Li X, Xie H, Wang R, Cai Y, Cao J, Wang F, Min H, Deng X (2016) Empirical analysis: stock market prediction via extreme learning machine. Neural Comput Appl 27(1):67–78
Li X, Wu P, Wang W (2020) Incorporating stock prices and news sentiments for stock market prediction: a case of Hong Kong. Inf Process Manag 57:102212
Li H, Dagli CH, Enke D (2007) Shortterm stock market timing prediction under reinforcement learning schemes. In: 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning, pp 233–240. IEEE
Lin CT, Wang YK, Huang PL, Shi Y, Chang YC (2022) Spatialtemporal attentionbased convolutional network with text and numerical information for stock price prediction. Neural Comput Appl. https://doi.org/10.1007/s00521022072340
Li X, Wang C, Dong J, Wang F, Deng X, Zhu S (2011) Improving stock market prediction by integrating both market news and stock prices. In: International conference on database and expert systems applications, pp 279–293. Springer, Berlin
Long W, Lu Z, Cui L (2019) Deep learningbased feature engineering for stock price movement prediction. KnowlBased Syst 164:163–173
Long W, Song L, Tian Y (2019) A new graphic kernel method of stock price trend prediction based on financial news semantic and structural similarity. Expert Syst Appl 118:411–424
Lv B, Jiang Y, Li Q (2021) Prediction of shortterm stock price trend based on multiview RBF neural network. Intell Neuroscience. https://doi.org/10.1155/2021/8495288
Meesad P and Thanh H (2014) Stock market trend prediction based on text mining of corporate web and time series data. J Adv Comput Intell Intell Inf. https://doi.org/10.20965/jaciii.2014.p0022
Mittermayer MA, Knolmayer GF (2006) Newscats: a news categorization and trading system. In: Sixth international conference on data mining (ICDM’06), pp 1002–1007. IEEE
Mohan S, Mullapudi S, Sammeta S, Vijayvergia P, Anastasiu DC (2019) Stock price prediction using news sentiment analysis. In: 2019 IEEE Fifth international conference on big data computing service and applications (BigDataService), pp 205–208. IEEE
Ronaghi F, Salimibeni M, Naderkhani F, Mohammadi A (2022) COVID19HPSMP: COVID19 adopted hybrid and parallel deep information fusion framework for stock price movement prediction. Expert Syst Appl 187:115879
Salton G, Buckley C (1988) Termweighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Schumaker RP, Chen H (2009) Textual analysis of stock market prediction using breaking financial news: the AZFin text system. ACM Trans Inf Syst (TOIS) 27(2):1–19
Shiller RJ (2015) Irrational exuberance. Princeton University Press, Princeton
Shynkevich Y, McGinnity TM, Coleman S, Belatreche A (2015a) Predicting stock price movements based on different categories of news articles. In: 2015 IEEE symposium series on computational intelligence, pp 703–710. IEEE
Shynkevich Y, McGinnity TM, Coleman S, Belatreche A (2015b) Stock price prediction based on stockspecific and subindustryspecific news articles. In: 2015 International joint conference on neural networks (IJCNN), pp 1–8. IEEE
Suhail K, Sankar S, Kumar AS, Nestor T, Soliman NF, Algarni AD, ElShafai W, Abd ElSamie FE (2022) Stock market trading based on market sentiments and reinforcement learning. CMCComput Mater Continua 70(1):935–950
Sun K et al (2017) Equity return modeling and prediction using hybrid ARIMAGARCH model. Int J Financ Res 8(3):154–161
Sun S, Yu M, ShaweTaylor J, Mao L (2022) Stabilitybased PACbayes analysis for multiview learning algorithms. Inf Fusion 86:76–92
Sun L, Ceran B, Ye J (2010) A scalable twostage approach for a class of dimensionality reduction techniques. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 313–322
Tan Z, Quek C, Cheng PY (2011) Stock trading with cycles: a financial application of ANFIS and reinforcement learning. Expert Syst Appl 38(5):4741–4755
Vo N, Ślepaczuk R (2022) Applying hybrid ARIMASGARCH in algorithmic investment strategies on S &P500 index. Entropy 24(2):158
Wang H, Zhou Z (2021) Multiview learning based on maximum margin of twin spheres support vector machine. J Intell Fuzzy Syst 40(6):11273–11286
Wang Y, Liu H, Guo Q, Xie S, Zhang X (2019) Stock volatility prediction by hybrid neural network. IEEE Access 7:154524–154534
Wang F, Liu L, Dou C (2012) Stock market volatility prediction: a serviceoriented multikernel learning approach. In: 2012 IEEE ninth international conference on services computing, pp 49–56. IEEE
White H (1988) Economic prediction using neural networks: the case of IBM daily stock returns. In: ICNN, vol 2, pp 451–458
Wüthrich B, Permunetilleke D, Leung S, Lam W, Cho V, Zhang J (1998) Daily prediction of major stock indices from textual www data. Hkie Trans 5(3):151–156
Xiao Y, Li X, Liu B, Zhao L, Kong X, Alhudhaif A, Alenezi F (2022) Multiview support vector ordinal regression with data uncertainty. Inf Sci 589:516–530
Xu C, Tao D, Xu C (2015) Multiview learning with incomplete views. IEEE Trans Image Process 24(12):5812–5825
Xu C, Tao D, Xu C (2013) A survey on multiview learning. arXiv preprint arXiv:1304.5634
Yan X, Hu S, Mao Y, Ye Y, Yu H (2021) Deep multiview learning methods: a review. Neurocomputing 448:106–129
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML, vol 97, pp 35. CiteSeer
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the association for computational linguistics, pp 189–196
Zhang T, Liu S, Xu C, Lu H (2010) Human action recognition via multiview learning. In: Proceedings of the second international conference on internet multimedia computing and service, pp 23–28
Acknowledgements
We are thankful to reviewers for their valuable comments to improve the manuscripts
Funding
This research was partly supported by National Natural Science Foundation of China (No. 71771204, 72231010) and the Fundamental Research Funds for the Central Universities (No. E0E48946X2).
Author information
Authors and Affiliations
Contributions
All the authors were involved in the research that led to the article and in its writing. All the authors read and approved the final manuscript
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Long, W., Gao, J., Bai, K. et al. A hybrid model for stock price prediction based on multiview heterogeneous data. Financ Innov 10, 48 (2024). https://doi.org/10.1186/s4085402300519w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4085402300519w