Data description
The dataset utilized in this study includes the daily direction (up or down) of the closing price of the SPDR S&P 500 ETF (ticker symbol: SPY) as the output, along with 60 financial and economic factors as input features. This daily data is collected from 2518 trading days between June 1, 2003 and May 31, 2013. The 60 potential features can be divided into 10 groups, including the SPY return for the current day and the three previous days, the relative difference in percentage of the SPY return, the exponential moving averages of the SPY return, Treasury bill (T-bill) rates, certificate of deposit rates, financial and economic indicators, term and default spreads, exchange rates between the USD and four other currencies, the return of seven major world indices (other than the S&P 500), the SPY trading volume, and the return of eight large capitalization companies within the S&P 500 (which is a market cap weighted index and driven by the larger capitalization companies within the index). These features, which are a mixture of those identified by various researchers (Cao & Tay, 2001; Thawornwong & Enke, 2004; Armano, Marchesi, & Murru, 2005; Enke & Thawornwong, 2005; Niaki & Hoseinzade, 2013; and Zhong & Enke, 2017a, 2017b), are included as long as their values are released without a gap of more than five continuous trading days during the study period. The details of these 60 financial and economic factors, including their descriptions, sources, and calculation formulas, are given in Table 10 of the Appendix.
Data preprocessing
Data normalization
Given that the data used in this study cover 60 factors over 2518 trading days, there invariably exist missing values, mismatching samples, and outliers. Yet, the data quality is an important factor that can make a difference in the prediction accuracy, and therefore, preprocessing the raw data is necessary. Using the 2518 trading days during the 10-year period, the collected samples from other days are initially deleted. If there are n values for any variable or column that are continuously missing, the average of the n existing values on both sides of the missing values are used to fill in the n missing values. A simple but classical statistical principle is employed to detect the possible outliers (Navidi, 2011). The possible outliers are then adjusted using a similar method to the one used by Cao & Tay (2001). Specifically, for each of the 60 factors or columns in the data, any value beyond the interval (Q1 − 1.5 ∗ IQR, Q3 + 1.5 ∗ IQR) is regarded as a possible outlier, with the factor value replaced by the closer boundary of the interval. Here, Q1 and Q3 are the first and third quartiles, respectively, of all the values in that column, and IQR = Q3 − Q1 is the interquartile of those values. The symmetry of all adjusted and cleaned columns can be checked using histograms or statistical tests. For example, Figure 1 includes the histograms of factor SPYt (i.e., the SPY current daily return), before and after data preprocessing (Zhong & Enke, 2017a). It can be observed that the outliers are removed, and the symmetry is achieved after adjustments.
In this study, the ANNs and DNNs for pattern recognition are used as the classifiers. At the start of the classification mining procedure, the cleaned data are sequentially partitioned into three parts: training data (the first 70% of the data), validation data (the last 15% of the first 85% of the data), and the testing data (the last 15% of the data).
Data transformation using PCA
As one of the earliest multivariate techniques, PCA aims to construct a low-dimensional representation of the data while maintaining the maximal variance and covariance structure of the data (Jolliffe, 1986). To achieve this goal, a linear mapping W that can maximize WT var (X)W, where var(X) is the variance-covariance matrix of the data X, needs to be created. Given that W is formed by the principal eigenvectors of var (X), PCA turns out to be an eigenproblem var(X)W = λW, where λ represents the eigenvalues of var (X). It is also known that working on the raw data X instead of the standardized data with the PCA tends to emphasize variables that have higher variances more than variables that have very low variances, especially if the units where the variables are measured are inconsistent. In this study, not all variables are measured at the same units. Thus, here, PCA is actually applied to the standardized version of the cleaned data X. The specific procedure is given below.
First, the linear mapping W∗ is searched such that
$$ corr\left(\boldsymbol{X}\right){\boldsymbol{W}}^{\ast }={\boldsymbol{\lambda}}^{\ast}{\boldsymbol{W}}^{\ast }, $$
(1)
and corr(X) is the correlation matrix of the data X. Assume that the data X has the format X = (X1 X2⋯XM); then corr(X) = ρ is a M × M matrix, where M is the dimensionality of the data, and the ijth element of the correlation matrix is
$$ corr\left({\boldsymbol{X}}_{\boldsymbol{i}},{\boldsymbol{X}}_{\boldsymbol{j}}\right)={\rho}_{ij}=\frac{\sigma_{ij}}{\sigma_i{\sigma}_j}, $$
where.
$$ {\sigma}_{ij}=\mathit{\operatorname{cov}}\left({\boldsymbol{X}}_{\boldsymbol{i}},{\boldsymbol{X}}_{\boldsymbol{j}}\right),{\sigma}_i=\sqrt{\mathit{\operatorname{var}}\left({\boldsymbol{X}}_{\boldsymbol{i}}\right)},{\sigma}_j=\sqrt{\mathit{\operatorname{var}}\left({\boldsymbol{X}}_{\boldsymbol{j}}\right)},\mathrm{and}\;i,j=1,2,\dots, M. $$
(2)
Let \( {\boldsymbol{\lambda}}^{\ast}={\left\{{\lambda}_i^{\ast}\right\}}_{i=1}^M \) denote the eigenvalues of the correlation matrix corr(X) such that \( \kern0.5em {\lambda}_1^{\ast}\ge {\lambda}_2^{\ast}\ge \cdots \ge {\lambda}_M^{\ast } \) and the vectors \( {\boldsymbol{e}}_{\boldsymbol{i}}^{\boldsymbol{T}}=\left({e}_{i1}\ {e}_{i2}\cdots {e}_{iM}\right) \) denote the eigenvectors of corr(X) corresponding to the eigenvalues \( {\lambda}_i^{\ast } \), i = 1, 2, … , M. The elements of these eigenvectors can be proven to be the coefficients of the principal components.
Secondly, the principal components of the standardized data are presented as
$$ \boldsymbol{Z}=\left({\boldsymbol{Z}}_{\mathbf{1}}\ {\boldsymbol{Z}}_{\mathbf{2}}\cdots {\boldsymbol{Z}}_{\boldsymbol{M}}\right), $$
where.
$$ {\boldsymbol{Z}}_{\boldsymbol{w}}^{\boldsymbol{T}}=\left({Z}_{1w}{Z}_{2w}\cdots {Z}_{Nw}\right),{Z}_{vw}=\frac{X_{vw}-{\mu}_w}{\sigma_w},v=1,2,\dots, N,\mathrm{and}\;w=1,2,\dots, M $$
(3)
can be written as.
$$ {\boldsymbol{Y}}_{\boldsymbol{i}}={\sum}_{j=1}^M{e}_{ij}{\boldsymbol{Z}}_{\boldsymbol{j}},i=1,2,\dots, M $$
(4)
Using the spectral decomposition theorem,
$$ \boldsymbol{\rho} =\sum \limits_{i=1}^M{\lambda}_i^{\ast }{\boldsymbol{e}}_{\boldsymbol{i}}{\boldsymbol{e}}_{\boldsymbol{i}}^{\boldsymbol{T}} $$
(5)
and the fact that \( {\boldsymbol{e}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{e}}_{\boldsymbol{i}}=\sum \limits_{j=1}^M{e}_{ij}^2=1 \) and the different eigenvectors are perpendicular to each other such that \( {\boldsymbol{e}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{e}}_{\boldsymbol{j}}=0 \), we can prove that
$$ \mathit{\operatorname{var}}\left({\boldsymbol{Y}}_{\boldsymbol{i}}\right)=\sum \limits_{k=1}^M\sum \limits_{l=1}^M{e}_{ik} corr\left({\boldsymbol{X}}_{\boldsymbol{k}},{\boldsymbol{X}}_{\boldsymbol{l}}\right){e}_{il}=\kern0.5em {\boldsymbol{e}}_{\boldsymbol{i}}^{\boldsymbol{T}}\boldsymbol{\rho} {\boldsymbol{e}}_{\boldsymbol{i}}={\lambda}_i^{\ast}\kern0.5em $$
(6)
and
$$ \mathit{\operatorname{cov}}\left({\boldsymbol{Y}}_{\boldsymbol{i}},{\boldsymbol{Y}}_{\boldsymbol{j}}\ \right)=\sum \limits_{k=1}^M\sum \limits_{l=1}^M{e}_{ik} corr\left({\boldsymbol{X}}_{\boldsymbol{k}},{\boldsymbol{X}}_{\boldsymbol{l}}\right){e}_{jl}={\boldsymbol{e}}_{\boldsymbol{i}}^{\boldsymbol{T}}\boldsymbol{\rho} {\boldsymbol{e}}_{\boldsymbol{j}}=0. $$
(7)
That is, the variance of the ith (largest) principal component is equal to the ith largest eigenvalue, and the principal components are mutually uncorrelated.
In summary, the principal components can be written as the linear combinations of all the factors with the corresponding coefficients equaling the elements of the eigenvectors. Different amounts of principal components can explain different proportions of the variance-covariance structure of the data. The eigenvalues can be used to rank the eigenvectors based on how much of the data variation is captured by each principal component.
Theoretically, the information loss due to the dimensionality reduction of the data space from M to k is insignificant if the proportion of the variation explained by the first k principal components is large enough. In practice, the chosen principle components must be those that best explain the data while simplifying the data structure as much as possible.