Weighted-indexed semi-Markov model: calibration and application to financial modeling

We address the calibration issues of the weighted-indexed semi-Markov chain (WISMC) model applied to high-frequency financial data. Specifically, we propose to automate the discretization of the price returns and the volatility index by using four different approaches, two based on statistical quantities, namely, the quantile and sigma discretization, and two derived by the application of two popular machine learning algorithms, namely the k-means and Gaussian mixture model (GMM). Moreover, by comparing the Bayesian information criterion (BIC) scores, the GMM approach allows for the selection of the number of states of returns and index. An application to Bitcoin prices at 1-min and 1-s intervals shows the validity and usefulness of the proposed discretization approaches. In particular, GMM discretization is well suited for high-frequency returns, whereas the quantile approach works better for low-frequency intervals. Finally, by comparing the results of the Monte Carlo simulation, we show that the WISMC model, applied with the proposed discretization, can reproduce the long-range serial correlation of the squared returns, which is typical of the financial markets and, in particular, the cryptocurrency market.

Other authors have employed the semi-Markov process to model the limit order book (Swishchuk et al. 2017).
However, the approach that reported the best results is the weighted-indexed semi-Markov chain model (WISMC) by D' Amico and Petroni (2012b) and its multivariate extensions (D' Amico andPetroni 2018, 2021). The model has proven to reproduce important stylized facts of financial time series, such as first-passage-time distributions and the persistence of volatility. Moreover, it has also been employed in other applications. Specifically, D'  applied the WISMC approach to model financial volumes, whereas D' Amico et al. (2020b) employed the model to study some risk measures in a high-frequency financial setting. In other fields, a simple indexed version of the model has been applied to analyze wind-power generation (D' Amico et al. 2020a).
The WISMC model can be regarded as a generalization of the semi-Markov chain model. Although the latter employs two random variables, namely, the observed price returns and the time between each price change, the WISMC includes a third variable that considers the history of the price returns and their intercurrent time, thus allowing for better reproduction of the observed quantities. However, in their original paper, D' Amico and Petroni (2012b) highlighted that applying the WISMC model to financial time series requires calibration of several parameters involved in the model. Mainly, we have to deal with converting continuous returns into a discrete state space. Moreover, the inclusion of an index that captures the history of the process requires further discretization. In D' Amico and Petroni (2012b), both conversions were based on visual inspection of the distribution of both processes, thus imposing a subjective choice. D' Amico et al. (2019) addressed the partition of the state space of an indexed Markov chain employing a change point approach.
In this study, we explore the possibility of automating the discretization of both price and index processes by testing the effectiveness of two simple discretizations, one based on quantiles and the other based on the returns standard deviation, and two algorithms taken from the machine learning literature, namely, the k-means and Gaussian mixture model (GMM). We included two machine learning algorithms because clustering and feature selection are two important research areas in applied financial research, especially given the complex distribution of financial data, and their respective literature is rapidly expanding. For example, Li et al. (2021) proposed an integrated cluster detection approach for financial applications, such as credit evaluation and fraud detection. Furthermore, Kou et al. (2021) employed machine learning algorithms to predict the bankruptcy of small and medium-sized enterprises (SMEs) using transactional data and payment network-based variables. Moreover, with automatic discretization, we can limit the discretion to the choice of the number of states. However, at the end of the paper, we show that using the GMM approach allows us to find the optimal number of states for both the returns and the index based on the Bayesian information criterion (BIC).
In addition, considering that the WISMC model has only been tested on stock markets, we apply the model to the cryptocurrency market in this study, particularly to the most recent Bitcoin prices from the Binance market, which is one of the most active cryptocurrency markets. The aim is to capture the typical stylized facts of this type of financial market, specifically the extremely high volatility inherent in Bitcoin prices, its high persistence, heavy tail behavior, and vulnerability to speculative bubbles. For example, Hafner (2020) found evidence of bubbles and extreme volatility by testing 11 of the largest cryptocurrencies. Meanwhile, Bariviera et al. (2017) analyzed the stylized facts of the Bitcoin market and found long memory in returns time series, indicating price predictability and market inefficiency. Moreover, Tan et al. (2020) assessed the volatility of 102 cryptocurrencies using Garman-Klass volatility measures, demonstrating the complexity of understanding such assets.
The application of the WISMC model to the Bitcoin market shows that the algorithms are useful in the discretization of both the returns process and index. More specifically, the quantile approach works better for lower-frequency data, whereas the GMM approach is better suited for higher-frequency returns. In addition, the BIC score of the GMM approach allows for the automation of choosing the number of states.
The remainder of this paper is organized as follows. "The model" section describes the WISMC model's theory, whereas "Discretization algorithms" section introduces four discretization approaches. "Application to financial data" section explores the challenges of the calibration process and shows the data along with the discretization results. Finally, "Conclusion" section concludes the paper.

The model
First, we introduce the semi-Markov processes from which the weighted-indexed semi-Markov process is derived. They were first proposed by Levy (1954) and Smith (1955) independently and further studied by Pyke (1961aPyke ( , 1961b and Çinlar (1975). Subsequently, they found applications in many fields, from industrial to financial markets, and the theory has been further implemented and expanded (see, e.g., Vasileiou and Vassiliou 2006;Swishchuk et al. 2017;Pasricha et al. 2020). For an in-depth analysis, we refer the readers to Janssen and Manca (2006) and Barbu and Limnios (2009).
Semi-Markov processes can be viewed as a generalization of renewal processes and the Markov chain. Let us consider a finite state space E = {1, ..., k} and a probability space (�, F, P) . The two random variables where n ∈ N and 0 = T 0 < T 1 < T 2 < . . . form a Markov renewal process (X, T) with a state space E × R + if Assuming that the process is temporally homogeneous, the probability is independent of n, and Q is called a semi-Markov kernel. In general, For each pair (i, j), where P(i, j) ≥ 0 and j∈E P(i, j) = 1, i, j ∈ E . The quantities P(i, j) are the transition probabilities of the Markov chain, {X n } n∈N , with state space E. Moreover, we can define the conditional waiting time distribution function as which can be computed as with the convention that G(i, j, t) = 1 if P(i, j) = 0 , and it can be proven that the increments T n+1 − T n are conditionally independent given the Markov chain X n (see, e.g., Çinlar 1975).
In particular, when the state space E is composed of a single point, the increments are independent and identically distributed nonnegative random variables, and we obtain a renewal process.
We can now define the semi-Markov process with state space E and transition kernel Q(i, j, t) as a continuous-time parameter process: This process can be considered the state at time t of a system that moves from one state to another with random sojourn times in between (Çinlar 1975). The sojourn interval [T n , T n+1 ) represents a random variable with a distribution that depends on the state being visited X n and the next state to be visited X n+1 .
The semi-Markov process is called so because it cannot be fully considered a Markovian process as it is not a memoryless process. In contrast, it follows the Markov property only at jump instants. In addition, when sojourn times are exponentially distributed, the semi-Markov process becomes a continuous-time Markov chain. Instead, we obtain a discrete-time Markov chain if we ignore time variables.
The semi-Markov process can be further extended by including the memory of the process using high-order semi-Markov processes (see, e.g., (Limnios and Oprian 2003;D' Amico et al. 2013). However, this method requires the estimation of several parameters. A more parsimonious model considers the dependence of the semi-Markov process on a third variable that considers the history of the process. This approach was initially considered in D' Amico (2011) and was further extended to financial applications in D' Amico andPetroni (2011, 2012b).
Let U n be a stochastic process with the values in R . This random variable represents the indexing process that stores the historical information of the semi-Markov process and can be expressed as D' Amico and Petroni (2021) where f : E × N × R − → R is a Borel measurable bounded function and U 0 is known and non-random. The size of the vector of the parameters θ depends on the chosen function f.
Process Y t is said to be a WISMC if, ∀n ∈ N , the following equality holds true: where the function Q is called the indexed semi-Markov kernel.
Condition (8) states that to assess the probability of the next state of the process, we only need knowledge of the last state i and the last value of the indexing process U n . Therefore, the triple process {X n , T n , U n } describes the system corresponding to any jump time T n . Note that if the indexed semi-Markov kernel is constant in v, then it degenerates into a semi-Markov kernel, and the WISMC process becomes a semi-Markov process.
Moreover, for each pair (i, j) and each value of the index, we have Q(i, j, 0, v) = 0 and The quantities P (i, j, v) are the transition probabilities of the Markov chain, {X n } n∈N , with state space E. These differ from the probabilities in (3) because they depend on the index level. Moreover, the conditional waiting time distribution function includes dependence on the index level:

Discretization algorithms
We encounter several calibration issues when applying the WISMC or semi-Markov model to financial data. The first step at the beginning of the application is the discretization of the price return, as the WISMC model is defined in discrete state space. In contrast, the returns we observe in real life are continuous. In their application, D' Amico and Petroni (2012b) relied on arbitrary discretization based on the visual observation of the returns histogram. Unfortunately, this approach cannot be used for automated routines. Therefore, we introduce four algorithms to deal with this discretization of price returns. The first two approaches are simple, as they are based on the statistical properties of returns. The first merely consists of splitting the observations into k quantiles, where k is the number of states. We refer to this approach as quantile discretization. Subsequently, by selecting the splitting point, we built the edges of the states. Although this discretization is easy to implement, it may present some issues. For example, if we select a high number of quantiles when dealing with a highly leptokurtic distribution, which is typical of a financial series, observations with a high frequency, typically the zero return, might be distributed between two contiguous states, thus resulting in nonunique state edges. 1 The second approach was proposed by De Blasis (2020) for the return series, and we refer to it as sigma discretization. The idea was to select the width of the states as the standard deviation of the observations. Then, based on the number of states and (8) centering them to zero, that is, the null return, we build the edges of the states. If the number of states is odd, then the central bin contains all zero returns together with smaller returns within a half standard deviation radius from the zero return. Then, departing from this central state, the other bins are defined as the one standard deviation distance from the others, leaving the extreme states up to the returns' minimum and maximum values. In the case of an even number of states, the central zero return state is omitted, and we have only positive and negative return states. 2 Table 1 shows the concept of both odd and even numbers of states. This approach is well designed to reproduce symmetric distributions of continuous returns, especially when choosing an odd number of states, as it can provide an immediate idea of the direction of returns and includes a portion of the market noise within the central bin.
The other two discretization approaches employ two popular clustering algorithms: k-means and GMM. The k-means algorithm is a simple unsupervised algorithm developed independently by Sebestyen (1962) and MacQueen (1967). The idea is to partition the observations so that the within-cluster sum of squares is minimized using an iterative algorithm. 3 Once we define the number of clusters k, that is, the states of the WISMC model in our specific application, the algorithm returns the discretization with the association of each continuous return to a specific state, thereby minimizing the variance within the clusters. The advantage of this approach is that it is completely endogenous and follows an empirical distribution of price returns. By contrast, with many observations, the k-means algorithm can result in slow convergence. To speed up the algorithm, we use a variation called the mini-batch k-means introduced by Sculley (2010), which lowers the computational cost by using random samples of the full dataset, thus reducing the number of distances to compute at the cost of a lower quality of the clusters.
Because the k-means algorithm presents some limitations, see, for example, Li et al. (2021), we include a fourth discretization performed using the GMM algorithm, which is based on the assumption that the observations are generated by a mixture of Gaussian distributions with unknown parameters. The first studies in this direction were proposed by Wolfe (1963) and Scott and Symons (1971) and further studied by many other authors (see, e.g., Banfield and Raftery 1993;Fraley and Raftery 2002). 4 Let us assume that the observations {z 1 , ..., z t } (i.e., the price returns) are realizations of a random vector Z ∈ R The zero returns can be included in either the positive or negative state.
3 For a review of the k-means clustering methods we refer the reader to Steinley (2006). 4 For a comprehensive review of the finite mixture clustering, we refer the reader to Bouveyron and Brunet-Saumard (2014) and Bouveyron et al. (2019). and the unobserved state labels {y 1 , ..., y t } are realizations of a random variable Y ∈ E . If we denote g as the probabilistic density function of Y, then the GMM is where π i is the mixture proportion with the constraint k i=1 π i = 1 and φ(z; θ i ) is the Gaussian density with parameter θ i = (µ i , σ i ) , which are generally estimated using the expectation-maximization (EM) algorithm, proposed by Dempster et al. (1977). One of the advantages of the GMM algorithm is that it allows us to select the optimal number of clusters based on the BIC criterion.

Application to financial data
The application to financial data requires the formalization of the functional form of the index U n (θ) . D' Amico and Petroni (2011) initially proposed using a moving average of the squared process, (X n ) 2 . Taking the square of the returns, the authors introduced the dependence of the process dynamics on volatility, which is an observed stylized fact in financial markets. In a later study, the authors opted for an exponentially weighted moving average (EWMA) of the squares of returns (D' Amico and Petroni 2012b). Using EWMA changes the function to The output values of the EWMA function in (12) were continuous. Therefore, similar to price returns, the index values need to be discretized into finite states using the proposed discretization algorithms in "Discretization algorithms" section.
Finally, to test the validity of the proposed approach for discretization, we performed a Monte Carlo simulation. We simulated a WISMC process using the following algorithm (D' Amico and Petroni 2012b): 1. set n = 0 , X 0 = i , T 0 = 0 , U 0 = v , horizon time = T 2. given X n and U n , sample X from P(i, j, v) and set X n+1 3. given X n and X n+1 , sample W from G(i, j, t, v) and set T n+1 = T n + W 4. set U n+1 using (7) and (12) 5. if T n+1 ≥ T stop, else set n = n + 1 and go to 2.
To estimate the transition probability matrices P(i, j, v) and conditional waiting time distribution G(i, j, t, v), we refer the readers to Appendix B in D'  We then verify whether the simulated series follows the long-range serial correlation of the squared returns, which is typical of the financial returns series. We recall the autocorrelation function of the squared returns: where Y is the process of returns and τ is the time lag. We estimate �(τ ) for the real and simulated returns and compute the root mean square error (RMSE) and mean absolute error (MAE) to compare the use of different parameter estimations. We tested the validity of the discretization algorithm on Bitcoin spot data sourced from the Binance public website. 5 We specifically selected Bitcoin data because the cryptocurrency market is open 24/7; thus, there are no gaps in the time series. In addition, we chose the Binance exchange because it is the largest cryptocurrency exchange in the world and is less subject to market manipulation (De Blasis and Webb 2022).
Following the approach of D' Amico and Petroni (2012b), we sample the price returns at 1-min intervals using Bitcoin data from March 1, 2021, to February 28, 2022. In addition, we test the application on 1-s interval returns with data ranging from February 21, 2022, to February 28, 2022. The date ranges vary because we aim to have a roughly similar number of observations in both samples. The summary statistics of the percentage log-returns are reported in Table 2. We observe a zero return on average with a standard deviation of 0.116% and 0.0153% for the 1-min and 1-s intervals, respectively. Both return distributions appear to be symmetric and follow the typical financial return  distribution, with high excess kurtosis and fat tails, as shown in Fig. 1. In addition, the 1-s distribution presents a very high frequency around the null return. As described in "Discretization algorithms" section, we discretize the continuous returns using four different approaches: quantile, sigma, k-means, and GMM discretization. The only discretion is left to the choice of the number of states, which, in our application, is set at three and five. The 4-state returns discretization is excluded from the analysis as an odd number of states would better follow the typical shape of the financial returns, which presents an almost symmetric distribution and a high frequency around the zero return. For space reasons, we report only the results of the 5-state discretization, which, for the sigma discretization, is identified by one central state representing the zero return surrounded by two positive and two negative states, corresponding to positive and negative returns, respectively. Table 3 lists the edges of each discretization bin for the four approaches. Panel A shows the discretization for the 1-min interval returns, whereas Panel B reports the edges of the bins for the 1-s interval returns. Note that the quantile discretization in this latter case fails because there is no way to attribute the continuous returns to State 0, State −1, or State 1. Therefore, we excluded this case from the subsequent analysis.
The results of the return discretization are also presented in Fig. 2, which shows the histograms built from the bins defined in Table 3. Quantile discretization is excluded from the charts as it results in a flat histogram. All discretizations present the highest frequency around the zero return; however, only the sigma discretization is symmetric around the zero return by construction. The k-means and GMM discretization of the 1-min returns appear to be asymmetric, whereas the distribution results are more balanced when using the 1-s returns. Moreover, the fourth state of the GMM discretization at the 1-min interval is minimal compared to the other states, which could result in a biased application. To this extent, we must highlight that the use of different discretizations leads to different distributions of WISMC processes, Y t , which could be in different states simultaneously for different discretizations.
Once the returns are filtered into discrete states, we compute the index using the EWMA function. This stage requires calibration of the parameter using the technique discussed in D' Amico and Petroni (2012b) by minimizing the RMSE or MAE of the autocorrelation function of the simulated and real squared returns. However, as reported by the authors and tested in our samples, the optimum is reached when varies between 0.95 and 0.99, and the overall RMSE or MAE values do not change visibly within that range. Therefore, following D' Amico and Petroni (2012b), we fixed = 0.97 for all our analyses, focusing our results mainly on the discretization algorithms. Furthermore, we note that when = 1 , the EWMA function reduces to the moving average index proposed in D' Amico and Petroni (2011). As stated earlier, the index has values in R ; therefore, it must be discretized like returns. D' Amico and Petroni (2012b) discretize the index into five states, specifically low, medium-low, medium, medium-high, and high volatility, choosing manual bins based on the visual observation of the distribution. By contrast, we employ the discussed discretization algorithms. We exclude only the sigma approach because the distribution of the index is not always symmetrical. Moreover, we did not limit the index discretization to five states. Table 4 presents the RMSE values for comparing the simulated and real autocorrelation values of the WISMC process. The MAE values are not reported for space reasons; however, they are equivalent to the RMSE values. The table reports two combinations of returns/index discretization, that is, 3-state returns and 3-state index, and 5-state returns and 5-state index. The 3-state GMM discretization for 1-min interval returns is not reported, as the algorithm resulted in a 2-state discretization. Similarly, the 5-state quantile discretization for the 1-s interval has not been reported because of the ambiguity of the state attribution, as described previously. The results show that quantile/quantile discretization is better suited for lower frequency intervals, whereas GMM/GMM or similarly GMM/k-means discretization works better at higher frequencies. Overall, the quantile/quantile with five states applied to the 1-min interval appears to be the best fit. In addition, we note that when using GMM discretization for the returns, the discretization of the index ceases to be relevant, leaving discretion over the choice of the algorithm. The results also show that the sigma and k-means discretization for the returns do not produce good results compared to the other two approaches.
In addition, to better understand the effect of the discretization approaches, we plotted the autocorrelation function of the simulated WISMC process using the best combinations from our results and compared it to the autocorrelation function of the observed data. Figure 3 clearly shows that the 5-state quantile/quantile discretization applied to the 1-min interval data performs much better than the 3-state quantile-quantile approach. However, we note a slight deviation between the simulated and real autocorrelation at low lags; more specifically, the simulated autocorrelation is underestimated up to the 20th lag. In contrast, the 3-state GMM/GMM discretization applied to 1-s interval data performs better than the 5-state GMM/k-means approach, which is the worst performer overall. In the GMM/GMM case, the simulated autocorrelations deviate from the real ones only for high lag values. Thus, this discretization better captures the short autocorrelation.
The presented results depend on the choice of the number of states for returns and index discretization. However, because one of the advantages of GMM discretization is the possibility of using the BIC score to choose the number of states and considering that the GMM works well for higher frequencies, we automate the selection of the number of states using the BIC score and apply this methodology only to the 1-s interval returns. First, we compute the BIC score for the return discretization and choose the optimal number of states;  . 3 Autocorrelation function of the simulated WISMC process against the real process. The discretization combinations are: quantile 3-state returns, quantile 3-state index 1-min interval (top-left); GMM 3-state returns, GMM 3-state index 1-sec interval (top-right); quantile 5-state returns, quantile 5-state index 1-min interval (bottom-left); GMM 5-state returns, k-means 5-state index 1-sec interval (bottom-right) Fig. 4 States selection based on the BIC scores. Returns discretization on top and index discretization at the bottom then, given the selected number of states for the returns, we compute the BIC score for the index discretization. The state selection is shown in Fig. 4, where the top chart indicates the optimal number of states for the returns, and the bottom chart indicates the optimal number of states for the index. The return discretization clearly reports the best score for the 3-state GMM approach, and this result appears to be in line with the RMSE results, where the 3-state discretization performed better than the 5-state one. Therefore, we fixed the number of states for the returns to three and proceeded with selecting the number of states for the index. In this case, we cannot directly choose the optimal BIC because its values appear to decline with the increment of the states. Note that adding states will result in estimating additional parameters, such as transition probabilities and sojourn time distribution. Therefore, we employ the elbow method to select the optimal score. We observed a significant drop in the BIC score from two to three states, followed by another smaller drop when four states were reached. Subsequently, from four to nine states, the decrease is reduced. Thus, we can easily select a 4-state GMM discretization for the index as a good trade-off between improving model performance and reducing the number of parameters to be estimated. Figure 5 compares the autocorrelation functions of both the simulated and real WISMC processes between the 3-, 4-, 5-, and 9-state index discretization. In all cases, short-run autocorrelation was well-fitted by the simulated data. However, we note that the 4-state discretization performs slightly better than the 3-state one, but adding more states to the index discretization does not significantly improve the performance of the model.