Analysis of the cryptocurrency market using different prototype-based clustering techniques

Lorenzo, Luis; Arroyo, Javier

doi:10.1186/s40854-021-00310-9

Research
Open access
Published: 12 January 2022

Analysis of the cryptocurrency market using different prototype-based clustering techniques

Financial Innovation volume 8, Article number: 7 (2022) Cite this article

14k Accesses
18 Citations
5 Altmetric
Metrics details

Abstract

Since the emergence of Bitcoin, cryptocurrencies have grown significantly, not only in terms of capitalization but also in number. Consequently, the cryptocurrency market can be a conducive arena for investors, as it offers many opportunities. However, it is difficult to understand. This study aims to describe, summarize, and segment the main trends of the entire cryptocurrency market in 2018, using data analysis tools. Accordingly, we propose a new clustering-based methodology that provides complementary views of the financial behavior of cryptocurrencies, and one that looks for associations between the clustering results, and other factors that are not involved in clustering. Particularly, the methodology involves applying three different partitional clustering algorithms, where each of them use a different representation for cryptocurrencies, namely, yearly mean, and standard deviation of the returns, distribution of returns that have not been applied to financial markets previously, and the time series of returns. Because each representation provides a different outlook of the market, we also examine the integration of the three clustering results, to obtain a fine-grained analysis of the main trends of the market. In conclusion, we analyze the association of the clustering results with other descriptive features of cryptocurrencies, including the age, technological attributes, and financial ratios derived from them. This will help to enhance the profiling of the clusters with additional descriptive insights, and to find associations with other variables. Consequently, this study describes the whole market based on graphical information, and a scalable methodology that can be reproduced by investors who want to understand the main trends in the market quickly, and those that look for cryptocurrencies with different financial performance.In our analysis of the 2018 and 2019 for extended period, we found that the market can be typically segmented in few clusters (five or less), and even considering the intersections, the 6 more populations account for 75% of the market. Regarding the associations between the clusters and descriptive features, we find associations between some clusters with volume, market capitalization, and some financial ratios, which could be explored in future research.

Introduction

The cryptocurrency market comprises more than 4000 cryptotocoins,^{Footnote 1} with over 800 trades per second, and more than 280 exchanges. It has become a huge new market in the very short term, considering that Bitcoin (Nakamoto 2009), the first peer-to-peer and decentralized digital currency was produced in 2008, and the first Bitcoin was mined in 2009. While cryptocurrencies were originally created to enable anonymous wire transfers and online purchases, they have become a powerful investment tool.

However, this new market is diverse. Cryptocurrencies with different technologies, purposes, and user bases, coexist, and form a highly heterogeneous market that is difficult to understand and manage, for those addressing a good investment allocation.

Regarding other assets, the value of cryptocurrencies swing based on news events. However, cryptocurrencies have no physical assets, or governments to return their value to. Moreover, the cryptocurrency market is new, based on a still developing technology that is highly speculative and small compared to others. Consequently, it is highly volatile with large upswings, bubbles, and sudden market downturns.

Being a market so novel, big, diverse, and volatile, it needs to be clearly understood. So far, several categorization efforts have been made. For example, the Cryptocompare website^{Footnote 2} analyzed over 200 cryptoassets, according to regulatory aspects, level of decentralization, supply issuance, economic incentives, and others. Such a taxonomy is useful, even if it only covers approximately 5% of the existing cryptocurrencies at that time. Another example, Burniske and Tatar (2017) classifies over 200 cryptocurrencies into three classes of assets, based on traditional financial markets, namely capital, consumable/transformable, and store of value assets. However, this classification is highly subjective, because many times, the cryptocurrencies may be an integration of some of them. Furthermore, these approaches typically cover a small fraction of the cryptocurrencies, which are the most important in terms of volume and popularity and focus on qualitative aspects or aspects that change insignificantly.

A different approach involves analyzing the financial performance of cryptocurrencies, and describing them from a statistical point of view. Chan et al. (2017) analyzed a few cryptocoins (Bitcoin, Dash, Dogecoin, Litecoin, MaidSafeCoin, Monero and Ripple), which exhibited heavy-tailed distributions, that fitted the generalized hyperbolic distributions. Hu et al. (2019) analyzed the stylized facts, and the return properties of 222 cryptocurrencies, and found a large degree of skewness, and volatility in the population of returns. Furthermore, according to Pele et al. (2020), cryptocurrencies can be clearly separated from classical assets, mainly owing to their tail behavior, high variance, and high departure from normality. However, their results also show that the behavior of cryptocurrencies is diverse.

The same conclusion can be drawn from other clustering analyses using cryptocurrencies. Stosic et al. (2018) represent the correlations of 119 cryptocurrency markets as a complex network, and discover distinct community structures in its minimum spanning tree. Song et al. (2019) analyze 76 cryptocurrencies using the correlation-based clustering, and filtering out the linear influences of Bitcoin and Ethereum, and detect 6 clusters, but that do not remain stable after the announcement of regulations from various countries. The time dimension also plays an important role (Sigaki et al. 2019) clustering 437 time series of cryptocurrencies, using hierarchical techniques that detect four different groups with a behavior that evolves differently, in terms of efficiency for the information.

All these approaches show that it is possible to establish different groups of cryptocurrencies in terms of their financial performance. Additionally, it is useful to better understand the cryptocurrency market, but also to build a diversified portfolio. In the same way, they use different representations of the cryptocurrencies: correlations (Stosic et al. 2018; Song et al. 2019), factors extracted from the correlation matrix (Pele et al. 2020), and time series (Sigaki et al. 2019). Each representation focuses on different aspects of cryptocurrency that are meaningful to the purpose of the analysis.

However, it would be possible to combine the clustering results using different representations of cryptocurrencies, where each consider different aspects of cryptocurrencies. Thus, the combination of the clustering results makes it possible to characterize each cryptocurrency in several dimensions, one for each cluster strategy. If the clusters for each cluster strategy are meaningful, their combinations would offer a more detailed characterization of the market and useful insights for portfolio management.

This study aims to propose a new methodology that help us to arrange and understand the main trends of the market at a glance, based on the financial behavior of cryptocurrencies.

Each of the clustering methods considered should offer a complementary view of cryptocurrencies, and a meaningful graphical representation that makes it possible to observe the main characteristics of each segment of the market at a glance.

The combination of the clustering methods will make it possible to profile each cryptocurrency, considering the different clusters to which it belongs. Furthermore, the most populated clustering intersections will help us detect the main trends in the market. In conclusion, the clustering analyst can spot the intersection that has a particular financial behavior by choosing the prototype of interest for each clustering method. This can help investors address interesting cryptocurrencies, depending on the investment profile.

In conclusion, the methodology includes the study of the associations between the clustering results, and other cryptocurrency features not considered in the clustering. Thus, the methodology helps to discover potential relationships between the resulting clusters, and other aspects beyond the data characterization used in the clustering methods.

The proposed methodology is fully supported by different statistical tools that ensure the robustness of the results. Further, it is easily scalable, to manage a growing, and dynamic market. Regarding computations, we used R (R Core Team 2013) and several R libraries, as shown in Table 15.

In our study, we include all cryptocurrencies in the market in 2018 (more than 1,700 cryptocurrencies), going beyond the few dozens (or few hundreds) of cryptocurrencies analyzed in other studies.

Regarding the data characterization, we describe each cryptocurrency considering the log-return transformation of the daily price in 2018 based on three different levels of granularity:

Mean and standard deviation of the daily returns
Distribution of the daily returns
Time-series of the daily returns

In the first case, we provide a meaningful summary commonly used to describe financial assets over time, as it is the annualized return and volatility, or with the central tendency and dispersion of returns. In the second case, we consider the whole distribution of returns that account for not only the central tendency and dispersion of an asset, but also for the whole aggregated behavior including asymmetry, kurtosis, and tails. The methods for analyzing distributional data belong to the field of symbolic data analysis (Noirhomme-Fraiture and Brito 2011), where observations account for internal variation that can be represented as intervals or distributions, and have been previously used in finance (Arroyo and Maté 2009; Arroyo et al. 2011; González-Rivera and Arroyo 2012). In conclusion, we consider the observed data, that is, the log return time series that accounts for variations over time and makes it possible to identify when volatile or stable periods occur in each cryptocurrency.

There is a high diversity of clustering techniques. However, in our case, interest lies in the different perspectives shown at each level of granularity. Therefore, for all representations, we use partitional prototype-based clustering algorithms with a similarity measure (distance), that is meaningful for each kind of representation. Thus, we will have a prototype describing the behavior of each cluster using the same representation of the data. Prototypes make it possible to assign financial meaning to the entire cluster.

We further combined the three clustering results and analyzed the most numerous intersections with the help of visual tools. Such approaches have been successfully used in biostatistics (L’Yi et al. 2015; Kern et al. 2017). In our case, we use them to represent the main trends in the cryptocurrency market. If several cryptocurrencies belong to the same clusters in the three clustering results, we can consider them to be very similar. We further inspect the relationships among the three clustering results with the help of visualization tools.

The proposed approach provides a screening mechanism that allows us to explore the entire market, despite its complexity and size. The intersection of the clustering results can also help investors in selecting a suitable cryptocurrency for the portfolio, as it characterizes the market in more detail.

In a further step, we investigate the association between the clustering results and different features of cryptocurrencies, such as technological variables, market capitalization, the maturity (age) of the cryptocurrency, and some of the asset portfolio ratios. We aim to inspect whether some clusters are tightly associated with some aspects that do not consider the clustering process. We conducted inference statistical tests, to assess whether the associations were significant. These associations enhance the profiling of the different clusters. We keep continuous references to concrete cryptocurrencies of the market, where most of them are not known, which are part of our analysis. In conclusion, we discuss our results and exemplify how they could be used, and further present our conclusions, including some points, to stimulate further research.

Literature review

Clustering financial data

Clustering analysis is a well-known data analysis tool that has been used in different fields (Henning et al. 2016). Particularly, in Finance, the seminal work of Mantegna (1999) used the cross-correlation of the return time series, and minimum spanning trees (MST) to group the stocks of the New York Stock Exchange from, 1989 to 1995. Mantegna (1999) applies the MST to represent the stock market as a network. Bonanno et al. (2004) further applied the same methodology considering different time horizons, to compare the return and volatility networks. The methodology of Mantegna is applied with different variations in other contexts (Onnela et al. 2003; Mizuno et al. 2006; Brida and Risso 2009). Furthermore, Marti et al. (2017) proposed different alternatives and variants for this methodology.

Another important strand applies fuzzy clustering to financial time series, typically grouping stocks to develop portfolios. For example, D’Urso et al. (2013) and D’Urso et al. (2016) applied a model-based approach with different variations of fuzzy clusters, to financial markets for different distance metrics (autorregresive, Caiado). Similarly, D’Urso et al. (2020) proposed a fuzzy clustering method based on cepstral representation, using the daily Sharpe ratio as a variable of clustering.

The main application of clustering in finance is building portfolios. For example, Nanda et al. (2010) applied K-means, fuzzy C-means, and self-organizing maps (SOM), to returns and financial ratios from Indian stocks, to classify them into different clusters and subsequently develop portfolios from these clusters. Chaudhuri and Ghosh (2015) propose an approach that groups the daily Indian market volatility comparing Kernel K-means, SOM and Gaussian clustering models to achieve right volatility prediction using the clusters as predictors.

Liao (2007); Liao and Chou (2013) cluster the daily market data, and apply different association rules between the K-means groups, indices, and some market categories. These associations help analyze and describe co-movement among the different markets.

Regarding the use of time series as objects for clustering, Aghabozorgi and Teh (2014) proposed a three-phase clustering model, to categorize companies based on the shape similarity of their stock markets, using dynamic time warping (DTW) (Berndt and Clifford 1994). D’Urso et al. (2019) apply a trimming procedure to a fuzzy clustering of stocks comprising the FTSE MIB with a DTW as a distance metric with good results, to mitigate the outlier effect on time series.

From traditional finance to cryptoasset markets

Yermack (2013) analyzes Bitcoin market in-depth, and consider it an investment that is more speculative than a currency. Apparently, Bitcoin market poses high risk for the management of transactions, and credit markets. In conclusion, a deflationary scenario is anticipated owing to the limited number of bitcoins that can be issued (21 million). This study anticipated many aspects of cryptocurrency markets prevalent today (excessive volatility and high level of computer knowledge required for using and integrating them into the web of international payments). A more updated vision of this innovative market regards cryptocurrency exchange. Drozdz et al. (2018, 2019) show that BTC/USDT and ETH/USDT ETH/BTC were almost indistinguishable from exchange rate quotes in the forex market. The authors show that the exchange of cryptocurrencies has a behavior similar to that of more mature markets such as stocks, commodities, or Forex. Complementarily, the latest study Drozdz et al. (2020a) points to the anticipated disconnection of the cryptocurrency from the conventional markets, and states that the Bitcoin on the cryptomarket plays a role similar to that of the USD in the Forex market, or Drozdz et al. (2020b), where cryptocurrencies began to be correlated with traditional assets, only from 2020.

Corbet et al. (2019) analyzed the high growth of the cryptocurrency market and its heterogeneity since 2014 in depth. They consider different aspects including regulatory, cyber-criminality, market efficiency, and bubble dynamics, and make recommendations for further investigations on different domains. We consider a couple of them, and address some characteristics based on liquidity with the volume as a proxy, market cap, and other key metrics or ratios, such as the beta or Sharpe ratio. More recently, Fang et al. (2021) updated a survey covering various cryptocurrency trading aspects, including unsupervised machine learning techniques and others (e.g., cryptocurrency trading systems, bubble, and extreme conditions, prediction of volatility and return, crypto-assets portfolio construction and crypto-assets, technical trading, and others).

The characterization of cryptocurrencies from a statistical point of view has been tackled by different studies. Chan et al. (2017) analyze the distributions of a few cryptocurrencies (Bitcoin, Dash, Dogecoin, Litecoin, MaidSafeCoin, Monero and Ripple) and show that they exhibit heavy-tailed distributions that fit the generalized hyperbolic distributions. Our study considers heavy-tail and associated power-law distribution analyses. As part of a benchmark with other markets, Baek and Elbeck (2015) show that Bitcoin market volatility is 26 times more volatile than the S&P 500 Index.

Zhang et al. (2018) analyzes the stylized facts of eight cryptocurrencies that represent almost 70% of the market capitalization and find, that among other things, heavy tails for the returns, return autocorrelations that decay quickly, while the autocorrelations for absolute returns decay slowly, whose returns display strong volatility clustering, and leverage effects, and a power-law correlation between price and volume. The study of stylized facts has been extended by increasing the number of digital coins to 222 (Hu et al. 2019). Similarly, we consider it important to include as many cryptocurrencies as possible in our study, to characterize the market fully.

Clustering of cryptocurrencies

The classical methodology based on MST algorithms (Mantegna 1999) is applied by Song et al. (2019), to filter out the influence of Bitcoins and Ethereum; it detects six homogeneous clusters. However, the structure found does not remain stable after the announcement of regulations from various countries. Interestingly, the use of clustering together with other methods, such as VAR models and Granger causality tests (Zieba et al. 2019) show that Bitcoin shock prices are not transmitted to the prices of other cryptocurrencies, with Litecoin and Dogecoin being the most influential actors. According to the results, Bitcoin exhibits a lower relationship with other cryptocurrencies. Another approach is the use of the random matrix theory, and hierarchical structures in an MST on 119 cryptocurrencies, from 2016 to 2018 (Stosic et al. 2018). They find multiple collective behavior in the cryptocurrency market, which contrasts with the intuitive idea that Bitcoin has a global influence on the entire market.

Furthermore, the time dimension was also considered. Sigaki et al. (2019) first classify 437 cryptocurrencies according to information efficiency, using permutation entropy and statistical complexity, and then cluster their time series using dynamic time warping and hierarchical clustering, to find four groups where the behavior in terms of information efficiency evolves differently.

All these articles show the complexity of the underlying structure in the cryptocurrency market, where some cryptocurrencies influence others, even in unexpected ways.

The comparative study of cryptocurrency markets and traditional financial markets is also a key research area. Corbet et al. (2018) show that cryptocurrencies are highly connected among themselves, and disconnected from mainstream assets (bonds, stocks, S&P500, gold). Consequently, Pele et al. (2020) merged classification based on asset profiles and the dynamic evolution of clusters. First, they characterize a selected group of log-returns assets, including 150 cryptocurrencies, stock commodities, and exchange rates, to estimate a multidimensional vector by applying a dimensionality reduction with factor analysis. They further used classification, where K-means is one of the techniques applied. The main difference between cryptocurrencies and traditional assets is the higher variance and longer tails of the log-return distribution. The work also shows that individual cryptocurrencies tend to develop over time, with similar characteristics (synchronic evolution).

Methodology

Dataset description

We retrieved data from https://www.cryptocompare.com/ for all cryptocurrencies traded in 2018. Many new cryptocurrencies have emerged in recent years, but many of them are short-lived and barely traded. We aim to include as many of them as possible in our study. First, we eliminate NaN and Inf values, which are mostly caused by zero prices in log transformations. Second, we filter out cryptocurrencies that were in the market less than 95% of the days (92 cryptocurrencies in 2018). We kept for clustering, those that were in the market but were not traded, that is, zero return and volatility or zero volume, as they are part of the market. In 2018, there were 306 cryptocurrencies on exchanges, that were barely traded. However, we decided to include them in the clustering as they are a substantial part of the cryptocurrency market. Even if they have no interest in investors, we are interested in knowing where they are allocated.

Our final dataset in 2018 consisted of 1,723 cryptocurrencies. However, we decided to eliminate cryptocurrencies with low or no activity from the second part of our analysis, the association tests. Low market activity may cause heavy tails in the return distribution and affect the consistency of the results. The remaining dataset for association tests consisted of 1,262 cryptocurrencies with higher statistical quality, ensuring the existence of the first and second statistical moments.

We also downloaded the data for 2019, to extend our experiment for a longer time-frame, analyzing the generalization of the results.

In addition to cryptocurrency data, we use daily data from CCI30^{Footnote 3} to represent the global behavior of the cryptocurrency market. The CCI30 is a market cap weighted index (Rivin and Scevola 2018), which represents the 30 largest cryptocurrencies by market capitalization, which makes it a good representative of the market. However, other crypto indexes (such as, CRIX or BCGI that stands for Bloomberg Galaxy Crypto Index) could be used in our methodology, as all of them highly correlate with the market (Häusler and Xia 2021). We chose the CCI30 index owing to its data availability and transparent methodology. However, the proposed methodology could be used with other indices, provided that it ensures an accurate representation of the trend for the entire market. We also retrieve data from the US Department of the Daily Treasury Bill market (T-Bill).^{Footnote 4} We use both for the computation of some financial benchmarking rates as the Beta and Sharpe ratio, which we explain in the subsequent sections.

Regarding the cryptocurrencies, we constructed the following variables:

Daily log-returns: The use of returns instead of prices in Finance price time-series is very extended and consolidated, owing to its more suitable statistical properties and better comparability. It has also been used in cryptocurrency markets Letra (2016), Stosic et al. (2018). The return for cryptocurrency i on day t is computed as:
$$\begin{aligned} r_i(t)=ln(P_i(t))-ln(P_i(t-1)) \end{aligned}$$
where $P_i(t)$ is the daily cryptocurrency price for the i cryptoasset on day t.
Heavy tail: Heavy tail behavior in a return distribution means that extreme price fluctuations are relatively frequent. This might be related to the finite-size effects in the number of active agents, linked to the liquidity and volume of the market (Watorek et al. 2020). The rates of return distributions for less liquid cryptocurrencies are characterized by thicker tails, and poorer scaling.^{Footnote 5} We aim to identify cryptocurrencies prone to extreme behavior and whether they associate with some clusters. We define a cryptocurrency with heavy tail behavior by a binary variable if it has a tail index lower than 2, according to Newman (2005)

This would question the existence of the finite first, and the second moments of the underlying distributions, which is not a problem in our case, as we use the observed sample statistics in a descriptive manner.
Volume: The daily traded volume in the units of the base cryptocurrency, is used as a liquidity proxy. We transform the volume into an ordinal variable using the quantile functions. Three cryptocurrencies represent 66% of the trading volume of the market in 2018, namely Bitcoin (46%), Ethereum (16.5%), and EOS (4%); in total, 10 cryptocurrencies (BTC, ETH, EOS, BCH, XRP, LTC, ICX, HSR, ETC, IOT) represent 80% of the daily volume.
Market cap: it is the one-day market capitalization of February 4, 2019. Three cryptocurrencies represent 60% of the market cap: WBTC* (26.8%), BTC(22.4%), and NPC (11.5%), and five cryptocurrencies (WBTC*, BTC, NPC, XRP, AMIS) represent 80% of the total market cap.
Beta and Sharpe ratios: We compute and discretize Beta and Sharpe ratio for each cryptocurrency. These variables enrich the characterization and give us a financial flavor of the clusters that will help us with interpretability.
Technological variables: We represent the encryption, and consensus algorithms of the cryptocurrency as nominal variables:
- Encryption: There are 105 different values. The most relevant are Scrypt, SHA256, SHA256D, X11, X13, X15, PoS, Multiple, and CryptoNight. We notice that this information is not available for 35% of cryptocurrencies (599 obs.) in 2018.
- Consensus: There are 60 possible values, including the well-known Proof of Work (PoW) and Proof of Stake (PoS). The most predominant are obviously PoW/PoS, PoW, and PoS, although this information is missing in 31% of the cryptocurrencies (536 obs.) in 2018
Age: We estimate the time on the market of each cryptocurrency, and transform it into an ordinal variable, by a quantile function. Age and maturity terms are interchangeable in our study.

Methods

We aim to group the cryptocurrencies based on the behavior of their log-returns in 2018, which will be described later. For this purpose, we use different clustering algorithms that deal with the three representations of the log-returns, described in the previous section: statistic moments, observed probability distribution, and observed daily time series.

We use centroid-based clustering algorithms as the centroids provide an interpretable summary of the elements of each cluster, which will help us identify the most relevant features of the cluster elements. However, this type of clustering algorithm assumes knowledge about the desired number of clusters (k), which is a drawback. We applied different quality criteria, to determine the optimum number of clusters, depending on the technique used. The evaluation of clustering performance is intrinsically difficult, owing to the lack of objective measures—no true table. Moreover, different approaches have been applied, to compare different clustering techniques, for instance, by applying the multiple criteria decision making (MCDM) in Kou et al. (2014) with different methods (i.e., TOPSIS, DEA, and VIKTOR) including 11 performance measures. Particularly, in our case, and for K-means, as we will detail later on, we mostly rely on the straightforward majority rule criteria implemented in the R-package (Charrad et al. 2014), which applies 30 performance measures, which is a simpler methodology than MCMD, but adequate in our case, as we do not benchmark different clustering techniques.

Moreover, we use distance-based clustering algorithms, which are simple, intuitive, and applicable to a wide variety of scenarios (Aggarwal et al. 2013). The algorithms considered are based on meaningful dissimilarity measures or distances that help in the interpretability of the clusters. This is especially important for more complex representations, such as distributions or time series. For example, in the case of distributions, the measure should relate to the properties of the density function (central tendency, spread, symmetry), while in the case of time series, it will be more with the shape of the time series. Meaningful measures will help us better understand the resulting clusters, and interpret the nearness of the observations to the centroid. Additionally, our clustering algorithm provides a prototype or centroid of the clustering, which facilitates the characterization of the resulting clusters.

The cluster intersections help us merge the results of the different clusters, and identify the most prominent cryptocurrency profiles in 2018, according to different characteristics through the three techniques. Furthermore, we analyzed the association between the clustering results found for the three representations, and the different attributes of cryptocurrencies.

K-means clustering algorithm for the first and second statistical moments

Regarding the bi-variate (or two-moments) representation, where the two variables are the yearly mean, and standard deviation of the log-returns, we use K-means clustering (MacQueen 1967), which is one of the most extensively used clustering algorithms (Wu et al. 2008) globally and on cryptocurrency markets, particularly (Fang et al. 2021). We standardized the two variables to homogenize the differences between their ranges. K-means clustering minimizes within-cluster variances, that is, squared Euclidean distances in our case, which makes the result easy to understand and interpret. Before clustering, we compute the Hopkins statistic (Banerjee and Dave 2004) to rule out the possibility that a uniform random distribution generated the dataset.

To select the number of clusters (k), we compute several internal cluster validity indices (CVIs) for crisp partitions (Arbelaitz et al. 2013), including Silhouette, Dunn, COP Davies-Bouldin, Calinski-Harabasz, or the score function, and then apply the majority rule to choose the best number of clusters.

We apply clustering ensemble techniques (Acharya 2011) to reduce the randomness of partitional cluster results. We run the K-means algorithms 10 times, and ensemble the outcomes by minimizing the Euclidean distance. We confirm that the dissimilarity among the different runs is closer to zero, which makes the ensemble cluster a more stable representation. For each algorithm run, we apply the Hartigan-Wong method for clustering (Hartigan and Wong 1979) with ten iterations, to reach convergence and consider 50 random starts for each iteration. Once we have the 10 algorithm runs, we compute the medoid of an ensemble of partitions, that is, the element of the ensemble minimizing the sum of dissimilarities to all other elements (Hornik 2005, 2019).

Dynamic clustering algorithm for histograms

Regarding the yearly log-return distribution, we apply a clustering algorithm that deals with the histogram-data form. More precisely, we apply the dynamic clustering algorithm for histogram data based on the $l _2$ Wasserstein distance (Irpino and Verde 2006; Irpino et al. 2014). Thus, we group the cryptocurrencies with similar distributions of log-returns in 2018.

The dynamic clustering algorithm needs a dissimilarity function to assign the observations to the clusters, which is the $l _2$ Wasserstein distance. Given two histograms $h_1$ and $h_2$, the $l _2$ Wasserstein distance is defined as

$$\begin{aligned} d_W(h_1,h_{2}):=\sqrt{\int _{0}^{1}\left[ F_1^{-1}(t)-F_{2}^{-1}(t)\right] ^2dt} \end{aligned}$$

(1)

where $F_1^{-1}$ and $F_{2}^{-1}$ are the inverse of the cumulative distribution functions, that is, the quantile functions of $h_1$ and $h_2$, respectively. This distance can be decomposed as follows:

$$\begin{aligned} d_W(h_1,h_{2})=\sqrt{\left( \mu _1-\mu _{2}\right) ^2 + \left( \sigma _1-\sigma _{2}\right) ^2+2\sigma _1\sigma _{2}\left( 1-\rho _{1,2}\right) } \end{aligned}$$

(2)

where $\mu _i$ and $\sigma _i$ are the mean and the standard deviation of the $h_i$ respectively, and $\rho _{1,2}$ is the correlation of $h_1$ and $h_2$ (Irpino and Verde 2015). Cpnsequently, the $l _2$ Wasserstein distance can be decomposed by adding three elements that account for the histogram differences in terms of location, spread, and shape. Interestingly, this distance matches the perceptual similarity that humans observe when comparing distributions (Arroyo and Maté 2009). All these aspects make it a suitable distance for clustering distributions and, in our case, log-return distributions.

The dynamic clustering algorithm for histogram data based on the Wasserstein distance (Hist-DAWass) is a k-means-like algorithm for clustering a set of observations described by histogram variables (Irpino and Verde 2006; Irpino et al. 2014). Each of the k clusters is represented by a centroid or prototype, and observations are assigned to the closest prototype. The prototype is an average histogram of the histograms observed for each variable. In our case, observations are described by a single histogram variable representing the distribution of log-returns, and the resulting prototype is a histogram that averages the histograms of the observations that belong to the cluster (Irpino and Verde 2015). Consequently, the prototypes can be interpreted in a financial context as log-return distributions.

We used the clustering implementation in the R-package Hist-DAWass (Irpino 2016). This implementation provides a quality measure, which is the percentage of the sum of squared (SS) deviation explained by the model running the algorithm several times for each k. We run the clustering algorithm 20 times for each k, of which the solution is the best among the repetitions, that is, the one that maximizes the SS.

TADPole clustering for time-series

Time-series clustering is a challenging domain for clustering owing to the high dimensionality of objects and their ordering. Consequently, many approaches have been proposed over time (Liao 2005; Rani and Sikka 2012; Aghabozorgi et al. 2015).

We aim to cluster the time series with similar volatility patterns in the same period. For this purpose, the Euclidean distance may fail to produce an intuitively correct measure of similarity between two time series as it is very sensitive to small distortions in the time axis. However, other measures, such as dynamic time warping (DTW), manage this problem using warping the non-linearly of the time dimension, to estimate their similarity. Currently, DTW is considered one of the most popular and useful shape-based measures (Aghabozorgi et al. 2015).

However, DTW is intrinsically slow owing to its quadratic time complexity, which hampers its applicability in clustering. Therefore, we use the enhanced DTW algorithm TADPole (Time-series Anytime Density Peak) (Begum et al. 2015), which extends the density peak (DP) clustering framework (Rodriguez and Laio 2014) and exploits the upper and lower bounds of DTW, to prune unnecessary distance computations, which accelerates the convergence of the algorithm. Consequently, TADPole produces a correct answer quicker, and then refines it until it converges to the exact answer. Moreover, the clustering algorithm only requires two parameters, which makes it easy to use. First, a cut-off distance that defines the thresholds to select the series. We further set it as 2; and second, a window size that defines the time frame to make the comparison between the series that we set as 3. Optionally, we can also select the number of clusters (k), or let the algorithm choose the optimal one, based on the local density of points (closer series at some time based on some cut-off distance) using a “knee point finding” algorithm, where points with higher values of $\rho _i \cdot \delta _i$, where $\rho _i$ refers to the local density and $\delta _i$ is the distance from points with higher local density.

We consider a different number of clusters k, and compute the internal cluster validity index (CVI) for each cluster. As this clustering algorithm uses three distances, we use Calinski-Harabasz as the CVI index to secure the convergence of the algorithm for the asymmetric distance measure.

TADPole allows for the clustering of time-series with arbitrary shapes, which is very useful in our case owing to the heterogeneity of the cryptocurrency market. In contrast, DTW is not a geometric distance with the three fundamental metric properties: non-negativity, symmetry and triangle inequality. TADPole clusters cannot be represented as “balls” in a metric plane, as in K-means. The result is a partition around the medoid (PAM) type centroid, using the DTW similarity measure that can be represented only in a DTW space. This centroid is a time-series that helps to identify the volatility patterns of the resulting clusters.

We apply the implementation of the TADPole algorithm of the R-libraries DTWCLUST by Sarda-Espinosa (2019); Sardá-Espinosa (2019). The time-series are log-return values that facilitate the characterization of the clusters from a financial perspective. The DTW measure implemented in the package follows the estimation in Lemire (2008).

Combination of clustering results

Once we have the results of the clustering algorithms, we combine them by intersecting the clusters. Potentially, we have $T_1 \cdot T_2 \cdot \ T_3$ intersections, where $T_n$ is the number of clusters obtained for the clustering algorithm n. The combination of the clustering results makes it possible to characterize each cryptocurrency in several dimensions, one for each cluster strategy. The resulting multidimensional categorical datasets can be shown using visualization techniques supported by graph theory (L’Yi et al. 2015; Kern et al. 2017). To better highlight the changes in the clustering between the different techniques, we visualized such changes by means of a so-called alluvial diagram, which is considered a good example in Rosvall and Bergstrom (2010). We use the alluvial visualization implemented in R (Bojanowski and Edwards 2016) to show the main flows of cryptocurrencies.

We can also numerically compare two partitions represented as a $c_1 \times c_2$ matrix, where $n_{ij}$ is the number of objects in group i of partition 1 ($i=1,...,c_1$) and group j of partition 2 ($j=1,...,c_2$). The labeling of the two partitions was arbitrary. Hubert and Arabie (1985) developed the Adjusted Rand Index (ARI) with a correction for chance as

$$\begin{aligned} ARI = \frac{\sum _{i,j}\left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) -\sum _{i}\left( {\begin{array}{c}n_{i\cdot }\\ 2\end{array}}\right) \sum _{j}\left( {\begin{array}{c}n_{\cdot j}\\ 2\end{array}}\right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) }{\frac{1}{2}\left[ \sum _{i}\left( {\begin{array}{c}n_{i\cdot }\\ 2\end{array}}\right) +\sum _{j}\left( {\begin{array}{c}n_{\cdot j}\\ 2\end{array}}\right) \right] -\sum _{i}\left( {\begin{array}{c}n_{i \cdot }\\ 2\end{array}}\right) \sum _{j}\left( {\begin{array}{c}n_{\cdot j}\\ 2\end{array}}\right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \end{aligned}$$

(3)

The index computes the proportion of the total of $\left( {\begin{array}{c}n\\ 2\end{array}}\right)$ object pairs that agree, that is, they are either (i) in the same cluster according to partition 1 and the same cluster according to partition 2, or (ii) in different clusters according to partition 1 and in different clusters according to partition 2. The higher the ARI index, the higher the agreement.^{Footnote 6} In our case, this means that more cryptocurrencies share clusters for the different partitions. We used the function implemented in the R package MCLUST by Scrucca et al. (2016). We also focus on the cluster intersections with higher cardinality for a better profiling of the main trends of the cryptocurrency market.

Association test

In conclusion, we enhance the descriptive information of each cluster by examining the level of association with different independent variables not considered by the clustering algorithms. We analyze the association among clusters and the categorical variables defined in Table 1 by applying Fisher’s exact tests, and analyzing the Pearson’s residuals of the contingency tables that we explain later. First, quantitative variables must be transformed into ordinal by quantile functions.

Table 1 Categorical variables used on the association tests and values

Analysis of the cryptocurrency market using different prototype-based clustering techniques

Abstract

Introduction

Literature review

Clustering financial data

From traditional finance to cryptoasset markets

Clustering of cryptocurrencies

Methodology

Dataset description

Methods

K-means clustering algorithm for the first and second statistical moments

Dynamic clustering algorithm for histograms

TADPole clustering for time-series

Combination of clustering results

Association test

Replication within a longer time-frame

Results

Clustering results of the bi-dimensional representations

Clustering results of histogram representations

Clustering results of the time-series representation

Intersection of clusters

Association tests

Association between market cap, volume and clusters

Association between financial ratios and clusters

Associations results for the intersection of clusters

Associations between financial and the technological variables

Associations with the age of the cryptocurrencies

Association with heavy-tail behavior

Analysis of the extended time frame

Discussion

Use of the methodology results

Conclusions

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords