Skip to main content

Blockchain-oriented approach for detecting cyber-attack transactions

Abstract

With the high-speed development of decentralized applications, account-based blockchain platforms have become a hotbed of various financial scams and hacks due to their anonymity and high financial value. Financial security has become a top priority with the sustainable development of blockchain-based platforms because of an increasing number of cyber attacks, which have resulted in a huge loss of crypto assets in recent years. Therefore, it is imperative to study the real-time detection of cyber attacks to facilitate effective supervision and regulation. To this end, this paper proposes the weighted and extended isolation forest algorithms and designs a novel framework for the real-time detection of cyber-attack transactions by thoroughly studying and summarizing real-world examples. Furthermore, this study develops a new detection approach for locating the compromised address of a cyber attack to resolve the data scarcity of hack addresses and reduce time consumption. Moreover, three experiments are carried out not only to apply on different types of cyber attacks but also to compare the proposed approach with the widely used existing methods. The results demonstrate the high efficiency and generality of the proposed approach. Finally, the lower time consumption and robustness of our method were validated through additional experiments. In conclusion, the proposed blockchain-oriented approach in this study can handle real-time detection of cyber attacks and has significant scope for applications.

Introduction

As a new decentralized infrastructure and disruptive core technology, the public blockchain technology has piqued the interest of researchers,Footnote 1 and the number of academic studies on blockchain is growing rapidly (Xu et al. 2019). According to Fang et al. (2022), Ethereum has become the mainstream blockchain platform for public blockchains, accounting for most of the total market capitalization. Currently, Ethereum is the largest decentralized open-source blockchain system that provides Turing-complete programming language to develop smart contracts. In consequence, several decentralized applications (dapps) based on smart contracts, such as Uniswap,Footnote 2 Aave,Footnote 3 and PanacakeSwap,Footnote 4 have emerged, and they have been applied to many areas, especially finance, arts, and collectibles. However, with the proliferation of dapps on the public blockchain, the account-based blockchain platforms have become a breeding ground for various financial scams and hacks due to their anonymity and enormous financial values. Cyber attacks and illegal activities are increasing on the account-based blockchain platforms (e.g., Ethereum,Footnote 5 Binance Smart ChainFootnote 6 (BSC), and SOLANAFootnote 7). Furthermore, according to the Rekt database,Footnote 8 over $1.9 billion have been lost in 161 attacks on the decentralized application in 2021, indicating that cyber attacks have been a critical issue for the public blockchain. As the anonymity of blockchain provides convenience for hackers, an increasing number of financial regulators are attempting to strengthen blockchain supervision in various countries (Sebastião and Godinho 2021). Facing this issue, this paper aims to propose a general and real-time approach to detect cyber attacks to facilitate effective supervision and regulation in the field of the public blockchain.

In particular, focusing on the top 30 cyber attacks (ranked by the funds lost) appearing in recent years, as shown in Table 1, the account-based blockchain platforms have been targeted by three types of cyber attacks: Smart contract exploits, Flash loan attacks, and Identity theft. We will briefly introduce the three types of cyber attacks (Table 1), because they are the targets that the proposed new approach must detect. Smart contract exploit is the most frequent cyber attack in recent years since the decentralized applications are run on an open-source smart contract, which are written in programming languages and solely controlled by its own code. Hence, hackers have the opportunity to review code and probe the networks to look for code vulnerabilities of smart contracts, such as the vulnerabilities of re-entrancy, integer overflow, and multisig (Aspris et al. 2021; Efanov and Roschin 2018; Harvey et al. 2021). Note that the vulnerabilities mainly exist in smart contracts on account-based blockchain; therefore, the compromised addresses of Smart contract exploits belong to smart contracts. A real-world example related to the Smart contract exploit is described in “Appendix 2”. Regarding Flash loan attack; hackers exploit economic vulnerabilities in the interaction between the decentralized applications of flash loans and other smart contracts. This enables hackers to borrow, arbitrage, and liquidate assets in an extremely short period, resulting in illegal profits (Qin et al. 2021), the most common method being arbitrage trading, in which the price of a crypto asset is manipulated on one decentralized exchange and quickly resold on another. In general, the compromised address of arbitrage trading based on flash loans is always the address of decentralized applications, which also belongs to smart contract. “Appendix 2” also provides detailed information about the arbitrage trading. Aside from the two types mentioned, Identity theft is a common type of cyber attacks (Fang et al. 2022; Xu 2016). It refers to a scenario in which a hacker gains unauthorized access to an individual or organization’s private key through phishing attacks, malware, or social engineering tactics, allowing them to access the associated blockchain account and transfer funds to the hacker’s account. According to the transaction definition (see “Appendix 1”), all transactions need to be signed using corresponding private keys before the transactions are submitted to the account-based blockchain. Therefore, losing authority of private keys is equal to losing funds on blockchain. The externally owned account (EOA), explained in “Appendix 1”, is thus a compromised address of Identity theft, since the private key on blockchain controls only an EOA. Based on the introduction of different cyber attacks, the compromised addresses of Smart contract exploit and Flash loan attacks belong to smart contracts, whereas those of Identity theft belong to EOAs. Meanwhile, both Smart contract exploit and Flash loan attack refer to exploiting code and economic vulnerabilities, respectively. In fact, the various types of cyber attacks imply that the key clues that must be detected differ. Accordingly, the first motivation for our work is to propose a general approach framework capable of dealing with multiple types of cyber attacks rather than just one or two types, especially since we believe that new types will emerge in the near future.

Table 1 Top 30 Cyber-attacks on the account-based platforms in recent years

Although supervised machine learning methods (SML) have succeeded in numerous fields, they also suffer from several challenges when dealing with this paper’s problem of detecting cyber-attack transactions in blockchain. The first difficulty stems from insufficient data. The adopted SML is part of the postmortem analysis technology, which means that the existing public transaction information is required to carry out the identity inference of the illegal addresses, such as the behavior analysis of the addresses and account identification. However, a dynamic and ever-changing cyber attack is hard to keep up. Only the historical types rather than the latest ones of cyber attacks can be learned, because not all labels can be available immediately (Carcillo et al. 2018; Dal Pozzolo et al. 2014). The second difficulty stems from the data’s imbalance. In fact, the annotated information of blockchain addresses published on third-party sites is relatively scarce, resulting in data imbalance. As a consequence of the data imbalance issue, SML will be biased toward the majority class, resulting in poor classification performance of minority classes, because only the majority class of cyber attacks can be fully learned (Thabtah et al. 2020). The third challenge lies in time consumption. Real-time detection of cyber attacks is crucial for victims and managers to take corresponding measures to prevent potential losses. According to Etherscan8 and Bscscan,Footnote 9 the average block time on the BSC and Ethereum is approximately 2.5 and 12 s, respectively. However, SML consumes far more time than average block times (Chen and Guestrin 2016). Therefore, completing the analysis of the real-time transactions within a short period is one of the most immense challenges for detecting cyber-attack transactions. Facing these challenges that SML is hard to cope with, the second motivation for this paper is to develop a new real-time method (i.e., with very low time consumption) that does not require adequate and balanced data.

Compared with the aforementioned SML, unsupervised machine learning (UML) methods can compensate for the deficiency of the aforementioned SML to some extent. Many existing studies demonstrate this point: (i) UML has been applied for credit card fraud detection by dealing with fraudsters’ ability to invent novel fraud behaviors and changes in customer behaviors (Carcillo et al. 2021); (ii) for telecommunications fraud detection by correcting the misclassification of behavior types and recognizing the dynamic appearance of new fraud types (Hilas and Mastorocostas 2008); and (iii) for bot recognition in a web store by identifying more camouflaged agents (Rovetta et al. 2020), among others. Accordingly, UML can handle many types of cyber attacks, whether it has known them before or not, because many examples have illustrated that UML can discover patterns and information that may seem strange or suspicious. Fortunately, as a typical UML, the isolation forest (Liu et al. 2012) has been demonstrated to be an effective method for anomaly detection with low time consumption and high efficiency in several fields, such as biostatistics and semiconductor manufacturing (Liu et al. 2012; Puggini and McLoone 2018), where the detection of cyber attacks falls under the category of anomaly detection. Furthermore, recent years’ work has improved the traditional isolation forest into an extended isolation forest (EIF) by adjusting the way branch cuts are made (Hariri et al. 2019). However, in our experiment, the EIF also performs poorly. The third motivation of this paper is to develop a classic isolation forest and EIF for achieving satisfactory results in detecting cyber attacks on the account-based blockchain.

Facing the three listed motivations, the main work is introduced as follows. First, we propose a novel method for extracting real-time account-based blockchain transactions from open-source websites, such as Etherscan and Bscscan, and identifying the addresses with the highest expenditure as the target addresses based on the accumulated expenditures of various crypto assets. Because most target addresses have a long usage history, the data deficiency of the hack addresses can be solved in this manner, responding to the first motivation. Second, the target addresses will be filtered by the funds expenditure threshold, and only a few data points will be fed into the next stage of the detection system, resulting in a reduction in time consumption, which responds to the second motivation. Third, we extract historical transaction data of the target addresses using open-source websites and use various data preprocessing methods to process the original data feature, allowing more useful information to be extracted. Then, in response to the third motivation, we propose an improved algorithm that assigns an anomaly score to the depth of the isolation tree based on the traditional EIF, dubbed weighted and extended isolation forest (WEIF).

To summarize, the main contributions are listed. The first contribution is that this work is one of the first to conduct an in-depth study into the real-time detection of cyber attacks on account-based blockchains. This work not only considers various types of cyber-attack transactions by developing real-world cyber-attack examples, but it also develops a new general UML-based framework with low data request and computation costs. The second contribution is to propose an effective strategy for identifying suspected compromised addresses and filtering them using a fund expenditure threshold, which will significantly reduce the number of analyzed targets. The third contribution is the designed dynamic modeling technology, which refers to the development of an evolving model for mining behavioral differences between suspected compromised targets. As a result, the designed dynamic model can detect real-time and constantly changing cyber-attack transactions. Last but not least, by adding weight to the depth of EIF, a new algorithm called WEIF is created. When the weight is introduced, the gap between the average depth of the normal transaction and the average depth of the cyber-attack transaction grows larger than in the famous EIF. In fact, the larger the gap, the easier it will be to distinguish cyber-attack transactions. The results of three types of cyber attacks show its high efficiency and generality.

The remainder of this paper is organized as follows to present our work and contributions logically. “Related work” section examines SML and UML related works, and the detailed information on the traditional isolation forest and its extension. “Methodology” section presents the overall framework for detecting cyber attacks, as well as our proposed algorithm and its validation results based on simulation data. “Experimental evaluation” section shows the detailed information about the training dataset and the analysis of experimental results. Finally, “Conclusion and future work” section concludes and discusses future work.

Related work

In essence, detecting cyber-attack transactions can be considered as identifying rare transactions that deviate significantly from most of the transactions. Because blockchain’s openness makes it easier for researchers to access transaction data, an increasing number of researchers are working on developing new technologies to detect various types of cyber attacks on the blockchain. The majority of related studies focused on the use of SML, with only a few studies focusing on UML. Therefore, we introduce related studies of SML and UML. As aforementioned, our proposed algorithm is developed based on the traditional isolation forest and its extension. Thus, the theories of the standard isolation forest and its extension are also elaborated in this section. These mentioned methods reviewed in this section will be compared with our proposed algorithm in “Experimental evaluation” section.

SML

SML is a type of machine learning in which an algorithm is trained on a labeled dataset to recognize patterns and make predictions. The labeled dataset for SML consists of input data and corresponding output labels. SML evaluates its accuracy using the loss function and learns from training data until the error is decreased sufficiently. As an increasing number of cyber attacks of blockchain are provided on open-source websites, such as Etherscan, Bscscan, and Rekt database, most researchers attempt to collect malicious addresses via these open-source websites and focus on the use of SML to detect malicious transactions by learning the transaction behaviors of malicious addresses.

According to Farrugia et al. (2020), the XGBoost classifier was used on a balanced dataset (4,699 accounts) to detect malicious accounts on the Ethereum blockchain. Despite its 96% accuracy, the execution time of this model is more than 62 s, which is much longer than the average block times of Ethereum and BSC. Accordingly, it indicates that the XGBoost classifier is incapable of detecting cyber-attack transactions on account-based blockchains in real time. According to Aziz et al. (2022), various SML methods, including random forest, XGBoost, and the light gradient boosting machine (LGBM), have recently been used to detect fraud transactions by learning the transaction behavior of labeled accounts, and all of these models achieve more than 93% on the F1 score. All three mentioned models belong to ensemble learning techniques, which is introduced in Table 2.

Table 2 A brief introduction of SML methods

Furthermore, some researchers have applied graph convolutional network (GCN) techniques (Shen et al. 2021; Yu et al. 2021) to the identity inference of phishing scams and Ponzi scheme based on the balanced dataset, and these techniques have also achieved good performance with around 90% on F1 score. A brief description of GCN is also provided in Table 2.

According to the existing SML-related studies, SML methods are effective for the identification of malicious addresses based on balanced datasets. However, the datasets for anomaly detection of cyber attacks on account-based blockchain are extremely imbalanced, which may result in poor performance and the generality of SML methods. In recent years, various types of UML have been used to deal with the imbalanced datasets, and the detailed information of UML is introduced as follows.

UML

As mentioned above, the methods of SML are much more resource intensive because of the need for labeled data, but the methods of UML discover the hidden patterns or the data cluster without the human intervention. Currently, UML is the primary technology for detecting anomalies in many fields, including finance, telecommunications, and network administration (Carcillo et al. 2021; Hilas and Mastorocostas 2008; Rovetta et al. 2020). However, UML methods for detecting cyber attacks on the blockchain have received little attention. Only kernel-based techniques are used to detect abnormal addresses in the historical transaction network (Patel et al. 2020), and the kernel-based technique achieves a F1 score of approximately 80%. This section introduces the techniques of distance-based, clustering-based, histogram-based, kernel-based, neural network-based, and ensemble-based to fully understand how UML techniques work and apply these techniques to the detection of cyber-attack transactions on account-based blockchains, as shown in Table 3.

Table 3 A brief introduction of UML methods

Among the UML methods shown in Table 3, the existing experiments on public datasets have shown that the isolation forest algorithm outperforms the other common UML techniques in terms of efficiency and accuracy while consuming significantly less memory (Falcão et al. 2019; Liu et al. 2012). However, the random isolation forest cuts are always horizontal or vertical, resulting in bias and artifacts in the anomaly score map (Hariri et al. 2019). Hariri et al. (2019) proposed the EIF to mitigate bias by using a random slope and a random intercept for branch cuts. Therefore, we use the traditional isolation forest and EIF as the base of our proposed model in this paper, and the detailed information on the isolation forest and EIF is further introduced as follows.

Isolation forest and its extensions

Isolation forest

An isolation forest, like a random forest, is built with decision trees, which belongs to UML methods because there are no predefined labels. The central idea behind isolation forest is to isolate anomalies by constructing a series of isolation trees with random attributes (Liu et al. 2012). The isolation tree is constructed using the algorithm shown in Table 12 (“Appendix 3”) by splitting the subsample observations over a split value of a randomly selected attribute. In this manner, observations with corresponding attribute values less than the split value go left, whereas others go right, and the process is repeated recursively until the tree is fully constructed. The split value is randomly selected between the selected attribute’s minimum and maximum values. Although the isolation forest is a typical method of UML, its random cuts are always straight lines, making the random cuts to be either horizontal or vertical. Therefore, several extensions of the isolation forest have been developed in recent years (Hariri et al. 2019).

EIF

Among the isolation forest algorithms developed, the EIF performs better (Hariri et al. 2019), which eliminates the disadvantage of isolation forest by adjusting the way of branch cuts. In contrast to the isolation forest, the EIF determines the information of random slope and intercept for the branch cut on a multidimensional dataset. The methods for generating the random slope and intercept for the branch cut will be briefly introduced here. In terms of the random slope for the EIF, it is a normal vector denoted as \({ }\user2{n }\) by drawing a random number for each coordinate of \({\varvec{n}}\) from the standard normal distribution N(0,1). As a result, the branch cut is a hyperplane for the high-dimensional dataset rather than a straight line. In terms of the intercept denoted as \({\varvec{p}}\), it is chosen from the value range of the training data. For a given point \({\varvec{x}}\), the branching criteria for the data splitting are shown as follows:

$$\left( {{\varvec{x}} - {\varvec{p}}} \right) \cdot {\varvec{n}} \le 0,$$
(1)

if the condition is not satisfied, the data point \({\varvec{x}}\) moves down to the right branch, otherwise it will be passed to the left branch. By this way, the value of intercept will be restricted to available data at each branch point when we construct trees with larger depths. The criteria for choosing intercept results in more possible branching options for areas of data concentration and less possible branching for areas of fewer observations.

Except for the construction of isolation trees, there are few differences between the isolation forest algorithm and the architecture of EIF, which includes four procedures: isolation tree construction, depth computation, EIF construction, and anomaly score computation (Hariri et al. 2019). First, the isolation trees for the EIF are built using the Eq. (1) as described in Table 13 (“Appendix 3”). Secondly, how to compute the depth of an observation on a given extended isolation tree is elaborated in Table 14 (“Appendix 3”). Finally, Table 15 (“Appendix 3”) shows the construction of an EIF and the computation of anomaly scores based on the isolation tree and depth computation.

Methodology

First of all, we propose a general framework for detecting cyber attacks in the context of the account-based blockchain. As shown in Fig. 1, this framework comprises four stages: the data source, data clean, feature process, model training and prediction. In the stage of data source, hundreds of real-time transactions in a block, referring to about 200 transactions every 12 s on Ethereum or about 150 transactions every 3 s on BSC, are extracted through open-source websites. For the stage of data clean, it is separated into the identification of compromised addresses, data extraction of historical transactions, and feature generation based on transaction behaviors, which will be executed in sequence. When it comes to the stage of the feature process, all the continuous features will be processed by data normalization and three sigma processes, and then all of the discrete features and the processed continuous features will be merged and fed into correlation analysis. During the final stage of model training and prediction, all the training data will be trained on our proposed new algorithm, WEIF, one of our main contributions in this work. The detailed information of each procedure in our proposed framework will be stated in the following subsections. Meanwhile, the evaluation metric is also introduced.

Fig. 1
figure 1

The framework of anomaly detection proposed in this work

Data source

For the data source stage, all transactions are obtained from Etherscan and Bscscan, which are the leading platforms for Ethereum and BSC, respectively. A complex transaction on Etherscan is used as an example to demonstrate all of the detailed transaction information used in this paper (Fig. 2). The entire transaction details are divided into six parts, as shown in Fig. 2. In detail, the block number in Part 1 is the location where transactions are stored and encrypted, and it is generated every 2.5 s on the BSC and every 12 s on Ethereum, respectively. This section contains all of the smart contract function calls for Part 2. Part 3 includes the transaction initiator, which refers to the EOA mentioned in “Appendix 1”. For Part 4, all the transfers of the native crypto assets, referring to the ETH and BNB, are shown in this Part. For Part 5, all the token transfers, known as the nonnative crypto assets transfers, are contained. Apart from the parts mentioned above, the transaction value and transaction fee are included in Part 6, where the transaction fee is equal to the product of gas price and quantity of gas consumed in a transaction.

Fig. 2
figure 2

The transaction details of account-based blockchains as an example

According to the definition of transaction on the blockchain, the transaction details can also be divided into four categories (i.e., external transaction, internal transfer, token transfer and transaction action). With regard to the external transaction, the transaction details of Part 1, Part 3, and Part 6 in Fig. 2 make up the basic information of external transactions. Regarding the internal transfer, the information in Part 4 contains all the internal transfer information. With regards to the token transfer, Part 5 represents the token transfer of the transaction. Apart from the three categories introduced above, Part 2 is named as the transaction action for transaction on the blockchain.

In particular, not all the transactions contain all six parts of the transaction details in Fig. 2. In fact, the more complicated transactions contains more parts of transaction details on the account-based blockchain. For instance, the cyber-attack transactions of Identity theft are mainly composed of external transactions and token transfers, while cyber-attack transactions of Flash loan attack and Smart contract exploit usually involve all parts in Fig. 2.

Data clean

Three different procedures will be executed in sequence during the data clean stage, which are the identification of compromised addresses, data extraction, and feature generation.

Identification of compromised address

The compromised address is the target of hackers owing to a large quantity of crypto assets in the compromised address. Meanwhile, the majority of the compromised addresses exhibit long-term usage behavior. However, because the majority of hacker addresses are newly created for the cyber attacks, only a few transaction records are available for behavior analysis. As a result of more historical transactions than the hacker address, the behavior analysis of the compromised address for the cyber attack is easier to conduct.

As a result, a specific strategy is proposed in this paper to find the address with the highest spend volume as the suspected compromised address, despite the fact that the compromised addresses for real-time transactions are unknown. Based on the transaction details shown in Fig. 2, the suspected compromised addresses are located by computing the cumulative total expenditure of various crypto assets. The data deficiency of hacker addresses can thus be addressed to some extent.

Furthermore, over one million transactions are generated every day on the blockchain according to the statistics of Etherscan. In consequence, analyzing all the transactions in real-time is not feasible. To make our proposed framework to be suitable for the real-time analysis, we propose a novel approach by filtering the transactions according to the expenditures of the suspected compromised addresses. Obviously, with this transaction filtering approach, not too many transactions will be fed into the following analysis of the proposed framework. Thus, the time cost of the proposed framework can be significantly reduced in this way.

Data extraction

In fact, the real-time detection of cyber attacks only depends on the behavior of recent transactions due to the ever-changing behavior of transactions. Meanwhile, the suspected compromised address’s numerous transactions make data extraction time consuming. Consequently, as shown in Fig. 3, we set a limit of transaction records, referring to the dynamic window, to construct the dataset of machine learning methods. For the compromised address of cyber-attack transaction, the window size is set as s, and \(T_{k}\) represents the cyber-attack transaction marked as red, where k is the transaction number of compromised address. Then, if the number of historical transactions before the cyber-attack transaction is larger than s, the transactions \([T_{k - s} , T_{k - s + 1} , \ldots , T_{k} ]\), marked as blue, will be extracted. Otherwise, the window size is equal to the number of historical transactions before the cyber attack and the transactions \([T_{1} , T_{2} , \ldots , T_{k} ]\) will be extracted. This way, the time cost can be further reduced by the data extraction method. In particular, the different window sizes will be chosen to verify the stability and efficiency of our proposed algorithm.

Fig. 3
figure 3

Using dynamic window to construct the dataset from the historical transactions of compromised address. The cyber-attack transaction is marked as red, and the transactions marked with blue are recent transactions within the window size

Feature generation

According to the introduction of the transaction details, we establish a general feature set for all types of cyber attacks based on the transaction details of external transactions, internal transfer, token transfer, and transaction action. Therefore, all the features are separated into four categories (Table 4). Regarding the external transaction features, seven features are generated from the basic information of the external transactions, in which an EOA sends native crypto assets directly to another EOA or smart contract (see “Appendix 1”). Regarding the internal transfer features, these features, referring to transfer volume and transfer count, are computed from the internal transfer carried out through a smart contract as an intermediary. The token transfer features are extracted from the transactions with token transfers that refer to the transfers of ERC-20Footnote 10 or BEP-20Footnote 11 tokens in the transaction details. Apart from the three categories of features mentioned above, the transaction action features are generated based on the specific function call of the smart contract, as shown in the transaction action in Fig. 2.

Table 4 Brief feature descriptions

Feature process

We process feature data using three methods (data normalization, three sigma process, and correlation matrix) to extract more useful information and speed up model training for anomaly detection.

Data normalization

Variables measured at different scales do not contribute equally to model fitting and may result in bias. To address this potential problem, Standard Scaler, also known as z score, is used for data normalization to speed up the model training and improve the model performance (Ioffe and Szegedy 2015). All feature values are rescaled to the new distribution so that the mean of observed values is 0 and the standard deviation is 1. The specification is expressed as

$$z = \frac{x - \mu }{\sigma },$$
(2)

where \(\mu\) is the mean and \(\sigma\) is the standard deviation of the original data.

Three sigma process

In fact, the probability of occurrence decreases as the value deviates from the mean. Consequently, we apply the probabilistic rules of normal distribution to process transaction features. According to the normal distribution, the standard deviation, that is, sigma (σ), defines how far the normal distribution is spread around the mean. For an approximately normal distributed dataset, it follows a set of probabilistic rules described as follows: 68% of all values fall in [mean − σ, mean + σ], 95% of all values fall in [mean − 2σ, mean + 2σ], and 99.7% of all values fall in [mean − 3σ, mean + 3σ]. According to the rules, there are only 0.3% values falling outside three times the sigma range (3σ), and thus we can judge these values that fall outside [mean − 3σ, mean + 3σ] to be anomalous.

Correlation analysis

In fact, some features, while highly relevant to the specific type of cyber attacks, may be redundant. Meanwhile, if two independent features are highly correlated, they are considered redundant. Therefore, although eliminating redundant variables may not result in a significant loss of accuracy, it does result in a very efficient model under many constraints. In our proposed framework, the correlation analysis is used for feature selection by removing redundant features. The correlation coefficient of correlation analysis, denoted r, ranges from − 1 to + 1 and quantifies the direction and strength of the linear association between two features. Furthermore, the correlation coefficient is denoted as

$$r = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {p_{i} - \overline{p}} \right)\left( {q_{i} - \overline{q}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {p_{i} - \overline{p}} \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {q_{i} - \overline{q}} \right)^{2} } }}$$
(3)

where \(n\) is the size of feature data, \(p_{i}\) and \(q_{i}\) are the individual features index with i, \(\overline{p}\) and \(\overline{q}\) are the mean value of two individual features.

WEIF

Proposed algorithm

According to the definition of EIF, the random slope and intercept for branch cuts should be determined before each branch cut during EIF construction, with a lower average depth indicating a more abnormal observation. The normal observation on a few trees may be close to the root due to the random selection of slope and intercept, whereas the abnormal observation on a few trees may be far away from the root based on EIF. As a result, an observation’s anomaly score, calculated based on the average depth of the extended isolation trees, may deviate from its true depth range on the isolation tree, resulting in a bias. To mitigate the bias, we propose a novel algorithm, named as WEIF, by weighting the original depths of given observations in EIF, where the anomaly score of given observation is computed based on the average of weighted depths processed by the algorithm in Table 5.

Table 5 Algorithm of weighted depth

In general, when the random trees of a forest produce shorter path lengths for some specific points, they are highly likely to be anomalies. Depths less than the first quartile of the original depths will be increased if the median of the original depths is greater than its mean, according to our proposed algorithm in Table 5. Depths greater than the third quartile of the original depths, on the other hand, will be decreased if the median of the original depths is less than its mean. Furthermore, if the median of the original depths is equal to its mean, the depths will not change. By this way, the depth difference between the normal observation and abnormal observations is becoming larger. The complexity of our proposed algorithm for training and prediction are \(O\left( {t\psi log\left( \psi \right)} \right)\) and \(O\left( {ntlog\left( \psi \right)} \right)\), respectively, where \(t\) is the number of trees, \(\psi\) is the subsample size of data and \(n\) is the number of observations in the dataset.

According to EIF, the architecture of WEIF contains four steps (Fig. 4): generating sub-datasets, constructing isolation trees, weighting the depths, and computing anomaly score. Compared with the architecture of the traditional isolation forest and its extension, there are few changes except for weighting the depths based on the algorithm shown in Table 5. Specifically, the definition of the anomaly score for an observation y is described as

$$S\left( {y,n} \right) = 2^{{ - \frac{{E\left( {h\left( y \right)} \right)}}{c\left( n \right)}}} ,$$
(4)

where \(E\left( {h\left( y \right)} \right)\) is the mean value of weighted depths for a given observation in all trees, \(c\left( n \right)\) is used to normalize the average path length \(E\left( {h\left( y \right)} \right)\) that is defined as the average path length of unsuccessful search in Binary Search Tree (Liu et al. 2012), i.e.,

$$c\left( n \right) = 2H\left( {n - 1} \right) - \left( {\frac{{2\left( {n - 1} \right)}}{n}} \right),$$
(5)

where \(H\left( i \right)\) can be calculated by \(ln\left( i \right)\) + 0.5772156649 (Euler’s constant) and n is the number of observations in a given dataset (Liu et al. 2012).

Fig. 4
figure 4

Architecture of weighted and extended isolation forest. Firstly, several sub-datasets are generated by randomly sampling from the training dataset. Secondly, all the sub-datasets are passed into the construction of the isolation trees, and the original depths of the observations in the sub-datasets are computed for each isolation tree. Thirdly, all the original depths are weighted based on the algorithm in Table 5. Finally, the anomaly score is calculated on the average depth of the weighted depths

Recalling Eq. (4), a smaller \(E\left( {h\left( y \right)} \right)\) means a higher anomaly score. By this way, all of the observations will be passed into the isolation trees and assigned an anomaly score. And the observation with a higher anomaly score is more anomalous based on Eq. (4). Specially, the threshold of the anomaly score shown in Fig. 4 is decided by the expected proportion of anomalies in the whole dataset, named as contamination in this paper.

Illustrative examples

To understand how WEIF works, we provide two illustrative examples of a two-dimensional dataset sampling from two-dimensional distribution with zero mean vector and an identity covariance matrix. The first example focuses on the comparison of normal and abnormal observations processed by our proposed algorithm, while the second example demonstrates the differences in the outputs of our proposed algorithm (WEIF) and EIF.

For the first example, the depth comparison of the normal and abnormal observations are presented with different forms. First, all the observations of the two-dimensional dataset are plotted in the scatter plot in Fig. 5a, where few samples exist further away from the center of the two-dimensional dataset. Second, a contour plot in Fig. 5b is achieved based on the average depths of all the observations computed by our proposed algorithm. Here, the color of the observation far away from the data center is darker than the observation close to the data center, indicating that the abnormal observation can be effectively identified by our proposed algorithm. Furthermore, the results of comparing the depths of the normal and abnormal observations with the depths computed by our proposed algorithm are shown in Fig. 6. As shown in Fig. 6a, the abnormal observation is quickly isolated, whereas the normal observation continues all the way to the bottom of the tree. All of the depths of the observations on the isolation trees are shown as straight lines in Fig. 6b, and it is obvious that the majority of the depths of the abnormal observations are shorter than the normal observations.

Fig. 5
figure 5

Scatter plot and anomaly score map in this example. a The data of scatter plot are sampling from two-dimensional distribution with zero mean vector and identity covariance matrix. b The anomaly score map is plotted based on the weight depths processed by WEIF. A darker color means to be more anomalous

Fig. 6
figure 6

Structure of a single Tree and depths of WEIF. The results of normal observation are marked with blue, while the results of abnormal observation are marked with red. a The paths for a normal observation and an abnormal observation are plotted in a single isolation tree. b The depths of an observation in the whole forest are displayed as the radial line and the length of line represents the value of depth

For the second example, the depths of a normal and abnormal observations are computed by WEIF and EIF, respectively. For comparing the results of WEIF and EIF, the depths of a normal observation for WEIF and EIF are plotted in Fig. 7a, while the depths of an abnormal for WEIF and EIF are plotted in Fig. 7b. The blue and red lines represent the depth of each tree processed by EIF and WEIF, respectively. Compared with EIF, the depths of a normal observation are increased by our proposed algorithm, as shown in Fig. 7a, whereas our proposed algorithm decreases the depths of an abnormal observation in Fig. 7b. In consequence, the difference in the average depth for the normal and abnormal observations is becoming larger for our proposed algorithm in contrast to EIF.

Fig. 7
figure 7

Comparison of the depths processed by WEIF and EIF. The red and blue lines represent the depths computed by WEIF and EIF, respectively

Evaluation metric

The F1 score is chosen to be the main metric of the models, which is the Harmonic Mean between precision and recall. The range for the F1 score is [0, 1]. It tells us how precise our classifier is and how robust it is. The greater F1 score indicates the better performance of our model. The precision, recall, and F1 score are formulated as follows:

$$Precision = \frac{True\;Positive}{{True\;Positive + False\;Positive }}$$
(6)
$$Recall = \frac{True\;Positive}{{True\;Positive + False\;Negative }}$$
(7)
$$F1\;score = 2*\frac{Precision*Recall}{{Precision + Recal}}.$$
(8)

True positive represents the number of real cyber attacks correctly detected, false positive represents the number of normal transactions wrongly detected as the cyber attacks, and false negative represents the number of real cyber attacks detected as the normal transactions.

Experimental evaluation

Several experiments on various types of cyber attacks are carried out to assess the efficiency and robustness of our proposed algorithm against all of the methods mentioned in “Related work” section.

Dataset and experimental setup

Dataset

All the detailed information on cyber-attack transactions is extracted from the open-source websites, referring to Etherscan, Bscscan, and Rekt Dataset. First, all the labeled cyber-attack transactions are extracted from the Rekt Dataset and classified into Smart contract exploit, Flash loan attack, and Identity theft. Second, 66, 62, and 58 compromised addresses of cyber-attack transactions are extracted for the Smart contract exploit, Flash loan attack, and the Identity theft based on the identification strategy of compromised address. Specially, 200 addresses that hackers have not attacked are also randomly extracted from Etherscan and Bscscan for verifying the effectiveness of the model, which are made up of smart contracts and EOAs and are similar to the compromised addresses of all cyber-attack transactions. Finally, the historical transactions of each compromised address are extracted from Etherscan and Bscscan using the data extraction strategy shown in Fig. 3, with the window size set to 2000.

To display the distribution of datasets for the compromised addresses and the differences between the three types of cyber attacks, this study computes the descriptive statistics of transaction quantity for the compromised addresses, as shown in Table 6. The descriptive statistics in Table 6 yield three findings. First, more than 50% compromised addresses of Identity theft have more than 2000 historical transactions, whereas more than 75% compromised addresses of other cyber attacks have less than 2000 historical transactions. The statistical results indicate that most of the Flash loan attacks and Smart contract exploits have been launched at the beginning of the decentralized applications, and that the compromised addresses of Identity theft have always been used for a long time or high-frequency trading. Second, the number of historical transactions on Identity theft, Smart contract exploit, and Flash loan attacks gradually decreases based on the values of median, third quartile, and max (see Table 6), demonstrating that the three types of cyber attacks are different from each other. Finally, as all values of the first quartile for all types of transactions are larger than 100, this indicates that the data deficiency of hack addresses is solved by extracting the historical transactions of the compromised addresses.

Table 6 Descriptive statistics of transaction numbers for the compromised addresses

After the data extraction, several features (mentioned in “Methodology” section) are generated from external transactions, internal transfers, token transfers, transaction actions. To take full advantage of the transaction features, the descriptive statistics and correlation analyzes are carried out in this paper, whose result are shown in Table 7 and Fig. 8. As shown in Table 7, most of the min values of the features are equal to 0, since most of the normal transactions are simple transactions compared to the cyber-attack transactions.

Table 7 Summary statistics of features
Fig. 8
figure 8

Results of correlation analysis

According to the result of the correlation analysis in Fig. 8, the type of cyber attack is heavily related to the external features and token transfer features (i.e., gas fee, spend volume, spend-anomaly, to–ts-count and to-ts–count-anomaly). Meanwhile, the redundant features exist in the feature data as a result of the correlation analysis. For example, the correlation coefficients between internal transfer features (i.e., inter-tx-volume, inter-ts-volume-50%, and inter-ts-volume-max) are extremely high, resulting in a lower model efficiency. To improve the detection model’s efficiency, we remove redundant features prior to model training.

Experimental setup

To provide sound results, the dataset of the compromised address is split into a train dataset and a test dataset according to the ratios of 70% and 30%. Furthermore, the expected proportion of anomalies in the entire dataset is set at 5% for all UML methods, as mentioned in “Methodology” section. For our proposed algorithm, many parameters should be decided before model training, such as the multiplier in Table 5, extension level, and the number of isolation trees in Table 15 (“Appendix 3”). The grid search technique is used to estimate the parameters of WEIF in order to find the optimal parameters, as it has always been used to find the optimal parameters for several algorithms such as SVM and neural network algorithm (Pontes et al. 2016; Syarif et al. 2016). Based on parameter estimation using sklearn’s GridSearchCV,Footnote 12 the multiplier, the extension level, and the number of isolation trees of WEIF are equal to 2, 4, and 100, respectively.

Results

Based on the experimental setup, this section aims to provide two types of results: one is about the efficiency and generality of our proposed algorithm (WEIF) and the other is about the importance of the generated features in detecting cyber-attack transactions on blockchain.

The performance of WEIF

Our proposed algorithm, as well as the comparative algorithms mentioned in “Related work” section, are used in three different types of cyber attacks. The results of the Smart contract exploit, the Flash loan attack, and the Identity theft are shown in Tables 8, 9 and 10, where all the recalls of GCN are equal to zero because of the heavily data imbalance and hence, its result is not included. According to the experimental results, we have the following finds.

Table 8 Result on Smart contract exploit
Table 9 Result on Flash loan attack
Table 10 Result on Identity theft

According to the results of the Smart contract exploit in Table 8, WEIF has the highest F1 score on the test dataset. In Table 8, the difference in F1 score between the two results of datasets is small, and the average F1 score of UML is around 0.83 for the test dataset. In Table 8, all SML methods perform well on the train dataset when compared to UML methods, but the average F1 score of SML on the test dataset is less than 0.5. As for the result of the Flash loan attack in Table 9, WEIF outperforms all comparative models with 0.906 on the F1 score for the test dataset, and the average F1 score of UML is about 0.87 on the test dataset. However, although the methods of SML have a good performance on the train dataset, these methods work poorly on the test dataset, as shown in Table 9. As for the results of Identity theft in Table 10, WEIF achieved the best performance with 0.753 on F1 score. Most of the methods belonging to UML have a poor performance on cyber-attack detection of Identity theft except for the Avg k-nearest neighbors (KNN), KNN, EIF, and WEIF, and its average differences in F1 score for both types of datasets are about 0.05. Be similar to the results of SML in Tables 8 and 9, the SML has a good performance on the train dataset, but poor performance on the test dataset, as shown in Table 10.

To summarize, three types of cyber attacks can be detected based on the machine learning methods. First, for all three types of cyber attacks, the proposed algorithm achieves the highest F1 score on the test datasets, demonstrating the efficiency and generality of WEIF on different types of cyber attacks. Second, all of the F1 scores of SML on train datasets are significantly higher than those on test datasets, whereas the differences in F1 scores of UML for the two types of data sets are negligible. As a result, SML performs less effectively than UML in detecting cyber attacks on account-based blockchains. Last but not least, when compared to EIF, our proposed WEIF achieves F1 score improvements of about 0.03, 0.04 and 0.05, respectively, indicating that computing the anomaly score based on the weighted depths of the EIF makes our proposed model more efficient.

Feature importance

Generally speaking, a higher feature importance score means that the specific feature will have a larger effect on the model. In order to determine out which feature is significant for the cyber-attack detection, random forest is applied in the analysis of feature importance owning to its best performance of cyber-attack classification among all methods of SML, as shown in Tables 8, 9 and 10. Figures 9, 10 and 11 show the results of feature importance for three types of cyber attacks. As shown in Figs. 9, 10 and 11, there are some differences and similarities in the analysis of feature importance for three types of cyber attacks.

Fig. 9
figure 9

Feature importance of Smart contract exploit

Fig. 10
figure 10

Feature importance of Flash loan attack

Fig. 11
figure 11

Feature importance of Identity theft

In terms of the differences between the three types of cyber attacks, the number of significant features differs based on the results of feature importance. In particular, the top five features account for more than 70% of the impact in detecting Smart contract exploits, as shown in Fig. 9, whereas the cumulative importance score of the top ten features is about 60% for the detection of Flash loan attack in Fig. 10 and the top three features account for about 90% of impact for the detection of Identity theft in Fig. 11. In terms of similarity between three types of cyber attacks, some features are shared by all types of cyber attacks, as illustrated in Figs. 9, 10 and 11. For example, the feature of gas fee is included in the top ten significant features for all types of cyber attacks, demonstrating that more complicated operations within the cyber-attack transaction result in more gas consumed when compared to normal transactions. To sum up, although the cyber-attack transactions are different, the features designed in this paper are general for all types of cyber-attack detection based on the high F1 scores achieved by WEIF, shown in Tables 8, 9 and 10. Furthermore, the significant features for the three types of cyber attacks are almost the same as the results of feature importance for LGBM, as shown in Figs. 16, 17, 18 (see “Appendix 4”). This demonstrates that the result of feature importance does not significantly depend on the selection of SML methods.

Additional validations

Two additional validations are carried out to evaluate our proposed algorithm’s efficiency and robustness. The first validation is to conduct a robustness test of WEIF based on a new dataset. The second validation is to conduct experiments on different window sizes of training data, mentioned in the data extraction, to determine whether our proposed approach is robust and suitable for the real-time detection of cyber-attack transactions.

The first validation is performed on the new dataset, which contains all of the compromised addresses from cyber attacks as well as an alternate set of 100 random addresses that were not attacked by hackers. The robustness tests on WEIF for all three types of cyber attacks follow the same process as the previous section, and the results are listed in Table 11. Compared to the results shown in Tables 8, 9 and 10, the performances for all three types do not decrease, which indicates the robustness of WEIF.

Table 11 Robustness test of WEIF

Six experiments are designed in the second validation to test the robustness and execution time of WEIF by using different sizes of training data mentioned in “Methodology” section, i.e., different window size of historical transactions of the compromised address for model training. Figures 12 and 13 show both results. As shown in Fig. 12, although the training data window size decreases from 2000 to 300, the F1 score of WEIF decreases by only 0.05, with the lowest F1 score being 0.815, indicating the robustness of WEIF again that is not sensitive to the size of the training data. According to Fig. 13, the average execution time of EIF and WEIF, referring to the average time consumption of model training and prediction for a single transaction, are almost the same, and the gap in the average execution time between WEIF and EIF is also narrowing as the size decrease. This finding implies that, when compared to EIF, WEIF does not significantly increase time consumption. Furthermore, WEIF’s lowest execution time is about 0.6 s, which is significantly less than the average block time mentioned above. As a result of the stability and robustness of our proposed algorithm, the window size can be set to a lower value for the blockchain with lower block time. Otherwise, it can be increased for the blockchain with longer block time.

Fig. 12
figure 12

Results of F1 score with different window sizes of training data

Fig. 13
figure 13

Results of execution time on different window sizes of training data. The blue and red bars represent the execution time of EIF and WEIF, respectively

Discussion

Based on the results of the experiments above, our study provides important theoretical values and high practical application values for the researchers and industry practitioners, respectively. For the theoretical values, we propose a general framework for cyber-attack detection that incorporates the compromised address recognizer, real-time transaction filter system, general feature generator, and detection model. Our proposed framework addresses the issue of data imbalance and data scarcity of hackers’ addresses. Furthermore, within our proposed framework of cyber-attack detection, we propose a novel algorithm named WEIF that is based on the isolation forest and its extension as the detection model and outperforms all methods on three types of cyber attacks based on experimental results. Based on its strong performance against various types of cyber attacks, our proposed framework and algorithm can serve as a theoretical foundation for improving the supervision and regulation of public blockchains. For the practical application values, the generality, robustness, and low time consumption of our proposed algorithm on different types of cyber attacks have been proved by the experimental results above. First, our proposed algorithm’s generality and robustness to various types of cyber attacks makes it perfectly capable of detecting dynamic and ever-changing cyber attacks on account-based blockchains. Second, because of its low time consumption, our proposed algorithm can detect all real-time transactions in a short period of time. Meanwhile, to reduce the number of real-time transactions analyzed by our proposed algorithm, a novel approach of filtering real-time transactions based on the expenditures of compromised addresses is proposed. Finally, the technology of multiprocesses or multithreads can be used to accelerate the process of detecting cyber attacks. All of these evidences show that our proposed algorithm can be directly applied to the detection of cyber attacks in the blockchain industry in real time.

Conclusion and future work

The dynamic and ever-changing cyber attacks frequently happen on account-based blockchain in recent years. However, only a few technologies of machine learning have been applied for the real-time detection of cyber attacks. To this end, we propose a systematic and comprehensive anomaly detection method for coping with this problem. First, a novel algorithm namely, WEIF, is developed for anomaly detection based on the standard isolation forest and its extended model. Then, we propose a general framework on the basis of our proposed algorithm through a comprehensive study of real-world examples of cyber attacks. Within this general framework, a novel approach of identifying the compromised address is created to solve the hack addresses’ data deficiency and reduce the time consumption of our proposed framework. Next, several experiments are carried out on different types of cyber attacks to verify our proposed algorithm’s efficiency and generality. As expected, the experimental results demonstrate the advantage of our proposed method in contrast to many widely used state-of-the-art techniques. Besides, the result also indicates that the techniques of SML are not suitable for real-time detection of cyber attacks, owing to data imbalance and data deficiency. Finally, the results of additional experiments provide more evidence for supporting the lower time consumption and the robustness of our proposed approach, illustrating that our proposed approach is capable of real-time detection of cyber attacks on the account-based blockchain.

In the future, we plan to extend our work from three aspects. First, we will try to apply the multivariate time-series analysis algorithms to the anomaly detections in the context of the account-based blockchain, since the historical transactions belong to the dataset with a time-series format. Second, crypto exchanges are the main gateway for connecting real-world user information to pseudonymous addresses on account-based blockchain, but few studies related to crypto exchanges have been conducted. Therefore, we plan to thoroughly study different types of crypto exchanges. Finally, we will develop applications to analyze the fund flows of illegal activities and automatically extract the funds’ path from the large transaction networks.

Availability of data and materials

Data and codes are available at https://github.com/fung2022/A-blockchain-oriented-approach-for-detecting-cyber-attack-transactions.

Notes

  1. In order to facilitate readers understanding the concepts of blockchain, “Appendix 1” lists and explains the common concepts such as account, transaction, block, cryptocurrency, flash loan and decentralized exchange (DEX).

  2. https://docs.uniswap.org/.

  3. https://aave.com/.

  4. https://pancakeswap.finance/.

  5. https://ethereum.org/en/.

  6. https://www.bnbchain.org/en.

  7. https://solanaminer.com/.

  8. https://defiyield.app/rekt-database.

  9. https://bscscan.com/.

  10. The ERC-20 introduces a standard for Fungible Tokens on Ethereum, in other words, they have a property that makes each token be exactly the same (in type and value) as another token.

  11. The BEP-20 token standard serves pretty much the same function as the ERC-20 token standard, but it applies to tokens built on the Binance Smart Chain (BSC).

  12. https://scikit-learn.org/stable/modules/g0.013enerated/sklearn.model_selection.GridSearchCV.html

Abbreviations

dapps:

Decentralized applications

BSC:

Binance smart chain

DEX:

Decentralized exchange

SML:

Supervised machine learning methods

PTA:

Postmortem analysis technology

UML:

Unsupervised machine learning methods

RF:

Random forest

LGBM:

Light gradient boosting machine

GCN:

Graph convolutional network

LOF:

Local outliers’ factors

CBLOF:

Cluster-based local outlier factor

HBOS:

Histogram-based outlier score

OCSVM:

One-class SVM

VAE:

Variational autoencoder

DeepSVDD:

Deep support vector data description

FB:

Feature bagging

IF:

Isolation forest

EIF:

Extended isolation forest

WEIF:

Weighted and extended isolation forest

EOA:

Externally owned account

ETH:

Ether

References

  • Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery, pp 15–27

  • Aspris A, Foley S, Svec J, Wang L (2021) Decentralized exchanges: the “wild west” of cryptocurrency trading. Int Rev Financ Anal 77:101845

    Article  Google Scholar 

  • Aziz RM, Baluch MF, Patel S, Ganie AH (2022) LGBM: a machine learning approach for Ethereum fraud detection. Int J Inf Technol 1–11

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 93–104

  • Carcillo F, Dal Pozzolo A, Le Borgne Y-A, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf Fus 41:182–194

    Article  Google Scholar 

  • Carcillo F, Le Borgne Y-A, Caelen O, Kessaci Y, Oblé F, Bontempi G (2021) Combining unsupervised and supervised learning in credit card fraud detection. Inf Sci 557:317–331

    Article  Google Scholar 

  • Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794

  • Dal Pozzolo A, Caelen O, Le Borgne Y-A, Waterschoot S, Bontempi G (2014) Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928

    Article  Google Scholar 

  • Efanov D, Roschin P (2018) The all-pervasiveness of the blockchain technology. Procedia Comput Sci 123:116–121

    Article  Google Scholar 

  • Falcão F, Zoppi T, Silva CBV, Santos A, Fonseca B, Ceccarelli A, Bondavalli A (2019) Quantitative comparison of unsupervised anomaly detection algorithms for intrusion detection. In: Proceedings of the 34th ACM/SIGAPP symposium on applied computing, pp 318–327

  • Fang F, Ventre C, Basios M, Kanthan L, Martinez-Rego D, Wu F, Li L (2022) Cryptocurrency trading: a comprehensive survey. Financ Innov 8(1):1–59

    Article  Google Scholar 

  • Farrugia S, Ellul J, Azzopardi G (2020) Detection of illicit accounts over the Ethereum blockchain. Expert Syst Appl 150:113318

    Article  Google Scholar 

  • Goldstein M, Dengel A (2012) Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: KI-2012: poster and demo track, pp 59–63

  • Hariri S, Kind MC, Brunner RJ (2019) Extended isolation forest. IEEE Trans Knowl Data Eng 33(4):1479–1489

    Article  Google Scholar 

  • Harvey CR, Ramachandran A, Santoro J (2021) DeFi and the future of finance. Wiley

    Google Scholar 

  • He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10):1641–1650

    Article  Google Scholar 

  • Hilas CS, Mastorocostas PA (2008) An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowl-Based Syst 21(7):721–726

    Article  Google Scholar 

  • Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456

  • Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in neural information processing systems, vol 30

  • Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv preprint http://arxiv.org/abs/1312.6114

  • Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint http://arxiv.org/abs/1609.02907

  • Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 157–166

  • Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data (TKDD) 6(1):1–39

    Article  Google Scholar 

  • Patel V, Pan L, Rajasegarar S (2020) Graph deep learning based anomaly detection in ethereum blockchain network. In: International conference on network and system security, pp 132–148

  • Pontes FJ, Amorim G, Balestrassi PP, Paiva A, Ferreira JR (2016) Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing 186:22–34

    Article  Google Scholar 

  • Puggini L, McLoone S (2018) An enhanced variable selection and Isolation Forest based methodology for anomaly detection with OES data. Eng Appl Artif Intell 67:126–135

    Article  Google Scholar 

  • Qin K, Zhou L, Livshits B, Gervais A (2021) Attacking the defi ecosystem with flash loans for fun and profit. In: International conference on financial cryptography and data security, pp 3–32

  • Rovetta S, Suchacka G, Masulli F (2020) Bot recognition in a Web store: an approach based on unsupervised learning. J Netw Comput Appl 157:102577

    Article  Google Scholar 

  • Ruff L, Vandermeulen R, Goernitz N, Deecke L, Siddiqui SA, Binder A, Müller E, Kloft M (2018) Deep one-class classification. In: International conference on machine learning, pp 4393–4402

  • Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471

    Article  Google Scholar 

  • Sebastião H, Godinho P (2021) Forecasting and trading cryptocurrencies with machine learning under changing market conditions. Financ Innov 7(1):1–30

    Article  Google Scholar 

  • Shen J, Zhou J, Xie Y, Yu S, Xuan Q (2021) Identity inference on blockchain using graph neural network. In: International conference on blockchain and trustworthy systems, pp 3–17

  • Syarif I, Prugel-Bennett A, Wills G (2016) SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (telecommun Comput Electron Control) 14(4):1502–1509

    Article  Google Scholar 

  • Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441

    Article  Google Scholar 

  • Xu JJ (2016) Are blockchains immune to all malicious attacks? Financ Innov 2(1):1–9

    Article  Google Scholar 

  • Xu M, Chen X, Kou G (2019) A systematic review of blockchain. Financ Innov 5(1):1–14

    Article  Google Scholar 

  • Yu S, Jin J, Xie Y, Shen J, Xuan Q (2021) Ponzi scheme detection in ethereum transaction network. In: International conference on blockchain and trustworthy systems, pp 175–186

Download references

Acknowledgements

We greatly thank the discussions from Dr. Chao Liu and Dr. Baoqiang Zhan.

Funding

This work was supported by the National Natural Science Foundation of China (72171059, 71771041, 72121001), the Fundamental Research Funds for the Central Universities (FRFCU5710000220) and the Natural Science Foundation of Heilongjiang Province, China (No. YQ2020G003).

Author information

Authors and Affiliations

Authors

Contributions

ZF: Conceptualization, methodology, visualization, software, validation, and data curation, and writing—original draft. YL: Writing—review and editing, conceptualization, supervision, validation. XM: Writing—review and editing, visualization, and data curation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yongli Li.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Common concepts on account-based blockchain

In this appendix, the common concepts on account-based blockchain, related to the analysis of cyber-attack, are shown as follow.

Account. There are two types of accounts: externally owned account (EOA) controlled by private key, and smart contract account controlled by their codes, described in whitepaper of Ethereum and BSC. For the EOA, it is made up of a cryptographic pair of keys: public and private. The public key and private key are similar to an online bank account and the corresponding password, and losing the private key is equivalent to losing the funds of the corresponding account. For the smart contract account, it is written in programming languages such as Solidity, and all of the smart contracts are executed on the blockchain. Applications can call the smart contract functions, change their state, and initiate transactions. Both account types have the ability to receive, hold and send the crypto assets, and interact with the deployed smart contracts.

Transaction. The transaction on the account-based blockchain refers to an action initiated by an externally owned account. In other words, an account is managed by a human, not a contract. A transaction is a signed data message sent from an externally owned account to another account on blockchain, e.g., so the recipient of a transaction has more crypto asset and the sender has less. It contains the information of the transaction sender and recipient, which are the amount of cryptocurrency to be transferred and the transaction fee the sender is willing to pay. Generally, an internal transfer is the consequence of smart contract logic that is triggered by an external transaction, where the transaction is transmitted from the EOA to the smart contract. Meanwhile, the execution of transaction with a smart contract may result in more transactions depending on the code of smart contract.

Transaction Fee. Transaction fee is also known as gas fee, and it is the fee paid to the nodes (miner) for executing transaction. When we transfer money on the account-based blockchain, the miner must pack our transaction and put it on the blockchain to complete the transaction. In this process, the nodes will consume computing resources, and the miner should be compensated. The gas fee only depends on the complexity of the transaction. Overall, the higher gas fee is consumed in the more complicated transaction. Meanwhile, the price of gas can be set by users, and the set price will affect the transaction speed. As miners give priority to transactions with high gas prices. If the transaction party sets the gas price too low, the speed of the transaction will slow down.

Block. A block is mined when added to the account-based network, once the consensus is reached. A transaction is said to be mined when it is included to the blockchain in a new block. Therefore, each block has several transactions. In order to preserve the historical transaction records, every new block contains a unique identifier of its parent block, this is how all of the blocks are linked in the blockchain. As a result, all of the blocks on the account-based blockchain are strictly ordered as well as the transactions within the blocks. According the Etherscan and BscScan, the average block times, referring to the times it takes to mine a new block of BSC and Ethereum, are 2.5 and 12 s respectively.

Cryptocurrency. The Ether (ETH) and BNB are the digital fuel for Ethereum and BSC, respectively, which is similar to the gasoline for our cars. For instance, transaction fee is an amount of computer power required in order to execution of the transaction, which is paid by ETH or BNB. Compared with the ETH and BNB, ERC-20 and BEP-20 Tokens are the most commonly used tokens on the Ethereum and BSC network, which are supported by the smart contract. According to the report of Etherscan and BscScan, there are more than 500,000 types of ERC20 Tokens on Ethereum and 2,568,483 types of BEP-20 Tokens on BSC, with over 100 billion dollars market capitalization.

Flash loan. Flash loan is one of the decentralized applications based on smart contracts. Because of the state reverting feature of Ethereum and BSC, the tools of flash loan are developed to enable the uncollateralized lending service. This type of loan service provides users an unsecured loan from lenders without intermediaries. The rule of flash loan is that the borrower must pay back the loan before the transaction ends. Otherwise, the transaction will be rejected and the smart contract reverses the transaction, and it's like the loan never happened in the first place on Ethereum and BSC.

Decentralized exchange (DEX). DEXs are blockchain-based applications that provide users the trading of crypto assets without intermediaries. It works entirely through automated algorithms based on a set of smart contracts. Unlike the centralized exchanges, DEXs do not allow for the exchange between crypto assets and fiat. Meanwhile, DEXs do not hold users’ crypto assets. Instead, users hold all their assets directly in their wallets all the time.

Appendix 2: Real-world examples of cyber-attack transaction

In this section, two real-world examples of cyber-attack transaction, referring to Flash loan attack and re-entrancy attack, are elaborated below.

Real-world example of Smart contract exploit. Taking the re-entrancy as an example, its brief process is shown in Fig. 14, DAO is the victim contract which is marked with blue, and the malicious proxy contract is marked with gray. The detail process of re-entrancy attack contains several steps:

  • Step 1. Malicious contract calls the withdrawBalance function of DAO attempting to withdraw a certain amount A of ETH from an account containing a large amount B of ETH;

  • Step 2. The withdrawBalance function of DAO check that the withdrawal is valid if B > A.

  • Step 3. The withdrawBalance function of DAO transfers the requested A ETH to the malicious contract.

  • Step 4. This transfer triggers Fallback function of the malicious contract, which calls the withdrawBalance function of DAO again requesting a withdrawal of A ETH.

  • Step 5. The withdrawBalance function of DAO checks that the withdrawal is valid, since account balance of Malicious contract is still B and B > A.

  • Step 6. The withdrawBalance function of DAO transfers the requested A ETH to the malicious contract.

  • Step 7. The Fallback function of malicious contract returns without performing any action.

  • Step 8. The withdrawBalance function of DAO updates its state to reflect the withdrawal in step 6, reducing account balance of malicious contract to (B–A) ETH.

Real-world example of Flash loan attack. For example, the Flash loan attack happened on 18 February 2020 which is shown in Fig. 15. First, the hacker obtained a flash loan of 7500 ETH from the bZx protocol and split the total amount of ETH into three parts (3518, 900 and 3082). Second, the hacker converted 3518 ETH to sUSD on Synthetix, the synthetic USD token (sUSD) is enabled by the Syntetix protocol and the sUSD were bought at the price of $1. Third, the hacker swapped 900 ETH in two batches for sUSD through Kyber. The first batch was sold for 540 ETH in KyberSwap and the second batch was sold 18 times for 20 ETH each in Kyber, effectively inflating the price of sUSD up to $2 in Kyber (By this way, the supply of sUSD will be decreased, and the total supply of ETH is increasing in Kyber. The ratio of the supply of ETH and the supply of sUSD in Kyber will rise, which makes the price of sUSD go up). After this operation, the hacker has finished all the preparatory work, such as accumulation of sUSD and inflating the price to the certain price in Kyber. Fourth, since bZx relies on Kyber for the real-time price feed, with the spiked sUSD/ETH price (This price is higher than the actual prices), the hacker started to attack the bZx by borrowing 6796 ETH with all the collection of sUSD. Finally, the hacker repaid the 7500 ETH flash loan back to bZx with a profit of 2378 ETH.

Fig. 14
figure 14

An illustration of the re-entrancy attack

Fig. 15
figure 15

Flowchart of bZx Flash loan attack scheme

Appendix 3: Algorithms of the standard isolation forest and its extension

In this section, the algorithms of the standard isolation forest and its extension are introduced as follow (See Tables 12, 13, 14, 15).

Table 12 Algorithm of isolation tree (IF)
Table 13 Algorithm of constructing isolation tree (EIF)
Table 14 Algorithm of depth computation (EIF)
Table 15 Algorithm of constructing isolation forest (EIF)

Appendix 4: Analysis of feature importance based on LGBM

In this section, LGBM is also selected to carry out the analysis of feature importance for verifying the whether or not the result of feature importance depends the selection of SML methods. And the results of feature importance for LGBM are show in Figs.

Fig. 16
figure 16

Feature importance of Smart contract exploit based on LGBM

16,

Fig. 17
figure 17

Feature importance of Flash loan attack based on LGBM

17 and

Fig. 18
figure 18

Feature importance of Identity theft based on LGBM

18.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, Z., Li, Y. & Ma, X. Blockchain-oriented approach for detecting cyber-attack transactions. Financ Innov 9, 81 (2023). https://doi.org/10.1186/s40854-023-00490-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40854-023-00490-6

Keywords