Comprehensive review of text-mining applications in finance

Text-mining technologies have substantially affected financial industries. As the data in every sector of finance have grown immensely, text mining has emerged as an important field of research in the domain of finance. Therefore, reviewing the recent literature on text-mining applications in finance can be useful for identifying areas for further research. This paper focuses on the text-mining literature related to financial forecasting, banking, and corporate finance. It also analyses the existing literature on text mining in financial applications and provides a summary of some recent studies. Finally, the paper briefly discusses various text-mining methods being applied in the financial domain, the challenges faced in these applications, and the future scope of text mining in finance.

can help us process information from top to bottom and analyse entire documents as well as individual words (Pandya et al. 2019;Parekh et al. 2020).
Human-generated 'natural' data in the form of text, audio, video, and so on are rapidly increasing (Shah et al. 2020a, b). This has led to a rise in interest in methods and tools that can help extract useful information automatically from enormous amounts of unstructured data (Jaseena and David 2014;David and Balakrishnan 2011). One crucial method is text mining, which is a combined derivative of techniques such as data mining, machine learning, and computational linguistics, among others. Text mining aims to extract information and patterns from textual data (Talib et al. 2016b;Fan et al. 2006). The trivial approach to text mining is manual, in which a human reads the text and searches for useful information in it. A more logical approach is automatic, which mines text in an efficient way in terms of speed and cost (Herranz et al. 2018;Sukhadia et al. 2020;Pathan et al. 2020).
According to the India Brand Equity Foundation (IBEF 2019), the Indian financial industry alone had US $340.48 billion in assets under management as of February 2019. This value only provides us with a limited indication of the actual size and reach of the global finance industry. Technology has paved the way for digitalisation in this rapidly growing behemoth. 'FinTech' is a developing domain in the finance industry, which has been defined as a union of finance and information technology (Zavolokina et al. 2016). Marrara et al. (2019) examined how FinTech relates to Italian small and medium-sized enterprises (SMEs), where FinTech has witnessed huge growth in terms of investment and development, and how it has proved fruitful for the SME market in a short amount of time. FinTech has popularised the use of data in the financial industry. This data is substantially in the form of structured or unstructured text. Therefore, traditionally and technically, textual data can be regarded as always having been a prevailing and essential element in the finance sector.
Unstructured textual data have been increasing rapidly in the finance industry (Lewis and Young 2019). This is where text mining has a lot of potential. Kumar and Ravi (2016) explored various applications in the financial domain in which text mining could play a significant role. They concluded that it had numerous applications in this industry, such as various kinds of predictions, customer relationship management, and cybersecurity issues, among others. Many novel methods have been proposed for analysing financial results in recent years, and artificial intelligence has made it possible to analyse and even predict financial outcomes based on historical data.
Finance has been an important force in human life since the earliest civilisations. It is noteworthy that from barter systems to cryptocurrencies, finance has always been associated with data, such as transactions, accounts, prices, and reports. Manual approaches to processing data have been reduced in use and significance over time. Researchers and practitioners have come to prefer digitised and automated approaches for studying and analysing financial data. Financial data contain a significant amount of latent information. If the latent information were to be extracted manually from a huge corpus of data, it might take years. Advancements in text mining have made it possible to efficiently examine textual data pertaining to finance. Bach et al. (2019) published a literature review on text mining for big-data analysis in finance. They structured the review in terms of three critical questions. These questions pertained to the intellectual core of finance, the text-mining techniques used in finance, and the data sources of financial sectors. Kumar and Ravi (2016) discussed the model presented by Vu et al. (2012) that implemented text mining on Twitter messages to perform sentiment analysis for the prediction of stock prices. They also mentioned the model of Lavrenko et al. (2000), which could classify news stories in a way that could help identify which of them affected trends in finance and to what degree. We will further discuss text-mining applications in finance in subsequent sections.
Apart from finance, we present a brief overview of text mining in other industries. On social media, people generate text data in the form of posts, blogs, and web forum activity, among many others (Agichtein et al. 2008). Despite the vast quantity of data available, the relatively low proportion of content of significant quality is still a problem (Kinsella et al. 2011), which is an issue that can be solved by text mining (Salloum et al. 2017). In the biomedical field too, there is a need for effective text-mining and classification methods (Krallinger et al. 2011). On e-commerce websites, text mining is used to prevent the repetition of information to the same audience (Da-sheng et al. 2009) and improve product listings through reviews (Kang and Park 2016;Ur-Rahman and Harding 2012). In healthcare, researchers have worked on applications such as the identification of healthcare topics directly from personal messages over the Internet (Lu 2013), classification of online data (Srivastava et al. 2018), and analysis of patient feedback (James et al. 2017). The agriculture industry has also used text mining in, for example, the classification of agricultural regulations (Espejo-Garcia et al. 2018), ontology-based agricultural text clustering , and analysis of agricultural network public opinions (Lee 2019). Text mining has also been utilised in the detection of malicious web URLs which evolve over time and have complex features (Li et al. 2020a;b, c). This paper discusses the use of text mining in the financial domain in detail, taking into consideration three major areas of application: financial forecasting, banking, and corporate finance. We also discuss the widely used methodologies and techniques for text mining in finance, the challenges faced by researchers, and the future scope for textmining methods in finance.

Overview of text-mining methodologies
Text mining is a process through which the user derives high-quality information from a given piece of text. Text mining has seen a significant increase in demand over the last few years. Coupled with big data analytics, the field of text mining is evolving continuously. Finance is one major sector that can benefit from these techniques; the analysis of large volumes of financial data is both a need and an advantage for corporates, government, and the general public. This section discusses some important and widely used techniques in the analysis of textual data in the context of finance.

Sentiment analysis (SA)
One of the most important techniques in the field is SA. It has applications in numerous sectors. This technique extracts the underlying opinions within textual data and is therefore also referred to as opinion mining (Akaichi et al. 2013). It is of prime use in a number of domains, such as e-commerce platforms, blogs, online social media, and microblogs. The motives behind sentiment analysis can be broadly divided into emotion recognition and polarity detection. Emotion detection is focused on the extraction of a set of emotion labels, and polarity detection is more of a classifier-oriented approach with discrete outputs (e.g., positive and negative) (Cambria 2016).
There are two main approaches for SA, namely lexicon-based (dictionary-based) and machine learning (ML). The latter is further classified into supervised and unsupervised learning approaches Pradhan et al. 2016). Lexicon-based approaches use SentiWordNet word maps, whereas ML considers SA as a classification problem and uses established techniques for it. In lexicon-based approaches, the overall score for sentiment is calculated by dividing the sentiment frequency by the sum of positive and negative sentiments. In ML approaches, the major techniques that are used are Naïve Bayes (NB) classifier and support vector machines (SVMs), which use labelled data for classification. SA using ML has an edge over the lexicon approach, as it doesn't require word dictionaries that are highly costly. However, ML requires domain-specific datasets, which can be considered as a limitation (Al-Natour and Turetken 2020). After data preprocessing, feature selection is performed as per the requirement, following which one obtains the final results after the analysis of the given data as per the adopted approach (Hassonah et al. 2019).
In the financial domain, stock market prediction is one of the applications in which SA has been used to predict future stock market trends and prices from the analysis of financial news articles. Joshi et al. (2016) compared three ML algorithms and observed that random forest (RF) and SVMs performed better than NB. Renault (2019) used StockTwits (a platform where people share ideas about the stock market) as a data source and applied five algorithms, namely NB, a maximum entropy method, a linear SVM, an RF, and a multilayer perceptron and concluded that the maximum entropy and linear SVM methods gave the best results. Over the years, researchers have combined deep learning methods with traditional machine learning techniques (e.g., construction of sentiment lexicon), thus obtaining more promising results (Yang et al. 2020).

Information extraction
Information extraction (IE) is used to extract predefined data types from a text document. IE systems mainly aim for object identification by extracting relevant information from the fragments and then putting all the extracted pieces in a framework. Post extraction, DiscoTEX (Discovery from TextEXtraction) is one of the core methods used to convert the structured data into meaningful data to discover knowledge from it (Salloum et al. 2018).
In finance, named-entity recognition (NER) is used for extracting predefined types of data from a document. In banking, transaction order documents of customers may come via fax, which results in very diverse documents because of the lack of a fixed template and creates the need for proper feature extraction to obtain a structured document (Emekligil et al. 2016).

Natural language processing (NLP)
NLP is a part of the artificial intelligence domain and attempts to help transform imprecise and ambiguous messages into unambiguous and precise messages. In the financial sector, it has been used to assess a firm's current and future performance, domain standards, and regulations. It is often used to mine documents to obtain insights for developing conclusions (Fisher et al. 2016). NLP can help perform various analyses, such as NER, which further helps in identifying the relationships and other information to identify the key concept. However, NLP lacks a dictionary list for all the named entities used for identification (Talib et al. 2016a;. As NLP is a pragmatic research approach to analyse the huge amount of available data, Xing et al. (2017) applied it to bridge the gap between NLP and financial forecasting by considering topics that would interest both the research fields. Figure 1 provides an intuitive grasp of natural language-based financial forecasting (NLFF). Chen et al. (2020) discussed the role of NLP in FinTech in the past, present, and future. They reviewed three aspects, namely know your customer (KYC), know your product (KYP), and satisfy your customer (SYC). In KYC, a lot of textual data is generated in the process of acquiring information about customers (corporate sector and retail). With respect to KYP, salespersons are required to know all the attributes of their product, which again requires data in order to know the prospects, risks, and opportunities of the product. In SYC, salespersons/traders and researchers try to make the financial activities more efficient to satisfy the customers in the business-to-customer as well as customerto-customer business models. Herranz et al. (2018) discussed the role of NLP in teaching finance and reported that it enhanced the transfer of knowledge within an environment overloaded with information.

Text classification
Text classification is a four-step process comprising feature extraction, dimension reduction, classifier selection, and evaluation. Feature extraction can be done with common techniques such as term frequency and Word2Vec; then, dimensionality reduction is performed using techniques such as principal component analysis and linear discriminant analysis. Choosing a classifier is an important step, and it has been observed that deep learning approaches have surpassed the results of other machine learning algorithms. The evaluation step helps in understanding the performance of the model; it is conducting using various parameters, such as the Matthews correlation coefficient  Gupta et al. Financ Innov (2020) 6:39 (MCC), area under the ROC curve (AUC), and accuracy. Accuracy is the simplest of these to evaluate. Figure 2 shows an overview of the text classification process (Kowsari et al. 2019). Brindha et al. (2016) compared the performance of various text classification techniques, namely NB, k-nearest neighbour (KNN), SVM, decision tree, and regression, and found that based on the precision, recall, and F1 measures, SVM provided better results than the others.

Deep learning
Deep learning is a part of machine learning, which trains a data model to make predictions about new data. Deep learning has a layered architecture, where the input data goes into the lowest level and the output data is generated at the highest level. The input is transformed at the various middle levels by applying algorithms to extract features, transform features into factors, and then input the factors into the deeper layer again to obtain transformed features (Heaton et al. 2016). Widiastuti (2018) focused on the input data, as it plays an important role in the performance of any algorithm. The author concluded that modification of the network architecture with deep learning algorithms can markedly affect performance and provide good results.
In finance, deep learning solves the problem of complexity and ambiguity of natural language. Kraus and Feuerriegel (2017) used a corpus of 13,135 German ad hoc announcements in English to predict stock market movements and concluded that deep learning was better than the traditional bag-of-words approach. The results also showed that the long short-term memory models outperformed all the existing machine learning algorithms when transfer learning was performed to pre-train word embeddings.

Review of text-mining applications in finance
As mentioned in earlier sections, this paper focuses on the applications of text mining in three sectors of finance, namely financial predictions, banking, and corporate finance. In the subsections, we review various studies. Some literature has been summarised in detail, and in the end, a tabular summary of some more studies is included. Figure 3 shows a summarised link between the text-mining techniques and their corresponding applications in the respective domains. Although the following subsections discuss the studies pertaining to each sector individually, there has also been research on techniques that can be applied to multiple financial sectors. One such system was proposed by Li et al. (2020a), which was a classifier based on adaptive hyper-spheres. It could be helpful in tasks such as credit scoring, stock price prediction, and anti-fraud analysis.

Prediction of financial trends
Using the ever-expanding pool of textual data to improve the dynamics of the market has long been a practice in the financial industry. The increasing volume of press releases, financial data, and related news articles have been motivating continued and sophisticated analysis, dating back to the 1980s, in order to derive a competitive advantage (Xing et al. 2017). Abundant data investigated with text mining can deliver an advantage in a variety of scenarios. As per Tkáč and Verner (2016) and Schneider and Gupta (2016), among the many ideas covered in financial forecasting, from credit scoring to inflation rate prediction, a large proportion of focus is on stock market and forex prediction. Wen et al. (2019) proposed an idea regarding how retail investor attention can be used for evaluation of the stock price crash risk.

Fig. 3
An overview of how text mining can be used in the financial domain. This paper follows a systematic approach for reviewing text-mining applications, as depicted by the flowchart in the figure. The two independent entities, namely finance and text mining, are linked together to show the possible applications of various text-mining techniques in various financial domains Wu et al. (2012) proposed a model that combined the features of technical analysis of stocks with sentiment analysis, as stock prices also depend on the decisions of investors who read stock news articles. They focused on obtaining the overall sentiment behind each news article and assigned it the respective sentiment based on the weight it carried. Next, using different indicators, such as price, direction, and volume, technical analysis was performed and the learning prediction model was generated. The model was used to predict Taiwan's stock market, and the results proved to be more promising than models that employed either of the two. This indicates an efficient system that can be integrated with even better features in the future.
Al-Rubaiee et al. (2015) analysed the relationship between Saudi Twitter posts and the country's stock market (Tadawul). They used a number of algorithms such as SVM, KNN, and NB algorithms to classify Arabic text for the purpose of stock trading. Their major focus was on properly preprocessing data before the analysis. By comparing the results, they found that SVM had the best recall, and KNN had the best precision. The one-to-one model that they built showcased the positive and negative sentiments as well as the closing values of the Tadawul All Share Index (TASI). The relationship between a rise in the TASI index and an increase in positive sentiments was found out to be greater than that of a decline in the index and negative sentiments. The researchers mentioned that in future work they would incorporate the Saudi stock market closing values and sentiment features on tweets to explore the patterns between the Saudi stock index and public opinion on Twitter.
Vijayan and Potey (2016) proposed a model based on recent news headlines that predicted the forex trends based on the given market situations. The information about the past forex currency pair trends was analysed along with the news headlines corresponding to that timeline, and it was assumed that the market would behave in the future as it had done in the past. The researchers focused on the elimination of redundancy, and their model focused on news headlines rather on entire articles. Multilayer dimension reduction algorithms were used for text mining, the Synchronous Targeted Label Prediction algorithm was used for optimal feature reduction, and the J48 algorithm was used for the generation of decision trees. The main focus was on fundamental analysis that targeted unstructured textual data in addition to technical analysis to make predictions based on historical data. The J48 algorithm resulted in an improvement in the accuracy and performance of the overall system, better efficiency, and less runtime. In fact, the researchers reported that the algorithm could be applied to diverse subjects, such as movie reviews. Nassirtoussi et al. (2015) proposed an approach for forex prediction wherein the major focus was on strengthening text-mining aspects that had not been focused upon in previous studies. Dimensionality reduction, semantic integration, and sentiment analysis enabled efficient results. The system predicted the directional movement of a currency pair based on news headlines in the sector from a few hours before. Again, headlines were taken into consideration for the analysis, and a multilayer algorithm was used to address semantics, sentiments, and dimensionality reduction. This model's process was highly accurate, with results of up to 83%. The strong results obtained in that study demonstrate that the studied relationships exist. The models can be applied to other contexts as well. Nikfarjam et al. (2010) discussed the components that constitute a forecasting model in this sector and the prototypes that had been recently introduced. The main components were compared with each other. Feature selection and feature weighting were used to select a piece of news and assign weights to them, used either individually or in combination for feature selection. Next, feature weighting was used to calculate the weights for the given terms. The feature weighting methodology was based on the study by Fung et al. (2002), who had assigned more weights to enhance the term frequency-inverse document frequency (TF-IDF) weighting. For text classification, most researchers have applied SVMs to classify the input text into either good or bad news. Some researchers have used Bayesian classifiers, and some others have used a combination of binary classifiers to achieve the final classification decision. Many authors have focused on news features but not equally addressed the available market data. The focus of most studies has been on the analysis of news and indicator values separately, which has proved to be less efficient. The combination of both market news and the status of market trends at the same time is expected to provide stronger results. Gupta et al. (2019) proposed a combination of two models: the primary model obtained the dataset for prediction, preprocessed the dataset using logistic regression to remove redundancy, and employed a genetic algorithm, KNN, and support vector regression (SVR). In a comparison of all three, KNN was the basis for their predictions, with an efficiency of more than 50%. The genetic algorithm was used next in search for better accuracy. In an attempt to further support the genetic algorithm, SVR was used, which gave the opening price for any day in the future. For sentiment analysis, Twitter was used, as it was considered the most popular source for related news. The model divided the tweets into two categories, and the rise or fall of the market was predicted taking into consideration the huge pool of keywords. In the end, the model had an accuracy of about 70-75%, which seems reasonable for a dynamic environment. Nguyen et al. (2015) focused on sentiment analysis of social media. They obtained the sentiments behind specific topics of discussion of the company on social media and achieved promising results in comparison with the accuracy of stocks in the preceding year. Sentiments annotated by humans on social media with regards to stock prediction were analysed, and the percentage of desired sentiments was calculated for each class. For a remaining lot of messages without explicit sentiments, a classification model was trained using the annotated sentiments on the dataset. For both of these tasks, an SVM was used as the classification model. In another study, after lemmatisation by CoreNLP, latent Dirichlet allocation (LDA) (Blei et al. 2003) was used as the generative probabilistic model. The authors also implemented the JST model (Lin and He 2009) and Aspectbased Sentiment Analysis for analysing topic sentiments for stock prediction. The study's limitation was that the topics and models were selected beforehand. The accuracy was around 54%; however, the overall prediction in the model passed only if the stock went up or down. As the model just focused on sentiments and historical prices, the authors intended to add more factors to build a more accurate model. Li et al. (2009) approached financial risk analysis through the available financial data on sentiments and used machine learning and sentiment analysis. The uniqueness of their study was the volume of data and the information sentiments. A generalised autoregressive conditional heteroskedasticity modelling (GARCH)-based artificial neural network and a GARCH-based SVM were used. A special training process, named the 'dynamic training technique' , was applied because the data was non-stationary and noisy and could have resulted in overfitting. For analysing news, the semantic orientation-based approach was adopted, mainly because of the number of articles that were analysed in the study. The future work on this model was expected to include more input data and better sentiment analysis algorithms to obtain better results.
The use of sentiment analysis as a tool to facilitate investment and risk decisions by stock investors was demonstrated by Wu et al. (2014). Sina Finance, an experimental platform, was the basis for the collection of financial data for this model. The method incorporated machine learning based on SVM and GARCH with sentiment analysis. At the specific opening and closing times for each day, the GARCH-based SVM was used to identify the relations between the obtained information's sentiment and stock price volatility. This model showed better results when predicting individual stocks rather than at the industry level. The machine learning approach was about 6% more accurate than the lexicon-based semantic approach, and it performed better with bigger datasets. The model performed better on datasets relating to small companies, as small companies were observed to be more sensitive to online reviews. The authors mentioned their future scope as expanding their dataset and attempting to create a more efficient sentiment calculation algorithm to increase the overall accuracy, similar to the one made by Li et al. (2009).
A slightly different approach was used by Ahmad et al. (2006), who focused on sentiment analysis of financial news streams in multiple languages. Three widely spoken languages, namely Arabic, Chinese, and English, were used for replication for automatic sentiment analysis. The authors adopted a local grammar approach using a local archive of the three languages. A statistical criterion in the training collection of texts helped in the identification of keywords. The most widely available corpus was for English, followed by Chinese and Arabic. Based on the frequencies of various words, the most widely utilised words were ranked and selected. Through manual evaluation, the accuracy of extraction ranged from 60 to 75%. A more robust evaluation of this model would be necessary for use in real-time markets, with the inclusion of more than one news vendor at a time.
Over the years, deep learning has become acknowledged as a useful machine learning technique that enables state-of-the-art results. It uses multiple layers to create representations and features from the input data. Text-mining analysis has also continuously evolved. The early basic model used lexicon-based analysis to account for a particular entity (sentiment analysis). Considering the complexity of language, a complete understanding of what any piece of text aims to convey requires a more complex analysis to identify and target relevant entities and related aspects (Dohaiha et al. 2018). The most important aspect is the relationship between the words in the text, and how the same is dominant in determining the meaning of the content. Several language elements, such as implications (Ray and Chakrabarti 2019) and sarcasm, require high-level methods for handling. This problem requires the use of deep learning models that can help completely understand a given piece of text. Deep learning may incorporate time series analysis and aspect-based sentiment analysis, which enhances data mining, feature selection, and fast information retrieval. Deep learning models learn features during the process of learning. They create abstract representations of the given data and therefore are unchanged with local changes to the input data (Sohangir et al. 2018). Word embeddings target words that are similar in context. By the measurement of similarities between words (e.g., cosine similarity in the case of vectors), one can employ word embeddings in the initial data preprocessing layers for faster and more efficient NLP execution (Young et al. 2018).
The huge amount of streaming financial news and articles are impossible to be processed by humans for interpretation and application on a daily basis. In a number of uses, such as portfolio construction, forecasting a financial time series is essential. The application of DL techniques on such data for forecasting purposes is of interest to industry professionals. It has been reported that repeated patterns of price movements can be estimated using econometric and statistical models (Souma et al. 2019). Even though the market is dynamic, a combination of deep learning models and past market trends is very useful for accurate predictions. In a comparison of real trades with the generated market trades with the use of SA, Kordonis et al. (2016) found a considerable effect of sentiments on the predictions. Because of the promising results, the use of artificial intelligence and deep learning has attracted the interests of many researchers and practitioners to improve forecasting.
With the use of deep learning, one has to perform little work by hand, while being able to harness a large amount of computation and data. DL techniques that use distributed representation are considered state-of-the-art methods for a large variety of NLP problems. We expect these models to improve and get better at handling unlabelled data through the development and use of approaches such as reinforcement learning.
Owing to the advancements in technology, there are several factors that can be used in models that aim to predict market movements. Not only the price models but also a number of different related models include macroeconomic variables (e.g., investment). Although macroeconomic indicators are important, they tend to be updated infrequently. Unlike such economic factors, public mood and sentiments (Xing et al. 2018a, b) are dynamic and can be instantaneously monitored. For instance, behavioural science researchers have found that the stock market is affected by the investors' psychology (Daniel et al. 2001). Depending on their mood states, investors make numerous decisions, a big proportion of which are risky. The impact of sentiment and attention measures on stock market volatility (Audrino et al. 2018) can be gauged through news articles, social media, and search engine results. The models that incorporate technical indicators of the market with sentiments obtained from the aforementioned sources outperform those that rely on only one of the two ). In a study pertaining to optimal portfolio allocation, Malandri et al. (2018) used historical data of the New York Stock Exchange and combined it with sentiment data to get comparatively better returns for the portfolios taken under consideration.
Empirical studies have shown that current market prices are a reflection of recently published news; this has been clearly shown by the Efficient Market Hypothesis (Fama 1991). Rather than being dependent on the existing information, price changes are markedly affected by new information or news. ML and DL methods have allowed data scientists to play a part in financial sector analysis and prediction (Picasso et al. 2019). There has been an increasing use of text-mining methods to make trading decisions (Wu et al. 2012). Different kinds of models, including neural networks, are used for sentiment embeddings from news, tweets, and financial blogs. Mudinas et al. (2019) studied the change of Granger-caused stocks based on sentiments alone-although this did not provide promising results, the integration with prediction models gave better results. This is because sentiments cannot be determinant factors alone, but they can be used with prediction models to lead to better and dynamic results.
As discussed above, a plethora of proposals and approaches in relation to financial forecasting have been studied, the two main applications of which have been stock prediction and forex. The main focus of these studies was on obtaining sentiments from news headlines and not from entire articles. Researchers have used a variety of text-mining approaches to integrate the abundant amount of useful information with financial patterns. Table 1 summarises some more research studies that have been conducted in recent years on the subject of text mining in financial predictions.

Banking and related applications
Banking is one of the largest and fastest-growing industries in this era of globalisation. The industry is heading towards adopting the most efficient practices for each of its departments. The total lending in the financial year 2017-2018 increased from US $429.92 billion to $1347.18 billion at a CAGR of 10.94% (Ministry of Commerce and Industry, Government of India, 2019). This huge rise is promoting strong economic growth, increasing incomes, enhancing trouble-free access to bank credit, and increasing consumerism. In the midst of an IT revolution, competitive reasons have led to the rising importance and adoption of banking automation. IT enables the implementation of various techniques for risk controls and smooth flow of transactions over electronic mediums and supports financial product innovation and development. Gao and Ye (2007) proposed a framework for preventing money laundering with the help of the transaction histories of customers. They did this by identifying suspicious data from various textual reports from law enforcement agencies. They also mined unstructured databases and text documents for knowledge discovery in order to automatically extract the profiles of the entities that could be involved in money laundering. They employed SVM, decision trees, and Bayesian inference to develop a hierarchical structure of the suspicious reports and regression to identify hidden patterns. Bholat et al. (2015) analysed the utility of text mining in central banks (CB), as a wide range of data sources is required for evaluating monetary and financial stability and for achieving policy objectives. Therefore, text-mining techniques are more powerful than manual means. The authors elucidated two major approaches: the use of text as data for research purposes in CB, and the various text-mining techniques for this purpose. For the former, they suggested that textual data in the form of social narratives can be used by central banks as financial indicators for risk and uncertainty management by employing topic clustering on the narratives. The latter aspect involved preprocessing of data to de-duplicate it, convert it into text files, and reduce it into tokens by various tokenisation techniques. Thereafter, text-mining techniques, such as dictionary techniques, vector space models, latent semantic analysis, LDA, and NB algorithm, were applied to the tokenised data. The authors concluded that aggregately, these can be a very useful addition to the efficient functioning of the CB.  Bach et al. (2019) stated that a huge amount of unstructured data from various sources has created a requirement for the extraction of keywords in the banking sector. They mentioned four different procedures for the extraction of keywords, which were obtained from the study by Bharti and Babu (2017). Bach et al. further discussed how keyword extraction can be implemented to extract related useful comments and documents and to compare the banking institutions as well. They also reviewed some other text-mining techniques that can be utilised by banks. NER was used on large datasets for the extraction of entities such as a person, location, and organisation. Sentiment analysis was done to analyse customer opinions, which is crucial for a bank's functioning. Topic extraction was found to be useful mainly in credit banking. Social network analysis, a graph theory-based methodology to study the social media user structure, provided an outlook on how the customers are connected on the social media and how impactful they were in sharing information to the network of interests. This social network analysis could then be coupled with text mining to identify the keywords which correspond to the customers' common interest. Yap et al. (2011) discussed the issue faced by recreational clubs with respect to potential defaulters and non-defaulters. They proposed a credit scoring model that utilised text mining for estimating the financial obligations of credit applicants. A scorecard was built with the help of past performance reports of the borrowers wherein different clubs used different criteria for evaluating the historic data. The data was split into a 70:30 ratio for training and validating, respectively. They used three different models, namely a credit scorecard model, logistic regression model, and decision tree model, with an accuracy rate of 72.0%, 71.9%, and 71.2% respectively. Although the model benefitted the club administration, it also had a few limitations, such as poor quality of the scorecard and biased samples used to evaluate new applicants, as the model was built on historic data. Xiong et al. (2013) devised a model for personal bankruptcy prediction using sequence mining techniques. The sequence results showed good prediction ability. This model has potential value in many industries. For clustering categorical sequences, a model-based k-means algorithm was designed. A comparative study of three models, namely SVM, credit scoring, and the one proposed by them, found that the accuracies were 89.3%, 80.54%, and 94.07% respectively. The sequence mining used in the proposed model outperformed the other two models. In terms of loss prediction, the KNN algorithm had the potential to identify bad accounts with promising predictive ability. Bhattacharyya et al. (2011) explored the use of text mining in credit card fraud detection by evaluating two predictive models: one based on SVM, and the other based on a combination of random forest with logistic regression. They discussed various challenges and problems in the implementation of the models. They recommended that the models should always be kept updated to account for the growing malpractices. The original dataset used in the study comprised more than 50 million real-time credit card transactions. The dataset was split into multiple datasets as per the requirements of different techniques. Because of imbalanced data, the performance was not solely measured by the overall accuracy but also by sensitivity, specificity, and area under the curve. Although the random forest model showed the highest overall accuracy of 96.2%, the study provided some other noteworthy observations. The accuracy of each model varied according to the proportion of the fraudulent cases, with all of them having more than 99% accuracy for a dataset with 2% fraud rates. The authors concluded with suggestions for future exploration: modifying the models to make them more accurate and devising a more reliable approach to split datasets into training and testing sets. Kou et al. (2014) used data regarding credit approval and bankruptcy risk from credit card applications to analyse financial risks using clustering algorithms. They made evaluations based on 11 performance measures using multicriteria decision-making (MCDM) methods. A previous study by Kou et al. (2012) had proposed these MCDM methods for the evaluation of classification algorithms. In a later study , they employed these methods for assessing the feature selection methods for text classification.
In addition to the above-discussed literature in this section, Table 2 provides a summary of some more studies related to the banking finance industry. As visible in Table 2, banking has a lot of different text-mining applications. Risk assessment, quality assessment, money laundering detection, and customer relationship management are just a few examples from the wide pool of possible text-mining applications in banking.

Applications in corporate finance
Corporate finance is an important aspect of the financial domain because it integrates a company's functioning with its financial structure. Various corporate documents such as the annual reports of a company have a lot of hidden financial context. Text-mining techniques can be employed to extract this hidden information and also to predict the company's future financial sustainability. Guo et al. (2016) implemented text-mining algorithms that are widely used in accounting and finance. They merged the Thomson Reuters News Archive database and the News Analytics database. The former provides original news, and the latter provides sentiment scores ranging from − 1 to 1 with positive, negative, and neutral scores. To balance the dataset, 3000 news articles were randomly selected for training and 500 for testing. Three algorithms, namely NB, SVM, and neural network, were run on the dataset. The overall output accuracies were 58.7%, 78.2%, and 79.6%, respectively. With the neural network having the highest accuracy, it was concluded that it can be used for text mining-based finance studies. Another model based on semantic analysis was also implemented, which used LDA. LDA was used to extract document relationships and the most relevant information from the documents. According to the authors, in accounting and finance, this technique has proven to be advantageous for examining analyst reports and financial reporting. Lewis and Young (2019) discussed the importance of text mining in financial reports. They preferred NLP methods. They highlighted the exploding growth of unstructured textual data in corporate reporting, which opens numerous possibilities for financial applications. According to the authors, NLP methods for text mining provide solutions for two significant problems. One, they prevent overload through automated procedures to deal with immense amounts of data. Two, unlike human cognition, they are able to identify the underlying important latent features. The authors reviewed the widely used methodologies for financial reporting. These include keyword searches and word counts, attribute dictionaries, NB classification,  (2017) Analysis of bank reviews Citibank reviews from Twitter, mouthshut.com, and myBankTracker.com Opinion mining, sentiment analysis Positive/negative - Gulaty (2016) cosine similarity, and LDA. Some factors, such as limited access to the text data resources and insufficient collaboration between various sectors and disciplines, were identified as challenges that are hindering progress in the application of text mining to finance. Arguing that corporate sustainability reports (CSR) have increased dramatically, become crucial from the financial reporting perspective, and are not amenable to manual analysis processes, Shahi et al. (2014) proposed an automated model based on text-mining approaches for more intelligent scoring of CSR reports. After preprocessing of the dataset, four classification algorithms were implemented, namely NB, random subspace, decision table, and neural networks. Various parameters were evaluated and the training categories and feature selection algorithms were tuned to determine the most effective model. NB with the Correlation-based Feature Selection (CFS) filter was chosen as the preferred model. Based on this model, software was designed for CSR report scoring that lets the user input a CSR report to get its score as an automated output. The software was tested and had an overall effectiveness of 81.10%. The authors concluded that the software could be utilised for other purposes such as the popularity of performance indicators as well.
Holton (2009) implemented a model for preventing corporate financial fraud with a different and interesting perspective. The author considered employee disgruntlement or employee dissatisfaction as a hidden indicator that is responsible for fraud. A minimal dataset of intra-company communication messages and emails on online discussion groups was prepared. After using document clustering for estimating that the data possess sufficient predictive power, the NB classifier was implemented to classify the messages into disgruntled/non-disgruntled classes, and an accuracy of 89% was achieved. The author proposed the use of the model for fraud risk assessment in corporations and organisations with the motivation that it can be used to prevent huge financial losses. The performance of other models such as neural networks and decision trees was to be compared in future work. Chan and Franklin (2011) developed a new decision-support system to predict the occurrence of an event by analysing patterns and extracting sequences from financial reports. After text preprocessing, textual information generalisation was performed with the help of a shallow parser, which had an F-measure of 85%. The extracted information was stored in a separate database. From this database, the event sequences were identified and extracted. A decision tree model was then implemented on these sequences to create an inference engine that could predict the occurrence of new events based on the training sequences. With an 85: 15% training-to-testing split, the model achieved an overall accuracy of 89.09%. The authors concluded by highlighting that their model had better and robust performance compared to the prevailing models. Humpherys et al. (2011) reviewed various text-mining methods and theories that have been proposed for the detection of corporate fraud in financial statements and subsequently devised a methodology of their own. Their dataset comprised the Management's Discussion and Analysis section of corporate annual financial reports. After basic analysis and reduction, various statistical and machine learning algorithms were implemented on the dataset, among which the NB and C4.5 decision tree models both gave the highest accuracy of 67.3% for classifying 10-K reports into fraudulent and non-fraudulent. The authors suggested that their model can be used by auditors for detecting fraudulent statements in reports with the aid of the Agent99 analyser tool. Loughran and McDonald (2011) came up with the argument that the word lists contained in the Harvard Dictionary, which is commonly used for textual analysis, are not suitable for financial text classification because a lot of negative words in the Harvard list are not actually considered a negative in the financial context. Corporate 10-K reports were taken as data sources to create a new dictionary with new word lists for financial purposes. The authors advised the use of term weighting for the word lists. The new word lists were compared with the Harvard word lists on multiple financial data items, such as 10-K filing returns, material weaknesses, and standardised unexpected earnings. Although a significant difference between the word lists was not observed for classification, the authors still suggested the use of their lists in order to be more careful and prevent any erroneous results.
Whereas other researchers have mostly focused on fraud detection and financial predictions from corporate financial reports, Song et al. (2018) focused on sentiment analysis of these reports with respect to the CSR score. The sentences in the sample reports were manually labelled as positive and negative in order to create sample data for the machine learning algorithm. SVM was implemented on the dataset with a 3:1 training to test split, which achieved a precision ratio of 86.83%. Following this, an object library was created, with objects referring to the internal and external environment of the company. Sentiment analysis was conducted on these objects. Then, six regression models were developed to get the CSR score, with the model comprising of the Political, Economic, Social, Technological, Environmental and Legal (PESTEL), Porter's Five Forces, and Primary and Support Activities showing the best performance in predicting the CSR score. The authors concluded that CSR plays a vital role in a company's sustainability, and their research could aid stakeholders in their company-related decision-making.
There have been more studies on CSR reports and sustainability. Liew et al. (2014) analysed process industries for their sustainability trends with the help of CSR and sustainability reports of a large number of big companies. The RapidMiner tool was used for text preprocessing followed by generating frequency statistics, pruning, and further text refinement, which generated sustainability-related terms for analysis. The most occurring terms were taken into consideration to create a hierarchical tree model. Environment, health and safety, and social were identified as the key concepts for sustainability. Based on term occurrence and involvement, the authors classified the sustainability issues as specific, critical, rare, and general. Table 3 presents some more studies on the applications of text mining in corporate finance. As evident from the table and the above-mentioned studies, the annual corporate reports are the most commonly used data source for text-mining applications.

Challenges and future scope
The financial sector is a significant driver of broader industry, and the increasing amount of data in this field has given rise to a number of applications that can be used to improve the field and achieve commercial objectives. Figure 4 shows some common challenges faced by various text-mining techniques in the financial sector. The huge amount of data available is highly unstructured and has explicit meanings in addition to implicit ones. The data needs to undergo proper preprocessing before it can be used for analysis. Although lexicon lists are available for various domains, the financial sector has to have a specific dictionary for such approaches, so as to assign proper weights to corresponding aspects in the document. In addition to  Heidari and Felden (2015) this, there is still restricted access to classified information, which is a significant obstacle. Lastly, the current techniques focus on obtaining static results statically that are true for a given period of time. There is a need for a system that performs text-mining techniques on dynamically obtained data to output real-time results to enable even better insights. The combination of text-mining techniques and financial data analytics can produce a model that can potentially be the most efficient model for this problem domain. The results obtained from mining textual data can be integrated with those from financial analysis, thereby providing models that focus on historical data as well as opinions from diverse sources.

Conclusion
This paper conducted an organised qualitative review of recent literature pertaining to three specific sectors of finance. First, this paper analysed the growing importance of text mining in predicting financial trends. While the prior consensus may have been that financial markets are unpredictable, text mining has challenged this notion. The second area of study was banking, which has seen constant growth in technological innovation over the years, especially in digitisation. Text mining has played a key role in supporting these advancements both directly and indirectly through combination with other technologies. Corporate finance was the third study area. We discussed the importance of text mining in enabling the utilisation of corporate reports and financial statements for serving various purposes in addition to supporting corporate sustainability goals. The use of text mining in financial applications is not limited to these sectors. Researchers are increasingly showing interest in text-mining applications and constantly seeking to build more accurate models. There are still many unexplored possibilities in the financial domain, and the related research can help develop more robust and accurate predictive and analytic systems.