A comparative analysis of text representation, classification and clustering methods over real project proposals

Meltem Aksoy*, Seda Yanık, Mehmet Fatih Amasyali

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Purpose: When a large number of project proposals are evaluated to allocate available funds, grouping them based on their similarities is beneficial. Current approaches to group proposals are primarily based on manual matching of similar topics, discipline areas and keywords declared by project applicants. When the number of proposals increases, this task becomes complex and requires excessive time. This paper aims to demonstrate how to effectively use the rich information in the titles and abstracts of Turkish project proposals to group them automatically. Design/methodology/approach: This study proposes a model that effectively groups Turkish project proposals by combining word embedding, clustering and classification techniques. The proposed model uses FastText, BERT and term frequency/inverse document frequency (TF/IDF) word-embedding techniques to extract terms from the titles and abstracts of project proposals in Turkish. The extracted terms were grouped using both the clustering and classification techniques. Natural groups contained within the corpus were discovered using k-means, k-means++, k-medoids and agglomerative clustering algorithms. Additionally, this study employs classification approaches to predict the target class for each document in the corpus. To classify project proposals, various classifiers, including k-nearest neighbors (KNN), support vector machines (SVM), artificial neural networks (ANN), classification and regression trees (CART) and random forest (RF), are used. Empirical experiments were conducted to validate the effectiveness of the proposed method by using real data from the Istanbul Development Agency. Findings: The results show that the generated word embeddings can effectively represent proposal texts as vectors, and can be used as inputs for clustering or classification algorithms. Using clustering algorithms, the document corpus is divided into five groups. In addition, the results demonstrate that the proposals can easily be categorized into predefined categories using classification algorithms. SVM-Linear achieved the highest prediction accuracy (89.2%) with the FastText word embedding method. A comparison of manual grouping with automatic classification and clustering results revealed that both classification and clustering techniques have a high success rate. Research limitations/implications: The proposed model automatically benefits from the rich information in project proposals and significantly reduces numerous time-consuming tasks that managers must perform manually. Thus, it eliminates the drawbacks of the current manual methods and yields significantly more accurate results. In the future, additional experiments should be conducted to validate the proposed method using data from other funding organizations. Originality/value: This study presents the application of word embedding methods to effectively use the rich information in the titles and abstracts of Turkish project proposals. Existing research studies focus on the automatic grouping of proposals; traditional frequency-based word embedding methods are used for feature extraction methods to represent project proposals. Unlike previous research, this study employs two outperforming neural network-based textual feature extraction techniques to obtain terms representing the proposals: BERT as a contextual word embedding method and FastText as a static word embedding method. Moreover, to the best of our knowledge, there has been no research conducted on the grouping of project proposals in Turkish.

Original languageEnglish
Pages (from-to)595-628
Number of pages34
JournalInternational Journal of Intelligent Computing and Cybernetics
Volume16
Issue number3
DOIs
Publication statusPublished - 12 Jul 2023

Bibliographical note

Publisher Copyright:
© 2023, Emerald Publishing Limited.

Funding

Project selection that addresses a range of strategic, tactical and operational issues is a multistep decision-making process. In this process, proposals are evaluated against the criteria specified by the financial support program to determine which projects are eligible for grants. illustrates the general project-selection process proposed by . While this workflow has been created specifically for research projects, it also demonstrates the general structure of the project selection process at various funding organizations. As part of a financial support program, IDA invited qualified applicants to submit project proposals aligned with previously designated themes and criteria. To validate the proposed approach, we used a real dataset containing 2,434 project proposals submitted to IDA between 2012 and 2021. These project proposals were submitted to different financial support programs under four themes: innovation, entrepreneurship, creative industries and children and youth. The use of the full text is deemed unnecessary because of the length of the project proposal texts. Thus, only the abstract and titles from the full text of the project proposal were used. These titles and abstracts were considered for the project proposal. The statistical characteristics of the datasets are summarized in .

Keywords

  • Project proposal selection
  • Text classification
  • Text clustering
  • Text mining
  • Word embedding

Fingerprint

Dive into the research topics of 'A comparative analysis of text representation, classification and clustering methods over real project proposals'. Together they form a unique fingerprint.

Cite this