Cím:Deep Learning-Based Analysis of Ancient Greek Literary Texts in English Version: A Statistical Model Based on Word Frequency and Noise Probability for the Classification of Texts / Gál Zoltán, Tóth Erzsébet
Megjegyzések:In our paper we intend to present a methodology that we elaborated for clustering texts based on the word frequency in the English translations of selected old Greek texts. We used the classification system of the ancient Library of Alexandria, devised by the prominent Greek scholar-poet, Callimachus in the 3rd century BC., as a basis for categorizing literary masterpieces. In our content analysis, we could determine a triplet of a, b, c values for describing a power function that appropriately fits a curve determined by the word frequencies in the texts. In addition, we have discovered 16 special features of the different texts that correspond to various token categories investigated in each text, such as part of speech of the word in the context, numerals, subordinate conjunction, symbols, etc. We have developed a cognitive model in which several hundred different subtexts were utilized for supervised learning with the aim of subtext class recognition. Concerning 200 subtexts, the triplet of a, b, c values, the classes of the subtexts, and their 16-dimensional feature vectors were learnt for the Recurrent Neural Network (RNN). It turned out that the Long-Short Term Memory RNN could efficiently predict which class a chosen subtext could be categorized into without considering the interpretation of the content. The influence of the non-zero error rate of new communication services on the meaning of the transferred texts was also investigated. The impact of the noise on the classification accuracy was found to be linear, dependent on the character error rate.
Megjelenés:Infocommunications Journal. - 16 : Joint Special Issue on Cognitive Infocommunications and Cognitive Aspects of Virtual Reality (2024), p. 2-11. -
Cím:Deep learning-based analysis of ancient Greek literary texts: A statistical model based on word frequency for the classification of texts / Gál Zoltán, Tóth Erzsébet
Megjelenés:12th IEEE International Conference on Cognitive Infocommunications: CogInfoCom 2021: Proceedings / ed. Jan Nikodem, Ryszard Klempous. - p. 529-535. -
Cím:Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification / Tóth Erzsébet, Gál Zoltán
Megjegyzések:A parallel corpus comprising Croatian EU legislative documents automatically translated into English spans 28 years and is enriched with metadata, including creation year and hierarchical classifier tags denoting descriptors, document types, and fields. However, nearly two-thirds of the approximately 1.5 thousand texts lack complete metadata, necessitating labor intensive manual efforts that pose challenges for human administration. This incompleteness issue can be observed in the case of official legal sites functioning as regular service provisioning databases. In response, this paper introduces an artificial cognitive and multilabel classification approach to expedite the tagging process with only a fraction of the manual effort. Leveraging the Latent Dirichlet Allocation (LDA) algorithm, our method assigns field values or tags to incompletely labeled documents. We implement a Flexible LDA variant, incorporating the influence of topics close to the most probable topic, regulated by a relative probability threshold (RPT). We evaluate the LDA prediction's dependence on document prefiltering and RPT values. Furthermore, we investigate the dependence of quantitative linguistic properties on the type and speciality of pre-processing tasks. Our algorithm, built on error-correcting optimizing codes, succesfully predicts a mixture of topic probabilities for these legal texts. This prediction is achieved by calculating the Hamming distance of binary feature vectors created using the legal fields of the EUROVOC multilingual thesaurus.
Megjelenés:Infocommunications Journal. - 16 : Joint Special Issue on Cognitive Infocommunications and Cognitive Aspects of Virtual Reality (2024), p. 58-66. -
Cím:Multilabel Clustering Analysis of the Croatian-English Parallel Corpus Based on Latent Dirichlet Allocation Algorithm / Tóth Erzsébet, Gál Zoltán
Megjegyzések:A parallel corpus of Croatian EU legislative documents translated automatically to English over 28 years with a year of creation and hierarchical classifier tags including descriptors, document types, and fields considered as meta information assigned to each text. Only two third part of around 1.5 thousand texts have all the fields completed, accomplishing the required manual work too time-consuming for human administration. Similar incompleteness of legal texts may appear in official legal sites operated as regular service provisioning databases. In this paper we proposed an artificial cognitive and multilabel classification method to automatically find the necessary tags for the corpus with just a tiny fraction of the manual tagging time. The Latent Dirichlet Allocation algorithm assigns field values or tags to incompletely labelled documents. The dependence of the quantitative linguistics properties was presented in the function of the type and specialty of preprocessing tasks. We successfully applied this algorithm built on no error correcting optimising codes to predict a mixture of topic probabilities of these legal texts on the basis of Hamming distance of the binary feature vectors created using the legal fields of the EUROVOC multilingual thesaurus.
Megjelenés:14th IEEE International Conference on Cognitive Infocommunications : Proceedings / IEEE. - p. 25-32. -
Cím:A mesterséges intelligencia alkalmazása görög irodalmi szövegek elemzésére / Tóth Erzsébet, Gál Zoltán
Megjegyzések:Dolgozatunkban egy olyan osztályozási modellt fejlesztettünk ki, amiben több száz különböző ókori görög szövegentitást használtunk ellenőrzött tanulásra abból a célból, hogy az felismerje a szövegentitások osztályát. Meghatároztuk az (a, b, c) hármas értékeit egy olyan hatványfüggvény leírására, amely pontosan illeszkedik a kiválasztott szövegekben lévő szavak relatív gyakorisága által megadott görbére. A 200 darab szövegentitással kapcsolatban az (a, b, c) hármas értékeinek becsléséhez a szövegentitások osztály azonosítóját és a 16 dimenziós tulajdonság ("feature") vektorokat használtuk fel a Visszacsatolásos Neurális Hálózat (RNN - Recurrent Neural Network) betanításához. Arra a következtetésre jutottunk, hogy az LSTM (Long-Short Term Memory) RNN hálózat hatékonyan előrejelezte számunkra, hogy a kiválasztott szövegentitás melyik osztályba sorolható.
Megjelenés:XXIII. Energetika-Elektrotechnika - ENELKO és XXXII. Számítástechnika és Oktatás : SzámOkt Multi-konferencia / szerk. Sebestyén-Pál György, Szabó Loránd. - p. 173-179. -
