Összesen 1 találat.


001-es BibID:BIBFORM086778
Első szerző:Pethő Gergely (nyelvész)
Cím:An n-gram-based language identification algorithm for variable-length and variable-language texts / Pethő Gergely, Mózes Eszter
Megjegyzések:The aim of this paper is to describe a new language identification method that uses language models based on character statistics, or more specifically, character n-gram frequency tables or Markov chains. An important advantage of this method is that it uses a very simple and straightforward algorithm, which is similar to those that have been used for the past 20 years for this purpose. In addition, it can also handle input such as target texts in an unknown language or more than one language, which the traditional approaches inherently classify incorrectly. We systematically compare and contrast our method with others that have been proposed in the literature, and measure its accuracy using a series of experiments. These experiments demonstrate that our solution works not only for whole documents but also delivers usable results for input strings as short as a single word, and the identification rate reaches 99.9 % for strings that are about 100 characters, i.e. a short sentence, in length.
Tárgyszavak:Bölcsészettudományok Nyelvtudományok idegen nyelvű folyóiratközlemény hazai lapban
character statistics
language identification
Markov chain
Megjelenés:Argumentum. - 10 (2014), p. 56-82. -
További szerzők:Mózes Eszter (1985-) (nyelvész)
Internet cím:Szerző által megadott URL
Intézményi repozitóriumban (DEA) tárolt változat
Rekordok letöltése1