Normal view MARC view ISBD view

Evaluating full lemmatization of Croatian texts / Agić, Željko ; Tadić, Marko ; Dovedan, Zdravko.

By: Agić, Željko.
Contributor(s): Tadić, Marko [aut] | Dovedan Han, Zdravko [aut].
Material type: materialTypeLabelArticleDescription: 133-144.ISBN: 978953-55375-1-9.Other title: Evaluating Full Lemmatization of Croatian Texts [Naslov na engleskom:].Subject(s): 2.09 | 5.04 | 6.03 | full lemmatization, morphosyntactic tagging, Croatian language hrv | full lemmatization, morphosyntactic tagging, Croatian language engOnline resources: Click here to access online In: Technologies for the Processing and Retrieval of Semi-Structured Documents: Experience from the CADIAL Project str. 133-144Tadić, Marko ; Dalbelo Bašić, Bojana ; Moens, Marie-FrancineSummary: The chapter presents the implementation and evaluation of a module for full lemmatization of Croatian texts. The module implements several lemmatization procedures, all of them based on merging outputs of the previously developed stochastic morphosyntactic tagger CroTag and the infectional lexicon of Croatian. Evaluation of the lemmatization module on two test cases, simulating realistic and ideal operating conditions, provided full lemmatization accuracy scores of 96.96 and 98.15 percent on a newspaper corpus, respectively. It is also shown that a majority of errors in this framework, 57.14 percent in the realistic testing scenario, occur on word forms with external homography. Moreover, approximately 80 percent of all lemmatization errors occur on nouns, adjectives, verbs and adverbs in that particular order. Language resources, testing environment and procedure descriptions are provided in the paper along with a discussion of obtained results and possible future research directions.
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

This is a corrected version of a paper published in Klopotek, M. ; Przepiorkowski, A. ; Wierzchon, S. ; Trojanowski, K. (eds.) (2009) Recent Advances in Intelligent Information Systems, Academic Publishing House EXIT, Warsaw, 175-184.

The chapter presents the implementation and evaluation of a module for full lemmatization of Croatian texts. The module implements several lemmatization procedures, all of them based on merging outputs of the previously developed stochastic morphosyntactic tagger CroTag and the infectional lexicon of Croatian. Evaluation of the lemmatization module on two test cases, simulating realistic and ideal operating conditions, provided full lemmatization accuracy scores of 96.96 and 98.15 percent on a newspaper corpus, respectively. It is also shown that a majority of errors in this framework, 57.14 percent in the realistic testing scenario, occur on word forms with external homography. Moreover, approximately 80 percent of all lemmatization errors occur on nouns, adjectives, verbs and adverbs in that particular order. Language resources, testing environment and procedure descriptions are provided in the paper along with a discussion of obtained results and possible future research directions.

Projekt MZOS 036-1300646-1986

Projekt MZOS 130-1300646-0645

Projekt MZOS 130-1300646-1776

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha