Normal view MARC view ISBD view

Lemmatization and morphosyntactic tagging of Croatian and Serbian / Agić, Željko ; Ljubešić, Nikola ; Merkler, Danijela.

By: Agić, Željko.
Contributor(s): Merkler, Danijela [aut] | Ljubešić, Nikola, informatičar [aut].
Material type: materialTypeLabelArticleDescription: 48-57.Other title: Lemmatization and Morphosyntactic Tagging of Croatian and Serbian [Naslov na engleskom:].Subject(s): 5.04 | lemmatization, tagging, Croatian, Serbian hrv | lemmatization, tagging, Croatian, Serbian engOnline resources: Click here to access online In: 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013) (8-9.08.2013. ; Sofija, Bugarska) Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing str. 48-57Summary: We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for the two languages reaches 97.87% and 96.30%, while full morphosyntactic tagging accuracy using a 600-tag tagset peaks at 87.72% and 85.56%, respectively. Part of speech tagging accuracies reach 97.13% and 96.46%. Results indicate that more complex methods of Croatian-to- Serbian annotation projection are not required on such dataset sizes for these particular tasks. The SETIMES.HR corpus, its resulting models and test sets are all made freely available .
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for the two languages reaches 97.87% and 96.30%, while full morphosyntactic tagging accuracy using a 600-tag tagset peaks at 87.72% and 85.56%, respectively. Part of speech tagging accuracies reach 97.13% and 96.46%. Results indicate that more complex methods of Croatian-to- Serbian annotation projection are not required on such dataset sizes for these particular tasks. The SETIMES.HR corpus, its resulting models and test sets are all made freely available .

Projekt MZOS 130-1300646-1776

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha