Normal view MARC view ISBD view

Efficient discrimination between closely related languages / Tiedemann, Jörg ; Ljubešić, Nikola.

By: Tiedemann, Jörg.
Contributor(s): Ljubešić, Nikola, informatičar [aut].
Material type: ArticleArticleDescription: 2619-2634 str.Other title: Efficient Discrimination Between Closely Related Languages [Naslov na engleskom:].Subject(s): 5.04 | language identification, language discrimination, closely related languages hrv | language identification, language discrimination, closely related languages eng In: COLING 2012 (10.-15.12.2012. ; Mumbai, Indija) Proceedings of COLING 2012 str. 2619-2634Summary: In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages make it very hard to classify them correctly using standard methods that have been proposed in the literature. Dedicated models that focus on specific discrimination tasks help to improve the accuracy of general-purpose language identification tools. We propose and compare methods based on simple document classification techniques trained on parallel corpora of closely related languages and methods that emphasize discriminating features in terms of blacklisted words. Our experiments demonstrate that these techniques are highly accurate for the difficult task of discriminating between Bosnian, Croatian and Serbian. The best setup yields an absolute improvement of over 9% in accuracy over the best performing baseline using a state-of-the-art language identification tool.
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages make it very hard to classify them correctly using standard methods that have been proposed in the literature. Dedicated models that focus on specific discrimination tasks help to improve the accuracy of general-purpose language identification tools. We propose and compare methods based on simple document classification techniques trained on parallel corpora of closely related languages and methods that emphasize discriminating features in terms of blacklisted words. Our experiments demonstrate that these techniques are highly accurate for the difficult task of discriminating between Bosnian, Croatian and Serbian. The best setup yields an absolute improvement of over 9% in accuracy over the best performing baseline using a state-of-the-art language identification tool.

Projekt MZOS FP7-288342

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha

//