Normal view MARC view ISBD view

Discriminating between Closely Related Languages on Twitter / Nikola Ljubešić ; Denis Kranjčić.

By: Ljubešić, Nikola, informatičar.
Contributor(s): Kranjčić, Denis [aut].
Material type: ArticleArticlePublisher: 2015Description: 1-8 str.Other title: Discriminating between Closely Related Languages on Twitter [Naslov na engleskom:].Subject(s): 5.04 | microblogging; language identification; closely related languages | microblogging; language identification; closely related languagesOnline resources: Elektronička verzija In: Informatica (Ljubljana) 39 (2015), 1 ; str. 1-8Summary: In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking into account very similar South-Slavic languages – Bosnian, Croatian, Montenegrin and Serbian. We apply the supervised machine learning approach by annotating a subset of 500 users from an existing Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-of- words model, univariate feature selection, 320 strongest features and a standard classifier, we reach user classification accuracy of ∼98%. Annotating the whole 63, 160 users strong Twitter collection with the best performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language map which clearly depicts the robustness of the classifier.
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking into account very similar South-Slavic languages – Bosnian, Croatian, Montenegrin and Serbian. We apply the supervised machine learning approach by annotating a subset of 500 users from an existing Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-of- words model, univariate feature selection, 320 strongest features and a standard classifier, we reach user classification accuracy of ∼98%. Annotating the whole 63, 160 users strong Twitter collection with the best performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language map which clearly depicts the robustness of the classifier.

Projekt MZOS projekt

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha

//