Normal view MARC view ISBD view

hrWaC and slWac: compiling web corpora for Croatian and Slovene / Ljubešić, Nikola ; Erjavec, Tomaž.

By: Ljubešić, Nikola, informatičar.
Contributor(s): Erjavec, Tomaž [aut].
Material type: ArticleArticleDescription: 395-402 str.ISBN: 9783-642-23537-5.Other title: hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene [Naslov na engleskom:].Subject(s): 5.04 | web corpus, Croatian, Slovene, topic modeling hrv | web corpus, Croatian, Slovene, topic modeling engOnline resources: Elektronička verzija In: Text, Speech and Dialogue : 14th International Conference, TSD 2011, Pilsen, Czech Republic, September 1-5, 2011. : Proceedings International Conference, TSD 2011(14 ; 2011; Pilsen, Czech Republic) str. 395-402Ivan Habernal and Vaclav MatousekSummary: Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.

Projekt MZOS 130-1301679-1380

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha

//