Normal view MARC view ISBD view

Producing monolingual and parallel web corpora at the same time - SpiderLing and Bitextor's love affair / Nikola Ljubešić ; Miquel Esplà-Gomis ; Antonio Toral ; Sergio Ortiz Rojas ; Filip Klubička.

By: Ljubešić, Nikola, informatičar.
Contributor(s): Esplà-Gomis, Miquel [aut] | Toral, Antonio [aut] | Ortiz Rojas, Sergio [aut] | Klubička, Filip [aut].
Material type: ArticleArticlePublisher: 2016Description: str.Other title: Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair [Naslov na engleskom:].Subject(s): 5.04 | crawling; top-level domain; monolingual corpus; parallel corpus | crawling; top-level domain; monolingual corpus; parallel corpus In: Language Resources and Evaluation ConferenceSummary: Abstract "This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain "".hr"" and the Slovene top-level domain "".si"", and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs."
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

Abstract "This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain "".hr"" and the Slovene top-level domain "".si"", and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs."

Projekt MZOS projekt

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha

//