Normal view MARC view ISBD view

Bilingual lexicon extraction from comparable corpora: a comparative study / Ljubešić, Nikola ; Fišer, Darja ; Vintar, Špela ; Pollak, Senja.

By: Ljubešić, Nikola, informatičar.
Contributor(s): Fišer, Darja [aut] | Vintar, Špela [aut] | Pollak, Senja [aut].
Material type: ArticleArticleDescription: str.Other title: Bilingual lexicon extraction from comparable corpora: A comparative study [Naslov na engleskom:].Subject(s): 5.04 | comparable corpora, bilingual lexicon extraction hrv | comparable corpora, bilingual lexicon extraction engOnline resources: Elektronička verzija In: First International Workshop on Lexical Resources (1-5.8.2011. ; Ljubljana, Slovenija) First International Workshop on Lexical Resources, An ESSLLI 2011 Workshop, Ljubljana, Slovenia - August 1-5, 2011Summary: This paper presents a comparative study of the impact of the key parameters for bilingual lexicon extraction for nouns from comparable corpora. The parameters we analyzed are: corpus size and comparability, dictionary size and type, feature selection for context vectors and window size, and association and similarity measures. Evaluation against the gold standard shows that window size of 7 with encoded position yields best results. The consistently best-performing association and similarity measures are Jensen-Shannon divergence with log-likelihood. We have shown that very good results can be achieved with small-sized but purpose-built seed lexicons and that problems arising from dissimilarities between the source and the target corpus can be compensated with their sufficient size.
Tags from this library: No tags from this library for this title. Log in to add tags.
No physical items for this record

This paper presents a comparative study of the impact of the key parameters for bilingual lexicon extraction for nouns from comparable corpora. The parameters we analyzed are: corpus size and comparability, dictionary size and type, feature selection for context vectors and window size, and association and similarity measures. Evaluation against the gold standard shows that window size of 7 with encoded position yields best results. The consistently best-performing association and similarity measures are Jensen-Shannon divergence with log-likelihood. We have shown that very good results can be achieved with small-sized but purpose-built seed lexicons and that problems arising from dissimilarities between the source and the target corpus can be compensated with their sufficient size.

Projekt MZOS 130-1301679-1380

ENG

There are no comments for this item.

Log in to your account to post a comment.

Powered by Koha

//