Dissemination of eCare web corpus


The eCare web corpus has been presented in two academic events, a conference and a workshop. In May, Wiktor Strandqvist presented “Towards a quality assessment of web corpora for language technology applications”, co-authored by Wiktor Strandqvist, Marina Santini, Leili Lind and Arne Jönsson at the TISLID18 Conference, Ghent University, Belgium.

In September, Marina Santini’s presentation showed the experiments described in “Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora”, co-authored by Marina Santini, Wiktor Strandqvist, Mikael Nyström, Marjan Alirezai, Arne Jönsson at the international workshop TIR 2018, held in Regensburg, Germany.

The eCare corpus is a public textual corpus containing web pages downloaded from the web. The corpus has been designed as a dynamic and extensible corpus whose size can be increased over time. The corpus is a concept-specific medical collection, i.e. the corpus contains web pages that talk about chronic diseases (e.g. “ansiktstics” or “lungemfysem”). More information on the eCare corpus can be found here.