Dissemination at the Swedish Language Technology Conference in Stockholm


The project will be disseminated at the Seventh Swedish Language Technology Conference that is held in Stockholm on November 7-9. In the paper, Marina Santini, Wiktor Strandqvist and Arne Jönsson describe an approach to profile the domain specificity of specialized web corpora in Swedish. The proposed approach is based on burstiness. Burstiness is a statistical measure that identifies words with uneven distribution across the documents of a corpus. We apply burstiness to two medical web corpora that have different size and different domain granularity. Results are promising and show that burstiness is an appropriate measure to profile the domain specificity when matched against reference lists (gold standards) that represent the target domains. However, further research is needed to find adequate evaluation metrics, less empirical cut-off points and more principled gold standard design.