(last updated: 27 August 2017)
A Web Corpus for eCare
The eCare corpus is a public textual corpus containing web pages downloaded from the web. The corpus has been designed as a dynamic and extensible corpus whose size can be increased over time. The corpus is a concept-specific medical collection, i.e. the corpus contains web pages that talk about chronic diseases (e.g. “ansiktstics” or “lungemfysem”).
At present, the eCare corpus includes only web documents written in Swedish. This version of the corpus is called eCare_SV_01. eCare_SV_01 contains 801 web documents. Each of these documents have been labelled as “lay” or “specialized” by a lay annotator (i.e. an annotator without medical edition) and an expert annotator (i.e. an annotator with medical annotation). The annotation is stored in the corpus in xlm format.
Each text in the corpus has an “text header” in xml. Not all the fields in the text header have been filled up. At present, only the fields related to “sublanguage” have been completed. The other fields will be filled up over time.
eCare_Sv_01 is distributed under the following disclaimer:
Copyright is held by the author/owner(s) of the web documents included in the corpus. The documents in the corpus can be used for research purposes ONLY.
Corpus, scripts and the output of the classification models are available for download from in the following page:
We encourage the use of the corpus and the replication of the experiments, and welcome improvements and further discussion.