Skip to main navigation Skip to search Skip to main content

W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Abstract

Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.

Original languageEnglish
Title of host publicationCIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
EditorsNorman Paton, Selcuk Candan, Haixun Wang, James Allan, Rakesh Agrawal, Alexandros Labrinidis, Alfredo Cuzzocrea, Mohammed Zaki, Divesh Srivastava, Andrei Broder, Assaf Schuster
PublisherAssociation for Computing Machinery (ACM)
Pages1847-1850
Number of pages4
ISBN (Electronic)9781450360142
DOIs
Publication statusPublished - Oct 2018
Event27th ACM International Conference on Information and Knowledge Management, CIKM 2018 - Torino, Italy
Duration: 22 Oct 201826 Oct 2018

Conference

Conference27th ACM International Conference on Information and Knowledge Management, CIKM 2018
Country/TerritoryItaly
CityTorino
Period22 Oct 201826 Oct 2018

Keywords

  • Benchmark dataset
  • Topic detection
  • Topic tracking

ASJC Scopus subject areas

  • General Business,Management and Accounting
  • General Decision Sciences

Cite this