A Dataset for Evaluating Entity Retrieval over Time (DEERT v.0)


The TREC Novelty track in 2004 consisted on a collection of news articles and a set of topics for evaluating retrieval of novel information over lists of documents ordered in time for each topic. The systems had to retrieve information (i.e., sentences in this case) relevant to the topic and not yet present in the retrieved results. A time-stamped list of documents is provided for every topic reflecting the temporal flow of the story the topic is about.

We created a new collection based on the one developed at the TREC 2004 Novelty track for evaluating entity retrieval over time.

The new collection

We selected the 25 event topics of the latest TREC Novelty collection (2004). We annotated the documents associated with those topics using a state of the art NLP tool (i.e., the SuperSense tagger also used for annotating the English Wikipedia) in order to extract entities of type person, location, organization, and product based on WSJ annotations.
The annotation system detected 7481 entity occurrences in the collection: 26% persons, 10% locations, 57% organizations, and 7% products.

Six human judges assessed the relevance of the entities in each document with respect to the topic grading each entity on the 3-points scale: Relevant, Related, Not Relevant. An additional category was used, i.e., 'Not an entity', to mark entities which had been wrongly annotated by the NLP tool.
A total of 21213 entity-document-topic judgements are contained in the collection.

The data we release consists of relevance judgements about the entities in the following format:

topicID documentID entity judgement
N51 NYT19981028.0437 1-Spain Relevant
N51 NYT19981028.0437 2-European NotAnEntity
N51 NYT19981028.0437 3-Chile NotRelevant

DOWNLOAD the dataset.

Related Publications

Gianluca Demartini, Malik Muhammad Saad Missen, Roi Blanco, and Hugo Zaragoza. Entity Summarization of News Articles. In: 33rd Annual ACM SIGIR Conference (SIGIR 2010 poster session), Geneva, Switzerland, July 2010.
Gianluca Demartini, Malik Muhammad Saad Missen, Roi Blanco, and Hugo Zaragoza. TAER: Time Aware Entity Retrieval. In: The 19th ACM International Conference on Information and Knowledge Management (CIKM 2010), Toronto, Canada, October 2010.