INitiative for the Evaluation of XML Retrieval

XML Entity Ranking (XER) 2008

logo

See also INEX 2008

21.Jan.2009 UPDATE: The evaluation results have been updated because of a bug found in the evaluation script. A new version of the script is available as well.

23.Jun.2009 UPDATE: Test collection released: final set of topics and relevance judgements for the XER and LC tasks available.

Testing Data

The testing data consists of 35 topics (inex08-xer-topics-final.xml) and their assessments in trec_eval format.

Genuine XER Topics

Topics 101-149 are genuine XER topics, in that the participants created these topics specifically for the track, and (almost all) topics have been assessed by the original topic authors.

From the originally proposed topics, we have dropped topics with less than 7 relevant entities (that is, 149, 111, and 120) and topics with more than 74 relevant entities (that is, 103, 101, 102, 146, 137, 148, 105, and 142).

Topic 145 has been excluded on request of the topic assessor.

Topics 107 and 131 have been dropped because their assessments were never finished.

The final set consists of 35 genuine XER topics (inex08-xer-topics-final.xml) with assessments.

Entity Relationship Topics

34 Entity Relationship topics have been developed based on the 49 XER Topics.

After the selection described in the previous section, 23 Entity Relationship topics are part of the final set of genuine XER topics considered in the evaluation.

Relevance assessments have not yet been performed for Entity Relationship topics.

Assessments Data Format (Qrels)

Assessments for the entity ranking topics are provided in qrels format, with the following structure:

The identifier for article ###.xml in the collection is given as WP### .

Use inex08-xer-testing-qrels-entity-ranking.txt to evaluate your results on the entity ranking tasks, and inex08-xer-testing-qrels-list-completion.txt to evaluate results for the list completion task. The difference between the two files is whether the entity examples are included as relevant answers or left out. Notice that your system should not include the given example entities in the answer set when evaluating the list completion task!

In the case of the entity ranking task, the organizers checked and fixed all the cases where the examples of relevant entities provided at topic creation time were inserted in the pool and judged as non-relevant at assessment time.

The official evaluation measure is xinfAP as defined in [1] which makes use of the stratified sampling for estimating Average Precision. For computing such measure the script sample_eval.pl (gently provided by Emine Yilmaz) can be used together with the qrels and the run files.

Guidelines and results

The evaluation results measured with xinfAP for the entity ranking task are:

1_FMIT_ER_TC_nopred-cat-baseline-a1-b8:0.341
1_cirquid_ER_TEC_idg.trec:0.326
4_UAms_ER_TC_cats:0.317
2_UAms_ER_TC_catlinksprop:0.314
1_UAms_ER_TC_catlinks:0.311
3_cirquid_ER_TEC.trec:0.277
2_cirquid_ER_TC_idg.trec:0.274
2_500_L3S08_ER_TDC:0.265
1_L3S08_ER_TC_mandatoryRun:0.256
3_UAms_ER_TC_overlap:0.253
1_CSIR_ER_TC_mandatoryRun:0.236
4_cirquid_ER_TC.trec:0.235
4_UAms_ER_TC_cat-exp:0.232
1_UAms_ER_TC_mixture:0.222
3_UAms_ER_TC_base:0.159
6_UAms_ER_T_baseline:0.111

The evaluation results measured with xinfAP for the list completion task are:

1_FMIT_LC_TE_nopred-stat-cat-a1-b8:0.402
1_FMIT_LC_TE_pred-2-class-stat-cat:0.382
1_FMIT_LC_TE_nopred-stat-cat-a2-b6:0.363
1_FMIT_LC_TE_pred-4-class-stat-cat:0.353
5_UAms_LC_TE_LC1:0.325
6_UAms_LC_TEC_LC2:0.323
1_CSIR_fixed:0.322
2_UAms_LC_TCE_dice:0.319
5_cirquid_LC_TE_idg.trec.fixed:0.305
1_L3S08_LC_TE_mantadoryRun:0.288
2_L3S08_LC_TE:0.286
5_cirquid_LC_TE_idg.trec:0.274
6_cirquid_LC_TE.trec.fixed:0.272
1_CSIR_LC_TE_mandatoryRun:0.257
6_cirquid_LC_TE.trec:0.249
5_UAms_LC_TE_baseline:0.133

Guidelines are archived in the original INEX 2008 Entity Ranking guidelines document.

Judging INEX Entity Ranking

See the judging pages.

References

[1] A simple and efficient sampling method for estimating AP and NDCG. Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. SIGIR'08.