|
![]() |
21.Jan.2009 UPDATE: The evaluation results have been updated because of a bug found in the evaluation script. A new version of the script is available as well.
23.Jun.2009 UPDATE: Test collection released: final set of topics and relevance judgements for the XER and LC tasks available.
The testing data consists of 35 topics (inex08-xer-topics-final.xml) and their assessments in trec_eval format.
Topics 101-149 are genuine XER topics, in that the participants created these topics specifically for the track, and (almost all) topics have been assessed by the original topic authors.
From the originally proposed topics, we have dropped topics with less than 7 relevant entities (that is, 149, 111, and 120) and topics with more than 74 relevant entities (that is, 103, 101, 102, 146, 137, 148, 105, and 142).
Topic 145 has been excluded on request of the topic assessor.
Topics 107 and 131 have been dropped because their assessments were never finished.
The final set consists of 35 genuine XER topics (inex08-xer-topics-final.xml) with assessments.
34 Entity Relationship topics have been developed based on the 49 XER Topics.
After the selection described in the previous section, 23 Entity Relationship topics are part of the final set of genuine XER topics considered in the evaluation.
Relevance assessments have not yet been performed for Entity Relationship topics.
Assessments for the entity ranking topics are provided in qrels format, with the following structure:
The identifier for article ###.xml in the collection is given as WP### .
Use inex08-xer-testing-qrels-entity-ranking.txt to evaluate your results on the entity ranking tasks, and inex08-xer-testing-qrels-list-completion.txt to evaluate results for the list completion task. The difference between the two files is whether the entity examples are included as relevant answers or left out. Notice that your system should not include the given example entities in the answer set when evaluating the list completion task!
In the case of the entity ranking task, the organizers checked and fixed all the cases where the examples of relevant entities provided at topic creation time were inserted in the pool and judged as non-relevant at assessment time.
The official evaluation measure is xinfAP as defined in [1] which makes use of the stratified sampling for estimating Average Precision. For computing such measure the script sample_eval.pl (gently provided by Emine Yilmaz) can be used together with the qrels and the run files.
The evaluation results measured with xinfAP for the entity ranking task are:
| 1_FMIT_ER_TC_nopred-cat-baseline-a1-b8: | 0.341 |
| 1_cirquid_ER_TEC_idg.trec: | 0.326 |
| 4_UAms_ER_TC_cats: | 0.317 |
| 2_UAms_ER_TC_catlinksprop: | 0.314 |
| 1_UAms_ER_TC_catlinks: | 0.311 |
| 3_cirquid_ER_TEC.trec: | 0.277 |
| 2_cirquid_ER_TC_idg.trec: | 0.274 |
| 2_500_L3S08_ER_TDC: | 0.265 |
| 1_L3S08_ER_TC_mandatoryRun: | 0.256 |
| 3_UAms_ER_TC_overlap: | 0.253 |
| 1_CSIR_ER_TC_mandatoryRun: | 0.236 |
| 4_cirquid_ER_TC.trec: | 0.235 |
| 4_UAms_ER_TC_cat-exp: | 0.232 |
| 1_UAms_ER_TC_mixture: | 0.222 |
| 3_UAms_ER_TC_base: | 0.159 |
| 6_UAms_ER_T_baseline: | 0.111 |
The evaluation results measured with xinfAP for the list completion task are:
| 1_FMIT_LC_TE_nopred-stat-cat-a1-b8: | 0.402 |
| 1_FMIT_LC_TE_pred-2-class-stat-cat: | 0.382 |
| 1_FMIT_LC_TE_nopred-stat-cat-a2-b6: | 0.363 |
| 1_FMIT_LC_TE_pred-4-class-stat-cat: | 0.353 |
| 5_UAms_LC_TE_LC1: | 0.325 |
| 6_UAms_LC_TEC_LC2: | 0.323 |
| 1_CSIR_fixed: | 0.322 |
| 2_UAms_LC_TCE_dice: | 0.319 |
| 5_cirquid_LC_TE_idg.trec.fixed: | 0.305 |
| 1_L3S08_LC_TE_mantadoryRun: | 0.288 |
| 2_L3S08_LC_TE: | 0.286 |
| 5_cirquid_LC_TE_idg.trec: | 0.274 |
| 6_cirquid_LC_TE.trec.fixed: | 0.272 |
| 1_CSIR_LC_TE_mandatoryRun: | 0.257 |
| 6_cirquid_LC_TE.trec: | 0.249 |
| 5_UAms_LC_TE_baseline: | 0.133 |
Guidelines are archived in the original INEX 2008 Entity Ranking guidelines document.
See the judging pages.
[1] A simple and efficient sampling method for estimating AP and NDCG. Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. SIGIR'08.