codalab/evaluation.html at master · TASS18-Task3/codalab · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<p><span>There </span><span>will be three evaluation scenarios:</span></p>
<h3><span>Scenario 1: </span><span>Only plain text is given (Subtasks A, B, C).</span></h3>
<p><span>In this first scenario, the participants will perform the three subtasks consecutively and provide the corresponding development output files. The only input provided are plain text files </span><span>input_&lt;topic&gt;.txt </span><span>for a particular list of topics that were not released with the training data.</span></p>
<p><span>Systems will be ranked according to an aggregated </span><span>F1</span> <span>metric computed on the three tasks, by considering </span><span>precision</span> <span>and </span><span>recall</span> <span>as follows:</span></p>
<p><img src="https://tass18-task3.github.io/website/images/image10.png" alt="" /></p>
<p><img src="https://tass18-task3.github.io/website/images/image11.png" alt="" /></p>
<p><img src="https://tass18-task3.github.io/website/images/image1.png" alt="" /></p>
<p><span>Besides this aggregated F1 score, individual F1 scores for each of the subtasks will also be reported.</span></p>
<h3><span>Scenario 2: </span><span>Plain text and manually annotated key phrase boundaries are given (Subtasks B, C).</span></h3>
<p><span>In this second scenario participants will perform tasks B and C sequentially, and provide the corresponding output files. As input, they receive both plain text files (</span> <span>input_&lt;topic&gt;.txt</span> <span>), and the corresponding gold files for the task A (</span> <span>output_A_&lt;topic&gt;.txt</span> <span>). The purpose of this scenario is to evaluate the quality of tasks B and C independently from task A. As in the previous scenario, an aggregated </span><span>F1</span> <span>metric is reported, based on the following </span><span>precision</span> <span>and </span><span>recall</span> <span>:</span></p>
<p><img src="https://tass18-task3.github.io/website/images/image12.png" alt="" /></p>
<p><img src="https://tass18-task3.github.io/website/images/image13.png" alt="" /></p>
<p><img src="https://tass18-task3.github.io/website/images/image1.png" alt="" /></p>
<p><span>Besides the aggregated F1 metric, individual scores of F1 for each of the subtasks are also reported.</span></p>
<h3><span>Scenario 3: </span><span>Plain text with manually annotated key phrases and their types are given (Subtask C).</span></h3>
<p><span>In this scenario both the gold outputs for task A and task B are provided, and the participants must only perform the process to obtain task C output files. The purpose of this scenario is to evaluate only the quality of task C independently of the complexity of task A and B. As before, an aggregated F1 metric is reported, based on the following </span><span>precision</span> <span>and </span><span>recall</span> <span>:</span></p>
<p><img src="https://tass18-task3.github.io/website/images/image14.png" alt="" /></p>
<p><img src="https://tass18-task3.github.io/website/images/image15.png" alt="" /></p>
<p><img src="https://tass18-task3.github.io/website/images/image1.png" alt="" /></p>