%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  SemEval-2007 task#9
%  Multilevel Semantic Annotation of Catalan and Spanish 
%  <Test Data Sets> 
%  Released on: March 10, 2007
%
%  Task organizers:
%  Llus Mrquez, Luis Villarejo
%    TALP Research Center
%    Technical University of Catalonia (UPC)
%  Antnia Mart, Mariona Taul,                 
%    Centre de Llenguatge i Computaci, CLiC       
%    Universitat de Barcelona                      
%  
%  Contact e-mail address: semeval-msacs@lsi.upc.edu
%  Task website: http://www.lsi.upc.edu/~nlp/semeval/msacs.html
%  
%  This datasets are distributed to support the SemEval-2007 task#9 on
%  Multilevel Semantic Annotation of Catalan and Spanish. They are free
%  for research and educational purposes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

License and usage of datasets
=============================

Training/test datasets are free for research and academic
purposes. However, participants must previously sign a usage license
which has to be delivered to the attention of Maria Antnia Mart
(Head researcher of CLiC, Universitat de Barcelona). The license form
is available at the official website for task #9:
http://www.lsi.upc.edu/~nlp/semeval/msacs_download.html.

Fill in the form and submit an electronic version with electronic
signature to amarti@ub.edu. Optionally, you can print and send it by
regular mail to:

  M. Antnia Mart Antonn, 
  Centre de Llenguatge i Computaci
  Universitat de Barcelona
  Gran Via de les Corts Catalanes 585, 
  08007 Barcelona, Spain

In that case, we also ask you to fax the form to: 
+34 93 3189822 or +34 93 4489434
	
Whenever you are publishing results using the present datasets you are
requested to appropriately cite the CESS-ECE project: 

  M. Antnia Mart, Mariona Taul, Llus Mrquez and Manuel Bertran.
  (2007). CESS-ECE: A Multilingual and Multilevel Annotated
  Corpus. Pending to be published. Currently available at
  http://www.lsi.upc.edu/~mbertran/cess-ece/publications.


About this data distribution
============================

* This distribution contains the test datasets for Catalan (ca) and
  Spanish (es). For each language there are two test sets, coming from
  two different corpus sources: ca-3LB and es-3LB are in-domain
  fragments extracted from the same corpora used to prepare the
  training sets (3LB corpus), while ca-CESS-ECE and es-CESS-ECE are
  fragments from a different corpus, with slight variations in domain
  and genre (CESS-ECE corpus). By having an out-of-domain fragment of
  testing data we aim at checking the robustness of the systems
  presented. The number of lexical tokens for each of the test sets
  are: 5,303 (ca-3LB), 5,145 (es-3LB), 5,146 (ca-CESS-ECE) and 5,330
  (es-CESS-ECE). Thus, there are slightly more than 10 thousand lexical
  tokens per language.

* This distribution was released in March 10, and posted in the
  SemEval-2007 website in March 12.

* These datasets are to be downloaded exclusively through the official
  SemEval-2007 website, once you are registered as a team. Test time
  will be 1 week (included in the 4-week complete period) from the
  moment the participants click the test-set-download button.

* As in the trial/training data distributions, test data comes in two
  directories: 'ca/' and 'es/', one for each language (ca=Catalan;
  es=Spanish) containing the following files (<lang> stands for 'ca'
  or 'es', and <corpus> stands for '3LB' or 'CESS-ECE' in the names 
  of the files):

   test.<lang>.<corpus>.trees.txt: It contains the syntactic trees in PTB
                                   format. We distribute the original 
				   complete trees merely for easing the 
				   comprehension/readability of the 
				   information presented and its connection 
				   to the column-based format. Note that 
				   the task will be evaluated strictly using 
				   the column-based format. In this test set 
				   all the semantic information described 
				   as OUTPUT has been removed from the 
				   syntactic trees.

   trial.<lang>.BII.txt : It contains the Basic Input Information that the 
                          participants need: lexical tokens and target verbs 
                          (for SRL) and nouns (for NSD).

   trial.<lang>.EII.txt : It contains the Extra Input Information provided to the 
                          participants (lemmas, POS and full parsing).


* Data formatting is exactly the same than that of the previously
  released trial/training datasets, with the exception that now the
  semantic levels of annotation (NE, NS, and SR) are not
  provided. Instead, they have to be predicted by participant systems
  and uploaded to the SemEval interface.


Systems outputs: procedure for evaluation
=========================================

* Participants are expected to generate (and upload to the
  SemEval-2007 website) a tarball including a file for each output
  column ---for each of the languages, corpus version and semantic
  layer (i.e., {ca,es}x{3LB,CESS-ECE}x{NER,NSD,SRL}, for a total of 9
  cases). The names of the individual files must be:

    test.ca.3LB.NER.txt
    test.ca.CESS-ECE.NER.txt
    test.es.3LB.NER.txt
    test.es.CESS-ECE.NER.txt

    test.ca.3LB.NSD.txt
    test.ca.CESS-ECE.NSD.txt
    test.es.3LB.NSD.txt
    test.es.CESS-ECE.NSD.txt

    test.ca.3LB.SRL.txt
    test.ca.CESS-ECE.SRL.txt
    test.es.3LB.SRL.txt
    test.es.CESS-ECE.SRL.txt
 
  and they have to have a perfect match by rows (e.g., make a 'paste'
  of both files and check alignment) to the test ".BII" files
  distributed here. This is crucial to be able to evaluate using the
  official scorer.

* The Semeval site will allow you to upload results more than once
  during the evaluation process (this is good, for instance, in the
  case you discover some bugs/errors and want to upload a fixed
  version of the output). The last upload before the deadline is 
  taken as the final results of your system.

* If you cannot provide some of the output files then the baseline
  outputs prepared by the organization will be used for the global
  evaluation of the participant system. Evaluation will be carried
  out by task organizers. The gold annotations for the test sets won't
  be publicly released until all the evaluation period is over.

* Further instructions for evaluation and paper preparation will be
  posted in the task website when the evaluatin period is over.

*Final note* the training dataset has been updated (an error on the
  order of the predicate-annotation columns has been fixed) and posted
  again on March 12 (together with the test distribution). Please,
  download the last version and check the changes in the README file.
  (http://nlp.cs.swarthmore.edu/semeval/tasks/task09/data.shtml) Also,
  consult periodically task#09 website:
  http://www.lsi.upc.edu/~nlp/semeval/msacs-download.html It wil
  contain the last versions of the task documentation and software
  tools (including official scoring software).


GOOD LUCK!
and thanks again for your participation

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 



