%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  SemEval-2007 task#9
%  Multilevel Semantic Annotation of Catalan and Spanish 
%  <Training Data Sets: version 3 - final training sets>
%  Released on: March 10, 2007
%
%  Task organizers:
%  Llus Mrquez, Luis Villarejo
%    TALP Research Center
%    Technical University of Catalonia (UPC)
%  Antnia Mart, Mariona Taul,                 
%    Centre de Llenguatge i Computaci, CLiC       
%    Universitat de Barcelona                      
%  
%  Contact e-mail address: semeval-msacs@lsi.upc.edu
%  Task website: http://www.lsi.upc.edu/~nlp/semeval/msacs.html
%  
%  This datasets are distributed to support the SemEval-2007 task#9 on
%  Multilevel Semantic Annotation of Catalan and Spanish. They are free
%  for research and educational purposes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Current version and scheduling
==============================

The present version of the datasets consists of the full training
corpora for Catalan and Spanish. They contain 97,758 and 90,661
lexical tokens, respectively

*IMPORTANT note*: the present training datasets strictly contain the
previous distribution with full training sets (released on March
5th). We recommend to discard the previous datasets and work with the
current, since the SRL columns have been ordered according to the textual
order of the predicates and, the annotation of one sentence has been fixed. 
By textual order we understand that the left-most SRL column corresponds to 
the first predicate in the sentence, the second left-most SRL column
corresponds to the second predicate in the sentence and so on. Both
prediction and gold files must follow textual order of the predicates since 
it is a requirement to perform the evaluation.
 
The test data will be about 10 times smaller than the training corpus
and will contain examples from two different sources (one of them is
the same training corpus and the other is a corpus from slightly
different domain and genre) in order to test the robustness of the
systems presented. It will be released by March 12. Stay tuned to the
SemEval and Task#9 websites.  The README accompanying the test set
distribution will include precise instructions on how to upload
system's results.

All these datasets are to be downloaded through the official SemEval
website, once you are registered as a team. The 4-week evaluation
period holding for the task will start from the moment the participant
team clicks the download button for training (this is also valid for
the partial training-set-1). Test time will be 1 week (included in
the 4-week complete period). This 4-week period has to be included in
the SemEval-2007 evaluation frame: February, 26 to April, 1.


License and usage of datasets
=============================

Training/test datasets are free for research and academic
purposes. However, participants must previously sign a usage license
which has to be delivered to the attention of Maria Antnia Mart
(Head researcher of CLiC, Universitat de Barcelona). The license form
is available at the official website for task #9:
http://www.lsi.upc.edu/~nlp/semeval/msacs_download.html.

Fill in the form and submit an electronic version with electronic
signature to amarti@ub.edu. Optionally, you can print and send it by
regular mail to:

  M. Antnia Mart Antonn, 
  Centre de Llenguatge i Computaci
  Universitat de Barcelona
  Gran Via de les Corts Catalanes 585, 
  08007 Barcelona, Spain

In that case, we also ask you to fax the form to: 
+34 93 3189822 or +34 93 4489434
	
Whenever you are publishing results using the present datasets you are
requested to appropriately cite the CESS-ECE project: 

  M. Antnia Mart, Mariona Taul, Llus Mrquez and Manuel Bertran.
  (2007). CESS-ECE: A Multilingual and Multilevel Annotated
  Corpus. Pending to be published. Currently available at
  http://www.lsi.upc.edu/~mbertran/cess-ece/publications.


Disclaimer 
==========
The information contained in this README file is orientative and might
be incomplete or inaccurate at some points. For a complete description
of the task setting, datasets, resources, etc. we refer the reader to
the official task webpage, which is periodically updated:
http://www.lsi.upc.edu/~nlp/semeval/msacs.html


General information and formats
===============================

The sentences in the training datasets are properly tokenized,
POS tagged and lemmatized, including full syntactic annotation
(gold-standard constituency parse trees with function tags). Also,
semantic annotations are included describing named entities (NE), noun
senses (NS) and semantic roles (SR), which is the target knowledge to
be learned.

Data formatting is exactly the same than that of the previously
released trial dataset. The test data will share also the same
formatting but will exclude the semantic levels of annotation (NE, NS,
and SR), which have to be predicted by participant systems. The parse
trees of the test set will be also the manually revised gold-standard
ones [unfortunately, we have had no time to develope automatic parsers
for both languages to provide the automatic generated syntactic input
levels, as we initially planned]. More instructions on how to format
and upload results of participant systems will be given with the test
set release (March 12).

Note that the trial dataset has been updated (a number of errors have
been fixed) and posted again on February 22. Download the new version
(http://nlp.cs.swarthmore.edu/semeval/tasks/task09/data.shtml) and
check the changes in the README file. The trial datasets are already
included in the complete training distribution, so there is no need to
use them as extra training material for developing the final systems.

Also, note that it is not forbidden to use some external resources
apart from the training datasets to produce a system for the
task. However, we strongly encourage participant teams to explicitely
comment on all the external resources used in the system description
paper to be prepared after the evaluation period. By "external
resources" we mean any knowledge or data that cannot be directly
inferred from the training sets provided in this release.

* Data formats (copied from the trial dataset description)

Data formats are highly similar to those of the CoNLL-2005 shared task
(column style presentation of levels of annotation), in order to be
able to share evaluation tools and already developed scripts for
format conversion. Note that the PROPS columns must follow textual order 
of the predicates as described previously.

Here you can find an example of a fully annotated sentence:

INPUT-------------------------------------------------------------->  OUTPUT-------------------------------------->
BASIC_INPUT_INFO----->  EXTRA_INPUT_INFO--------------------------->  NE--->  NS------>  SR----------------------->
WORD		TN  TV  LEMMA	    POS	     SYNTAX		      NE      NS  	 SC  PROPS---------------->
-------------------------------------------------------------------------------------------------------------------

Las             -   -   el          da0fp0   (S(sn-SUJ(espec.fp*)	  *   -          -   	      *  (Arg1-TEM*
conclusiones    *   -   conclusin  ncfp000        (grup.nom.fp*	  *   05059980n  -            *           *
de              -   -   de          sps00              (sp(prep*)	  *   -          -            *           *
la              -   -   el          da0fs0         (sn(espec.fs*)     (ORG*   -          -            *           *
comisin        *   -   comisin    ncfs000        (grup.nom.fs*          *   06172564n  -            *           *
Zapatero        -   -   Zapatero    np00000           (grup.nom*)     (PER*)  -          -            *           *
,               -   -   ,           Fc                   (S.F.R* 	  *   -          -            *           *
que             -   -   que         pr0cn000       (relatiu-SUJ*)         *   -          -   (Arg0-CAU*)  	  *
ampliar        -   *   ampliar     vmif3s0                 (gv*)	  *   -          a1         (V*)          *
el              -   -   el          da0ms0      (sn-CD(espec.ms*)         *   -          -   (Arg1-PAT*           *
plazo           *   -   plazo       ncms000        (grup.nom.ms*          *   10935385n  -            *           *
de              -   -   de          sps00              (sp(prep*)	  *   -          -            *           *
trabajo         *   -   trabajo     ncms000     (sn(grup.nom.ms*)))))     *   00377835n  -            *)          *
,               -   -   ,           Fc                         *))))))    *)  -          -            *           *)
quedan          -   *   quedar      vmip3p0                 (gv*)         *   -          b3           *         (V*)
para            -   -   para        sps00           (sp-CC(prep*)         *   -          -            *  (ArgM-TMP*
despus_del     -   -   despus_del spcms              (sp(prep*)         *   -          -            *           *
verano          *   -   verano      ncms000     (sn(grup.nom.ms*))))      *   10946199n  -            *           *)
.               -   -   .           Fp                         *)         *   -          -            *           *


There is one line for each token, and a blank line after the last
token of each sentence. The columns, separated by blank spaces,
represent different annotations of the sentence with a tagging along
words. For structured annotations (named entities, parse trees and
arguments), we use the Start-End format.

The Start-End format represents phrases (syntactic constituents, named
entities, and arguments) that constitute a well-formed bracketing in a
sentence (that is, phrases do not overlap, though they admit
embedding). Each tag is of the form STARTS*ENDS, and represents
phrases that start and end at the corresponding word. A phrase of type
k places a '(k' parenthesis at the STARTS part of the first word, and
a ')' parenthesis at the END part of the last word.

The different annotations in a sentence are grouped in five main
categories:

[1] BASIC_INPUT_INFO. The basic input information that the
    participants need:
    * WORDS (column 1): words of the sentence.
    * TN (column 2): target nouns of the sentence (those that are to
      be assigned WordNet synsets); marked with '*'
    * TV (column 3): target verbs of the sentence (those that are to
      be annotated with semantic roles); marked with '*'

[2] EXTRA_INPUT_INFO. The extra input information provided to the
    participants:
    * LEMA (column 4): lemmas of the words
    * POS (column 5): part-of-speech tags
    * SYNTAX (column 6): Full syntactic tree. 

[3] NE (column 7). Named Entities (output information = to be
    predicted when testing ; available only for trial/training sets).

[4] NS (column 8). WordNet sense of target nouns (output information)

[5] SRL. Information on semantic roles:

    * SC (column 9). The lexico-semantic class of the verb (output
      information).
    * PROPS (columns 10-[10+N-1]). For each of the N target verb, a
      column representing the argument structure of the target verb 
      (output information). Core numbered arguments are enriched with 
      the thematic role label (e.g., Arg1-TEM). ArgM's are the adjuncts.

NOTE-1: All these annotations in column format are extracted
automatically from the syntactic-semantic trees from the CESS-ECE
corpora, which are also distributed with the datasets (see description
below). These are constituency trees enriched with semantic labels for
NE, NS and SR. The format is similar to that of Penn Treebank and it
is fully described in the accompanying documentation. As an example,
the following tree represents the complete previous example in column
format.

(
  (S
    (sn-SUJ-Arg1-TEM
      (espec.fp
        (da0fp0 Las el))
      (grup.nom.fp
        (ncfp000 conclusiones conclusin 01207975n)
        (sp
          (prep
            (sps00 de de))
          (sno
            (espec.fs
              (da0fs0 la el))
            (grup.nom.fs
              (ncfs000 comisin comisin 01207975n)
              (snp
                (grup.nom
                  (np0000p Zapatero Zapatero)))
              (S.F.R
                (Fc , ,)
                (relatiu-SUJ-Arg0-CAU
                  (pr0cn000 que que))
                (gv
                  (vmif3s0 ampliar ampliar-a1))
                (sn-CD-Arg1-PAT
                  (espec.ms
                    (da0ms0 el el))
                  (grup.nom.ms
                    (ncms000 plazo plazo 01207975n)
                    (sp
                      (prep
                        (sps00 de de))
                      (sn
                        (grup.nom.ms
                          (ncms000 trabajo trabajo 01207975n))))))
                (Fc , ,)))))))
    (gv
      (vmip3p0 quedan quedar-b3))
    (sp-CC-ArgM-TMP
      (prep
        (sps00 para para))
      (sp
        (prep
          (spcms despus_del despus_del))
        (sn
          (grup.nom.ms
            (ncms000 verano verano 01207975n)))))
    (Fp . .)))


The scripts for automatically converting these trees into the column
format are also distributed as part of the resources for the task. If
you want to use them, see the Download section of task#9 official
wepage for instructions on how to download, install, and use the
software.

NOTE-2: some syntactic labels of tree constituents contain the '*'
symbol (e.g., S.F.R*; see the files tagset-constituents.ca.pdf and
tagset-constituents.es.pdf for details). This symbol is also used as a
meta character for our column based codification of the syntactic
trees, possibly leading to confusion. For instance, you may find:

   "(S.F.AComp*.j(conj.subord*)"    (Spanish sentence 1; line 31) 
   "(S.F.R**"                       (Catalan sentence 2; line 20) 

However, note that this codification is not ambiguous: each line
contains one '*' meta character. If more than one '*' appears in a
line, the meta character is the last one (also, it is identified for
being either the last symbol of the field or the symbol preceding a
closing parenthesis, ')')


Training data organization
==========================

As in the trial data distribution, training data comes in two
directories: 'ca' and 'es', one for each language (ca=Catalan;
es=Spanish) containing the following files (<lang> stands for 
'ca' or 'es' in the names of the files):

   trial.<lang>.trees.txt.gz: It contains the syntactic trees enriched with all the semantic 
                              information (original trees from the CESS-ECE corpora; see an 
                              example above). We distribute the original complete trees merely 
                              for easing the comprehension/readability of the information 
                              presented and its connection to the column-based format. Note that 
                              the task will be evaluated strictly using the column-based format.
                              In the test set all the information described as OUTPUT will be 
                              removed from the syntactic trees. Don't be tempted to use it 
                              during training.
   trial.<lang>.BII.txt.gz  : It contains the Basic Input Information that the participants need (input info).
   trial.<lang>.EII.txt.gz  : It contains the Extra Input Information provided to the participants (input info).
   trial.<lang>.NER.txt.gz  : It contains the Named Entity tags (output information).
   trial.<lang>.NSD.txt.gz  : It contains the WordNet senses of target nouns (output information).
   trial.<lang>.SRL.txt.gz  : It contains the lexico-semantic class of the verb (output information).
			      And, for each target verb, a column representing the arguments of the 
		              target verb (output information).
   trial.<lang>.ALL.txt.gz  : It contains ALL the previous files pasted (in columns) in the same order 
                              described in the example.


Accompanying Documentation
==========================

Accompanying documentation needed to properly understand details of
formats and tag sets is distributed through task#9 web site.  Please
consult the URL: http://www.lsi.upc.edu/~nlp/semeval/msacs_download.html 
It wil contain the last versions of the following documentation and
software tools:

  Descriptions of syntactic tagsets:
    - tagset_POS.pdf : tagset with part-of-speech labels for Catalan and Spanish 
    - tagset-constituents.ca.pdf : list of tree constituents for Catalan
    - tagset-constituents.es.pdf : list of tree constituents for Spanish     
    - tagset_syntactic_functions.ca.pdf : syntactic functions for Catalan
    - tagset_syntactic_functions.es.pdf : syntactic functions for Spanish

  Description of the annotation of named entities and the associated tagset:
    - NE_annotation_criteria.pdf 

  Description of the annotation of noun senses and associated tagset:
    - WordNet_annotation_of_nouns.pdf

  Description of the annotation of semantic roles:
    - semantic_classes.pdf : description of the verbal semantic classes
    - thematic_roles_tagset.pdf : complete tagset of 'argument+thematic-role' labels
    - verb_lexical_entry.pdf : description of the entries of the verbal lexicon (rolesets)

  Formatting scripts
    - tree2column: Format conversion script. It receives as input
      sentences in the standard CESS-ECE format (similat to that of
      Penn Treebank) and outputs the sentences in column style
      presentation of levels of annotation. Already available updated
      version: semeval9-0.6.tar.gz (see the README file in the software
      package). It can be useful for those working directly with the
      tree format instead of the column format.

  Official evaluation script
    - msacs-eval: Official script for evaluation in SemEval-2007 task
      #9. It offers the capabilities described in the evaluation
      section.

  Baselines
    - A baseline system for each subtask and language will be provided
      by the organization.
      * SRL: it will consists of a series of simple language dependent
        heuristics that perform a basic SRL tagging (e.g. tag first sn
        or sn* before target verb as A0.) This baseline is adapted
        from the CoNLL 2005 shared task
      * NSD: it will consists of a most-frequent-sense tagging
        strategy. Every noun is tagged with the first sense from the
        Spanish or Catalan WordNets
      * NER: It will consists of the application of a gazetteer
        (collected from the training data) and a series of simple
        heuristics that perform a basic NER tagging (e.g., if POS=W
        then tag=DAT)

  Other Resources
    - Full Catalan and Spanish WordNets, which are linked to English
      WordNet 1.6.
    - Link to Multilingual Central Repository developed under the
      MEANING project.
    - Dictionary of senses (according to the Catalan and Spanish
      WordNets) for all nouns treated in the dataset
    - Full style guides for syntax annotation :
      * annotation-of-constituents-guidelines.ca.pdf : Annotation of
        Catalan constituents (document in Catalan).
      * annotation-of-constituents-guidelines.es.pdf : Annotation of
        Spanish constituents (document in Spanish).
      * annotation-of-functions-guidelines.ca.pdf : Annotation of
        Catalan functions (document in Catalan).
      * annotation-of-functions-guidelines.es.pdf : Annotation of
        Spanish functions (document in Spanish).
    - Full verbal lexicon : roleset descriptions for all verbs in the
      training/test corpora


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


