|
This section is providing scripts to process the data, evaluation
software, complementary materials, baseline systems, etc. But it does
not contain the official datasets. They are distributed through the
download section of the SemEval-2007
website
Data and accompanying documentation
See details in the README file. Updated 9th March!
- Train and test data
release calendar:
February
26, 2007 |
First
release of training data.
(~42.000 words for Catalan and ~87.000 words for Spanish) |
March
5, 2007 |
Release
of complete training data |
March
12, 2007 |
Release
of test data |
LICENSE and USAGE of data:
Training/test datasets are free for research and academic purposes.
However, participants must previously sign a usage license which has to
be delivered to the attention of Maria Antònia Martí (Head researcher
of CLiC, Universitat de Barcelona). The license form is available here
(.rtf / .doc / .swx). Fill in
the form and submit an electronic version
with electronic signature to amarti@ub.edu.
Optionally, you can print and send it by regular mail to: M. Antònia Martí, Centre de Llenguatge i
Computació, Universitat de Barcelona. Gran Via de les Corts Catalanes
585, 08007 Barcelona, Spain. In that case, we also ask you to
fax the form to: +34 93 3189822 or +34 93 4489434. Consult the README
file accompanying training data distribution to get further
instructions.
- The updated trial data
(February 22) is available
at the SemEval-2007
webpage for task #9. Download it and consult the README file to be
aware of the minor changes and get
started.
Updated!
The following documents contain necessary information to properly
interpret the data annotations:
Descriptions of syntactic tagsets
Description of the annotation
of named entities and the associated tagset:
Description of the annotation
of noun senses and associated tagset:
Description of the annotation
of semantic roles
Note
these documents are also distributed with the trial dataset tarball (at
the SemEval website) but we cannot guarantee that they contain the
ultimate versions. For being up to date with the complementary
documentation and scripts, download them directly from this webpage.
Software
Formatting
scripts
- tree2column: Format
conversion script. It receives as input sentences in the standard
CESS-ECE format (similat to that of Penn Treebank) and outputs the
sentences in column style presentation of levels of annotation. Already
available updated version: semeval9-1.4.tar.gz
(see the README
file in the software package). It can be useful for those working
directly with the tree format instead of the column format. Updated! 20th March
Official
evaluation script
- msacs-eval:
Official script for evaluation in SemEval-2007 task #9. It offers the
capabilities described in the evaluation section. It is
already available with semeval9-1.4.tar.gz
(see the README file in the software package). Remember that SRL columns
must follow textual order of the predicates. Updated! 12th April
Baselines
A baseline system for each subtask and language was calculated by the organization.
- SRL: it
consists of a series of simple language dependent heuristics that
perform a basic SRL tagging (e.g. tag first sn or sn* before target verb as A0.) This
baseline was adapted from the CoNLL 2005 shared task
- NSD: it
consists of a most-frequent-sense tagging strategy. Every noun is
tagged with the first sense from the training corpus with backup to the Spanish or Catalan WordNets
- NER: it
consists of the application of a gazetteer (collected from the training
data) and a series of simple heuristics that perform a basic NER
tagging (e.g., if POS=W then
tag=DAT)
Other Resources
- Full dictionaries
relating lemmas (nouns, verbs, adjectives and adverbs) and WordNet senses. New! 24th March
- Full Catalan and Spanish
WordNets, which are linked to English WordNet 1.6. New!
- Multilingual Central
Repository developed under the MEANING project. It includes the
Spanish and Catalan WordNets, though we cannot guarantee that they are
exactly the same versions as the ones we distribute for task#9.
- Full style guides for
syntax annotation :
- Full verbal lexicon :
roleset descriptions
for all verbs in the training/test corpora. Updated! 9th March
Last
update: May 22nd, 2007
|