Task #09: Multilevel Semantic Annotation of Catalan and Spanish


This section is providing scripts to process the data, evaluation software, complementary materials, baseline systems, etc. But it does not contain the official datasets. They are distributed through the download section of the SemEval-2007 website

Data and accompanying documentation

See details in the README file. Updated 9th March!
  • Train and test data release calendar: 

  • February 26, 2007 First release of training data.
    (~42.000 words for Catalan and ~87.000 words for Spanish)
    March 5, 2007 Release of complete training data
    March 12, 2007 Release of test data

LICENSE and USAGE of data: Training/test datasets are free for research and academic purposes. However, participants must previously sign a usage license which has to be delivered to the attention of Maria Antònia Martí (Head researcher of CLiC, Universitat de Barcelona). The license form is available here (.rtf / .doc / .swx). Fill in the form and submit an electronic version with electronic signature to amarti@ub.edu. Optionally, you can print and send it by regular mail to: M. Antònia Martí, Centre de Llenguatge i Computació, Universitat de Barcelona. Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain. In that case, we also ask you to fax the form to: +34 93 3189822 or +34 93 4489434. Consult the README file accompanying training data distribution to get further instructions.
  • The updated trial data (February 22) is available at the SemEval-2007 webpage for task #9. Download it and consult the README file to be aware of the minor changes and get started. Updated!
    The following documents contain necessary information to properly interpret the data annotations:
  Descriptions of syntactic tagsets
  Description of the annotation of named entities and the associated tagset:
  Description of the annotation of noun senses and associated tagset:
  Description of the annotation of semantic roles
Note  these documents are also distributed with the trial dataset tarball (at the SemEval website) but we cannot guarantee that they contain the ultimate versions. For being up to date with the complementary documentation and scripts, download them directly from this webpage.


Formatting scripts
  • tree2column: Format conversion script. It receives as input sentences in the standard CESS-ECE format (similat to that of Penn Treebank) and outputs the sentences in column style presentation of levels of annotation. Already available updated version: semeval9-1.4.tar.gz (see the README file in the software package). It can be useful for those working directly with the tree format instead of the column format. Updated! 20th March
Official evaluation script
  • msacs-eval: Official script for evaluation in SemEval-2007 task #9. It offers the capabilities described in the evaluation section. It is already available with semeval9-1.4.tar.gz (see the README file in the software package). Remember that SRL columns must follow textual order of the predicates. Updated! 12th April


A baseline system for each subtask and language was calculated by the organization.
  • SRL: it consists of a series of simple language dependent heuristics that perform a basic SRL tagging (e.g. tag first sn or sn* before target verb as A0.) This baseline was adapted from the CoNLL 2005 shared task
  • NSD: it consists of a most-frequent-sense tagging strategy. Every noun is tagged with the first sense from the training corpus with backup to the Spanish or Catalan WordNets
  • NER: it consists of the application of a gazetteer (collected from the training data) and a series of simple heuristics that perform a basic NER tagging (e.g., if POS=W then tag=DAT)

Other Resources

Last update: May 22nd, 2007

 For more information, visit the SemEval-2007 home page.