SemEval-2007

Task #09: Multilevel Semantic Annotation of Catalan and Spanish

Home Technical Setting Download Systems & Results

| |

This section is providing scripts to process the data, evaluation software, complementary materials, baseline systems, etc. But it does not contain the official datasets. They are distributed through the download section of the SemEval-2007 website

Data and accompanying documentation

See details in the README file. Updated 9th March!

Train and test data release calendar:

February 26, 2007	First release of training data. (~42.000 words for Catalan and ~87.000 words for Spanish)
March 5, 2007	Release of complete training data
March 12, 2007	Release of test data

LICENSE and USAGE of data: Training/test datasets are free for research and academic purposes. However, participants must previously sign a usage license which has to be delivered to the attention of Maria Antònia Martí (Head researcher of CLiC, Universitat de Barcelona). The license form is available here (.rtf / .doc / .swx). Fill in the form and submit an electronic version with electronic signature to amarti@ub.edu. Optionally, you can print and send it by regular mail to: M. Antònia Martí, Centre de Llenguatge i Computació, Universitat de Barcelona. Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain. In that case, we also ask you to fax the form to: +34 93 3189822 or +34 93 4489434. Consult the README file accompanying training data distribution to get further instructions.

The updated trial data (February 22) is available at the SemEval-2007 webpage for task #9. Download it and consult the README file to be aware of the minor changes and get started. Updated!
The following documents contain necessary information to properly interpret the data annotations:

Descriptions of syntactic tagsets

tagset_POS.pdf : tagset with part-of-speech labels for Catalan and Spanish
tagset-constituents.ca.pdf : list of tree constituents for Catalan
tagset-constituents.es.pdf : list of tree constituents for Spanish
tagset_syntactic_functions.ca.pdf : syntactic functions for Catalan
tagset_syntactic_functions.es.pdf : syntactic functions for Spanish

Description of the annotation of named entities and the associated tagset:

NE_annotation_criteria.pdf

Description of the annotation of noun senses and associated tagset:

WordNet_annotation_of_nouns.pdf

Description of the annotation of semantic roles

semantic_classes.pdf : description of the verbal semantic classes
thematic_roles_tagset.pdf : complete tagset of 'argument+thematic-role' labels
verb_lexical_entry.pdf : description of the entries of the verbal lexicon (rolesets)

Note these documents are also distributed with the trial dataset tarball (at the SemEval website) but we cannot guarantee that they contain the ultimate versions. For being up to date with the complementary documentation and scripts, download them directly from this webpage.

Software

Formatting scripts

tree2column: Format conversion script. It receives as input sentences in the standard CESS-ECE format (similat to that of Penn Treebank) and outputs the sentences in column style presentation of levels of annotation. Already available updated version: semeval9-1.4.tar.gz (see the README file in the software package). It can be useful for those working directly with the tree format instead of the column format. Updated! 20th March

Official evaluation script

msacs-eval: Official script for evaluation in SemEval-2007 task #9. It offers the capabilities described in the evaluation section. It is already available with semeval9-1.4.tar.gz (see the README file in the software package). Remember that SRL columns must follow textual order of the predicates. Updated! 12th April

Baselines

A baseline system for each subtask and language was calculated by the organization.

SRL: it consists of a series of simple language dependent heuristics that perform a basic SRL tagging (e.g. tag first sn or sn* before target verb as A0.) This baseline was adapted from the CoNLL 2005 shared task
NSD: it consists of a most-frequent-sense tagging strategy. Every noun is tagged with the first sense from the training corpus with backup to the Spanish or Catalan WordNets
NER: it consists of the application of a gazetteer (collected from the training data) and a series of simple heuristics that perform a basic NER tagging (e.g., if POS=W then tag=DAT)

Other Resources

Full dictionaries relating lemmas (nouns, verbs, adjectives and adverbs) and WordNet senses. New! 24th March
- Full Catalan dictionary.
- Full Spanish dictionary.
- README file.
Full Catalan and Spanish WordNets, which are linked to English WordNet 1.6. New!

WordNet files.
README file.

Multilingual Central Repository developed under the MEANING project. It includes the Spanish and Catalan WordNets, though we cannot guarantee that they are exactly the same versions as the ones we distribute for task#9.

Web interface.

Full style guides for syntax annotation :

annotation-of-constituents-guidelines.ca.pdf : Annotation of Catalan constituents (document in Catalan).
annotation-of-constituents-guidelines.es.pdf : Annotation of Spanish constituents (document in Spanish).
annotation-of-functions-guidelines.ca.pdf : Annotation of Catalan functions (document in Catalan).
annotation-of-functions-guidelines.es.pdf : Annotation of Spanish functions (document in Spanish).

Full verbal lexicon : roleset descriptions for all verbs in the training/test corpora. Updated! 9th March

Last update: May 22nd, 2007

For more information, visit the SemEval-2007 home page.