%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SemEval-2007 task#9 % Multilevel Semantic Annotation of Catalan and Spanish % % Released on: March 10, 2007 % % Task organizers: % Lluís Màrquez, Luis Villarejo % TALP Research Center % Technical University of Catalonia (UPC) % Antònia Martí, Mariona Taulé, % Centre de Llenguatge i Computació, CLiC % Universitat de Barcelona % % Contact e-mail address: semeval-msacs@lsi.upc.edu % Task website: http://www.lsi.upc.edu/~nlp/semeval/msacs.html % % This datasets are distributed to support the SemEval-2007 task#9 on % Multilevel Semantic Annotation of Catalan and Spanish. They are free % for research and educational purposes. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Current version and scheduling ============================== The present version of the datasets consists of the full training corpora for Catalan and Spanish. They contain 97,758 and 90,661 lexical tokens, respectively *IMPORTANT note*: the present training datasets strictly contain the previous distribution with full training sets (released on March 5th). We recommend to discard the previous datasets and work with the current, since the SRL columns have been ordered according to the textual order of the predicates and, the annotation of one sentence has been fixed. By textual order we understand that the left-most SRL column corresponds to the first predicate in the sentence, the second left-most SRL column corresponds to the second predicate in the sentence and so on. Both prediction and gold files must follow textual order of the predicates since it is a requirement to perform the evaluation. The test data will be about 10 times smaller than the training corpus and will contain examples from two different sources (one of them is the same training corpus and the other is a corpus from slightly different domain and genre) in order to test the robustness of the systems presented. It will be released by March 12. Stay tuned to the SemEval and Task#9 websites. The README accompanying the test set distribution will include precise instructions on how to upload system's results. All these datasets are to be downloaded through the official SemEval website, once you are registered as a team. The 4-week evaluation period holding for the task will start from the moment the participant team clicks the download button for training (this is also valid for the partial training-set-1). Test time will be 1 week (included in the 4-week complete period). This 4-week period has to be included in the SemEval-2007 evaluation frame: February, 26 to April, 1. License and usage of datasets ============================= Training/test datasets are free for research and academic purposes. However, participants must previously sign a usage license which has to be delivered to the attention of Maria Antònia Martí (Head researcher of CLiC, Universitat de Barcelona). The license form is available at the official website for task #9: http://www.lsi.upc.edu/~nlp/semeval/msacs_download.html. Fill in the form and submit an electronic version with electronic signature to amarti@ub.edu. Optionally, you can print and send it by regular mail to: M. Antònia Martí Antonín, Centre de Llenguatge i Computació Universitat de Barcelona Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain In that case, we also ask you to fax the form to: +34 93 3189822 or +34 93 4489434 Whenever you are publishing results using the present datasets you are requested to appropriately cite the CESS-ECE project: M. Antònia Martí, Mariona Taulé, Lluís Màrquez and Manuel Bertran. (2007). CESS-ECE: A Multilingual and Multilevel Annotated Corpus. Pending to be published. Currently available at http://www.lsi.upc.edu/~mbertran/cess-ece/publications. Disclaimer ========== The information contained in this README file is orientative and might be incomplete or inaccurate at some points. For a complete description of the task setting, datasets, resources, etc. we refer the reader to the official task webpage, which is periodically updated: http://www.lsi.upc.edu/~nlp/semeval/msacs.html General information and formats =============================== The sentences in the training datasets are properly tokenized, POS tagged and lemmatized, including full syntactic annotation (gold-standard constituency parse trees with function tags). Also, semantic annotations are included describing named entities (NE), noun senses (NS) and semantic roles (SR), which is the target knowledge to be learned. Data formatting is exactly the same than that of the previously released trial dataset. The test data will share also the same formatting but will exclude the semantic levels of annotation (NE, NS, and SR), which have to be predicted by participant systems. The parse trees of the test set will be also the manually revised gold-standard ones [unfortunately, we have had no time to develope automatic parsers for both languages to provide the automatic generated syntactic input levels, as we initially planned]. More instructions on how to format and upload results of participant systems will be given with the test set release (March 12). Note that the trial dataset has been updated (a number of errors have been fixed) and posted again on February 22. Download the new version (http://nlp.cs.swarthmore.edu/semeval/tasks/task09/data.shtml) and check the changes in the README file. The trial datasets are already included in the complete training distribution, so there is no need to use them as extra training material for developing the final systems. Also, note that it is not forbidden to use some external resources apart from the training datasets to produce a system for the task. However, we strongly encourage participant teams to explicitely comment on all the external resources used in the system description paper to be prepared after the evaluation period. By "external resources" we mean any knowledge or data that cannot be directly inferred from the training sets provided in this release. * Data formats (copied from the trial dataset description) Data formats are highly similar to those of the CoNLL-2005 shared task (column style presentation of levels of annotation), in order to be able to share evaluation tools and already developed scripts for format conversion. Note that the PROPS columns must follow textual order of the predicates as described previously. Here you can find an example of a fully annotated sentence: INPUT--------------------------------------------------------------> OUTPUT--------------------------------------> BASIC_INPUT_INFO-----> EXTRA_INPUT_INFO---------------------------> NE---> NS------> SR-----------------------> WORD TN TV LEMMA POS SYNTAX NE NS SC PROPS----------------> ------------------------------------------------------------------------------------------------------------------- Las - - el da0fp0 (S(sn-SUJ(espec.fp*) * - - * (Arg1-TEM* conclusiones * - conclusión ncfp000 (grup.nom.fp* * 05059980n - * * de - - de sps00 (sp(prep*) * - - * * la - - el da0fs0 (sn(espec.fs*) (ORG* - - * * comisión * - comisión ncfs000 (grup.nom.fs* * 06172564n - * * Zapatero - - Zapatero np00000 (grup.nom*) (PER*) - - * * , - - , Fc (S.F.R* * - - * * que - - que pr0cn000 (relatiu-SUJ*) * - - (Arg0-CAU*) * ampliará - * ampliar vmif3s0 (gv*) * - a1 (V*) * el - - el da0ms0 (sn-CD(espec.ms*) * - - (Arg1-PAT* * plazo * - plazo ncms000 (grup.nom.ms* * 10935385n - * * de - - de sps00 (sp(prep*) * - - * * trabajo * - trabajo ncms000 (sn(grup.nom.ms*))))) * 00377835n - *) * , - - , Fc *)))))) *) - - * *) quedan - * quedar vmip3p0 (gv*) * - b3 * (V*) para - - para sps00 (sp-CC(prep*) * - - * (ArgM-TMP* después_del - - después_del spcms (sp(prep*) * - - * * verano * - verano ncms000 (sn(grup.nom.ms*)))) * 10946199n - * *) . - - . Fp *) * - - * * There is one line for each token, and a blank line after the last token of each sentence. The columns, separated by blank spaces, represent different annotations of the sentence with a tagging along words. For structured annotations (named entities, parse trees and arguments), we use the Start-End format. The Start-End format represents phrases (syntactic constituents, named entities, and arguments) that constitute a well-formed bracketing in a sentence (that is, phrases do not overlap, though they admit embedding). Each tag is of the form STARTS*ENDS, and represents phrases that start and end at the corresponding word. A phrase of type k places a '(k' parenthesis at the STARTS part of the first word, and a ')' parenthesis at the END part of the last word. The different annotations in a sentence are grouped in five main categories: [1] BASIC_INPUT_INFO. The basic input information that the participants need: * WORDS (column 1): words of the sentence. * TN (column 2): target nouns of the sentence (those that are to be assigned WordNet synsets); marked with '*' * TV (column 3): target verbs of the sentence (those that are to be annotated with semantic roles); marked with '*' [2] EXTRA_INPUT_INFO. The extra input information provided to the participants: * LEMA (column 4): lemmas of the words * POS (column 5): part-of-speech tags * SYNTAX (column 6): Full syntactic tree. [3] NE (column 7). Named Entities (output information = to be predicted when testing ; available only for trial/training sets). [4] NS (column 8). WordNet sense of target nouns (output information) [5] SRL. Information on semantic roles: * SC (column 9). The lexico-semantic class of the verb (output information). * PROPS (columns 10-[10+N-1]). For each of the N target verb, a column representing the argument structure of the target verb (output information). Core numbered arguments are enriched with the thematic role label (e.g., Arg1-TEM). ArgM's are the adjuncts. NOTE-1: All these annotations in column format are extracted automatically from the syntactic-semantic trees from the CESS-ECE corpora, which are also distributed with the datasets (see description below). These are constituency trees enriched with semantic labels for NE, NS and SR. The format is similar to that of Penn Treebank and it is fully described in the accompanying documentation. As an example, the following tree represents the complete previous example in column format. ( (S (sn-SUJ-Arg1-TEM (espec.fp (da0fp0 Las el)) (grup.nom.fp (ncfp000 conclusiones conclusión 01207975n) (sp (prep (sps00 de de)) (sno (espec.fs (da0fs0 la el)) (grup.nom.fs (ncfs000 comisión comisión 01207975n) (snp (grup.nom (np0000p Zapatero Zapatero))) (S.F.R (Fc , ,) (relatiu-SUJ-Arg0-CAU (pr0cn000 que que)) (gv (vmif3s0 ampliará ampliar-a1)) (sn-CD-Arg1-PAT (espec.ms (da0ms0 el el)) (grup.nom.ms (ncms000 plazo plazo 01207975n) (sp (prep (sps00 de de)) (sn (grup.nom.ms (ncms000 trabajo trabajo 01207975n)))))) (Fc , ,))))))) (gv (vmip3p0 quedan quedar-b3)) (sp-CC-ArgM-TMP (prep (sps00 para para)) (sp (prep (spcms después_del después_del)) (sn (grup.nom.ms (ncms000 verano verano 01207975n))))) (Fp . .))) The scripts for automatically converting these trees into the column format are also distributed as part of the resources for the task. If you want to use them, see the Download section of task#9 official wepage for instructions on how to download, install, and use the software. NOTE-2: some syntactic labels of tree constituents contain the '*' symbol (e.g., S.F.R*; see the files tagset-constituents.ca.pdf and tagset-constituents.es.pdf for details). This symbol is also used as a meta character for our column based codification of the syntactic trees, possibly leading to confusion. For instance, you may find: "(S.F.AComp*.j(conj.subord*)" (Spanish sentence 1; line 31) "(S.F.R**" (Catalan sentence 2; line 20) However, note that this codification is not ambiguous: each line contains one '*' meta character. If more than one '*' appears in a line, the meta character is the last one (also, it is identified for being either the last symbol of the field or the symbol preceding a closing parenthesis, ')') Training data organization ========================== As in the trial data distribution, training data comes in two directories: 'ca' and 'es', one for each language (ca=Catalan; es=Spanish) containing the following files ( stands for 'ca' or 'es' in the names of the files): trial..trees.txt.gz: It contains the syntactic trees enriched with all the semantic information (original trees from the CESS-ECE corpora; see an example above). We distribute the original complete trees merely for easing the comprehension/readability of the information presented and its connection to the column-based format. Note that the task will be evaluated strictly using the column-based format. In the test set all the information described as OUTPUT will be removed from the syntactic trees. Don't be tempted to use it during training. trial..BII.txt.gz : It contains the Basic Input Information that the participants need (input info). trial..EII.txt.gz : It contains the Extra Input Information provided to the participants (input info). trial..NER.txt.gz : It contains the Named Entity tags (output information). trial..NSD.txt.gz : It contains the WordNet senses of target nouns (output information). trial..SRL.txt.gz : It contains the lexico-semantic class of the verb (output information). And, for each target verb, a column representing the arguments of the target verb (output information). trial..ALL.txt.gz : It contains ALL the previous files pasted (in columns) in the same order described in the example. Accompanying Documentation ========================== Accompanying documentation needed to properly understand details of formats and tag sets is distributed through task#9 web site. Please consult the URL: http://www.lsi.upc.edu/~nlp/semeval/msacs_download.html It wil contain the last versions of the following documentation and software tools: Descriptions of syntactic tagsets: - tagset_POS.pdf : tagset with part-of-speech labels for Catalan and Spanish - tagset-constituents.ca.pdf : list of tree constituents for Catalan - tagset-constituents.es.pdf : list of tree constituents for Spanish - tagset_syntactic_functions.ca.pdf : syntactic functions for Catalan - tagset_syntactic_functions.es.pdf : syntactic functions for Spanish Description of the annotation of named entities and the associated tagset: - NE_annotation_criteria.pdf Description of the annotation of noun senses and associated tagset: - WordNet_annotation_of_nouns.pdf Description of the annotation of semantic roles: - semantic_classes.pdf : description of the verbal semantic classes - thematic_roles_tagset.pdf : complete tagset of 'argument+thematic-role' labels - verb_lexical_entry.pdf : description of the entries of the verbal lexicon (rolesets) Formatting scripts - tree2column: Format conversion script. It receives as input sentences in the standard CESS-ECE format (similat to that of Penn Treebank) and outputs the sentences in column style presentation of levels of annotation. Already available updated version: semeval9-0.6.tar.gz (see the README file in the software package). It can be useful for those working directly with the tree format instead of the column format. Official evaluation script - msacs-eval: Official script for evaluation in SemEval-2007 task #9. It offers the capabilities described in the evaluation section. Baselines - A baseline system for each subtask and language will be provided by the organization. * SRL: it will consists of a series of simple language dependent heuristics that perform a basic SRL tagging (e.g. tag first sn or sn* before target verb as A0.) This baseline is adapted from the CoNLL 2005 shared task * NSD: it will consists of a most-frequent-sense tagging strategy. Every noun is tagged with the first sense from the Spanish or Catalan WordNets * NER: It will consists of the application of a gazetteer (collected from the training data) and a series of simple heuristics that perform a basic NER tagging (e.g., if POS=W then tag=DAT) Other Resources - Full Catalan and Spanish WordNets, which are linked to English WordNet 1.6. - Link to Multilingual Central Repository developed under the MEANING project. - Dictionary of senses (according to the Catalan and Spanish WordNets) for all nouns treated in the dataset - Full style guides for syntax annotation : * annotation-of-constituents-guidelines.ca.pdf : Annotation of Catalan constituents (document in Catalan). * annotation-of-constituents-guidelines.es.pdf : Annotation of Spanish constituents (document in Spanish). * annotation-of-functions-guidelines.ca.pdf : Annotation of Catalan functions (document in Catalan). * annotation-of-functions-guidelines.es.pdf : Annotation of Spanish functions (document in Spanish). - Full verbal lexicon : roleset descriptions for all verbs in the training/test corpora %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%