======================================================================== SENSEVAL-3 evaluation Catalan Lexical Sample Task: March 8, 2004 ======================================================================== Task organizers: Lluís Màrquez (contact person) lluism@lsi.upc.es TALP Research Center, LSI, Universitat Politècnica de Catalunya Antònia Martí amarti@ub.edu CLiC, Universitat de Barcelona Mariona Taulé mtaule@uoc.edu CLiC, Universitat de Barcelona ======================================================================== General description of the Catalan Lexical Sample Task: We propose a "Lexical-Sample" task for Catalan in order to evaluate supervised and semi-supervised learning systems for WSD. Each participant will be provided with a relatively small set of labeled examples (2 thirds of 75+15*#senses) and a comparatively very large set of unlabeled examples (ten times more, when possible) for 27 words. The test set will be comprised with one third of 75+15*#senses. We target at two types of participants: supervised systems (not using unlabeled data) and semi-supervised systems (those taking profit from the unlabeled data), but unsupervised systems can also participate, of course. The MiniDir sense inventory, which is specially developed for the task, is manually linked to WordNet 1.5 (automatic links to WordNet1.6/1.7 will be also provided). This task is coordinated with other lexical-sample tasks (Basque, Spanish, English, Italian, Rumanian) in order to share around 10 of the target words. ======================================================================== All data sets and complementary information are organized in the following files: README - this file words.info - information about the 27 words treated MiniDir.xml - sense inventory used to annotate examples Catalan-samples.train.raw.xml - training examples for all 27 words Catalan-samples.train.tagged.xml - training examples for all 27 words: lemmatized and POS tagged version Catalan-samples.unlab.raw.xml - unlabeled examples for all 27 words Catalan-samples.unlab.tagged.xml - unlabeled examples for all 27 words: lemmatized and POS tagged version Catalan-samples.test.raw.xml - test examples for all 27 words Catalan-samples.test.tagged.xml - test examples for all 27 words: lemmatized and POS tagged version All this files are grouped in the following 3 gzipped tar-ed packages for downloading: - LexicalSample.ca.train.raw.tgz (11.1Mb): contains informative files, MiniDir.xml, and the "raw" version of the traininig examples, both labeled and unlabeled. - LexicalSample.ca.train.tagged.tgz (39.2Mb): contains the "tagged" versions of the traininig examples, both labeled and unlabeled. - LexicalSample.ca.test.tgz (4.5Mb): contains test datasets, both "raw" and "tagged" Below there is additional information about each of the files and datasets. ======================================================================== Set of words treated: the "words.info" file This file contains basic information about the 27 words treated: word, PoS (noun, verb, or adjective), number of valid senses, and number of training/test/unlabeled exemples. It looks like: ---------------------------------------------------- word PoS #senses #train #test #unlab ---------------------------------------------------- actuar v 2 197 99 2442 apuntar v 5 184 93 1881 autoritat n 2 188 93 102 baixar v 3 189 92 1572 ... verd a 2 128 64 1315 vital a 3 160 81 220 ---------------------------------------------------- TOTAL 4469 2253 23935 ---------------------------------------------------- ======================================================================== Training Examples: the "Catalan-samples.train.raw.xml" file This file contains the manually annotated set of examples for all 27 words. All the examples have been extracted from the 2000-2003 corpus of the Catalan ACN News Agency (Agència Catalana de Notícies). Each example has been tagged by two independent and expert human annotators and disagreement cases have been resolved by another lexicographer (assigning a unique sense to each example). The senses corresponding to multi-word expressions have been manually filtered out. Also examples from senses with low frequency in the reference corpus have been discarded. More information can be found in the MiniDir.xml file. The "Catalan-samples.train.raw.xml" file is compliant with standard Senseval-2 XML formats with a few added exceptions. 1. Each example is provided with a (non null) list of category-labels marked according to the internal ACN annotation scheme. Example: ======== 2. The context of each example consists of three paragraphs, which have been marked with the , , and labels. Previous and following paragraphs may be empty. Example: ======== El conseller d'Interior de la Generalitat, Xavier Pomés, ha dit que les condicions climatològiques actuals han permès retardar l'inici del dispositiu de bombers previst per la campanya d'enguany fins el 15 de maig en comptes de fer-lo efectiu a primers del mes vinent. D'aquesta manera, ha afegit Pomés, es pot garantir tenir els efectius necessaris fins a l'octubre. El conseller s'ha desplaçat avui a Lleida per inaugurar l'assignatura d' Incendis Forestals del curs d'Enginyers Tècnics Forestals i Enginyers de Monts de la facultat d'Agrònoms de la Universitat de Lleida. D'altra banda, el conseller ha defensat la preparació dels bombers de la Generalitat com dels 'pompièrs' de la Val d'Aran per actuar en cas d'accidents a l'interior dels túnels. Pomés ha fet aquestes declaracions en resposta a un informe europeu fet públic ahir a la seu del RACC a Barcelona en el que es posa de manifest que tant els bombers de la Generalitat com els de l'Aran no han rebut l'entrenament especialitzat per situacions dins de túnels. ======================================================================== Unlabeled Examples: the "Catalan-samples.unlabeled.raw.xml" file. This file contains the unlabeled set of examples for all 27 words. All the examples have been extracted also from the 2000-2003 corpus of the Catalan ACN News Agency. It has the same format than the "Catalan-samples.train.raw.xml", except that examples contain no information about correct senses. Unlabeled examples have been automatically extracted with no manual post-processing, thus they are not free of errors. According to the information compiled from the manual inspection of the set of training examples, an informative element has been included for each word in the "Catalan-samples.unlabeled.raw.xml" file. Example: ======== Expected proportion of sense 1: 31.40% Expected proportion of sense 2: 7.51% Expected proportion of sense 3: 57.00% Expected proportion of sense 4: 0.68% Expected proportion of unfiltered multi-word expressions: 3.07% Expected proportion of other senses: 0.34% Expected POS errors: 0 ======================================================================== Test Examples: the "Catalan-samples.test.raw.xml" file. This file contains the test set of examples for all 27 words. All the examples have been extracted also from the 2000-2003 corpus of the Catalan ACN News Agency and annotated following the same procedure as the training set. The test file has the same format than the "Catalan-samples.train.raw.xml", except that examples contain no information about correct senses. ======================================================================== POS tagged and lemmatized examples: "Catalan-samples.train.tagged.xml", "Catalan-samples.unlab.tagged.xml", and "Catalan-samples.test.tagged.xml" files Aiming at helping teams with few resources, the "train.raw", "unlab.raw", and "test.raw" files are complemented with the corresponding "train.tagged", "unlab.tagged", and "test.tagged" files, in which the contexts of the examples are also tokenized, lemmatized and POS tagged. This annotation is performed using the Catalan linguistic processors developed at the TALP research center, UPC. For instance, the context: El tribunal de la secció setena ha valorat el fet que el jove va actuar molt afectat pel consum d'alcohol, que li mermava les capacitats volitives, i per això ha rebaixat considerablement la petició que va formular el fiscal, que era de set anys de presó. is complemented with a element, in wich there is an annotation for each token: ... ... The attribute "frm" contains the form, "lem" contains the lemma, and "pos" the part-of-speech tag. Note that the target word is marked with an additional attribute: head="yes". This annotation includes a proper segmentation into lexical tokens referring to basic multiword expressions: etc. proper nouns and named entities: dates and temporal expressions: numbers, money, and percentages: etc. You can find a raw list with the meaning of the tags used by these analyzers at: http://www.lsi.upc.es/~nlp/tools/parole-eng.html It is a PAROLE compliant tag set specially developed for Spanish and Catalan. A more detailed description for Spanish and Catalan tagsets can be found at: http://www.lsi.upc.es/~nlp/tools/parole-sp.html http://www.lsi.upc.es/~nlp/tools/parole-cat.html ======================================================================== Sense inventory: the "MiniDir.xml" file This file contains the dictionary entries for all 27 words treated. The MiniDir Spanish and Catalan dictionaries have been developed specifically for automatic Word Sense Disambiguation by CLiC (Universitat de Barcelona). Each entry will contain: - general information: - the list of multi-word expressions filtered out from the training set: - and information relative to each sense: Each sense contains: an identifier, the definition, a list of examples, a list of synonyms (possibly empty), a list of typical collocations (possibly empty), and the corresponding set of equivalent synsets in WordNet-1.5 (possibly empty). MiniDir is more coarse grained than WordNet. Thus, typically one MiniDir sense corresponds to several WordNet senses. However, it might be the case in which a MiniDir sense has no WordNet equivalents (see "canal.4" for instance. Senses discarded as low-frequent have an attribute: used="no" (e.g., see sense "actuar.3"). Note that the dictionary contains some additional entries not finally included in the evaluation (e.g., art.n, corona.n, etc.) ======================================================================== Other Resources provided: - Spanish/Catalan/English WordNet1.5 can be consulted online at: http://nipadio.lsi.upc.es/cgi-bin/wei/public/wei.consult.perl - Mappings between WordNet1.5, Wordnet1.6, and Wordnet1.7 can be freely obtained at: http://www.lsi.upc.es/~nlp/tools/mapping.html - For accessing a more complete online interface to Spanish/Catalan/English WordNets, please contact the organizer (lluism@lsi.upc.es) in order to get a username and password. ========================================================================