NLP'07 - Semantics assignment:

Supervised Word Sense Disambiguation

[INTRODUCTION]

Some time ago, building a supervised WSD system implied compiling a corpus, performing the manual semantic annotation, applying semantic and syntactic analyzers on the corpus, applying a feature extractor, and implementing some automatic learning technique to finally have a system. Today, we have access to lots of freely available resources and the effort in building a supervised WSD system has been dramatically reduced.

The task for this assignment is to build two versions of a supervised WSD system making use of freely available resources.


[RESOURCES]

1) SemCor corpus (with machine learning features already extracted):

In the context of the Semeval 2007 competition, a lot of resources have been made available. Among them, the IXA NLP group has released machine learning features for all content words with more than 10 occurrences in SemCor. These features can be freely used for developing all-words supervised Word Sense Disambiguation systems. The sense tags correspond to synsets of WordNet v. 1.6, but the senses can be easily mapped to other versions (see for instance http://www.lsi.upc.es/~nlp/tools/mapping.html).

You can download it from the Semeval WSD-CLIR task website or directly from here.

2) Machine Learning algorithms:

  1. SVMlight software.
  2. TIMBL software.

[TASK STEPS]

The task consists in:

  1. Download and split the corpus of machine learning features in two datasets:
    1. Training, containing the examples we are going to use for the learning. (80% of the total number of examples)
    2. Test, containing the examples we are going to use to compare the two sets of features. (20% of the total number of examples)
  2. Design two sets of features to be the base of each one of the WSD systems.
  3. Choose a Machine Learning algorithm and build the systems on the training dataset.
  4. Evaluate the two systems on the test dataset and compare the results.

Note: Optionally, you can split the corpus in three datasets(80% for training, 10% for test and 10% for development), so you have a development dataset to tune the trained systems and try to improve the results.

[EVALUATION]

Write a short report describing the experiment set, the features you used and focusing on the differences in the results achieved by the two systems (around 3 pages).

[BIBLIOGRAPHY]

Here you can find a study by Audibert on different feature types and their impact in WSD. Be creative! ;)
Word sense disambiguation criteria: a systematic study (Audibert L. 2004) In Proceedings of the 20th international conference on Computational Linguistics

[QUESTIONS]

Don't hesitate to send me any questions/doubts to villarejo at gmail dot com