================================================================ [1] Named Entity Recognition and Classification: Comparing Learning Approaches and Feature Sets ================================================================ GOAL: The goal of this work is twofold: first, we aim at comparing two learning architectures for sequential learning on Named Entity Recognition and Classification (NERC); the second goal consists of the development and study of a rich and effective set of features for the task. All NERC prototypes have to be constructed on the datasets from CoNLL-2002. NOTE-1: you can restrict to the Spanish data if you want. Download the version with words + POS tags. SETTING: The task setting will be fixed to that of the CoNLL-2002 shared task on named entity recognition. Please, visit the website: http://www.cnts.ua.ac.be/conll2002/ner/ to obtain the datasets, the problem description, and a large state-of-the-art bibliography from which you can borrow the set of standard features commonly used in the task. CONSTRUCTING THE NERC PROTOTYPES * The learning software to use is - A machine learning sequential tagger based on local classifiers trained for each of the labels. Here, we can distinguish between greedy local inference with/without left chaining and Viterbi style decoding for optimizing the probability of the whole sequence. You may use the YamCha software, which uses SVMs as its basic learning module and implements window-based feature extraction and sequence labeling. Find the software available at: http://chasen.org/~taku/software/yamcha/ - SVMstruct. More concretely, the SVM-hmm instantiation, which is a large-marging global learning algorithm for sequential labeling. Find the software at the following URLs: http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html http://www.cs.cornell.edu/People/tj/svm_light/svm_hmm.html Install it and try the basic examples. Using CRF's is an alternative. * Features: the SVM-hmm tool doesn't provide feature extraction for the target task so you have to program a feature extractor for codifying training/test examples to feed the learner. On the other hand, Yamcha provides basic feature extraction from the column-based input format. However, you must codify extra columns for the features accounting for anything different to exact word forms and POS tags. In general, you may inspire your features on the sate-of-the-art systems from CoNLL-2002. In principle, don't worry about generating features coming from external sources (e.g., gazetteers, trigger words, etc.). Program only a core of basic features for the task (contextual and also orthographic). Gazzetteers and trigger words can be included as an optional extension. Contact work advisor in case you want to do it. NOTE-2: Yamcha uses polynomial kernels to combine features, but SVM-hmm is a linear classifier, so think of codifying some features as conjunctions of basic features (e.g., n-grams of POS and words) if you want to increase performance. NOTE-3: make all features binary (e.g., "the_previous_word_is_Mr."). NOTE-4: you probably want to filter out low frequecy features (simply set a threshold to N and discard those features occurring less than N times in the training set). Apart from discarding less informative features, the aim is to keep the training set size reasonably small and making training doable in a standard PC. If the training is very slow then you can work with a subset of it for all the experiments required below. NOTE-5: proceed incrementally by codifying first a very small set of features and then designing the more sophisticated ones. Use the development set to test the improvements of the system. NOTE-6: you also have to program a simple format converter in order to take the output of SVM-hmm and run the official CoNLL-2002 scoring software. (suggested) CONCRETE ASPECTS OF THE EVALUATION: [a] Accuracy on the task: report the best accuracy values obtained and compare it to the state-of-the-art (see the CoNLL-2002 webpage; use the official scorer). [b] Parameter tuning: report results of experiments with different parameter settings. Are the methods stable or its performance depend highly on the parameterization? [c] Impact of feature types: report results of your system trained with the different families of features separately. Where does the biggest contribution come from? [d] Efficiency: report training/testing times in all experiments. Are the learning approaches feasible/efficient? [e] Training set size: provide a learning curve with performance for increasing training set sizes (e.g., 1%, 5%, 10%, 20%,... 100%). Is the training set size critical? Proposals from students that deviate from the previous suggested roadbook are also welcome. But please, contact the work advisor in advance to grant authorization. ================================================================