

This page is (and will be) under
construction during all the semester. Check the new contents regularly

News
2007/01/28: The course is over!
2007/01/28: Each student is about to receive an email with personal
marks
2007/01/28: The slides corresponding to the students' presentations are
available
2006/12/14: The presentation
of the practical works will take place in January 22, 11-14h at room Omega-S217
2006/12/14: Slides for the
presentation of complementary readings are
being posted (see scheduling)
2006/11/29: New slides on
Information Extraction have been posted (Adaptability)
2006/11/29: Final
assignments and calendar of complementary readings is available here
2006/11/20: New links to SRL
materials (point 3.3) have been added
2006/11/17: New slides on
Information Extraction (Multilinguality and Evaluation) have been posted
2006/11/13: The slides for
"structure
learning for NLP" have been updated to completion
2006/11/10:
Deadline for
selecting the complementary
readings: November 20;
contact L.
Màrquez when you make your choice
2006/11/10: All the
remaining complementary
readings have been posted (6,7,8)
2006/11/10: The teams for the two
practical works are set
2006/10/27: Slight change
in
the scheduling: point 4.3 supressed; sessions devoted to the
presentation of students
readings
have been extended to Dec. 11, 15, and 18.
2006/10/27: The two practical
works are already available
2006/10/06: The course has
started; thanks for attending!
2006/09/02: The Web page has been set; welcome to the course!

Timetable
Monday: 12h-14h
Friday: 12h-14h
Course start:
October 6th 2006
Room: S219, UPC, Campus Nord

Advisors
Lluís
Màrquez
(LM)
(Campus Nord, Omega-S120, lluism@lsi.upc.edu
)
Jordi Turmo
(JT)
(Campus Nord, Omega-215, turmo@lsi.upc.es)

Summary
The main goal of this course is to provide the students with an in
depth knowledge of the techniques, methods and tools, both symbolic and
empirical, of Natural Language Processing (NLP). The course focuses on
the systems dealing with the analysis and processing of massive
quantities of textual data. The applications in this domain
usually work in a batch mode and have their basic framework in Internet
and very large textual data bases. After taking this course we expect
students to be familiar with the basic bibliography of this area of NLP
and have the capacity and skills for performing a future in-depth
research in any of the themes covered by the course. Also, the range of
applications studied allows the students to bridge the gap between the
language technologies studied and the real-world applications in which
they take part. A final goal of the course is the presentation of the
most active research areas within the topics of the course.
This course is highly coupled with the course covering Natural
Language
applications for person-machine communication (Natural Language
Processing for Human-Machine Communication). By taking both courses,
the student will be able to get a sufficient knowledge of the two basic
paradigms of NLP in the framework of the two most frequent
scenarios.
Find a full description of the course and the evaluation method here
(an even more complete description in Catalan)

Program
1. Introduction
1.1 The necessity of automatically processing
massive quantities of textual data.
Main
applications in this domain.
2. Advanced Topics in Machine Learning
2.0 Review of the main concepts of Machine
Learning
2.1 Statistical Methods: Maximum Entropy modeling:
MEEMs; Conditional Random Fields
2.2 Discriminative Learning Methods: Boosting,
Support Vector Machines
2.3 Learning & Inference for relational and
structured domains
2.4 Semi-supervised Learning: Bootstrapping,
co-training and variants
3. Generic Tasks
3.1 Partial parsing: chunking and clause boundary detection
3.2 Word Sense Disambiguation
3.3 Semantic Role Labeling
4. Applications
4.1. Information Extraction: typology, adaptability,
multilinguality, evaluation
4.2. Document Categorization: thematic
classification, using hierarchies of concepts
from the
Web, subjective classification (intention, sentiment, etc.)
4.3. Automatic Summarization: single document,
multi-document, multilingual

Scheduling
October'06:
6 (LM, 1,2.0), 9 (LM, 2.2), 16 (LM, 2.2), 20
(LM, 2.1), 23 (LM, 2.3), 27 (LM, 2.3), 30 (LM, 2.4)
November'06:
3 (JT, 4.1), 6 (LM, 3.1), 10 (JT, 4.1), 13 (LM,
3.2), 17 (JT, 4.1), 20 (LM, 3.3), 24 (JT, 4.1), 27 (LM, 3.3)
December'06:
1 (JT, 4.1), 4 (LM, 4.2),
11: complementary readings on co-training and multitask learning
15: complementary
readings on CRFs and sentiment
classification & IE
18:
complementary readings on relation
extraction and sentiment
classification

Download course materials
Session 1: (points 1 and 2.0 of the
program)
Introduction to the
course
An introductory talk
on Machine Learning for NLP (Given at UdG in 2003)
An introductory talk on
Learning and Inference in NLP problems (Given at OSU in 2004)
Sessions 2 and 3: (point 2.2 of
the program)
slides on
AdaBoost
a talk on SVMs
given in the 2002 Summer Course on Machine Learning at UPV/EHU
complementary slides on linear classifiers:
Perceptron, Winnow and SNoW
an introduction
and a technical paper
on AdaBoost (by R. Schapire & Y. Singer)
find here a good application of AdaBoost
to Text Classification (Boostexter; Schapire & Singer)
a survey
paper on SVMs and a book
chapter (in Spanish); more surveys/tutorials on SVM here
complementary
readings (1): Tree Kernels
original
paper (Collins
and Duffy, 2002)
an application to Semantic Role Labeling (Moschitti,
Pighin and Basili, 2006)
Session 4: (point 2.1 of the
program)
slides on the EMNLP-2005
course by Lluís
Padró (consider only the MaxEnt section)
other tutorials on Maximum Entropy can be
found here
complementary
readings (2): Conditional Random Fields
original paper and applications to chunking (Lafferty,
McCallum, and Pereira, 2001; Sha and
Pereira, 2003)
application to semantic role labeling (Roth and
Yih, 2005)
Sessions 5 and 6: (point 2.3
of the
program)
structure
learning for NLP (second part pending)
complementary
slides on generative approaches (an applied example to named entity
recognition)
slides on the paper Discovering
Entities and Relations: A Linear Programming Formulation (Yih and
Roth, CoNLL-2004)
slides from Xavier Carreras'
PhD thesis defense
slides on a SVM-based
learning algorithm for Natural Language Learning (Michael Collins)
complementary
readings (3): Re-ranking
application to parsing (Michael
Collins, 2000; Collins
and Koo, 2005)
application to semantic role labeling
(Toutanova,
Haghighi, and Manning, 2005)
complementary
readings (4): Multitask learning (via Alternating Structure
Optimization)
original formulation of ASO and an
application to semi-supervised chunking (Ando
and Zhang 2005; journal
version at JMLR)
an application to WSD (Ando 2006)
Session
7: (point 2.4 of the program)
complementary
readings (5): Co-training and variants
the original paper (Blum
and Mitchell, 1998)
two applications to WSD (Mihalcea,
2004; Pam,
Ng and Lee, 2005)
Sessions
8, 10, 12, 14, and 16: (point 4.1 of the program)
First set of slides: Introduction and architectures of IE
systems
Second set of slides: Multilinguality and Evaluation
Third set of slides: Adaptability
complementary
readings (6): Relation Extraction
CRFs applied to relation extraction on the ACE-2005
setting (Cox et al., 2005)
Kernels over SVMs for relation extraction in the
ACE-2005 corpus (Zhao and
Grishman, 2005)
Session
9: (point 3.1 of the program)
Session 11: (point 3.2 of the
program)
Sessions
13 and 15: (point 3.3 of the program)
Automatic
Semantic Role Labeling HLT-NAACL 2006 tutorial by Scott Yih and
Kristina Toutanova.
Introduction
to the CoNLL-2005 shared task (slides in PDF)
Spotlights from CoNLL-2005 shared task: partial
vs full parsing; system
combination
Session
17: (point 4.2 of the program)
complementary
readings (7): Sentiment classification
Automatic humor recognition (Mihalcea
and Strapparava, 2005)
Identifying perspectives of document and
sentences (Lin
et al., 2006)
Detection of Opinion Bearing Words and Sentences (Kim
and Hovy, 2005)
complementary
readings (8): Sentiment classification and Information Extraction
Subjectivity classification for improved
Information Extraction (Riloff
et al., 2005)
CRFs and extraction patterns for identifying sources
of opinions (Choi
et al., 2005)
Session
18: (point 4.3 of the program)
pending downloadable materials will appear a few days in advance of
each of the sessions (stay tuned)

Practical works

Basic References

Some useful links

Other NLP courses at the AI master
PLN-PMT: Natural
Language Processing for Human-Machine Communication (specific web
page of the course)

If you need more information don't hesitate to email me (not
necessarily
in English :-)
lluism@lsi.upc.es

Last Update: January
15,
2007