

Contents in this page may change
along the semester. Check it regularly!

News
2008/01/16: Presentation of students'
practical works: January 23rd, 11:00 at Omega S205
2007/12/20: Today's class
has been cancelled!
2007/12/20: Slides
for the 3rd session of "complementary-readings" have been
posted.
2007/12/10: Cancellation
of a "complementary-reading" presentation from December 20
2007/12/05: Assignments of
the "complementary-reading" presentations have been set (see scheduling
below)
2007/11/29: Deadline for
selecting the complementary
readings: December 5
2007/11/15: Work teams have
been set and provided with practical works.
2007/11/15: Practical works
have been posted. Set groups and make you choice asap.
2007/11/12: The November 8
session was suspended. It has been moved to Nov. 23. The rest of the
scheduling has been shifted accordingly.
2007/10/15: More changes in the
course scheduling (see corrections in blue)
2007/10/08: A few changes in
October's scheduling (see corrections in red)
2007/10/04: The course has
started; thanks for attending!
2007/10/01: The Web page has
been set; welcome to the course!

Timetable
Thursday: 12h-14h
Friday: 10h-12h
Course start:
October 4th 2007
Room: A1106, UPC, Campus Nord

Advisors
Lluís
Màrquez
(LM)
(Campus Nord, Omega-S120, lluism@lsi.upc.edu
)
Jordi Turmo
(JT)
(Campus Nord, Omega-215, turmo@lsi.upc.es)

Summary
The main goal of this course is to provide the students with an in
depth knowledge of the techniques, methods and tools, both symbolic and
empirical, of Natural Language Processing (NLP). The course focuses on
the systems dealing with the analysis and processing of massive
quantities of textual data. The applications in this domain
usually work in a batch mode and have their basic framework in Internet
and very large textual data bases. After taking this course we expect
students to be familiar with the basic bibliography of this area of NLP
and have the capacity and skills for performing a future in-depth
research in any of the themes covered by the course. Also, the range of
applications studied allows the students to bridge the gap between the
language technologies studied and the real-world applications in which
they take part. A final goal of the course is the presentation of the
most active research areas within the topics of the course.
This course is highly coupled with the course covering Natural
Language
applications for person-machine communication (Natural Language
Processing for Human-Machine Communication). By taking both courses,
the student will be able to get a sufficient knowledge of the two basic
paradigms of NLP in the framework of the two most frequent
scenarios.
Find a full description of the course and the evaluation method here
(an even more complete description in Catalan)

Program
1. Introduction
1.1 The necessity of automatically processing
massive quantities of textual data.
Main
applications in this domain.
2. Advanced Topics in Machine Learning
2.0 Review of the main concepts of Machine
Learning
2.1 Statistical Methods: Maximum Entropy modeling:
MEEMs; Conditional Random Fields
2.2 Discriminative Learning Methods: Boosting,
Support Vector Machines
2.3 Learning & Inference for relational and
structured domains
2.4 Semi-supervised Learning: Bootstrapping,
co-training and variants
3. Generic Tasks
3.1 Partial parsing: chunking and clause boundary detection
3.2 Word Sense Disambiguation
3.3 Semantic Role Labeling
4. Applications
4.1. Information Extraction: typology, adaptability,
multilinguality, evaluation
4.2. Document Categorization: thematic
classification, using hierarchies of concepts
from the
Web, subjective classification (intention, sentiment, etc.)
4.3.
Automatic Summarization: single document,
multi-document, multilingual <not covered this year>

Scheduling
October 2007:
4 (LM, 1,2.0), 5 (JT, 4.1), 11 (LM, 2.2), 18 (JT, 4.1), 19
(LM, 2.2), 25
(LM, 2.1), 26 (JT, 4.1), 31
(JT, 4.1)
November 2007:
8
<suspended>,
9 (LM, 2.3), 15 (LM,
2.3), 16 (LM,
2.4),
22 (LM, 3.1), 23
(JT, 4.1), 29 (LM,
3.2), 30 (LM, 3.3)
December 2007:
13 (LM, 3.3),
14: Dependency Parsing (10-11h) and
Sentiment classification
(11-12h)
20: Relation Extraction (12-13h) and Sentiment
Classification and IE
(13-14) (cancelled!)
21: Multitask
learning (10-11h) and Co-training
(11-12h)
[find the final assignment of readings and the guidelines here]

Download course materials
Session 1:
(points 1 and 2.0 of the
program)
Introduction
to the
course
An introductory talk
on Machine Learning for NLP (Given at UdG in 2003)
An introductory talk
on
Learning and Inference in NLP problems (Given at OSU in 2004)
Sessions 2, 5, 7, 8, and
10: (point 4.1 of the program)
First set of slides: Introduction
and architectures of IE
systems
Second set of slides: Multilinguality
and Evaluation
Third set of slides: Adaptability
complementary
readings (1): Relation Extraction
CRFs applied to relation extraction on the ACE-2005
setting (Cox
et al., 2005)
Kernels over SVMs for relation extraction in the
ACE-2005 corpus (Zhao
and
Grishman, 2005)
Sessions 3 and 4: (point 2.2 of
the program)
slides
on
AdaBoost
a
talk on SVMs
given in the 2002 Summer Course on Machine Learning at UPV/EHU
complementary slides on linear
classifiers:
Perceptron, Winnow and SNoW
an introduction
and a technical
paper
on AdaBoost (by R. Schapire & Y. Singer, 2004)
find here a good application
of AdaBoost
to Text Classification (Boostexter; Schapire & Singer 2000)
a survey
paper on SVMs and a book
chapter (in Spanish); more surveys/tutorials on SVM here
complementary
readings (2): Tree Kernels
original
paper (Collins
and Duffy, 2002)
an application to Semantic Role Labeling (Moschitti,
Pighin and Basili, 2006)
Session 6: (point 2.1 of the
program) [postponed]
Slides on
ME and CRFs
slides on the
EMNLP-2005
course by Lluís
Padró (consider only the MaxEnt section)
other tutorials on Maximum Entropy can be
found here
complementary
readings (3): Conditional Random Fields
original paper and applications to chunking (Lafferty,
McCallum, and Pereira, 2001; Sha and
Pereira, 2003)
application to semantic role labeling (Roth and
Yih, 2005)
Sessions
9 and 11: (point 2.3
of the
program)
Structure
learning for NLP
complementary
slides on generative approaches (an applied example to named entity
recognition)
slides on the paper Discovering
Entities and Relations: A Linear Programming Formulation (Yih and
Roth, CoNLL-2004)
slides
from Xavier Carreras'
PhD thesis defense
slides on a SVM-based
learning algorithm for Natural Language Learning (Michael Collins)
complementary
readings (4): Re-ranking
application to parsing (Michael
Collins, 2000; Collins
and Koo, 2005)
application to semantic role labeling
(Toutanova,
Haghighi, and Manning, 2005)
complementary
readings (5): Multitask learning (via Alternating Structure
Optimization)
original formulation of ASO and an
application to semi-supervised chunking (Ando
and Zhang 2005; journal
version at JMLR)
an application to WSD (Ando 2006)
See also the related paper on Structural Correspondence Learning (Blitzer,
McDonald and Pereira, 2006). Similar ideas applied to domain
adaptation.
Session 12: (point 2.4 of the
program)
Slides on
semi-supervised learning
complementary
readings (6): Co-training and variants
the original paper (Blum
and Mitchell, 1998)
two applications to WSD (Mihalcea,
2004; Pam,
Ng and Lee, 2005)
Session 13:
(point 3.1 of the program)
Session 14: (point 3.2 of the
program)
Partial Parsing and WSD are covered by examples in
previous materials
Sessions
15 and 16: (point 3.3 of the program)
Semantic
Role Labeling
Automatic
Semantic Role Labeling HLT-NAACL 2006 tutorial by Scott Yih and
Kristina Toutanova.
Introduction
to the CoNLL-2005 shared task (slides
in PDF)
Spotlights from CoNLL-2005 shared task: partial
vs full parsing; system
combination
Session
17: (point 4.2 of the program) [cancelled]
Document
Categorization
complementary
readings (7): Sentiment classification
Automatic humor recognition (Mihalcea
and Strapparava, 2005)
Identifying perspectives of document and
sentences (Lin
et al., 2006)
Detection of Opinion Bearing Words and Sentences (Kim
and Hovy, 2005)
complementary
readings (8): Sentiment classification and Information Extraction
Exploiting Subjectivity Classification to
Improve Information Extraction (Riloff
et al., 2005)
CRFs and extraction patterns for identifying sources
of opinions (Choi
et al., 2005)

Practical works


Basic References

Some useful links

Other NLP courses at the AI master
PLN-PMT: Natural
Language Processing for Human-Machine Communication (specific web
page of the course)

If you need more information don't hesitate to email me (not
necessarily
in English :-)
lluism@lsi.upc.es

Last Update: December
13,
2007