----------------------------------------

PLN-PMT: Natural Language Processing for Massive Textual Data Management
Master in Artificial Intelligence
First term 2007/08

----------------------------------------

Contents in this page may change along the semester. Check it regularly!

----------------------------------------
News


2008/01/16: P
resentation of students' practical works: January 23rd, 11:00 at Omega S205
2007/12/20: Today's class has been cancelled!
2007/12/20: Slides for the 3rd session of "complementary-readings" have been posted.
2007/12/10: Cancellation of a "complementary-reading" presentation from December 20

2007/12/05: Assignments of the "complementary-reading" presentations have been set (see scheduling below)
2007/11/29: Deadline for selecting the complementary readings: December 5
2007/11/15: Work teams have been set and provided with practical works.
2007/11/15: Practical works have been posted. Set groups and make you choice asap.
2007/11/12: The November 8 session was suspended. It has been moved to Nov. 23. The rest of the scheduling has been shifted accordingly.
2007/10/15: More changes in the course scheduling (see corrections in blue)
2007/10/08: A few changes in October's scheduling (see corrections in red)
2007/10/04: The course has started; thanks for attending!
2007/10/01: The Web page has been set; welcome to the course!

----------------------------------------
Timetable

    Thursday: 12h-14h
    Friday: 10h-12h
    Course start: October 4th 2007
    Room: A1106, UPC, Campus Nord

----------------------------------------
Advisors

Lluís Màrquez (LM)
(Campus Nord, Omega-S120, lluism@lsi.upc.edu )
Jordi Turmo (JT)
(Campus Nord, Omega-215, turmo@lsi.upc.es)

----------------------------------------
Summary

The main goal of this course is to provide the students with an in depth knowledge of the techniques, methods and tools, both symbolic and empirical, of Natural Language Processing (NLP). The course focuses on the systems dealing with the analysis and processing of massive quantities of textual data. The applications in this domain  usually work in a batch mode and have their basic framework in Internet and very large textual data bases. After taking this course we expect students to be familiar with the basic bibliography of this area of NLP and have the capacity and skills for performing a future in-depth research in any of the themes covered by the course. Also, the range of applications studied allows the students to bridge the gap between the language technologies studied and the real-world applications in which they take part. A final goal of the course is the presentation of the most active research areas within the topics of the course.

This course is highly coupled with the course covering Natural Language applications for person-machine communication (Natural Language Processing for Human-Machine Communication). By taking both courses, the student will be able to get a sufficient knowledge of the two basic paradigms of NLP in the framework of the two most frequent scenarios. 

Find a full description of the course and the evaluation method here (an even more complete description in Catalan)

----------------------------------------
Program

1. Introduction

    1.1 The necessity of automatically processing massive quantities of textual data. 
          Main applications in this domain.

2. Advanced Topics in Machine Learning

    2.0 Review of the main concepts of Machine Learning
    2.1 Statistical Methods: Maximum Entropy modeling: MEEMs; Conditional Random Fields
    2.2 Discriminative Learning Methods: Boosting, Support Vector Machines
    2.3 Learning & Inference for relational and structured domains
    2.4 Semi-supervised Learning: Bootstrapping, co-training and variants

3.  Generic Tasks

   3.1 Partial parsing: chunking and clause boundary detection
   3.2 Word Sense Disambiguation
   3.3 Semantic Role Labeling

4. Applications

    4.1. Information Extraction: typology, adaptability, multilinguality, evaluation
    4.2. Document Categorization: thematic classification, using hierarchies of concepts
           from the Web, subjective classification (intention, sentiment, etc.)
    4.3. Automatic Summarization: single document, multi-document, multilingual <not covered this year>

----------------------------------------
Scheduling

October 2007:
    4 (LM, 1,2.0), 5 (JT, 4.1), 11 (LM, 2.2), 18 (JT, 4.1), 19 (LM, 2.2), 25 (LM, 2.1), 26 (JT, 4.1),  31 (JT, 4.1)

November 2007:

    8 <suspended>, 9 (LM, 2.3), 15 (LM, 2.3), 16 (LM, 2.4), 22 (LM, 3.1), 23 (JT, 4.1), 29 (LM, 3.2), 30 (LM, 3.3)

December 2007:
    13 (LM, 3.3),
   14: Dependency Parsing (10-11h) and Sentiment classification (11-12h)

  
20:  Relation Extraction (12-13h) and Sentiment Classification and IE (13-14) (cancelled!)
   21:
Multitask learning (10-11h) and  Co-training (11-12h)
   [find the final assignment of readings and the guidelines here]

January 2008:
    23: Public presentation of students' practical works: 11:00-13:00 at Omega S205
    find some guidelines here
  

----------------------------------------
Download course materials

Session 1: (points 1 and 2.0 of the program)
   Introduction to the course
   An introductory talk on Machine Learning for NLP (Given at UdG in 2003)
   An introductory talk on Learning and Inference in NLP problems (Given at OSU in 2004)

Sessions 2, 5, 7, 8, and 10: (point 4.1 of the program)
    First set of slides: Introduction and architectures of IE systems
    Second set of slides: Multilinguality and Evaluation
    Third set of slides: Adaptability

    complementary readings (1): Relation Extraction 

      CRFs applied to relation extraction on the ACE-2005 setting (Cox et al., 2005)
      Kernels over SVMs for relation extraction in the ACE-2005 corpus (Zhao and Grishman, 2005)

Sessions 3 and 4: (point 2.2 of the program)
    slides on AdaBoost
    a talk on SVMs given in the 2002 Summer Course on Machine Learning at UPV/EHU
    complementary slides on linear classifiers: Perceptron, Winnow and SNoW
    an introduction and a technical paper on AdaBoost (by R. Schapire & Y. Singer, 2004)
    find here a good application of AdaBoost to Text Classification (Boostexter; Schapire & Singer 2000)
    a survey paper on SVMs and a book chapter (in Spanish); more surveys/tutorials on SVM here

   complementary readings (2):  Tree Kernels
  
original paper (Collins and Duffy, 2002)
    an application to Semantic Role Labeling (Moschitti, Pighin and Basili, 2006)

Session 6: (point 2.1 of the program)  [postponed]
    Slides on ME and CRFs
    slides on the EMNLP-2005 course by Lluís Padró (consider only the MaxEnt section)
     other tutorials on Maximum Entropy can be found here

     complementary readings (3):  Conditional Random Fields
     original paper and applications to chunking (Lafferty, McCallum, and Pereira, 2001; Sha and Pereira, 2003)
     application to semantic role labeling (Roth and Yih, 2005)

Sessions 9 and 11: (point 2.3 of the program)
    Structure learning for NLP
    complementary slides on generative approaches (an applied example to named entity recognition)
     slides on the paper Discovering Entities and Relations: A Linear Programming Formulation (Yih and Roth, CoNLL-2004)
     slides from Xavier Carreras' PhD thesis defense
     slides on a SVM-based learning algorithm for Natural Language Learning (Michael Collins)

     complementary readings (4): Re-ranking
     application to parsing (Michael Collins, 2000; Collins and Koo, 2005)
     application to semantic role labeling  (
Toutanova, Haghighi, and Manning, 2005)

     complementary readings (5): Multitask learning (via Alternating Structure Optimization)
     original formulation of ASO and an application to semi-supervised chunking (Ando and Zhang 2005; journal version at JMLR)
     an application to WSD (Ando 2006)
     See also the related paper on Structural Correspondence Learning (Blitzer, McDonald and Pereira, 2006). Similar ideas applied to domain adaptation.

Session 12: (point 2.4 of the program)
   Slides on semi-supervised learning

    complementary readings (6): Co-training and variants
     the original paper (Blum and Mitchell, 1998)
     two applications to WSD (Mihalcea, 2004; Pam, Ng and Lee, 2005)


Session 13: (point 3.1 of the program)
Session 14: (point 3.2 of the program)
   Partial Parsing and WSD are covered by examples in previous materials 

Sessions 15 and 16: (point 3.3 of the program)
    Semantic Role Labeling
    Automatic Semantic Role Labeling HLT-NAACL 2006 tutorial by Scott Yih and Kristina Toutanova.
     Introduction to the CoNLL-2005 shared task  (slides in PDF)
     Spotlights from CoNLL-2005 shared task: partial vs full parsing; system combination

Session 17: (point 4.2 of the program) [cancelled]
    Document Categorization

    complementary readings (7): Sentiment classification
     Automatic humor recognition (Mihalcea and Strapparava, 2005)
    Identifying perspectives of document and sentences (Lin et al., 2006)
    Detection of Opinion Bearing Words and Sentences (Kim and Hovy, 2005)

    complementary readings (8): Sentiment classification and Information Extraction
    Exploiting Subjectivity Classification to Improve Information Extraction (Riloff et al., 2005)
    CRFs and extraction patterns for identifying sources of opinions (Choi et al., 2005)


----------------------------------------
Practical works

[1] Named Entity Recognition and Classification: Comparing Learning Approaches and Feature Sets (advisor: LM)
[2] Semantic Role Labeling. Combining outputs (advisor: LM)
[3] Study of different feature sets to learn perceptrons useful for extracting the ACE mentions of relations (advisor: JT)

All works are to be carried by teams of two students.

----------------------------------------

----------------------------------------
Basic References

Natural Language Processing
* R. Dale, H. Moisl, H.Somers, ed. Handbook of natural Language Processing, Marcel Dekker, New York, 2000.
* D. Jurafsky, James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, Upper Saddle River, N.J. ,2000.
* C. Manning, H. Schütze. Foundations of statistical Natural Language Processing, MIT Press Cambridge, Mass., 1999.
* R. Mitkov (editor). The Oxford handbook of Computational Linguistics, Oxford University Press, 2004.

Machine Learning
* N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000.
* Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). Elements of Statistical Learning. Springer
* Tom Mitchell, Machine Learning, McGraw Hill, 1997.
* J. Hernández-Orallo, M. J. Ramírez-Quintana, C. Ferri. Introducción a la Minería de Datos, Prentice Hall / Addison-Wesley, 2004.

Surveys/Tutorials on techniques, tasks, and applications
* Xavier Carreras, Lluís Màrquez, and Erique Romero. Máquinas de Vectores Soporte, Capítulo en Introducción a la Minería de Datos, Hernández, J. and Ramírez and M. J. and Ferri, C. (eds.), Pearson Prentice
* HC. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.all, 353-382.
* Ide, N., & Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics, 24(1), 1-40.
* L. Màrquez, G. Escudero, D. Martínez and G. Rigau. Supervised Corpus-based Methods for Word Sense Disambiguation. Chapter in Eneko Agirre and Phil Edmonds (Eds.) Word Sense Disambiguation. Algorithms and Applications, Kluwer, 2006 (draft version available).
* J. Turmo, A. Ageno, N. Català (2006). Adaptive Information Extraction. ACM Computing Surveys, vol. 38, issue 2. (draft version in pdf)
* Fabrizio Sebastiani. Text categorization. In Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, pp. 109--129.
* Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.
* Alonso, Laura; Castellon, Irene; Climent, Salvador; Fuentes, María, Padró, Lluís; Rodríguez, Horacio (2003)  Approaches to Text Summarization: Questions and Answers. Revista Iberoamericana de Inteligencia Artificial (noviembre de 2003). Special Issue on Multilingual Information Access
* Mani, Inderjeet. Automatic Summarization. John Benjamins, xi+285pp, paperback ISBN 1-58811-060-5, Natural Language Processing, 3, 2001.

 
----------------------------------------
Some useful links

Research groups/institutions/organizations/etc.
* Association of Computational Linguistics ACL
* ACL Anthology
* The ACL wiki
* Information Society Technology IST
* Oficina del Español en la Sociedad de la Información OESI
* Sociedad Española para el procesamiento del lenguaje natural SEPLN
* TALP Research Center (UPC)
* Research Group on Natural Language Processing (GPLN), LSI-UPC
* Cognitive Computation Group (UIUC): Demos page
* Portal on Support Vector Machines and Kernel Methods
* Automatic Content Extraction (ACE)
* Document Understanding Conferences (DUC)
* CoNLL conferences and shared tasks
* A bibliography on Boosting (R. Schapire)

Resources and Toolkits for Natural Language Processing
* Stanford University NLP Resources
* FreeLing 1.5: Open Source suite of Language Analyzers
* SVMTool: Open Source generator of sequential taggers based on Support Vector Machines
* YamCha: tagger for sequential structures
* Natural Language Toolkit, NLTK
* OpenNLP
* TnT--Statistical Part-of-Speech Tagging

Machine Learning Toolkits
* Maximum Entropy Modeling
* MALLET: Advanced Machine Learning for Language
* Software on SVMs and Kernel Machines
* WEKA: Machine Learning and Data Mining Suite
* SVMstruct: Support Vector Machine for Complex Outputs
* TiMBL: Tilburg Memory Based Learner
* The SNoW Learning Architecture
* Fast Transformation-Based Learning Toolkit (fnTBL)

----------------------------------------
Other NLP courses at the AI master

PLN-PMT: Natural Language Processing for Human-Machine Communication (specific web page of the course)

----------------------------------------
If you need more information don't hesitate to email me (not necessarily in English :-)
  lluism@lsi.upc.es

----------------------------------------
Last Update:  December 13, 2007