PLN-PMT: Natural Language Processing for Massive Textual Data Management
Master in Artificial Intelligence

First term 2009/10


News
 
    2009/09/16: Detailed program and the main materials for the course have been posted  
    2009/09/16: Classes will start next week (October 22 and 23)
    2009/09/15: The Web page has been set; welcome to the course!


Timetable


    Thursday: 11:00 - 13:00
    Friday: 12:00 - 14:00
    Course start: October 22nd 2009
    Rooms: (Thursdays) A5104, UPC, Campus Nord
                 (Fridays) A6105, UPC, Campus Nord

Advisors

    Lluís Màrquez (LM, classes on Thursdays)
    (Campus Nord, Omega-S120,
lluism@lsi.upc.edu )
    Jordi Turmo
(JT, classes on Fridays)
    (Campus Nord, Omega-215,
turmo@lsi.upc.es)

Summary


The main goal of this course is to provide the students with an in depth knowledge of the techniques, methods and tools, both symbolic and empirical, of Natural Language Processing (NLP). The course focuses on the systems dealing with the analysis and processing of massive quantities of textual data. The applications in this domain  usually work in a batch mode and have their basic framework in Internet and very large textual data bases. After taking this course we expect students to be familiar with the basic bibliography of this area of NLP and have the capacity and skills for performing a future in-depth research in any of the themes covered by the course. Also, the range of applications studied allows the students to bridge the gap between the language technologies studied and the real-world applications in which they take part. A final goal of the course is the presentation of the most active research areas within the topics of the course.

This course is highly coupled with the course covering Natural Language applications for person-machine communication (Natural Language Processing for Human-Machine Communication). By taking both courses, the student will be able to get a sufficient knowledge of the two basic paradigms of NLP in the framework of the two most frequent scenarios. 

Find a full description of the course and the evaluation method
here (an even more complete description in Catalan)

Detailed program


1. Introduction (5%)
    1.1 The necessity of automatically processing massive quantities of textual data. 
          Main applications in this domain.

2. Advanced Topics in Machine Learning (30%)
    2.1 Review of the main concepts of Machine Learning

    2.2 Discriminative Learning Methods: Boosting, Support Vector Machines

    2.3 Machine Learning for relational and structured prediction

    2.4 Semi-supervised Learning: Bootstrapping, co-training and variants
not covered this year

3.  Generic Subtasks (20%)
   3.1 Partial parsing: chunking and clause boundary detection
   3.2 Word Sense Disambiguation not covered this year
   3.3 Semantic Role Labeling

4. Information Extraction: typology, adaptability, multilinguality, evaluation (45%)

5. Other Applications (5%)
    4.1. Document Categorization: thematic classification, using hierarchies of concepts
           from the Web, subjective classification (intention, sentiment, etc.)
    4.2. Automatic Summarization: single document, multi-document, multilingual not covered this year

Scheduling


    October 2009
       22 (LM), 23 (JT), 29 (LM), 30 (LM)
    November 2009
       5 (LM), 6 (JT), 12 (LM), 13 (JT), 19 (LM), 20 (JT), 26 (LM)
    December 2009
       3 (LM), 4 (JT), 10 (LM), 11 (JT), 17 (LM), 18 (JT)
     January 2010
 
      14 (tentative): presentation and discussion of students' complementary readings
       21 (tentative): Public presentation of students' practical works

Course materials


First package for topics in points 1, 2 and 3
Tutorial on Semantic Role Labeling at ACL-IJCNLP 2009 (pdf, bibliography) (point 3.2)
Complementary materials for points 1, 2 and 3
Slides on Information Extraction (point 4):
    Introduction and architectures of IE systems
    Multilinguality and Evaluation
    Adaptability

Presentation of complementary readings


    List with candidate papers to appear soon
    Some guidelines for the presentation
 
   Tentative date: January 14, 2009 (morning session)
    Room and hour to be announced

Practical works


    Some guidelines for the presentation
 
   Tentative date: January 21, 2009 (morning session)
    Room and hour to be announced


References


Natural Language Processing
* R. Dale, H. Moisl, H.Somers, ed. Handbook of natural Language Processing, Marcel Dekker, New York, 2000.
* D. Jurafsky, James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, Upper Saddle River, N.J. ,2000.
* C. Manning, H. Schütze. Foundations of statistical Natural Language Processing, MIT Press Cambridge, Mass., 1999.
* R. Mitkov (editor). The Oxford handbook of Computational Linguistics, Oxford University Press, 2004.

Machine Learning
* N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000.
* Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). Elements of Statistical Learning. Springer
* Tom Mitchell, Machine Learning, McGraw Hill, 1997.
* J. Hernández-Orallo, M. J. Ramírez-Quintana, C. Ferri. Introducción a la Minería de Datos, Prentice Hall / Addison-Wesley, 2004.

Surveys/Tutorials on techniques, tasks, and applications
* Xavier Carreras, Lluís Màrquez, and Erique Romero. Máquinas de Vectores Soporte, Capítulo en Introducción a la Minería de Datos, Hernández, J. and Ramírez and M. J. and Ferri, C. (eds.), Pearson Prentice
* HC. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.all, 353-382.
* Ide, N., & Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics, 24(1), 1-40.
* L. Màrquez, G. Escudero, D. Martínez and G. Rigau. Supervised Corpus-based Methods for Word Sense Disambiguation. Chapter in Eneko Agirre and Phil Edmonds (Eds.) Word Sense Disambiguation. Algorithms and Applications, Kluwer, 2006 (draft version available).
* J. Turmo, A. Ageno, N. Català (2006). Adaptive Information Extraction. ACM Computing Surveys, vol. 38, issue 2. (draft version in pdf)
* Fabrizio Sebastiani. Text categorization. In Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, pp. 109--129.
* Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.
* Alonso, Laura; Castellon, Irene; Climent, Salvador; Fuentes, María, Padró, Lluís; Rodríguez, Horacio (2003)  Approaches to Text Summarization: Questions and Answers. Revista Iberoamericana de Inteligencia Artificial (noviembre de 2003). Special Issue on Multilingual Information Access
* Mani, Inderjeet. Automatic Summarization. John Benjamins, xi+285pp, paperback ISBN 1-58811-060-5, Natural Language Processing, 3, 2001.


If you need more information don't hesitate to email me: lluism@lsi.upc.es
Last Update:  October 16, 2009