PLN-PMT: Natural Language Processing for Massive Textual Data Management
UPC-URV-UB Master in Artificial Intelligence

2011 Fall semester

    2012/01/02: Examination days have changed (January 17: presentation of readings; January 24: presentation of the shared task system)
    2011/12/14: Examination days have been fixed (January 18: presentation of readings; January 25: presentation of the shared task system)
    2011/11/30: Papers on complementary readings have been posted.
    2011/11/27: There is a swap of instructors in the classes this week (Monday 28: Jordi Turmo; Wednesday 30: Lluís Màrquez). We'll discuss the details on the practical work next Wednesday, November 30; please bring the articles and the information about the corpus you might have collected; it is important that we have full attendance for that class.
    2011/10/02: The course starts next Monday, October 3 2011, at 15:00 (Lluís Màrquez; Introduction)
    2011/10/02: Slides for the first sessions on Information Extraction and Machine Learning have been posted
    2011/10/02: The Web page has been set; welcome to the course! (adverts are also posted at the "FIB Racó" for this course)


    Monday:  15:00 - 17:00
    Wednesday:  15:00 - 17:00
    Course start: October 3 2011
    Room:  S215 Omega building, Campus Nord UPC


    Lluís Màrquez (LM, classes on Mondays; main instructor)
    (Campus Nord, Omega-S120, )
    Jordi Turmo
(JT, classes on Wednesdays)
    (Campus Nord, Omega-S113,


The main goal of this course is to provide the students with an in depth knowledge of the techniques, methods and tools, both symbolic and empirical, of Natural Language Processing (NLP). The course focuses on the systems dealing with the analysis and processing of massive quantities of textual data. The applications in this domain  usually work in a batch mode and have their basic framework in Internet and very large textual data bases. After taking this course we expect students to be familiar with the basic bibliography of this area of NLP and have the capacity and skills for performing a future in-depth research in any of the themes covered by the course. Also, the range of applications studied allows the students to bridge the gap between the language technologies studied and the real-world applications in which they take part. A final goal of the course is the presentation of the most active research areas within the topics of the course.

This course is highly coupled with the course covering Natural Language applications for person-machine communication (Natural Language Processing for Human-Machine Communication). By taking both courses, the student will be able to get a sufficient knowledge of the two basic paradigms of NLP in the framework of the two most frequent scenarios. 

Find a full description of the course and the evaluation method
here (an even more complete description in Catalan)

Detailed program

1. Introduction (5%)
    1.1 The necessity of automatically processing massive quantities of textual data. 
          Main applications in this domain.

2. Advanced Topics in Machine Learning (30%)
    2.1 Review of the main concepts of Machine Learning

    2.2 Discriminative Learning Methods: Boosting, Support Vector Machines

    2.3 Machine Learning for relational and structured prediction

    2.4 Semi-supervised Learning: Bootstrapping, co-training and variants
not covered

3.  Generic Subtasks (20%)
   3.1 Partial parsing: chunking and clause boundary detection
   3.2 Word Sense Disambiguation not covered
   3.3 Semantic Role Labeling

4. Information Extraction (40%)
Typology, adaptability, multilinguality, evaluation

5. Other Applications (5%)
    4.1. Document Categorization: thematic classification, using hierarchies of concepts
           from the Web, subjective classification (intention, sentiment, etc.)
    4.2. Automatic Summarization: single document, multi-document, multilingual not covered


    October 2010
       3 (LM), 5 (JT), 10 (LM), 17 (LM), 19 (JT), 24 (LM), 26 (JT)
    November 2010
       2 (JT), 7 (LM), 9 (JT), 14 (LM), 16 (JT), 21 (LM), 23 (JT), 28 (JT), 30 (LM)
    December 2010
       12 (LM), 14 (JT), 19 (LM), 21 (JT)
     January 2011
      TBA: presentation and discussion of students' complementary readings
       TBA: Public presentation of students' practical works

Course materials

Main package of slides for topics in points 1, 2 and 3
Slides on Information Extraction (point 4):

    Introduction and architectures of IE systems
    Multilinguality and Evaluation
    IE system adaptability-I
    IE system adaptability-II

Complementary readings

  1. A Unified Model of Phrasal and Sentential Evidence for Information Extraction (Siddharth Patwardhan and Ellen Riloff; EMNLP 2009)
  2. Template-Based Information Extraction without the Templates (Nathanael Chambers and Dan Jurafsky; ACL 2011)
  3. Active Learning Selection Strategies for Information Extraction (Aidan Finn and Nicholas Kushmerick; Workshop Adaptive Text Extraction and Mining 2003)
  4. Unsupervised Relation Extraction by Massive Clustering (Edgar Gonzàlez and Jordi Turmo; IEEE International Conference on Data Mining 2009)
  5. Convolutional Kernels applied to NLP problems. Chose one of the following three: Syntactic/semantic kernels applied to Textual Entailment (Mehdad, et al., 2010; NAACL); Question Answering (Moschitti, 2009; EACL) and opinion detection (Johansson and Moschitti, 2010; COLING) ---Original paper on Tree Kernels (Collins and Duffy, 2002; ACL)
  6. Search-based Structured Prediction (Hal Daumé III, John Langford and Daniel Marcu; Machine Learning Journal 2009)
  7. Dual Decomposition for Parsing with Non-Projective Head Automata (Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola and David Sontag; EMNLP 2010)
    Some guidelines for the presentation
   Date and hour: January 17, 15:00 - 18:00
    Room: TBA

Practical works

    We have set a team with all students of the course to participate at the Spatial Role Labeling task from SemEval 2012
    Check the following tarball with a set of related papers:

    Some guidelines for the presentation
   Date and hour: January 24, 15:00 - 17:00
    Room: TBA


Natural Language Processing
* R. Dale, H. Moisl, H.Somers, ed. Handbook of natural Language Processing, Marcel Dekker, New York, 2000.
* D. Jurafsky, James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, Upper Saddle River, N.J. ,2000.
* C. Manning, H. Schütze. Foundations of statistical Natural Language Processing, MIT Press Cambridge, Mass., 1999.
* R. Mitkov (editor). The Oxford handbook of Computational Linguistics, Oxford University Press, 2004.

Machine Learning
* N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000.
* Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). Elements of Statistical Learning. Springer
* Tom Mitchell, Machine Learning, McGraw Hill, 1997.
* J. Hernández-Orallo, M. J. Ramírez-Quintana, C. Ferri. Introducción a la Minería de Datos, Prentice Hall / Addison-Wesley, 2004.

Surveys/Tutorials on techniques, tasks, and applications
* Xavier Carreras, Lluís Màrquez, and Erique Romero. Máquinas de Vectores Soporte, Capítulo en Introducción a la Minería de Datos, Hernández, J. and Ramírez and M. J. and Ferri, C. (eds.), Pearson Prentice
* HC. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.all, 353-382.
* Ide, N., & Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics, 24(1), 1-40.
* L. Màrquez, G. Escudero, D. Martínez and G. Rigau. Supervised Corpus-based Methods for Word Sense Disambiguation. Chapter in Eneko Agirre and Phil Edmonds (Eds.) Word Sense Disambiguation. Algorithms and Applications, Kluwer, 2006 (draft version available).
* J. Turmo, A. Ageno, N. Català (2006). Adaptive Information Extraction. ACM Computing Surveys, vol. 38, issue 2. (draft version in pdf)
* Fabrizio Sebastiani. Text categorization. In Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, pp. 109--129.
* Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.
* Alonso, Laura; Castellon, Irene; Climent, Salvador; Fuentes, María, Padró, Lluís; Rodríguez, Horacio (2003)  Approaches to Text Summarization: Questions and Answers. Revista Iberoamericana de Inteligencia Artificial (noviembre de 2003). Special Issue on Multilingual Information Access
* Mani, Inderjeet. Automatic Summarization. John Benjamins, xi+285pp, paperback ISBN 1-58811-060-5, Natural Language Processing, 3, 2001.

If you need more information don't hesitate to email us: {lluism,turmo}
Last Update:  October 02, 2011