Natural Language Processing for Massive Textual Data Management (PLN-PMT) Basic Information ================= Course type: optional Semester: third ECTS : 6 (12 "punts de docència") Periodicity: anual UPC structural unit responsible of the course: Software Department (LSI) Teachers: Lluís Màrquez (main), Jordi Turmo Language: Catalan / Spanish / English Requirements: you have to course the following subjects before doing PLN-PMT: Processament del Llenguatge Natural, Aplicacions de la Intel·ligència Artificial, Aprenentatge Description and Goals ===================== The main goal of this course is to provide the students with an in depth knowledge of the techniques, methods and tools, both symbolic and empirical, of Natural Language Processing (NLP). The course focuses on the systems dealing with the analysis and processing of massive quantities of textual data. The applications in this domain usually work in a batch mode and have their basic framework in Internet and very large textual data bases. After taking this course we expect students to be familiar with the basic bibliography of this area of NLP and have the capacity and skills for performing a future in-depth research in any of the themes covered by the course. Also, the range of applications studied allows the students to bridge the gap between the language technologies studied and the real-world applications in which they take part. This course is highly coupled with the course covering Natural Language applications for person-machine communication (Natural Language Processing for Human-Machine Communication). By taking both courses, the student will be able to get a sufficient knowledge of the two basic paradigms of NLP in the framework of the two most frequent scenarios. A final goal of the course is the presentation of the most active research areas within the topics of the course. Contents ======== The content of the course is organized into three main blocks: (1) The most representative applications based on a massive processing of textual data. These applications are currently being used mainly in the context of Internet processing and the automatic organization of very large document data bases, but they are still the focus of very active research. The concrete set of applications that will be covered by the course are: Document Categorization, Information Extraction, and Automatic Summarization. (2) Basic generic tasks, which can be very useful for the applications listed above (apart from others). We will cover only those that have not been introduced by previous mandatory courses of the Master. More concretely, the generic tasks studied will be: partial parsing, word sense disambiguation, and semantic role labeling. (3) The introduction of advanced Machine Learning Techniques for Natural Language Processing. These algorithms and techniques are very useful for implementing most of the generic tasks described in the previous point. By "advanced" we mean that the ML topics covered extend the basic techniques that are already known by students from the previous mandatory courses of the Master. The table of contents is structured in four themes. The numbers accompanying each of the titles give an orientation of the percentage of the course that will be devoted to the corresponding theme. As can be seen, the main focus will be on the applications. Table of Contents ================= 1. Introduction (5%) 1.1 The necessity of automatically processing massive quantities of textual data. Main applications in this domain. 2. Advanced Topics in Machine Learning (30%) 2.0 Review of the main concepts of Machine Learning 2.1 Statistical Methods: Maximum Entropy modeling 2.2 Discriminative Learning Methods: Boosting, Support Vector Machines 2.3 Learning & Inference for relational and structured domains 2.4 Semi-supervised Learning: Bootstrapping, co-training and variants 3. Generic Tasks (25%) 3.1 Partial parsing: chunking, clause boundary detection 3.2 Word Sense Disambiguation 3.3 Semantic Role Labeling 4. Applications (40%) 4.2. Information Extraction: typology, adaptability, multilinguality, evaluation 4.1. Document Categorization: thematic classification, using hierarchies of concepts from the Web, subjective classification (intention, sentiment, etc.) 4.3. Automatic Summarization: single document, multi-document, multilingual Evaluation Method ================= There will be three aspects subject to evaluation: 1) Development and implementation of a linguistic tool related to the tasks described in points 3 and 4 of the program. 40% of the final score. Work in group. 2) Oral defense of the preceeding work. 20% of the final score. 3) Writting a survey (state-of-the-art) about one of the suggested themes or, alternatively, oral presentation and class discussion of one of the complementary readings suggested along the course. 40% of the final score.