PLN-PMT: Natural Language Processing for Massive Textual Data Management
Master in Artificial Intelligence
First term 2006/07


This page is (and will be) under construction during all the semester. Check the new contents regularly


2007/01/28: The course is over!
2007/01/28: Each student is about to receive an email with personal marks
2007/01/28: The slides corresponding to the students' presentations are available
2006/12/14: The presentation of  the practical works will take place in January 22, 11-14h at room Omega-S217
2006/12/14: Slides for the presentation of complementary readings are being posted (see scheduling)

2006/11/29: New slides on Information Extraction  have been posted (Adaptability)

2006/11/29: Final assignments and calendar of complementary readings is available here

2006/11/20: New links to SRL materials (point 3.3) have been added

2006/11/17: New slides on Information Extraction (Multilinguality and Evaluation) have been posted

2006/11/13: The slides for "structure learning for NLP" have been updated to completion

2006/11/10: Deadline for selecting the complementary readings: November 20;

                    contact L. Màrquez when you make your choice

2006/11/10: All the remaining complementary readings have been posted (6,7,8)
2006/11/10: The teams for the two practical works are set

2006/10/27: Slight change in the scheduling: point 4.3 supressed; sessions devoted to the
                    presentation of students readings have been extended to Dec. 11, 15, and 18.

2006/10/27: The two practical works are already available

2006/10/06: The course has started; thanks for attending!
2006/09/02: The Web page has been set; welcome to the course!


    Monday: 12h-14h
    Friday: 12h-14h
    Course start: October 6th 2006
    Room: S219, UPC, Campus Nord


Lluís Màrquez (LM)
(Campus Nord, Omega-S120, lluism@lsi.upc.edu )
Jordi Turmo (JT)
(Campus Nord, Omega-215, turmo@lsi.upc.es)


The main goal of this course is to provide the students with an in depth knowledge of the techniques, methods and tools, both symbolic and empirical, of Natural Language Processing (NLP). The course focuses on the systems dealing with the analysis and processing of massive quantities of textual data. The applications in this domain  usually work in a batch mode and have their basic framework in Internet and very large textual data bases. After taking this course we expect students to be familiar with the basic bibliography of this area of NLP and have the capacity and skills for performing a future in-depth research in any of the themes covered by the course. Also, the range of applications studied allows the students to bridge the gap between the language technologies studied and the real-world applications in which they take part. A final goal of the course is the presentation of the most active research areas within the topics of the course.

This course is highly coupled with the course covering Natural Language applications for person-machine communication (Natural Language Processing for Human-Machine Communication). By taking both courses, the student will be able to get a sufficient knowledge of the two basic paradigms of NLP in the framework of the two most frequent scenarios. 

Find a full description of the course and the evaluation method here (an even more complete description in Catalan)


1. Introduction

    1.1 The necessity of automatically processing massive quantities of textual data. 
          Main applications in this domain.

2. Advanced Topics in Machine Learning

    2.0 Review of the main concepts of Machine Learning
    2.1 Statistical Methods: Maximum Entropy modeling: MEEMs; Conditional Random Fields
    2.2 Discriminative Learning Methods: Boosting, Support Vector Machines
    2.3 Learning & Inference for relational and structured domains
    2.4 Semi-supervised Learning: Bootstrapping, co-training and variants

3.  Generic Tasks

   3.1 Partial parsing: chunking and clause boundary detection
   3.2 Word Sense Disambiguation
   3.3 Semantic Role Labeling

4. Applications

    4.1. Information Extraction: typology, adaptability, multilinguality, evaluation
    4.2. Document Categorization: thematic classification, using hierarchies of concepts
           from the Web, subjective classification (intention, sentiment, etc.)
    4.3. Automatic Summarization: single document, multi-document, multilingual


    6 (LM, 1,2.0), 9 (LM, 2.2), 16 (LM, 2.2), 20 (LM, 2.1), 23 (LM, 2.3), 27 (LM, 2.3), 30 (LM, 2.4)

    3 (JT, 4.1), 6 (LM, 3.1), 10 (JT, 4.1), 13 (LM, 3.2), 17 (JT, 4.1), 20 (LM, 3.3), 24 (JT, 4.1), 27 (LM, 3.3)

    1 (JT, 4.1), 4 (LM, 4.2),
complementary readings on co-training and multitask learning
complementary readings on CRFs and sentiment classification & IE
complementary readings on relation extraction and sentiment classification

    22, from 11h to 14h: Public presentation of students' practical works (room Omega-S217)
    (find some instructions here for preparing the presentations)
    1st presentation: chunking (slides)
    2nd presentation: relation extraction (slides, report)

Download course materials

Session 1: (points 1 and 2.0 of the program)
   Introduction to the course
   An introductory talk on Machine Learning for NLP (Given at UdG in 2003)
   An introductory talk on Learning and Inference in NLP problems (Given at OSU in 2004)

Sessions 2 and 3: (point 2.2 of the program)
    slides on AdaBoost
    a talk on SVMs given in the 2002 Summer Course on Machine Learning at UPV/EHU
    complementary slides on linear classifiers: Perceptron, Winnow and SNoW
    an introduction and a technical paper on AdaBoost (by R. Schapire & Y. Singer)
    find here a good application of AdaBoost to Text Classification (Boostexter; Schapire & Singer)
    a survey paper on SVMs and a book chapter (in Spanish); more surveys/tutorials on SVM here

   complementary readings (1):  Tree Kernels
original paper (Collins and Duffy, 2002)
    an application to Semantic Role Labeling (Moschitti, Pighin and Basili, 2006)

Session 4: (point 2.1 of the program) 
    slides on the EMNLP-2005 course by Lluís Padró (consider only the MaxEnt section)
     other tutorials on Maximum Entropy can be found here

     complementary readings (2):  Conditional Random Fields
     original paper and applications to chunking (Lafferty, McCallum, and Pereira, 2001; Sha and Pereira, 2003)
     application to semantic role labeling (Roth and Yih, 2005)

Sessions 5 and 6: (point 2.3 of the program)
    structure learning for NLP (second part pending)
    complementary slides on generative approaches (an applied example to named entity recognition)
     slides on the paper Discovering Entities and Relations: A Linear Programming Formulation (Yih and Roth, CoNLL-2004)
     slides from Xavier Carreras' PhD thesis defense
     slides on a SVM-based learning algorithm for Natural Language Learning (Michael Collins)

     complementary readings (3): Re-ranking
     application to parsing (Michael Collins, 2000; Collins and Koo, 2005)
     application to semantic role labeling  (
Toutanova, Haghighi, and Manning, 2005)

     complementary readings (4): Multitask learning (via Alternating Structure Optimization)
     original formulation of ASO and an application to semi-supervised chunking (Ando and Zhang 2005; journal version at JMLR)
     an application to WSD (Ando 2006)

Session 7: (point 2.4 of the program)

     complementary readings (5): Co-training and variants
     the original paper (Blum and Mitchell, 1998)
     two applications to WSD (Mihalcea, 2004; Pam, Ng and Lee, 2005)

Sessions 8, 10, 12, 14, and 16: (point 4.1 of the program)
    First set of slides: Introduction and architectures of IE systems
    Second set of slides: Multilinguality and Evaluation
    Third set of slides: Adaptability

    complementary readings (6): Relation Extraction 
      CRFs applied to relation extraction on the ACE-2005 setting (Cox et al., 2005)
      Kernels over SVMs for relation extraction in the ACE-2005 corpus (Zhao and Grishman, 2005)

Session 9: (point 3.1 of the program)
Session 11: (point 3.2 of the program)

Sessions 13 and 15: (point 3.3 of the program)
    Automatic Semantic Role Labeling HLT-NAACL 2006 tutorial by Scott Yih and Kristina Toutanova.
     Introduction to the CoNLL-2005 shared task  (slides in PDF)
     Spotlights from CoNLL-2005 shared task: partial vs full parsing; system combination

Session 17: (point 4.2 of the program)

    complementary readings (7): Sentiment classification

     Automatic humor recognition (Mihalcea and Strapparava, 2005)
    Identifying perspectives of document and sentences (Lin et al., 2006)
    Detection of Opinion Bearing Words and Sentences (Kim and Hovy, 2005)

    complementary readings (8): Sentiment classification and Information Extraction
    Subjectivity classification for improved Information Extraction (Riloff et al., 2005)
    CRFs and extraction patterns for identifying sources of opinions (Choi et al., 2005)

Session 18: (point 4.3 of the program)

pending downloadable materials will appear a few days in advance of each of the sessions (stay tuned)

Practical works

[1] Comparative Study of Learning Approaches for Sequential Labeling: A Case Study on Syntactic Chunking

[2] Study of different feature sets to learn SVM models useful for extracting the ACE mentions of relations

Both works are to be carried by work teams of three students.
The groups are already set.

Basic References

Natural Language Processing
* R. Dale, H. Moisl, H.Somers, ed. Handbook of natural Language Processing, Marcel Dekker, New York, 2000.
* D. Jurafsky, James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, Upper Saddle River, N.J. ,2000.
* C. Manning, H. Schütze. Foundations of statistical Natural Language Processing, MIT Press Cambridge, Mass., 1999.
* R. Mitkov (editor). The Oxford handbook of Computational Linguistics, Oxford University Press, 2004.

Machine Learning
* N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000.
* Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). Elements of Statistical Learning. Springer
* Tom Mitchell, Machine Learning, McGraw Hill, 1997.
* J. Hernández-Orallo, M. J. Ramírez-Quintana, C. Ferri. Introducción a la Minería de Datos, Prentice Hall / Addison-Wesley, 2004.

Surveys/Tutorials on techniques, tasks, and applications
* Xavier Carreras, Lluís Màrquez, and Erique Romero. Máquinas de Vectores Soporte, Capítulo en Introducción a la Minería de Datos, Hernández, J. and Ramírez and M. J. and Ferri, C. (eds.), Pearson Prentice
* HC. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.all, 353-382.
* Ide, N., & Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: the state of the art. Computational Linguistics, 24(1), 1-40.
* L. Màrquez, G. Escudero, D. Martínez and G. Rigau. Supervised Corpus-based Methods for Word Sense Disambiguation. Chapter in Eneko Agirre and Phil Edmonds (Eds.) Word Sense Disambiguation. Algorithms and Applications, Kluwer, 2006 (draft version available).
* J. Turmo, A. Ageno, N. Català (2006). Adaptive Information Extraction. ACM Computing Surveys, vol. 38, issue 2. (draft version in pdf)
* Fabrizio Sebastiani. Text categorization. In Alessandro Zanasi (ed.), Text Mining and its Applications, WIT Press, Southampton, UK, 2005, pp. 109--129.
* Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002.
* Alonso, Laura; Castellon, Irene; Climent, Salvador; Fuentes, María, Padró, Lluís; Rodríguez, Horacio (2003)  Approaches to Text Summarization: Questions and Answers. Revista Iberoamericana de Inteligencia Artificial (noviembre de 2003). Special Issue on Multilingual Information Access
* Mani, Inderjeet. Automatic Summarization. John Benjamins, xi+285pp, paperback ISBN 1-58811-060-5, Natural Language Processing, 3, 2001.

Some useful links

Research groups/institutions/organizations/etc.
* Association of Computational Linguistics ACL
* ACL Anthology
* The ACL wiki
* Information Society Technology IST
* Oficina del Español en la Sociedad de la Información OESI
* Sociedad Española para el procesamiento del lenguaje natural SEPLN
* TALP Research Center (UPC)
* Research Group on Natural Language Processing (GPLN), LSI-UPC
* Cognitive Computation Group (UIUC): Demos page
* Portal on Support Vector Machines and Kernel Methods
* Automatic Content Extraction (ACE)
* Document Understanding Conferences (DUC)
* CoNLL conferences and shared tasks
* A bibliography on Boosting (R. Schapire)

Resources and Toolkits for Natural Language Processing
* Stanford University NLP Resources
* FreeLing 1.5: Open Source suite of Language Analyzers
* SVMTool: Open Source generator of sequential taggers based on Support Vector Machines
* YamCha: tagger for sequential structures
* Natural Language Toolkit, NLTK
* OpenNLP
* TnT--Statistical Part-of-Speech Tagging

Machine Learning Toolkits
* Maximum Entropy Modeling
* MALLET: Advanced Machine Learning for Language
* Software on SVMs and Kernel Machines
* WEKA: Machine Learning and Data Mining Suite
* SVMstruct: Support Vector Machine for Complex Outputs
* TiMBL: Tilburg Memory Based Learner
* The SNoW Learning Architecture
* Fast Transformation-Based Learning Toolkit (fnTBL)

Other NLP courses at the AI master

PLN-PMT: Natural Language Processing for Human-Machine Communication (specific web page of the course)

If you need more information don't hesitate to email me (not necessarily in English :-)

Last Update:  January 15, 2007