schedule

updated May 29th

What we do, day by day:
(we are currently here)

March 13

Introduction to the course: brainstorming, what is text mining? what is it useful for?
Evaluation: how do you evaluate what you still don't know? baselines, hypothesis tests

compulsory reading: W. Fan, L. Wallace, S. Rich, Z. Zhang, 2005. Tapping into the power of text mining, Communications of ACM.
recommended reading: success cases:

Ashok N. Srivastava and Brett Zane-Ulman. Discovering Recurring Anomalies in Text Reports Regarding Complex Space Systems, NASA Intelligent Systems Division
[ this one can be presented in class and will count for the final mark ]
some examples of collaborative filtering: Stumble Upon, Menéame
applications of data mining to other areas of knowledge (not economical): discovery of favorable antibiotic combinations, breakthrough in interventional radiology
interview with Usama Fayyad (interviewing Piatetsky Shapiro) (also one in 2001)
check more success stories from SAS, for example, one about mining touristic guides!
I think you'll enjoy Marti Hearst's essay What is Text Mining?

homework: think about what you would like to do as course project, find corpora that are liable to be used to obtain some ordered learning.
You can also have a tour:

http://filebox.vt.edu/users/wfan/text_mining.html
newsletters on data mining: KDNuggets, SigKDD Explorations (of special interest: Explorations' special issue on Text Mining)

slides: Untangling Text Data Mining, by Marti Hearst

March 15

Natural Language Processing: classical architectures, data-driven solutions

compulsory reading: Introduction to Speech and Language Processing, by D. Jurafsky and J. H. Martin (2000)
recommended reading:

J. Allen (1987) Natural Language Understanding
Lluís Márquez survey of Machine Learning Applied to Natural Language Processing

homework: download Weka, install it and get it to run, check that you can open and successfully read the manual
slides: Lluís Padró on Statistical Methods for NLP (we'll be using a significantly reduced version, but do not hesitate to enjoy the whole ;) ), we will also be using the first part my talk on free NLP software resources.

March 20

Math Foundations and Linguistic Essentials

compulsory reading: none!
recommended reading: chapters 1 and 2 from Foundations of Statistical NLP.
homework: prepare corpus data to work with Weka in clustering
slides: we will be working with an extract of Lluís Padró's slides, and we might use some examples of the slides on Linguistic Essentials to illustrate linguistic phenomena and we may also resort to some other slides on Math Foundations.

March 22

Data-driven characterization of linguistic phenomena: clustering

compulsory reading: Clustering, chapter 14 of Foundations of Statistical NLP
recommended reading: chapter on exploratory data analysis from the NIST/SEMATECH e-Handbook of Statistical Methods
homework: get acquainted with Weka's functionalities for clustering (at least read the manual!)
slides: we'll be using Chris Manning's slides on document clustering, (also in .pdf) since they follow the chapter quite closely.

March 27 and March 29

Data-driven characterization of linguistic phenomena: clustering to find equivalence classes

compulsory reading: Pantel, P. and Lin, D. 2002. Discovering Word Senses from Text. KDD-02
recommended reading:

D. Lin. 1998. An Information-Theoretic Definition of Similarity. In Proceedings of ICML-98. Madison, Wisconsin.
Senellart, P., Blondel, V. D. 2003. Automatic discovery of similar words. A Comprehensive Survey of Text Mining. Springer-Verlag.

homework: toy experiments with clustering words to find equivalence classes.To do that, you can use a relatively small corpus of Spanish or a relatively big one. You should be able to present succintly (2-3 minutes) the results of your experiments in class, in one-two weeks time. A short written report (1-2 pages long) is also requested.
You can also play with Dekang Lin's demo page on similarity of words and discuss them in class.
if you need some hints for the task...
slides: some bunch of slides of mine commenting on the reading (believe it or not, for once it was me who made the slides!)

April 3

Data-driven characterization of linguistic phenomena: clustering to find word associations

compulsory reading: Collocations, chapter 5 of Foundations of Statistical NLP
recommended reading: Ken Church and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16 (1)
homework: toy experiments with clustering to find word associations
slides: Ines Rehbein's slides on the chapter for the NCLT reading group at the School of Computing at Dublin City University.

April 5

Data-driven characterization of linguistic phenomena, incorporating knowledge: word sense disambiguation

compulsory reading: Word Sense Disambiguation, chapter 7 of Foundations of Statistical NLP
recommended reading:

Dan Yarowsky. 1997. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. ACL.
J. Véronis. 2004. HyperLex: Lexical Cartography for Information Retrieval. Computer, Speech and Language, 18 (3)

homework: present the results of your toy experiments with clustering, possibly comparing them with some of the readings scheduled for today.

April 10

today we'll be discussing this nice paper on unsupervised word sense disambiguation using graphs (Deme is doing the talking)

recommended reading: J. Véronis. 2004. HyperLex: Lexical Cartography for Information Retrieval. Computer, Speech and Language, 18 (3)
homework: present the results of your toy experiments with clustering -- class discussion, recommendations for handing in short reports after Easter.
slides: we'll be skimming through some of the slides from the tutorial Advances in Word Sense Disambiguation given by Rada Mihalcea and Ted Pedersen at IBERAMIA-2004 ACL-2005 and AAAI-2005

April 12

Data-driven characterization of linguistic phenomena, incorporating knowledge: subcategorization acquisition

compulsory reading: Lexical Acquisition, chapter 8 of Foundations of Statistical NLP
recommended reading:

Chris Manning. 1993. Automatic acquisition of a large subcategorisation dictionary from corpora. ACL
T. Briscoe, J. Carroll. 1997. Automatic extraction of subcategorization from corpora. ACL
Riloff, E. and Shepherd, J. 1997. A corpus-based approach for building semantic lexicons. EMNLP-1997.

homework: think about what you'll like to do as a project, so that you can meditate during the holidays.
slides: Manning & Schütze's slides for their chapter on Lexical Acquisition, to be found at the companion website for the book.

April 17

Decision Trees and Association Rules

compulsory reading: Xavier Carreras, Lluís Màrquez and Lluís Padró, Named Entity Extraction using AdaBoost, CoNLL-02
recommended reading: AAAI's site on Decision Trees, with plentiful of links and resources, you can also take a look at the wikipedia article on decision trees.
homework: getting to know Weka's functionalities with Decision Trees.
slides: Lluís Màrquez on Decision Tree Induction.

April 19

Language as a sequence: multiple sequence alignment

compulsory reading: Statistical Alignment and Machine Translation, chapter 13 of Foundations of Statistical NLP
recommended reading: check Patrick Lambert's Bibliography for Statistical Alignment and Machine Translation (strongly recommended), and also:

Franz Josef Och and Hermann Ney.2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19-51, March.
W. A. Gale and Ken Church. 1991. Identifying word correspondences in parallel texts.
F.J. Och, H. Ney. 2003. Improved Statistical Alignment Models ACL

homework: skim and get a good (discussable) idea of the workbook A Statistical MT Tutorial Workbook prepared in the JHU summer workshop.
just for fun: take a look at alphamalig, the multiple alignment tool with parametrisable distances and alphabets.
slides: the skeleton of the lecture were these, Nathalie Japkowicz slides on chapter 13 for her course Natural Language Processing, A Statistical Approach. We saw some examples on difficult cases for alignment, and the IBM models, in Chris Manning's slides for his lecture on Statistical Machine Translation in his course on Natural Language Processing. The slides for dynamic programming techniques applied to sequence alignment were taken from Robert W. Robinson.

April 24

Language as a sequence: statistical machine translation

compulsory reading: Statistical Alignment and Machine Translation, chapter 13 of Foundations of Statistical NLP
recommended reading: A. Venugopal, S. Vogel, A. Waibel. 2003. Effective Phrase Translation Extraction from Alignment Models ACL
homework:

please make appointments to come by the office to discuss what you'll be working about in your project
remember that you can submit something to IBERAMIA's TIL workshop...

April 26

Language as a sequence: paraphrase extraction

compulsory reading: Regina Barzilay and Kathy McKeown. 2001. Extracting Paraphrases from a Parallel Corpus. ACL
recommended reading: Regina Barzilay, Lillian Lee. Bootstrapping Lexical Choice via Multiple-Sequence Alignment.
homework:

define the schedule for your project, you should already be reading state of the art on the subject

May 3

Language as a sequence: markovian models

compulsory reading: Statistical Inference: n-gram Models over Sparse Data, chapter 6 of Foundations of Statistical NLP
recommended reading: M.L. Forcada. 2001. Corpus-based stochastic finite-state predictive text entry for reduced keyboards: application to Catalan. SEPLN
homework:

May 8

Data-driven characterization of linguistic phenomena: latent semantic analysis, principal component analysis

compulsory reading: T. K. Landauer, P. W. Foltz, & D. Laham. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25.
recommended reading:

T.K. Landauer, S.T. Dumais. 1997. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge. Psychological Review.
Thomas Hoffman. 1999. Probabilistic Latent Semantic Analysis. UAI-99.

homework: you can try the Open Source LSA Package for R, you can also take a look at the LSA page of the University of Colorado.
slides: some very nice slides on principal component analysis from a course in Princeton... authorship unknown!

May 10

Text as a graph: graph-based algorithms for NLP

compulsory reading: Avrim Blum and Shuchi Chawla. 2001. Learning from Labeled and Unlabeled Data using Graph Mincut. ICML'01.
recommended reading: papers presented at the workshop on graph-based algorithms for NLP (HLT-06), particularly

Einat Minkov, William Cohen and Andrew Ng. 2006. A Graphical Framework for Contextual Search and Name Disambiguation in Email
Diego Molla. 2006. Learning of Graph-based Question Answering Rules

homework:
slides: Regina Barzilay's slides on graph-based methods, from her course on Natural Language Processing.

May 15

Lineal segmentation and word chains

compulsory reading:

Hearst, M., TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics, 23 (1), pp. 33-64, March 1997.
Morris J. and Hirst G., 1991, Lexical cohesion computed by thesaural relations as an indicator of the structure of text, Computational Linquistics. 17(1):21-43

recommended reading: Freddy Y. Y. Choi, Peter Wiemer-Hastings, Johanna Moore. 2001. Latent Semantic Analysis for Text Segmentation. Proceedings of 6th EMNLP
slides: some of Noémie Elhadad's slides on Discourse from her course Natural Language Processing (2006) and some Marti Hearst's slides on discourse processing and text segmentation from her course Applied Natural Language Processing (2004).

by now you should have decided what you will be working on in your project

May 17

Multiple Sequence Alignment to discover the structure of natural languages

compulsory reading: Z. Solan, D. Horn, E. Ruppin, and S. Edelman, Unsupervised learning of natural languages, PNAS
recommended reading: W. R. Pearson and D. J. Lipman (1988) Improved Tools for Biological Sequence Comparison. PNAS 85:2444- 2448
slides: Peter Adriaans on Language Learning and D. Horn on Adios.

the scheme of your work and all major decisions to be taken must be settled by June, so that you can make an oral presentation of it in front of the class.

May 29

Ontology Induction and Population

compulsory reading: H. Davulcu, S. Vadrevu, S. Nagarajan. OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites. In First International Workshop on Semantic Web and Databases, September, 2003, Berlin, Germany
recommended reading: Pantel, P. 2005. Inducing Ontological Co-occurrence Vectors. ACL-05.
slides: Eduard Hovy's slides for introducing ontologies and Patrick Pantel's slides for inducing ontological co-occurrence vectors.

May 31

Template Induction

compulsory reading: Zvika Mark, Ido Dagan and Eli Shamir. 2002. Cross-component Clustering for Template Induction. Workshop on Text Learning (TextML 2002)
recommended reading:

Ion Muslea, Steve Minton, Craig Knoblock. 2001. Hierarchical Wrapper Induction for Semistructured Information Sources. Journal of Autonomous Agents and Multi-Agent Systems, 4.
Ion Muslea. 1999. Extraction Patterns for Information Extraction Tasks: A Survey. The AAAI-99 Workshop on Machine Learning for Information Extraction
Riloff, E. and Jones, R. 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. AAAI-99.

homework:

DEADLINE for the TIL workshop at IBERAMIA

June 5

Learning of Morphology

compulsory reading: Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics. 27 (2)
recommended reading: Yu Hu, I. Matveeva, J. Goldsmith, C. Sprague. 2005. Using Morphology and Syntax Together in Unsupervised Learning. ACL workshop PsychoCompLA-2005
homework:

June 7

Democratically Elected Topic

compulsory reading:
recommended reading:
homework:

DEADLINE for handing in the papers for the presentation of your work

June 12

presentation of your work (schedule to be defined)

homework: reading your classmates' work and prepare to criticize it in the class

June 14

presentation of your work (schedule to be defined)

homework: reading your classmates' work and prepare to criticize it in the class

wrapping, beer and the like.

Text Mining

schedule