What we do, day by day:
(we are currently here)
March 13
Introduction to the course: brainstorming, what is text mining? what is it useful for?
Evaluation: how do you evaluate what you still don't know? baselines, hypothesis tests
compulsory reading: W. Fan, L. Wallace, S. Rich, Z. Zhang, 2005. Tapping into the power of text mining, Communications of ACM.
recommended reading: success cases:
- Ashok N. Srivastava and Brett
Zane-Ulman. Discovering Recurring Anomalies in Text
Reports Regarding Complex Space Systems, NASA
Intelligent Systems Division
[ this one can be presented in class and will count for the final mark ] - some examples of collaborative filtering: Stumble Upon, Menéame
- applications of data mining to other areas of
knowledge (not economical): discovery of favorable antibiotic
combinations, breakthrough in interventional
radiology
- interview with Usama Fayyad
(interviewing Piatetsky Shapiro) (also one in 2001)
- check more success stories from SAS, for example,
one about mining touristic guides!
- I think you'll enjoy Marti Hearst's essay What is Text Mining?
homework: think about what you would like to do as course project, find corpora that are liable to be used to obtain some ordered learning.
You can also have a tour:
- http://filebox.vt.edu/users/wfan/text_mining.html
- newsletters on data mining: KDNuggets, SigKDD Explorations (of special
interest: Explorations' special issue on Text Mining)
slides: Untangling Text Data Mining, by Marti Hearst
March 15
Natural Language Processing: classical architectures, data-driven solutions
compulsory reading: Introduction to Speech and Language Processing, by D. Jurafsky and J. H. Martin (2000)
recommended reading:
- J. Allen (1987) Natural Language Understanding
- Lluís Márquez survey of Machine Learning Applied to Natural
Language Processing
homework: download Weka, install it and get it to run, check that you can open and successfully read the manual
slides: Lluís Padró on Statistical Methods for NLP (we'll be using a significantly reduced version, but do not hesitate to enjoy the whole ;) ), we will also be using the first part my talk on free NLP software resources.
March 20
Math Foundations and Linguistic Essentials
compulsory reading: none!
recommended reading: chapters 1 and 2 from Foundations of Statistical NLP.
homework: prepare corpus data to work with Weka in clustering
slides: we will be working with an extract of Lluís Padró's slides, and we might use some examples of the slides on Linguistic Essentials to illustrate linguistic phenomena and we may also resort to some other slides on Math Foundations.
March 22
Data-driven characterization of linguistic phenomena: clustering
compulsory reading: Clustering, chapter 14 of Foundations of Statistical NLP
recommended reading: chapter on exploratory data analysis from the NIST/SEMATECH e-Handbook of Statistical Methods
homework: get acquainted with Weka's functionalities for clustering (at least read the manual!)
slides: we'll be using Chris Manning's slides on document clustering, (also in .pdf) since they follow the chapter quite closely.
March 27 and March 29
Data-driven characterization of linguistic phenomena: clustering to find equivalence classes
compulsory reading: Pantel, P. and Lin, D. 2002. Discovering Word Senses from Text. KDD-02
recommended reading:
- D. Lin. 1998. An Information-Theoretic Definition of
Similarity. In Proceedings of ICML-98. Madison,
Wisconsin.
- Senellart, P., Blondel, V. D. 2003. Automatic discovery of similar words. A
Comprehensive Survey of Text Mining. Springer-Verlag.
homework: toy experiments with clustering words to find equivalence classes.To do that, you can use a relatively small corpus of Spanish or a relatively big one. You should be able to present succintly (2-3 minutes) the results of your experiments in class, in one-two weeks time. A short written report (1-2 pages long) is also requested.
You can also play with Dekang Lin's demo page on similarity of words and discuss them in class.
if you need some hints for the task...
slides: some bunch of slides of mine commenting on the reading (believe it or not, for once it was me who made the slides!)
April 3
Data-driven characterization of linguistic phenomena: clustering to find word associations
compulsory reading: Collocations, chapter 5 of Foundations of Statistical NLP
recommended reading: Ken Church and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16 (1)
homework: toy experiments with clustering to find word associations
slides: Ines Rehbein's slides on the chapter for the NCLT reading group at the School of Computing at Dublin City University.
April 5
Data-driven characterization of linguistic phenomena, incorporating knowledge: word sense disambiguation
compulsory reading: Word Sense Disambiguation, chapter 7 of Foundations of Statistical NLP
recommended reading:
- Dan Yarowsky. 1997. Unsupervised Word Sense Disambiguation
Rivaling Supervised Methods. ACL.
- J. Véronis. 2004. HyperLex: Lexical Cartography for
Information Retrieval. Computer, Speech and
Language, 18 (3)
homework: present the results of your toy experiments with clustering, possibly comparing them with some of the readings scheduled for today.
April 10
today we'll be discussing this nice paper on unsupervised word sense disambiguation using graphs (Deme is doing the talking)
recommended reading: J. Véronis. 2004. HyperLex: Lexical Cartography for Information Retrieval. Computer, Speech and Language, 18 (3)
homework: present the results of your toy experiments with clustering -- class discussion, recommendations for handing in short reports after Easter.
slides: we'll be skimming through some of the slides from the tutorial Advances in Word Sense Disambiguation given by Rada Mihalcea and Ted Pedersen at IBERAMIA-2004 ACL-2005 and AAAI-2005
April 12
Data-driven characterization of linguistic phenomena, incorporating knowledge: subcategorization acquisition
compulsory reading: Lexical Acquisition, chapter 8 of Foundations of Statistical NLP
recommended reading:
- Chris Manning. 1993. Automatic acquisition of a large
subcategorisation dictionary from corpora. ACL
- T. Briscoe, J. Carroll. 1997. Automatic extraction of subcategorization
from corpora. ACL
- Riloff, E. and Shepherd, J. 1997. A corpus-based approach for building
semantic lexicons. EMNLP-1997.
homework: think about what you'll like to do as a project, so that you can meditate during the holidays.
slides: Manning & Schütze's slides for their chapter on Lexical Acquisition, to be found at the companion website for the book.
April 17
Decision Trees and Association Rules
compulsory reading: Xavier Carreras, Lluís Màrquez and Lluís Padró, Named Entity Extraction using AdaBoost, CoNLL-02
recommended reading: AAAI's site on Decision Trees, with plentiful of links and resources, you can also take a look at the wikipedia article on decision trees.
homework: getting to know Weka's functionalities with Decision Trees.
slides: Lluís Màrquez on Decision Tree Induction.
April 19
Language as a sequence: multiple sequence alignment
compulsory reading: Statistical Alignment and Machine Translation, chapter 13 of Foundations of Statistical NLP
recommended reading: check Patrick Lambert's Bibliography for Statistical Alignment and Machine Translation (strongly recommended), and also:
- Franz Josef Och and Hermann Ney.2003. A systematic comparison of various
statistical alignment models. Computational
Linguistics, 29(1):19-51, March.
- W. A. Gale and Ken Church. 1991. Identifying word correspondences in
parallel texts.
- F.J. Och, H. Ney. 2003. Improved Statistical Alignment Models
ACL
homework: skim and get a good (discussable) idea of the workbook A Statistical MT Tutorial Workbook prepared in the JHU summer workshop.
just for fun: take a look at alphamalig, the multiple alignment tool with parametrisable distances and alphabets.
slides: the skeleton of the lecture were these, Nathalie Japkowicz slides on chapter 13 for her course Natural Language Processing, A Statistical Approach. We saw some examples on difficult cases for alignment, and the IBM models, in Chris Manning's slides for his lecture on Statistical Machine Translation in his course on Natural Language Processing. The slides for dynamic programming techniques applied to sequence alignment were taken from Robert W. Robinson.
April 24
Language as a sequence: statistical machine translation
compulsory reading: Statistical Alignment and Machine Translation, chapter 13 of Foundations of Statistical NLP
recommended reading: A. Venugopal, S. Vogel, A. Waibel. 2003. Effective Phrase Translation Extraction from Alignment Models ACL
homework:
please make appointments to come by the office to discuss what you'll be working about in your project
remember that you can submit something to IBERAMIA's TIL workshop...
April 26
Language as a sequence: paraphrase extraction
compulsory reading: Regina Barzilay and Kathy McKeown. 2001. Extracting Paraphrases from a Parallel Corpus. ACL
recommended reading: Regina Barzilay, Lillian Lee. Bootstrapping Lexical Choice via Multiple-Sequence Alignment.
homework:
define the schedule for your project, you should already be reading state of the art on the subject
May 3
Language as a sequence: markovian models
compulsory reading: Statistical Inference: n-gram Models over Sparse Data, chapter 6 of Foundations of Statistical NLP
recommended reading: M.L. Forcada. 2001. Corpus-based stochastic finite-state predictive text entry for reduced keyboards: application to Catalan. SEPLN
homework:
May 8
Data-driven characterization of linguistic phenomena: latent semantic analysis, principal component analysis
compulsory reading: T. K. Landauer, P. W. Foltz, & D. Laham. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25.
recommended reading:
- T.K. Landauer, S.T. Dumais. 1997. A Solution to Plato's Problem: The Latent
Semantic Analysis Theory of Acquisition, Induction and
Representation of Knowledge. Psychological Review.
- Thomas Hoffman. 1999. Probabilistic Latent Semantic Analysis.
UAI-99.
homework: you can try the Open Source LSA Package for R, you can also take a look at the LSA page of the University of Colorado.
slides: some very nice slides on principal component analysis from a course in Princeton... authorship unknown!
May 10
Text as a graph: graph-based algorithms for NLP
compulsory reading: Avrim Blum and Shuchi Chawla. 2001. Learning from Labeled and Unlabeled Data using Graph Mincut. ICML'01.
recommended reading: papers presented at the workshop on graph-based algorithms for NLP (HLT-06), particularly
- Einat Minkov, William Cohen and Andrew Ng. 2006.
A Graphical Framework for Contextual Search
and Name Disambiguation in Email
- Diego Molla. 2006. Learning of Graph-based Question Answering
Rules
homework:
slides: Regina Barzilay's slides on graph-based methods, from her course on Natural Language Processing.
May 15
Lineal segmentation and word chains
compulsory reading:
- Hearst, M., TextTiling: Segmenting Text into Multi-Paragraph
Subtopic Passages, Computational Linguistics, 23
(1), pp. 33-64, March 1997.
- Morris J. and Hirst G., 1991, Lexical cohesion computed by thesaural
relations as an indicator of the structure of
text, Computational Linquistics. 17(1):21-43
recommended reading: Freddy Y. Y. Choi, Peter Wiemer-Hastings, Johanna Moore. 2001. Latent Semantic Analysis for Text Segmentation. Proceedings of 6th EMNLP
slides: some of Noémie Elhadad's slides on Discourse from her course Natural Language Processing (2006) and some Marti Hearst's slides on discourse processing and text segmentation from her course Applied Natural Language Processing (2004).
by now you should have decided what you will be working on in your project
May 17
Multiple Sequence Alignment to discover the structure of natural languages
compulsory reading: Z. Solan, D. Horn, E. Ruppin, and S. Edelman, Unsupervised learning of natural languages, PNAS
recommended reading: W. R. Pearson and D. J. Lipman (1988) Improved Tools for Biological Sequence Comparison. PNAS 85:2444- 2448
slides: Peter Adriaans on Language Learning and D. Horn on Adios.
the scheme of your work and all major decisions to be taken must be settled by June, so that you can make an oral presentation of it in front of the class.
May 29
Ontology Induction and Population
compulsory reading: H. Davulcu, S. Vadrevu, S. Nagarajan. OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites. In First International Workshop on Semantic Web and Databases, September, 2003, Berlin, Germany
recommended reading: Pantel, P. 2005. Inducing Ontological Co-occurrence Vectors. ACL-05.
slides: Eduard Hovy's slides for introducing ontologies and Patrick Pantel's slides for inducing ontological co-occurrence vectors.
May 31
Template Induction
compulsory reading: Zvika Mark, Ido Dagan and Eli Shamir. 2002. Cross-component Clustering for Template Induction. Workshop on Text Learning (TextML 2002)
recommended reading:
- Ion Muslea, Steve Minton, Craig Knoblock. 2001.
Hierarchical Wrapper Induction for
Semistructured Information Sources. Journal of
Autonomous Agents and Multi-Agent Systems, 4.
- Ion Muslea. 1999. Extraction Patterns for Information
Extraction Tasks: A Survey. The AAAI-99 Workshop
on Machine Learning for Information Extraction
- Riloff, E. and Jones, R. 1999. Learning Dictionaries for Information
Extraction by Multi-Level Bootstrapping. AAAI-99.
homework:
DEADLINE for the TIL workshop at IBERAMIA
June 5
Learning of Morphology
compulsory reading: Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics. 27 (2)
recommended reading: Yu Hu, I. Matveeva, J. Goldsmith, C. Sprague. 2005. Using Morphology and Syntax Together in Unsupervised Learning. ACL workshop PsychoCompLA-2005
homework:
June 7
Democratically Elected Topic
compulsory reading:
recommended reading:
homework:
DEADLINE for handing in the papers for the presentation of your work
June 12
presentation of your work (schedule to be defined)
homework: reading your classmates' work and prepare to criticize it in the class
June 14
presentation of your work (schedule to be defined)
homework: reading your classmates' work and prepare to criticize it in the class
wrapping, beer and the like.