- Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,
Dan Klein. 2008. Learning bilingual lexicons from
monolingual corpora. Association for Computational
Linguistics (ACL), 2008.
- Zhifei Li; David Yarowsky. 2008. Unsupervised Translation Induction for
Chinese Abbreviations using Monolingual Corpora.
ACL 2008.
- Mamoru Komachi, Taku Kudo, Masashi Shimbo and Yuji
Matsumoto. 2008. Graph-based Analysis of Semantic Drift in
Espresso-like Bootstrapping Algorithms. EMNLP
2008.
August 15
Introduction to the course: brainstorming, what is text mining? what is it useful for?
Evaluation: how do you evaluate what you still don't know? baselines, hypothesis tests
compulsory reading: W. Fan, L. Wallace, S. Rich, Z. Zhang, 2005. Tapping into the power of text mining, Communications of ACM.
recommended reading: success cases:
- Ashok N. Srivastava and Brett
Zane-Ulman. Discovering Recurring Anomalies in Text
Reports Regarding Complex Space Systems, NASA
Intelligent Systems Division
[ this one can be presented in class and will count for the final mark ] - applications of data mining to other areas of
knowledge (not economical): discovery of favorable antibiotic
combinations, breakthrough in interventional
radiology
- interview with Usama Fayyad
(interviewing Piatetsky Shapiro) (also one in 2001)
- check more success stories from SAS, for example,
one about mining touristic guides!
- I think you'll enjoy Marti Hearst's essay What is Text Mining?
homework: think about what you would like to do as course project, find corpora that are liable to be used to obtain some ordered learning.
You can also have a tour:
- http://filebox.vt.edu/users/wfan/text_mining.html
- newsletters on data mining: KDNuggets, SigKDD Explorations (of special
interest: Explorations' special issue on Text Mining)
slides: Untangling Text Data Mining, by Marti Hearst
August 20
Natural Language Processing: classical architectures, data-driven solutions
compulsory reading: Introduction to Speech and Language Processing, by D. Jurafsky and J. H. Martin (2000)
recommended reading: J. Allen (1987) Natural Language Understanding
homework: download Weka, install it and get it to run, check that you can open and successfully read the manual
slides: we will be using the first part my talk on free NLP software resources and we will work on the blackboard.
August 22
Math Foundations and Linguistic Essentials
compulsory reading: none!
recommended reading: chapters 1 and 2 from Foundations of Statistical NLP.
homework: prepare corpus data to work with Weka in clustering
slides: we will be working with an extract of Lluís Padró's slides on Statistical Methods for NLP. We will also peruse Lluís Màrquez's slides on Machine Learning for NLP. We might use some examples of the slides on Linguistic Essentials to illustrate linguistic phenomena and we may also resort to some other slides on Math Foundations.
August 27
CENSO PROVINCIAL
August 29
Data-driven characterization of linguistic phenomena: clustering
compulsory reading: Clustering, chapter 14 of Foundations of Statistical NLP
recommended reading: chapter on exploratory data analysis from the NIST/SEMATECH e-Handbook of Statistical Methods
homework 1: get acquainted with Weka's functionalities for clustering (at least read the manual!)
homework 2: toy experiments with clustering words to find equivalence classes.
You should be able to present succintly (2-3 minutes) the results of your experiments in class, in one-two weeks time.
slides: we'll be using Chris Manning's slides on document clustering, (also in .pdf) since they follow the chapter quite closely.
You can also play with Dekang Lin's demo page on similarity of words and discuss them in class.
September 3 and 5
fully supervised >> mildly supervised >> fully unsupervised
(comparison in the task word sense disambiguation)
compulsory reading:
- Eneko Agirre and Philip Edmonds. 2006. Introduction. Word Sense Disambiguation: Algorithms and
Applications, Springer.
- Dan Yarowsky. 1997. Unsupervised Word Sense Disambiguation
Rivaling Supervised Methods. ACL'97.
- Hinrich Schütze. 1998. Automatic word sense discrimination.
Journal of Computational Linguistics.
recommended reading: Word Sense Disambiguation, chapter 7 of Foundations of Statistical
homework:
slides: we'll be skimming through some of the slides from the tutorial Advances in Word Sense Disambiguation given by Rada Mihalcea and Ted Pedersen at IBERAMIA-2004 ACL-2005 and AAAI-2005
September 12
refining clustering for word sense discrimination
compulsory reading: Pantel, P. and Lin, D. 2002. Discovering Word Senses from Text. KDD-02
recommended reading:
- D. Lin. 1998. An Information-Theoretic Definition of
Similarity. In Proceedings of ICML-98. Madison,
Wisconsin.
- Senellart, P., Blondel, V. D. 2003. Automatic discovery of similar words. A
Comprehensive Survey of Text Mining. Springer-Verlag.
slides: some bunch of slides of mine commenting on the reading (believe it or not, for once it was me who made the slides!)
September 17
small-world graphs to discover minor word senses for fine-grained Information Retrieval
compulsory reading: J. Véronis. 2004. HyperLex: Lexical Cartography for Information Retrieval. Computer, Speech and Language, 18 (3)
September 19
unsupervised, graph-based techniques as applied to Biomedical Text Mining
compulsory reading: Amgad Madkour, Kareem Darwish, Hany Hassan, Ahmed Hassan, Ossama Emam. 2007. BioNoculars: Extracting Protein-Protein Interactions from Biomedical Text. In Proceedings of the Workshop on Biological translational and clinical language processing, ACL'07.
recommended reading: Takaaki Hasegawa Satoshi Sekine and Ralph Grishman. 2004. Discovering Relations among Named Entities from Large Corpora. ACL 2004
September 24
HOLIDAY
September 26
HOLIDAY
October 1
Association rules as applied to text
compulsory reading: Bernardi, M., Lapi, M., Leo, P., Loglisci, C. 2005. Mining Generalized Association Rules on Biomedical Literature. In: Moonis, A. Esposito, F. (eds): Innovations in Applied Artificial Intelligence. Lect. Notes Artif. Int. 3353 (2005) 500-509
slides: slides for Chapter 6, Association Analysis: Basic Concepts and Algorithms (612KB), from the book Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar
October 3
Language as a sequence: multiple sequence alignment and paraphrase extraction
compulsory reading: Regina Barzilay and Kathy McKeown. 2001. Extracting Paraphrases from a Parallel Corpus. ACL
recommended reading: check Patrick Lambert's Bibliography for Statistical Alignment and Machine Translation (strongly recommended), and also:
- Statistical Alignment and Machine
Translation, chapter 13 of Foundations of
Statistical NLP
- Franz Josef Och and Hermann Ney.2003. A systematic comparison of various
statistical alignment models. Computational
Linguistics, 29(1):19-51, March.
- W. A. Gale and Ken Church. 1991. Identifying word correspondences in
parallel texts.
- F.J. Och, H. Ney. 2003. Improved Statistical Alignment Models
ACL
- Regina Barzilay, Lillian Lee. Bootstrapping Lexical Choice via
Multiple-Sequence Alignment.
homework: skim and get a good (discussable) idea of the workbook A Statistical MT Tutorial Workbook prepared in the JHU summer workshop.
just for fun: take a look at alphamalig, the multiple alignment tool with parametrisable distances and alphabets.
slides: the skeleton of the lecture were these, Nathalie Japkowicz slides on chapter 13 for her course Natural Language Processing, A Statistical Approach. We saw some examples on difficult cases for alignment, and the IBM models, in Chris Manning's slides for his lecture on Statistical Machine Translation in his course on Natural Language Processing. The slides for dynamic programming techniques applied to sequence alignment were taken from Robert W. Robinson.
define the schedule for your project, you should already be reading state of the art on the subject
October 6
Multiple Sequence Alignment to discover the structure of natural languages
compulsory reading: Z. Solan, D. Horn, E. Ruppin, and S. Edelman, Unsupervised learning of natural languages, PNAS
recommended reading: W. R. Pearson and D. J. Lipman (1988) Improved Tools for Biological Sequence Comparison. PNAS 85:2444- 2448
slides: D. Horn on Adios.
October 8
Lineal segmentation and word chains
compulsory reading:
- Hearst, M., TextTiling: Segmenting Text into Multi-Paragraph
Subtopic Passages, Computational Linguistics, 23
(1), pp. 33-64, March 1997.
- Morris J. and Hirst G., 1991, Lexical cohesion computed by thesaural
relations as an indicator of the structure of
text, Computational Linquistics. 17(1):21-43
recommended reading: Freddy Y. Y. Choi, Peter Wiemer-Hastings, Johanna Moore. 2001. Latent Semantic Analysis for Text Segmentation. Proceedings of 6th EMNLP
slides: some of Marti Hearst's slides on discourse processing and text segmentation from her course Applied Natural Language Processing (2006).
October 10
Learning of Morphology
compulsory reading: Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics. 27 (2)
recommended reading: Yu Hu, I. Matveeva, J. Goldsmith, C. Sprague. 2005. Using Morphology and Syntax Together in Unsupervised Learning. ACL workshop PsychoCompLA-2005
homework:
October 15
Ontology Induction and Population
compulsory reading:
M. Ruiz-Casado, E. Alfonseca and P. Castells. 2005. Automatic extraction of semantic relationships for WordNet by means of pattern learning from Wikipedia. Proceedings of NLDB-2005. In Natural Language Processing and Information Systems.
or
Pantel, P. 2005. Inducing Ontological Co-occurrence Vectors. ACL-05.
recommended reading: take a tour around the PASCAL ontology learning challenge (2006) and OLP3 – 3rd Workshop on Ontology Learning and Population held at ECAI 2008
slides: Eduard Hovy's slides for introducing ontologies and Patrick Pantel's slides for inducing ontological co-occurrence vectors.
October 17
Text as a graph: graph-based algorithms for NLP
compulsory reading: Avrim Blum and Shuchi Chawla. 2001. Learning from Labeled and Unlabeled Data using Graph Mincut. ICML'01.
recommended reading:
For a nice, short discussion of what is transductive learning, take a look at Thorsten Joachim's paper Transductive Learning via Spectral Graph Partitioning, Proceedings of the International Conference on Machine Learning (ICML), 2003. His page on Spectral Graph Partitioning also has interesting material.
papers presented at the workshop on graph-based algorithms for NLP'06 (HLT-06), TextGraphs-07 and TextGraphs-08.
October 22
Feature Selection
compulsory reading: Introduction to Feature Extraction, Foundations and Applications by Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, (eds). Series Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2006.
or
Huan Liu; Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, Volume 17, Issue 4, April 2005
recommended reading: Workshop on New challenges for feature selection in data mining and knowledge discovery 2008, ECML PKDD 2008
also the Special Issue on Variable and Feature Selection of the Journal of Machine Learning Research, 2003.
October 24
Learning Selectional Preferences.
compulsory reading: Shane Bergsma, Dekang Lin and Randy Goebel. 2008. Discriminative Learning of Selectional Preference from Unlabeled Text. EMNLP 2008
October 29
Information Extraction
compulsory reading: Matthew Michelson and Craig A. Knoblock. 2007. Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web. International Journal of Document Analysis and Recognition (IJDAR), Special Issue on Noisy Text Analytics.
or
Marius Pasca and Benjamin Van Durme. 2007. What You Seek is What You Get: Extraction of Class Attributes from Query Logs, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07).
or
Marius Pasca. 2007. Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds, Proceedings of the 16th International World Wide Web Conference (WWW-07).
recommended reading: Roman Yangarber, Ralph Grishman, Pasi Tapanainen and Silja Huttunen. 2000. Unsupervised Discovery of Scenario-Level Patterns for Information Extraction. In Proceedings of Conference on Applied Natural Language Processing ANLP-NAACL 2000 pp. 282-289, (2000) Seattle, WA.
October 31
Fact Extraction (mildly supervised)
compulsory reading: Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, Alpa Jain. 2006. Names and similarities on the web: Fact extraction in the fast lane. ACL. 2006
recommended reading:
- Heng Ji and Ralph Grishman. Refining Event Extraction through
Unsupervised Cross-document Inference.
- Hany Hassan, Ahmed Hassan and Ossama Emam. Unsupervised Information Extraction
Approach Using Graph Mutual Reinforcement. EMNLP
2006.
November 5
Inference Rules
Rahul Bhagat, Patrick Pantel, Eduard Hovy. 2007. LEDIR: An Unsupervised Algorithm for Learning Directionality of Inference Rules. EMNLP'07.
November 7
Data-driven characterization of linguistic phenomena: latent semantic analysis, principal component analysis
compulsory reading: T. K. Landauer, P. W. Foltz, & D. Laham. 1998. Introduction to Latent Semantic Analysis. Discourse Processes, 25.
recommended reading:
- T.K. Landauer, S.T. Dumais. 1997. A Solution to Plato's Problem: The Latent
Semantic Analysis Theory of Acquisition, Induction and
Representation of Knowledge. Psychological Review.
- Thomas Hoffman. 1999. Probabilistic Latent Semantic Analysis.
UAI-99.
homework: you can try the Open Source LSA Package for R, you can also take a look at the LSA page of the University of Colorado.
slides: some very nice slides on principal component analysis from a course in Princeton... authorship unknown!
November 12
mining the wikipedia
compulsory reading: Fadi Biadsy; Julia Hirschberg; Elena Filatova. 2008. An Unsupervised Approach to Biography Production Using Wikipedia. ACL 2008.
or
Elif Yamangil; Rani Nelken. 2008. Mining Wikipedia Revision Histories for Improving Sentence Compression. ACL 2008.
recommended reading: list of academic papers that use Wikipedia
November 14
co-reference resolution
compulsory reading: Hoifung Poon and Pedro Domingos. 2008. Joint Unsupervised Coreference Resolution with Markov Logic. EMNLP 2008.
or
Vincent Ng. 2008. Unsupervised Models for Coreference Resolution. EMNLP 2008.
November 19
surprise topic: decipherment!!!
compulsory reading: Kevin Knight, Anish Nair, Nishit Rathod and Kenji Yamada. 2006. Unsupervised Analysis For Decipherment Problems. COLING-ACL 2006 (poster)
November 21
informal presentation of projects (as they are at the time) (no need for slides!), brainstorming with each other's projects.
discussion of the course: what was good, what could/should be improved, what did you learn, what did you want to learn?
what is the future of text mining?
take a look at the Grand Challenge of Text Mining, as stated by Ronen Feldman in the KDD-2006 panel report What are the Grand Challenges for Data Mining?
Text Mining is an exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text Mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results.
[...] we would like to have (this is our text mining grand challenge) Text mining systems that will be able to pass standard reading comprehension tests such as SAT, GRE, GMAT etc.
Systems that will be able to pass the average scores will win the grand challenge. The systems can utilize the web when answering the test questions. We view this grand challenge as an extension of the classic Turing test. This grand challenge satisfies most of the criteria that were set for the various challenges. First, there are no systems today that are able to get above average score in any of the standard tests. Second, the criterion for success is very well defined. Then, we believe that within 5 years researchers will be able to build such systems based on technologies that are developed for annual competitions such as ACE, TREC and TIDES. Finally, having such systems will contribute to the advance of humankind as the underlying technologies deployed by these systems can be utilized by children and adults to more rapidly acquire knowledge about various topics.