- Mine the site Cooking for Engineers:
- induce a wrapper
- obtain correspondences between texts and tabular
summaries
[ this task requires close supervision in virtually all stages ]
[ you can team up for this one ]
- Maybe you can consider working in the discovery of
morphology, starting from John Goldsmith's Linguistica
- What about a critical survey of available toolkits
that can be used for text mining?
[ this should include machine learning environments as well ] - Any ideas about mining system logs?
- What about the exploration of weblogs?
- You can build lexica finding equivalence classes in a
newspaper ads corpus... ask me!
- It's not a bad idea to replicate experiments from
CoNLL: Named Entity Recognition, Clause
Splitting, etc.
- If you like e-mail, the Enron email dataset is available, for
example, to perform social-network analysis, including
useful mappings between the MD5 digest of the email
bodies and such things as authors, recipients, etc
- For those who want to try statistical machine
translation and the like, both the Hansards of the Parliament of Canada
and the documentation of the European Union Parliament
are available as parallel texts.
Tasks given in other courses related to text mining:
- Acquisition of the Meaning of Nouns,
from the Course on Lexical Semantics given at
the ISI in Fall 2005. This assignment
corresponds to the Computational Lexical Semantics
Module given by Professor Patrick Pantel.
- From Tom Mitchell's course on Machine Learning (for
graduates):
Natural Language Feature Selection for Exponential Models
A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see http://www.cs.cmu.edu/~roni/wsme.ps). The model was originally developed for modeling of natural language, and has highlighted feature selection as the main challenge in that domain. In this project you will be expected to read and understnad this paper. Then, you will be given two corpora. The first one consists of transcribed over-the-phone conversations. The second corpus is artificial, and was generated from the best existing language model (which was trained on the first corpus). Your job is to use machine learning and statistical methods of your choice (and other methods if you wish) to find systematic differences between the two corpora. These differences translate directly into new features, which will be added to the model in an attempt to improve on it (an improvement in language modeling can increase the quality of language technologies such as speech recognition, machine translation, text classification, spellchecking etc.)