A choice of tasks for your practical assignments:

  • Mine the site Cooking for Engineers:

    • induce a wrapper
    • obtain correspondences between texts and tabular summaries

    [ this task requires close supervision in virtually all stages ]
    [ you can team up for this one ]
  • Maybe you can consider working in the discovery of morphology, starting from John Goldsmith's Linguistica
  • What about a critical survey of available toolkits that can be used for text mining?
    [ this should include machine learning environments as well ]
  • Any ideas about mining system logs?
  • What about the exploration of weblogs?
  • You can build lexica finding equivalence classes in a newspaper ads corpus... ask me!
  • It's not a bad idea to replicate experiments from CoNLL: Named Entity Recognition, Clause Splitting, etc.
  • If you like e-mail, the Enron email dataset is available, for example, to perform social-network analysis, including useful mappings between the MD5 digest of the email bodies and such things as authors, recipients, etc
  • For those who want to try statistical machine translation and the like, both the Hansards of the Parliament of Canada and the documentation of the European Union Parliament are available as parallel texts.



Tasks given in other courses related to text mining:

  • Acquisition of the Meaning of Nouns, from the Course on Lexical Semantics given at the ISI in Fall 2005. This assignment corresponds to the Computational Lexical Semantics Module given by Professor Patrick Pantel.
  • From Tom Mitchell's course on Machine Learning (for graduates):
    Natural Language Feature Selection for Exponential Models

    A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see http://www.cs.cmu.edu/~roni/wsme.ps). The model was originally developed for modeling of natural language, and has highlighted feature selection as the main challenge in that domain. In this project you will be expected to read and understnad this paper. Then, you will be given two corpora. The first one consists of transcribed over-the-phone conversations. The second corpus is artificial, and was generated from the best existing language model (which was trained on the first corpus). Your job is to use machine learning and statistical methods of your choice (and other methods if you wish) to find systematic differences between the two corpora. These differences translate directly into new features, which will be added to the model in an attempt to improve on it (an improvement in language modeling can increase the quality of language technologies such as speech recognition, machine translation, text classification, spellchecking etc.)